Classification and Regression Trees

Similar documents
Classification and Regression Trees

Tree-based methods for classification and regression

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Chapter 5. Tree-based Methods

Lecture 19: Decision trees

Classification/Regression Trees and Random Forests

CS Machine Learning

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Network Traffic Measurements and Analysis

Lecture 7: Decision Trees

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision tree learning

CSE4334/5334 DATA MINING

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Classification. Instructor: Wei Ding

Classification: Basic Concepts, Decision Trees, and Model Evaluation

7. Decision or classification trees

CS229 Lecture notes. Raphael John Lamarre Townshend

The Basics of Decision Trees

Knowledge Discovery and Data Mining

Classification: Decision Trees

8. Tree-based approaches

Data Mining Practical Machine Learning Tools and Techniques

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Part I. Instructor: Wei Ding

Decision Tree Learning

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Lecture 2 :: Decision Trees Learning

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Patrick Breheny. November 10

Lecture outline. Decision-tree classification

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Lecture 5: Decision Trees (Part II)

Introduction to Machine Learning

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Data Mining Concepts & Techniques

Lecture 20: Classification and Regression Trees

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Slides for Data Mining by I. H. Witten and E. Frank

Lazy Decision Trees Ronny Kohavi

10601 Machine Learning. Model and feature selection

Notes based on: Data Mining for Business Intelligence

Random Forest A. Fornaser

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Lecture 20: Bagging, Random Forests, Boosting

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Algorithms: Decision Trees

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA.

Statistical Machine Learning Hilary Term 2018

Business Club. Decision Trees

Data Mining Classification - Part 1 -

h=[3,2,5,7], pos=[2,1], neg=[4,4]

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Topics in Machine Learning

Extra readings beyond the lecture slides are important:

Univariate and Multivariate Decision Trees

CS 229 Midterm Review

Using Machine Learning to Optimize Storage Systems

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Trees (Part 1, Theoretical) CSE 2320 Algorithms and Data Structures University of Texas at Arlington

Cyber attack detection using decision tree approach

1) Give decision trees to represent the following Boolean functions:

CS 441 Discrete Mathematics for CS Lecture 26. Graphs. CS 441 Discrete mathematics for CS. Final exam

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

CSC411/2515 Tutorial: K-NN and Decision Tree

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

An introduction to random forests

Classification Salvatore Orlando

Comparing Univariate and Multivariate Decision Trees *

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Lecture 3 February 9, 2010

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Data Mining Lecture 8: Decision Trees

Statistical Methods for Data Mining

Figure 4.1: The evolution of a rooted tree.

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Credit card Fraud Detection using Predictive Modeling: a Review

CSE 373 MAY 10 TH SPANNING TREES AND UNION FIND

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

International Journal of Software and Web Sciences (IJSWS)

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Regularization and model selection

Data Mining Part 5. Prediction

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Variable Selection 6.783, Biomedical Decision Support

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Machine Learning in Real World: C4.5

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Geometric data structures:

Machine Learning Techniques for Data Mining

CPSC 340: Machine Learning and Data Mining

Decision Trees / Discrete Variables

Transcription:

Classification and Regression Trees David S. Rosenberg New York University April 3, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 1 / 51

Contents 1 Trees 2 Regression Trees 3 Classification Trees 4 Trees in General David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 2 / 51

Trees David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 3 / 51

Tree Terminology From Criminisi et al. MSR-TR-2011-114, 28 October 2011. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 4 / 51

A Binary Decision Tree binary tree: each node has either 2 children or 0 children From Criminisi et al. MSR-TR-2011-114, 28 October 2011. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 5 / 51

Binary Decision Tree on R 2 Consider a binary tree on {(X 1,X 2 ) X 1,X 2 R} From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 6 / 51

Types of Decision Trees We ll only consider binary trees (vs multiway trees where nodes can have more than 2 children) decisions at each node involve only a single feature (i.e. input coordinate) for continuous variables, splits always of the form x i t for discrete variables, partitions values into two groups Other types of splitting rules oblique decision trees or binary space partition trees (BSP trees) have a linear split at each node sphere trees space is partitioned by a sphere of a certain radius around a fixed point David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 7 / 51

Regression Trees David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 8 / 51

Binary Regression Tree on R 2 Consider a binary tree on {(X 1,X 2 ) X 1,X 2 R} From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 9 / 51

Fitting a Regression Tree The decision tree gives the partition of X into regions: {R 1,...,R M }. Recall that a partition is a disjoint union, that is: X = R 1 R 2 R M and R i R j = i j David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 10 / 51

Fitting a Regression Tree Given the partition {R 1,...,R M }, final prediction is M f (x) = c m 1(x R m ) m=1 How to choose c 1,...,c M? For loss function l(ŷ,y) = (ŷ y) 2, best is ĉ m = ave(y i x i R m ). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 11 / 51

Trees and Overfitting If we do enough splitting, every unique x value will be in its own partition. This very likely overfits. As usual, we need to control the complexity of our hypothesis space. CART (Breiman et al. 1984) uses number of terminal nodes. Tree depth is also common. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 12 / 51

Complexity of a Tree Let T = M denote the number of terminal nodes in T. We will use T to measure the complexity of a tree. For any given complexity, we want the tree minimizing square error on training set. Finding the optimal binary tree of a given complexity is computationally intractable. We proceed with a greedy algorithm Means build the tree one node at a time, without any planning ahead. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 13 / 51

Root Node, Continuous Variables Let x = (x 1,...,x d ) R d. (d features) Splitting variable j {1,...,d}. Split point s R. Partition based on j and s: R 1 (j,s) = {x x j s} R 2 (j,s) = {x x j > s} David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 14 / 51

Root Node, Continuous Variables For each splitting variable j and split point s, ĉ 1 (j,s) = ave(y i x i R 1 (j,s)) ĉ 2 (j,s) = ave(y i x i R 2 (j,s)) Find j, s minimizing loss L(j,s) = (y i ĉ 1 (j,s)) 2 + (y i ĉ 2 (j,s)) 2 How? i:x i R 1 (j,s) i:x i R 2 (j,s) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 15 / 51

Finding the Split Point Consider splitting on the j th feature x j. If x j(1),...,x j(n) are the sorted values of the j th feature, we only need to check split points between adjacent values traditionally take split points halfway between adjacent values: { } 1 ( ) s j xj(r) + x 2 j(r+1) r = 1,...,n 1. So only need to check performance of n 1 splits. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 16 / 51

Then Proceed Recursively 1 We have determined R 1 and R 2 2 Find best split for points in R 1 3 Find best split for points in R 2 4 Continue... When do we stop? David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 17 / 51

Complexity Control Strategy If the tree is too big, we may overfit. If too small, we may miss patterns in the data (underfit). Can limit max depth of tree. Can require all leaf nodes contain a minimum number of points. Can require a node have at least a certain number of data points to split. Can do backward pruning the approach of CART (Breiman et al 1984): 1 Build a really big tree (e.g. until all regions have 5 points). 2 Prune the tree back greedily all the way to the root, assessing performance on validation. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 18 / 51

Classification Trees David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 19 / 51

Classification Trees Consider classification case: Y = {1,2,...,K}. We need to modify criteria for splitting nodes David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 20 / 51

Classification Trees Let node m represent region R m, with N m observations Denote proportion of observations in R m with class k by ˆp mk = 1 N m Predicted classification for node m is {i:x i R m } 1(y i = k). k(m) = arg max ˆp mk. k Predicted class probability distribution is (ˆp m1,..., ˆp mk ). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 21 / 51

Misclassification Error Consider node m representing region R m, with N m observations Suppose we predict as the class for all inputs in region R m. k(m) = arg max ˆp mk k What is the misclassification rate on the training data? It s just 1 ˆp mk(m). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 22 / 51

What loss function to use for node splitting? Natural loss function for classification is 0/1 loss. Is this tractable for finding the best split? Yes! Should we use it? Maybe not! If we re only splitting once, then make sense to split using ultimate loss function (say 0/1). But we can split nodes repeatedly don t have to get it right all at once. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 23 / 51

Splitting Example Two class problem: 4 observations in each class. Split 1: (3,1) and (1,3) [each region has 3 of one class and 1 of other] Split 2: (2,4) and (2,0) [one region has 2 of one class and 4 of other, other region pure] Misclassification rate for the two splits are same. (2). In split 1, we ll want to split each node again, and we ll end up with a leaf node with a single element.node. In split 2, we re already done with the node (2,0). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 24 / 51

Splitting Criteria Eventually we want pure leaf nodes (i.e. as close to a single class as possible). We ll find splitting variables and split point minimizing some node impurity measure. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 25 / 51

Two-Class Node Impurity Measures Consider binary classification Let p be the relative frequency of class 1. Here are three node impurity measures as a function of p HTF Figure 9.3 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 26 / 51

Classification Trees: Node Impurity Measures Consider leaf node m representing region R m, with N m observations Three measures Q m (T ) of node impurity for leaf node m: Misclassification error: 1 ˆp mk(m). Gini index: K ˆp mk (1 ˆp mk ) k=1 Entropy or deviance (equivalent to using information gain): K ˆp mk log ˆp mk. k=1 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 27 / 51

Class Distributions: Pre-split From Criminisi et al. MSR-TR-2011-114, 28 October 2011. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 28 / 51

Class Distributions: Split Search (Maximizing information gain is equivalent to minimizing entropy.) From Criminisi et al. MSR-TR-2011-114, 28 October 2011. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 29 / 51

Splitting nodes: How exactly do we do this? Let R L and R R be regions corresponding to a potential node split. Suppose we have N L points in R L and N R points in R R. Let Q(R L ) and Q(R R ) be the node impurity measures. Then find split that minimizes the weighted average of node impurities: N L Q(R L ) + N R Q(R R ) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 30 / 51

Classification Trees: Node Impurity Measures For building the tree, Gini and Entropy seem to be more effective. They push for more pure nodes, not just misclassification rate A good split may not change misclassification rate at all! Two class problem: 4 observations in each class. Split 1: (3,1) and (1,3) [each region has 3 of one class and 1 of other] Split 2: (2,4) and (2,0) [one region has 2 of one class and 4 of other, other region pure] Misclassification rate for two splits are same. Gini and entropy split prefer Split 2. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 31 / 51

Trees in General David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 32 / 51

Missing Features What to do about missing features? Throw out inputs with missing features Impute missing values with feature means If a categorical feature, let missing be a new category. For trees we can do something else... David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 33 / 51

Surrogate Splits for Missing Data For any non-terminal node that splits using feature f, we can find a surrogate split using each of the other features. To make a surrogate using f, we find the split using f that best approximates the split using f. Define best in term of 0/1 loss on the examples for which neither f nor f is missing. If there are d features, we ll have d 1 surrogate splits to approximate the split on f. We can rank these splits by how well they approximate the original split. We repeat the above process for every non-terminal node. So each node has the primary split and d 1 surrogate splits, where d is the number of features. If we re predicting on an example and the feature needed to evaluate a split is missing, simply go down the list of surrogate splits until we get to one for which the feature is not missing. I found the CART book a bit vague on this, so this is my best guess for what is intended. If somebody finds a clear statement, please let me know. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 34 / 51

Categorical Features Suppose we have a categorical feature with q possible values (unordered). We want to find the best split into 2 groups There are 2 q 1 1 distinct splits. Is this tractable? Maybe not in general. But... For binary classification Y = {0, 1}, there is an efficient algorithm. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 35 / 51

Categorical Features in Binary Classification Assign each category a number the proportion of class 0 among training examples with that category. Then find optimal split as though it were a numeric feature. For binary classification, this is equivalent to searching over all splits at least for certain for node impurity measures of a certain class, including square error, gini and entropy. (This trick doesn t work for multiclass would have to use approximations...) Statistical issues with categorical features? If a category has a very large number of categories, we can overfit. Extreme example: Row Number could lead to perfect classification with a single split. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 36 / 51

Trees vs Linear Models Trees have to work much harder to capture linear relations. X 2 2 1 0 1 2 X 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 X 1 X 1 X 2 2 1 0 1 2 X 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 X 1 X 1 From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 37 / 51

Interpretability Trees are certainly easy to explain. You can show a tree on a slide. Small trees seem interpretable. For large trees, maybe not so easy. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 38 / 51

Trees for Nonlinear Feature Discovery Suppose tree T gives partition R 1,...,R m. Predictions are M f (x) = c m 1(x R m ) m=1 Each region R m can be viewed as giving a feature function x 1(x R m ). Can use these nonlinear features in e.g. lasso regression. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 39 / 51

Comments about Trees Trees make no use of geometry No inner products or distances called a nonmetric method Feature scale irrelevant Prediction functions are not continuous not so bad for classification may not be desirable for regression David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 40 / 51

Appendix: Tree Pruning David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 41 / 51

Stopping Conditions for Building the Big Tree First step is to build the big tree. Keep splitting nodes until every node either has Zero error OR Node has C or fewer examples (typically C = 5 or C = 1) OR All inputs in node are identical (and thus we cannot split more) David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 42 / 51

Pruning the Tree Consider an internal node n. To prune the subtree rooted at n eliminate all descendants of n n becomes a terminal node David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 43 / 51

Tree Pruning Full Tree T 0 From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 44 / 51

Tree Pruning Subtree T T 0 From An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 45 / 51

Empirical Risk and Tree Complexity Suppose we want to prune a big tree T 0. Let ˆR(T ) be the empirical risk of T (i.e. square error on training) Clearly, for any subtree T T 0, ˆR(T ) ˆR(T 0 ). Let T be the number of terminal nodes in T. T is our measure of complexity for a tree. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 46 / 51

Cost Complexity (or Weakest Link) Pruning Definitions The cost complexity criterion with parameter α is C α (T ) = ˆR(T ) + α T Trades off between empirical risk and complexity of tree. Cost complexity pruning: For each α, find the subtree T T 0 minimizing C α (T ) (on training data). Use cross validation to find the right choice of α. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 47 / 51

Do we need to search over all subtrees? The cost complexity criterion with parameter α is C α (T ) = ˆR(T ) + α T C α (T ) has familiar regularized ERM form, but Cannot just differentiate w.r.t. parameters of a tree T. To minimize C α (T ) over subtrees T T 0, seems like we need to evaluate exponentially many 1 subtrees T T 0. (In particular, we can include or exclude any subset of internal nodes that are parents of leaf nodes.) Amazingly, we only need to try N Int, where N Int is the number of internal nodes of T 0. 1 See On subtrees of trees. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 48 / 51

Cost Complexity Greedy Pruning Algorithm Find a proper 2 subtree T 1 T 0 that minimizes ˆR(T 1 ) ˆR(T 0 ). Can get T 1 by removing a single pair of leaf nodes, and their internal node parent becomes a leaf node. This T 1 will have 1 fewer internal node than T 0. (And 1 fewer leaf node.) Then find proper subtree T 2 T 1 that minimizes minimizes ˆR(T 2 ) ˆR(T 1 ). Repeat until we have removed all internal nodes are left with just a single node (a leaf node). If N Int is the number of internal nodes of T 0, then we end up with a nested sequence of trees: T = { } T 0 T 1 T 2 T NInt 2 T 1 is a proper subtree of T 0 if tree T 1 T 0 and T 1 T 0. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 49 / 51

Greedy Pruning is Sufficient Cost complexity pruning algorithm gives us a set of nested trees: T = { } T 0 T 1 T 2 T NInt Breiman et al. (1984) proved that this is all you need. That is: { } Only need to evaluate N Int trees. arg min T T 0 C α (T ) α 0 T David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 50 / 51

Regularization Path for Trees on SPAM dataset (HTF Figure 9.4) For each α, we find optimal tree T α on training set. Corresponding tree size T α is shown on bottom. Blue curves gives error rate estimates from cross-validation (tree-size in each fold may be different from T α ). Orange curve is test error. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 51 / 51