CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Size: px

Start display at page:

Download "CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology"

Albert Summers
6 years ago
Views:

1 CART Classification and Regression Trees Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology

2 CART CART stands for Classification And Regression Trees. Similar to knn, the underlying assumption is that the posterior p(y = c x) is locally constant, but we define the neighborhood of x differently. Instead of looking at the nearest neighbors of x we partition X space into rectangular regions and assume a constant value for the posterior within each region. This leads to a region-specific class label prediction. For a CART with M regions, the rule can be written as ĉ(x) = arg max x M 1{x R m } 1{y i = c}, i R m m=1 where R m denotes the m-th rectangular region and the expression inside the ( ) brackets computes the class proportion within the region.

3 CART We don t allow for any kind of rectangular region because the search for that is too complex (too many choices). Instead, we form the rectangular regions sequentially via binary splits of data. (There are some extensions to CART that allow for more flexible regions, e.g. A. Molinaro).

4 CART algorithm 1 Consider each feature j and split the data into do parts: data 1 {i : x ij > T j } data 2 {i : x ij T j } where the threshold T j is selected to make the data sets as pure as possible in terms of class labels. 2 Choose the feature j that is the best in terms of splitting the data into pure parts. 3 Repeat steps 1-2 on data 1 and data 2 separately (i.e. treat each data split as a new data set).

5 CART algorithm You keep building the CART rule by iterating the algorithm, splitting the data into smaller and smaller parts. Stopping criteria: The number of data splits exceeds M, e.g. 30 The data set in each region comprises less than o observations, e.g. 5. The error rate improvement is below some cutoff, e.g. 1% of the error rate without any data splitting. Running CART until one of these stopping criteria kicks in is called the growing phase and generates a so-called max tree

6 Visualizing CART CART is very popular because the data splits that constitute the rule can be visualized as a decision tree. The length of the branches in the tree reflect how much that decision improves the classification error rate. x1< 3.52 x /0 x2< /12 2 0/ x1 (a) (b) (a): the data splits illustrated. (b): the first data split reduces the error rate from approx 30 to 20%, the second data split reduces it to 0% as is therefore shown as a longer branch.

7 Splitting criterion As mentioned above, you split the data into two parts with the goal of making each part as pure as possible in terms of class labels. You can measure this in different ways. One common criterion is error rate (number of mistakes you make). A more popular criterion is the so-called Gini index. This is geared at minimizing the variance in each region, specified as follows; GI m = C ˆp(y = c x R m )ˆp(y c x R m ) c=1 where ˆp(y c x R m ) = 1 ˆp(y = c x R m ), and p (1 p) is the variance of a binomial random variable. The more mixed region R m is, the larger the Gini index is. You split on feature j at threshold T j to minimize the Gini index. In general, GI splitting is very aggressive in trying to form pure (single class) regions quickly.

8 Validation of CART As with other models, it s usually better to not let the rule be too flexible (too many splits of data). A large tree with many splits corresponds to a local rule that is highly data adaptive and runs the risk of adaptive to noise, i.e. variations in data that is not reproducible on future data. Short branches toward the bottom of the tree is often an indication that you have over-trained your rule. To construct a more robust rule we prune the tree by cutting short branches (i.e. removing data splits and merging regions). Does this sound like you re undoing work done in training? Not really, you may need to perform data splits with short branches to get to ones that pay off (long branches). It is only after building the max tree that you can identify unnecessary data splits.

9 Validation of CART We have discussed cross-validation. This is a more complex operation for CART. Each validation data can result in a very different max tree with different features used. How do we compare the different trees from the different validation data sets? We have to come up with a measure to identify which pruned max trees to compare for different validation data sets. This is achieved via a complexity cost function and a pruning parameter that controls how much you prune.

10 Validation of CART Consider a max tree. Now consider pruning this tree by cutting a branch at the bottom of the tree. There are many such bottom branches to consider, each resulting in a particular error rate. We define the cost function of a particular tree, T, with T regions as C α (T ) = T m=1 1{y i ĉ(x i )} + α T Now, for α = 0, the max tree will clearly minimize this cost function (have the smallest error rate. For each alpha > 0 there will be a unique pruned tree that minimizes the cost function. The point of this: α will be how we match different trees to each other in order to compare across validation data sets.

11 Validation of CART 1 Split the data into B parts, or folds, for cross-validation 2 Build a max tree on each of the folds, holding out the b-th data part for testing. 3 Prune the shortest branches sequentially to minimize the cost-complexity function T C α(t ) = 1{y i ĉ(x i )} + α T m=1 This generates a sequence of trees {T b α, α = 0, } 4 For each α, apply T b α to predict data in the b-th test set, resulting in error rate TE b α 5 For each α, compute the average error rate TE α = 1 B B b=1 TE b α 6 Identify α that minimizes the average error rate: α = arg min α TE α

12 Validation of CART 7 Build the max tree on all the original data 8 Prune this max tree using sequence α = 0, used on the CV data sets, resulting in a sequence of trees T α 9 Final rule: T α, that is, the max tree on all the data pruned to minimize the cost-complexity function with α = α So, CV is not used to identify a particular type tree in terms of size or features. CV is used to identify the α that tells you how to optimally balance tree size and error rate and matches this balance on the tree from all the data.

13 Cautionary remarks Trees are notoriously unstable - meaning small changes to the data can change the appearance (size and features used) of the tree substantially. Be careful not to read too much into the tree. Trees are inappropriate if the true decision boundaries or relationships between the outcome and predictors are well approximated by linear combinations of x-features. Then discriminant analysis is a better option or extension of CART called MARS or logistic/linear regression. You can spot this problem fairly easily - if your tree splits on the same feature again and again, this is a tell-tale sign. Check the error rates of each class. Are we obtaining a low error rate because we are simply mislabeling one class completely?

14 Good things about trees CART can easily be adapted to the case of missing values, either by including missingness as another feature that you can split on or by using substitute variables. Even though CART is unstable, if you exploit this instability by using multiple trees as an ensemble you can improve on the error rate of each single tree. Build not one rule but 100s from random subsets of data. Use a majority vote based on all the trees. This is called bagging and is an example of an ensemble method (see MSA220, Big Data).

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data Lecture 2 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification - selection of tuning parameters