Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Lecture 10 - Classification trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-10-CT 23 Feb 2015 1 / 26

Need-to-knows 1 how the algorithms progressively produce partitions 2 how the algorithms terminate tree growth 3 what resubstitution and generalisation errors are 4 what is meant by training data and validation 5 what role cross validation and OOB sampling play in tree construction 6 (relatedly) what tree pruning is, and how it is performed 7 some basic limitations of regression and classification trees Tom Kelsey ID5059-10-CT 23 Feb 2015 2 / 26

Trees Regression trees 1 There are no criteria for predictive accuracy 2 Select splits reduce group heterogeneity via Standard Deviation 3 Determine when to stop splitting 4 Select the "right-sized" tree Classification trees 1 Specify the criteria for predictive accuracy 2 Select splits reduce group heterogeneity via information 3 Determine when to stop splitting 4 Select the "right-sized" tree Each stage is "solved" heuristically either no exact answer, or the exact answer is too hard to compute Tom Kelsey ID5059-10-CT 23 Feb 2015 3 / 26

This lecture Classification trees 1 Grow/learn/build using entropy 2 Prune/validate using tree complexity e.g. misclassification error Same recipe as for Regression Trees Same problems with inconsistent terminology within the discipline CART is often the umbrella acronym for all such trees, but is really just one successful & popular approach Tom Kelsey ID5059-10-CT 23 Feb 2015 4 / 26

Classification trees Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ Tom Kelsey ID5059-10-CT 23 Feb 2015 5 / 26

Example: split on Travel Cost Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ Tom Kelsey ID5059-10-CT 23 Feb 2015 6 / 26

Information Gain We want to compare impurity of a table before and after a split (the Information Gain). Based on the purity of the parent table and the purity of sub-tables. Recall impurity change - shows the change in impurity resulting from splitting a node t into a left node t L and a right node t R (or more): δi(t) = i(t) p L i(t L ) p R i(t R ) where p L and p R are the proportion of data in t that go to splits. Our example using entropy p i log 2 p i : i δi(t) = 1.571 ( 5 10 0.722 + 2 10 0 + 3 0) = 1.210 10 Tom Kelsey ID5059-10-CT 23 Feb 2015 7 / 26

Example: all splits Teknomo, Kardi. (2009) Tutorial on Decision Trees. Tom Kelsey http://people.revoledu.com/kardi/tutorial/decisiontree/ ID5059-10-CT 23 Feb 2015 8 / 26

Example: all splits Travel cost gives the highest information gain for this data, so we branch on this attribute first. Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ There is nothing to do now for two of the subtables (why?). We could iterate again on one subtable, see the online tutorial for details. Tom Kelsey ID5059-10-CT 23 Feb 2015 9 / 26

Example: first iteration Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ Tom Kelsey ID5059-10-CT 23 Feb 2015 10 / 26

Example: Final tree Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ Tom Kelsey ID5059-10-CT 23 Feb 2015 11 / 26

Model Validation We have seen several ways to grow trees different choices will, in general, lead to different trees The trees are bigger than we think we need deliberate overfit so that we can prune back get a good balance between resubstitution & generalisation errors So the two important factors are tree size and tree error For all trees, our cost-complexity measure is given by R(T) α = R(T) + α T where R(T) is the resubstitution error rate for tree T, T is the size/complexity of the tree (the number of terminal nodes). Tom Kelsey ID5059-10-CT 23 Feb 2015 12 / 26

Setting tree complexity If we were to choose a tree T to minimise R(T) α, then we can see α balances the emphasis in terms of: fidelity to the data as measured by R(T) i.e. if α=0, no constraint is imposed on complexity and our model would attempt to predict the training sample exactly; model complexity as measured by T i.e. if α were very large then the complexity would dominate our minimisation and the simplest model would be favoured - no splits at all. Tom Kelsey ID5059-10-CT 23 Feb 2015 13 / 26

Setting tree complexity for every value of α there will be a smallest subtree whose pruning provides the greatest reduction in R(T) α (the smallest minimising subtree). there are a set of values of α where the node defining the smallest minimising subtree changes. α can be determined from either: a validation dataset a k-fold cross-validation process OOB sampling Tom Kelsey ID5059-10-CT 23 Feb 2015 14 / 26

Generalisation Errors In order to evaluate choices for α, we need to define our generalisation error. Common choices are: Misclassification error (false negatives + false positives)/(all negatives + all positives) Positive Predicitive Value (true positives)/(true negatives + true positives) log loss 1 = 1 n Σn i=1 y ilog(ŷ i ) + (1 y i )log(1 ŷ i ) see http://www.kaggle.com/c/pf2012-diabetes/details/evaluatio Others exist, and are derived by taking into account what type of prediction is important for a given dataset and environment. 1. where n is the number of observations, log is the natural logarithm, ŷ i is the posterior probability that the i-th prediction is correct, and y i is the ground truth Tom Kelsey ID5059-10-CT 23 Feb 2015 15 / 26

Tree Pruning for Hold-back Data If data is plentiful, tree pruning is conceptually easy We have the resubstitution errors computed at each stage of the tree derivation We can work out the generalisation error for each tree pruned back by turning two leafs into one Plot both, and choose the optimal pruned tree As we did for generalised regression models Tom Kelsey ID5059-10-CT 23 Feb 2015 16 / 26

Tree Pruning for Hold-back Data Tom Kelsey ID5059-10-CT 23 Feb 2015 17 / 26

Tree Pruning for Hold-back Data In practice, there is more than one choice for where to prune In the example, there are many trees of size 10 obtainable from the original tree of size 70 So we have to systematically work through these In software packages, heuristics are used to reduce the amount of work needed In return, we lose any guarantee that the optimal final tree is chosen Tom Kelsey ID5059-10-CT 23 Feb 2015 18 / 26

Cross-validation for α If data is scarce, we can k-fold cross validate as before k = 10... 20 seems to work well A different choice of k will give a different optimal α, in general Grow k trees, and test with the k hold-back sets Assess by average performance Tom Kelsey ID5059-10-CT 23 Feb 2015 19 / 26

Cross-validation for α Two methods for choosing an optimal pruned tree 1 Minimise the cross validation relative error i.e. do the same as for the training/test approach, but averaged out over the k models 2 Use the 1-SE rule instead of selecting the best α, select an α with minimum number of leaves which performs not worse than one standard error above the best α In CART (and many packages such as rpart), 1-SE is preferred the 1-SE line plotted on charts, and we select the first α value that has relative generalisation error below that line Tom Kelsey ID5059-10-CT 23 Feb 2015 20 / 26

OOB and Bootstrapping for α As for generalised linear models, we could do Out-Of-Bag sampling from the dataset Train on that sample, test on those data not selected Plus many variations on the same theme We re going to cover these techniques in detail later in the course. Tom Kelsey ID5059-10-CT 23 Feb 2015 21 / 26

Tree Disadvantages Approximation of linear/smooth relationships takes a large number of splits parameters, where a simple continous function would suffice (with relatively low numbers of parameters) Covariates are not combined in any way Trees can be deceptive a variable not included could have been masked by another Tree structures may be unstable a change in the sample may give a different tree A tree is optimal at each split it may not be globally optimal Truly optimal trees either don t exist, are too expensive to compute, or it is too expensive to find out that they don t exist Tom Kelsey ID5059-10-CT 23 Feb 2015 22 / 26

Tree Disadvantages I supply a dataset to 55 different data-miners. Each will, in general, report a different optimal tree if they make any different choices of 1 Numeric discretisation at all, supervised, unsupervised,... 2 Measure of impurity SD, Gini index, entropy,... 3 Condensed output mean, probability, class, linear model,... 4 Resubstitution error squared, absolute, (differential) misclassification, inverse proportional censored weighting,... 5 Generalisation error misclassification, log, loss, PPV,... 6 Validation method hold-back test, k-fold cross, OOB,... Tom Kelsey ID5059-10-CT 23 Feb 2015 23 / 26

Tree Advantages Can cope with any type of data Packages work straight out of the box Simple to interpret, simple to explain Use conditional information effectively Robust with respect to outliers Provide useful estimates of the generalisation error (usually misclassification rate) "Localise" the data in a similar way to basis functions Covariates near the root are the most important predictors Tom Kelsey ID5059-10-CT 23 Feb 2015 24 / 26

Summary Classification trees are built and pruned in the same way as regression trees Information gain replaces SDR (and its variants) For both theoretical and practical reasons, binary trees and entropy are preferred But you don t have to use either Classification and regression trees are the weapon of choice in many data mining situations although they should be avoided in areas such as good poker hands, or rare diseases in the general population since rare occurrences get pruned out Tom Kelsey ID5059-10-CT 23 Feb 2015 25 / 26

Decision Tree - Worked Example Data is given name and gender x 1 - name begins with a vowel: Y or N x 2 - number of vowels in name: 3 classes One, Two & HigherThanTwo x 3 - length of name: 3 classes Short, Medium & Long y - Male or Female At each node calculate the information of the current table i(t) for each of the x i calculate the information gain for that attribute Branch on the attribute with most information gain Repeat for the nodes created Stop when pure enough Tom Kelsey ID5059-10-CT 23 Feb 2015 26 / 26