Knowledge Discovery and Data Mining

Similar documents
Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Tree-based methods for classification and regression

Lecture 19: Decision trees

Data Mining Lecture 8: Decision Trees

Classification and Regression Trees

7. Decision or classification trees

Decision Tree Learning

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Classification and Regression Trees

Classification with Decision Tree Induction

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Notes based on: Data Mining for Business Intelligence

Artificial Intelligence. Programming Styles

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

8. Tree-based approaches

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Business Club. Decision Trees

Lecture 20: Bagging, Random Forests, Boosting

Knowledge Discovery and Data Mining

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Didacticiel - Études de cas. Comparison of the implementation of the CART algorithm under Tanagra and R (rpart package).

Classification Algorithms in Data Mining

Network Traffic Measurements and Analysis

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Classification/Regression Trees and Random Forests

CS229 Lecture notes. Raphael John Lamarre Townshend

Random Forest A. Fornaser

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

The Basics of Decision Trees

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Data Mining Practical Machine Learning Tools and Techniques

CS Machine Learning

Nonparametric Classification Methods

CSE4334/5334 DATA MINING

Slides for Data Mining by I. H. Witten and E. Frank

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Decision tree learning

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Machine Learning Techniques for Data Mining

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

SOCIAL MEDIA MINING. Data Mining Essentials

1) Give decision trees to represent the following Boolean functions:

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Lecture on Modeling Tools for Clustering & Regression

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Supervised Learning Classification Algorithms Comparison

Knowledge Discovery and Data Mining

Nearest neighbor classification DSE 220

COMP 465: Data Mining Classification Basics

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

List of Exercises: Data Mining 1 December 12th, 2015

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Ensemble Methods, Decision Trees

Algorithms: Decision Trees

Lecture outline. Decision-tree classification

Introduction to Classification & Regression Trees

Performance Evaluation of Various Classification Algorithms

Classification: Decision Trees

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

BITS F464: MACHINE LEARNING

Faculty of Sciences. Holger Cevallos Valdiviezo

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Exam Advanced Data Mining Date: Time:

Extra readings beyond the lecture slides are important:

Lecture 25: Review I

INF 4300 Classification III Anne Solberg The agenda today:

Data Mining Concepts & Techniques

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

7. Boosting and Bagging Bagging

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Cyber attack detection using decision tree approach

CISC 4631 Data Mining

CLASSIFICATION USING E-MINER: LOGISTIC REGRESSION AND CART

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

model order p weights The solution to this optimization problem is obtained by solving the linear system

Supervised vs unsupervised clustering

Classification with PAM and Random Forest

High dimensional data analysis

An overview for regression tree

Lecture 2 :: Decision Trees Learning

Data Mining in Bioinformatics Day 1: Classification

Lecture 7: Decision Trees

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Classification. Instructor: Wei Ding

Classification: Basic Concepts, Decision Trees, and Model Evaluation

An introduction to random forests

I211: Information infrastructure II

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Lecture 5: Decision Trees (Part II)

DATA MINING OVERFITTING AND EVALUATION

Transcription:

Knowledge Discovery and Data Mining Lecture 10 - Classification trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-10-CT 23 Feb 2015 1 / 26

Need-to-knows 1 how the algorithms progressively produce partitions 2 how the algorithms terminate tree growth 3 what resubstitution and generalisation errors are 4 what is meant by training data and validation 5 what role cross validation and OOB sampling play in tree construction 6 (relatedly) what tree pruning is, and how it is performed 7 some basic limitations of regression and classification trees Tom Kelsey ID5059-10-CT 23 Feb 2015 2 / 26

Trees Regression trees 1 There are no criteria for predictive accuracy 2 Select splits reduce group heterogeneity via Standard Deviation 3 Determine when to stop splitting 4 Select the "right-sized" tree Classification trees 1 Specify the criteria for predictive accuracy 2 Select splits reduce group heterogeneity via information 3 Determine when to stop splitting 4 Select the "right-sized" tree Each stage is "solved" heuristically either no exact answer, or the exact answer is too hard to compute Tom Kelsey ID5059-10-CT 23 Feb 2015 3 / 26

This lecture Classification trees 1 Grow/learn/build using entropy 2 Prune/validate using tree complexity e.g. misclassification error Same recipe as for Regression Trees Same problems with inconsistent terminology within the discipline CART is often the umbrella acronym for all such trees, but is really just one successful & popular approach Tom Kelsey ID5059-10-CT 23 Feb 2015 4 / 26

Classification trees Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ Tom Kelsey ID5059-10-CT 23 Feb 2015 5 / 26

Example: split on Travel Cost Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ Tom Kelsey ID5059-10-CT 23 Feb 2015 6 / 26

Information Gain We want to compare impurity of a table before and after a split (the Information Gain). Based on the purity of the parent table and the purity of sub-tables. Recall impurity change - shows the change in impurity resulting from splitting a node t into a left node t L and a right node t R (or more): δi(t) = i(t) p L i(t L ) p R i(t R ) where p L and p R are the proportion of data in t that go to splits. Our example using entropy p i log 2 p i : i δi(t) = 1.571 ( 5 10 0.722 + 2 10 0 + 3 0) = 1.210 10 Tom Kelsey ID5059-10-CT 23 Feb 2015 7 / 26

Example: all splits Teknomo, Kardi. (2009) Tutorial on Decision Trees. Tom Kelsey http://people.revoledu.com/kardi/tutorial/decisiontree/ ID5059-10-CT 23 Feb 2015 8 / 26

Example: all splits Travel cost gives the highest information gain for this data, so we branch on this attribute first. Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ There is nothing to do now for two of the subtables (why?). We could iterate again on one subtable, see the online tutorial for details. Tom Kelsey ID5059-10-CT 23 Feb 2015 9 / 26

Example: first iteration Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ Tom Kelsey ID5059-10-CT 23 Feb 2015 10 / 26

Example: Final tree Teknomo, Kardi. (2009) Tutorial on Decision Trees. http://people.revoledu.com/kardi/tutorial/decisiontree/ Tom Kelsey ID5059-10-CT 23 Feb 2015 11 / 26

Model Validation We have seen several ways to grow trees different choices will, in general, lead to different trees The trees are bigger than we think we need deliberate overfit so that we can prune back get a good balance between resubstitution & generalisation errors So the two important factors are tree size and tree error For all trees, our cost-complexity measure is given by R(T) α = R(T) + α T where R(T) is the resubstitution error rate for tree T, T is the size/complexity of the tree (the number of terminal nodes). Tom Kelsey ID5059-10-CT 23 Feb 2015 12 / 26

Setting tree complexity If we were to choose a tree T to minimise R(T) α, then we can see α balances the emphasis in terms of: fidelity to the data as measured by R(T) i.e. if α=0, no constraint is imposed on complexity and our model would attempt to predict the training sample exactly; model complexity as measured by T i.e. if α were very large then the complexity would dominate our minimisation and the simplest model would be favoured - no splits at all. Tom Kelsey ID5059-10-CT 23 Feb 2015 13 / 26

Setting tree complexity for every value of α there will be a smallest subtree whose pruning provides the greatest reduction in R(T) α (the smallest minimising subtree). there are a set of values of α where the node defining the smallest minimising subtree changes. α can be determined from either: a validation dataset a k-fold cross-validation process OOB sampling Tom Kelsey ID5059-10-CT 23 Feb 2015 14 / 26

Generalisation Errors In order to evaluate choices for α, we need to define our generalisation error. Common choices are: Misclassification error (false negatives + false positives)/(all negatives + all positives) Positive Predicitive Value (true positives)/(true negatives + true positives) log loss 1 = 1 n Σn i=1 y ilog(ŷ i ) + (1 y i )log(1 ŷ i ) see http://www.kaggle.com/c/pf2012-diabetes/details/evaluatio Others exist, and are derived by taking into account what type of prediction is important for a given dataset and environment. 1. where n is the number of observations, log is the natural logarithm, ŷ i is the posterior probability that the i-th prediction is correct, and y i is the ground truth Tom Kelsey ID5059-10-CT 23 Feb 2015 15 / 26

Tree Pruning for Hold-back Data If data is plentiful, tree pruning is conceptually easy We have the resubstitution errors computed at each stage of the tree derivation We can work out the generalisation error for each tree pruned back by turning two leafs into one Plot both, and choose the optimal pruned tree As we did for generalised regression models Tom Kelsey ID5059-10-CT 23 Feb 2015 16 / 26

Tree Pruning for Hold-back Data Tom Kelsey ID5059-10-CT 23 Feb 2015 17 / 26

Tree Pruning for Hold-back Data In practice, there is more than one choice for where to prune In the example, there are many trees of size 10 obtainable from the original tree of size 70 So we have to systematically work through these In software packages, heuristics are used to reduce the amount of work needed In return, we lose any guarantee that the optimal final tree is chosen Tom Kelsey ID5059-10-CT 23 Feb 2015 18 / 26

Cross-validation for α If data is scarce, we can k-fold cross validate as before k = 10... 20 seems to work well A different choice of k will give a different optimal α, in general Grow k trees, and test with the k hold-back sets Assess by average performance Tom Kelsey ID5059-10-CT 23 Feb 2015 19 / 26

Cross-validation for α Two methods for choosing an optimal pruned tree 1 Minimise the cross validation relative error i.e. do the same as for the training/test approach, but averaged out over the k models 2 Use the 1-SE rule instead of selecting the best α, select an α with minimum number of leaves which performs not worse than one standard error above the best α In CART (and many packages such as rpart), 1-SE is preferred the 1-SE line plotted on charts, and we select the first α value that has relative generalisation error below that line Tom Kelsey ID5059-10-CT 23 Feb 2015 20 / 26

OOB and Bootstrapping for α As for generalised linear models, we could do Out-Of-Bag sampling from the dataset Train on that sample, test on those data not selected Plus many variations on the same theme We re going to cover these techniques in detail later in the course. Tom Kelsey ID5059-10-CT 23 Feb 2015 21 / 26

Tree Disadvantages Approximation of linear/smooth relationships takes a large number of splits parameters, where a simple continous function would suffice (with relatively low numbers of parameters) Covariates are not combined in any way Trees can be deceptive a variable not included could have been masked by another Tree structures may be unstable a change in the sample may give a different tree A tree is optimal at each split it may not be globally optimal Truly optimal trees either don t exist, are too expensive to compute, or it is too expensive to find out that they don t exist Tom Kelsey ID5059-10-CT 23 Feb 2015 22 / 26

Tree Disadvantages I supply a dataset to 55 different data-miners. Each will, in general, report a different optimal tree if they make any different choices of 1 Numeric discretisation at all, supervised, unsupervised,... 2 Measure of impurity SD, Gini index, entropy,... 3 Condensed output mean, probability, class, linear model,... 4 Resubstitution error squared, absolute, (differential) misclassification, inverse proportional censored weighting,... 5 Generalisation error misclassification, log, loss, PPV,... 6 Validation method hold-back test, k-fold cross, OOB,... Tom Kelsey ID5059-10-CT 23 Feb 2015 23 / 26

Tree Advantages Can cope with any type of data Packages work straight out of the box Simple to interpret, simple to explain Use conditional information effectively Robust with respect to outliers Provide useful estimates of the generalisation error (usually misclassification rate) "Localise" the data in a similar way to basis functions Covariates near the root are the most important predictors Tom Kelsey ID5059-10-CT 23 Feb 2015 24 / 26

Summary Classification trees are built and pruned in the same way as regression trees Information gain replaces SDR (and its variants) For both theoretical and practical reasons, binary trees and entropy are preferred But you don t have to use either Classification and regression trees are the weapon of choice in many data mining situations although they should be avoided in areas such as good poker hands, or rare diseases in the general population since rare occurrences get pruned out Tom Kelsey ID5059-10-CT 23 Feb 2015 25 / 26

Decision Tree - Worked Example Data is given name and gender x 1 - name begins with a vowel: Y or N x 2 - number of vowels in name: 3 classes One, Two & HigherThanTwo x 3 - length of name: 3 classes Short, Medium & Long y - Male or Female At each node calculate the information of the current table i(t) for each of the x i calculate the information gain for that attribute Branch on the attribute with most information gain Repeat for the nodes created Stop when pure enough Tom Kelsey ID5059-10-CT 23 Feb 2015 26 / 26