Data Mining Lecture 8: Decision Trees

Similar documents
Business Club. Decision Trees

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Tree Learning

Tree-based methods for classification and regression

8. Tree-based approaches

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Ensemble Methods, Decision Trees

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Random Forest A. Fornaser

Classification/Regression Trees and Random Forests

7. Boosting and Bagging Bagging

Knowledge Discovery and Data Mining

Computer Vision Group Prof. Daniel Cremers. 6. Boosting

Classification with Decision Tree Induction

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

STA 4273H: Statistical Machine Learning

Lecture 7: Decision Trees

Classification: Decision Trees

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Decision Trees. Predic-on: Based on a bunch of IF THEN ELSE rules. Fi>ng: Find a bunch of IF THEN ELSE rules to cover all cases as best you can.

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

Algorithms: Decision Trees

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Lecture 5: Decision Trees (Part II)

Performance Evaluation of Various Classification Algorithms

Lecture 20: Bagging, Random Forests, Boosting

An introduction to random forests

Notes based on: Data Mining for Business Intelligence

Lecture 19: Decision trees

CS 229 Midterm Review

Logical Rhythm - Class 3. August 27, 2018

Random Forest Classification and Attribute Selection Program rfc3d

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

BITS F464: MACHINE LEARNING

Classification and Regression Trees

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Network Traffic Measurements and Analysis

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Classification Algorithms in Data Mining

CS229 Lecture notes. Raphael John Lamarre Townshend

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Random Forests and Boosting

Lecture outline. Decision-tree classification

Classification with PAM and Random Forest

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

International Journal of Software and Web Sciences (IJSWS)

The Basics of Decision Trees

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Allstate Insurance Claims Severity: A Machine Learning Approach

The Curse of Dimensionality

Supervised Learning Classification Algorithms Comparison

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Introduction to Classification & Regression Trees

Decision tree learning

Lecture 2 :: Decision Trees Learning

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Three Embedded Methods

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

COMP 465: Data Mining Classification Basics

CSE4334/5334 DATA MINING

Boosting Simple Model Selection Cross Validation Regularization

extreme Gradient Boosting (XGBoost)

Machine Learning Techniques for Data Mining

Decision Trees / Discrete Variables

Supervised Learning for Image Segmentation

Slides for Data Mining by I. H. Witten and E. Frank

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Machine Learning. Chao Lan

Nonparametric Classification Methods

CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Applying Supervised Learning

Lazy Decision Trees Ronny Kohavi

Artificial Intelligence. Programming Styles

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

CSC 411 Lecture 4: Ensembles I

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

CS Machine Learning

April 3, 2012 T.C. Havens

Data Mining Practical Machine Learning Tools and Techniques

Business Data Analytics

Supervised vs unsupervised clustering

Scaling up Decision Trees

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Classification. Instructor: Wei Ding

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Chapter 7: Numerical Prediction

1) Give decision trees to represent the following Boolean functions:

Extra readings beyond the lecture slides are important:

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Statistical Machine Learning Hilary Term 2018

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Transcription:

Data Mining Lecture 8: Decision Trees Jo Houghton ECS Southampton March 8, 2019 1 / 30

Decision Trees - Introduction A decision tree is like a flow chart. E. g. I need to buy a new car Can I afford it? No Nope Yes Is the car safe? Looks Dodgy Looks safe Nope How many doors? 4 or more 3 2 What colour is it? Nope Nope Red Pink Yellow Any other colour Excellent Nope Good OK 2 / 30

Decision Trees - Introduction Decision Trees can be hand crafted by experts They can also be built up using machine learning techniques They are interpretable, it is easy to see how they made a certain decision They are used in a wide range of contexts, for example: Medicine Financial Analysis Astronomy Especially in medicine, the explicit reasoning in decision trees means experts can understand why the algorithm has made its decision. 3 / 30

Decision Trees - Introduction Each node is a test, each branch is an outcome of the test Can be used for tabulated data with a range of data types, e.g. numerical, categorical 4 / 30

Decision Trees - Tree Growing Algorithms There are a good number of algorithms to build decision trees CART - Classification And Regression Trees ID3 - Iterative Dichotomiser 3 C4.5 - improved ID3 C5.0 - improved C4.5 CHAID - Chi - squared Automatic Interaction Detector 5 / 30

Decision Trees - CART The CART algorithm was published by Breiman et al in 1984 Find best split for each feature - minimises impurity measure Find feature that minimises impurity the most Use the best split on that feature to split the node Do the same for each of the leaf nodes The CART algorithm depends on an impurity measure. It uses Gini impurity Gini impurity measures how often a randomly chosen element from a set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the set. The probabilities for each label are summed up. 6 / 30

Decision Trees - CART Gini Impurity (I G ) sums up probability of a mistake for each label: mistake probability = k i p k = 1 p i For J classes: Gini(p) = J p i p k = i=1 k i J p i (1 p i ) i=1 It reaches its minimum when all cases in the node fall into a single category 7 / 30

Decision Trees - CART Maximum improvement in impurity found using the equation: Gini(root) ( Gini(Left) n L n + Gini(Right)n R ) n Where Gini(root) is the impurity of the node to be split, Gini(Left) and Gini(Right) is the impurity of the left and right branches, n L and n R are the numbers in left and right branches. 8 / 30

Decision Trees - CART Example: Car Make Type Colour Price Mileage Bought? VW Polo Grey 2000 82000 Yes Ford Fiesta Purple 1795 95000 Yes Ford Fiesta Grey 1990 90000 No VW Golf Red 1800 120000 Yes VW Polo Grey 900 150000 No Ford Ka Yellow 1400 100000 Yes Can go through and calculate best split for each feature. 9 / 30

Decision Trees - CART Calculate root node impurity: Gini(root) = 1 3 (1 1 3 ) + 1 3 (1 1 3 ) = 2 9 + 2 9 = 4 9 Impurity decrease (or Information gain if using Entropy) is thus: I G = Gini(root) ( Gini(Left) n L n + Gini(Right)n R ) n 10 / 30

Decision Trees - CART Car Make VW Ford 4/9 - (2/9 + 2/9) Y, Y, N Y, Y, N = 0 11 / 30

Decision Trees - CART Car Make VW Ford 4/9 - (2/9 + 2/9) Y, Y, N Y, Y, N = 0 Type Golf Not Golf 4/9 - (0 + 0.4) Y Y, Y, Y, N, N = 0.044 11 / 30

Decision Trees - CART Car Make VW Ford 4/9 - (2/9 + 2/9) Y, Y, N Y, Y, N = 0 Type Golf Not Golf 4/9 - (0 + 0.4) Y Y, Y, Y, N, N = 0.044 Colour Grey Not Grey 4/9 - (1/9 + 0) Y, N, N Y, Y, Y = 0.333 11 / 30

Decision Trees - CART Car Make VW Ford 4/9 - (2/9 + 2/9) Y, Y, N Y, Y, N = 0 Type Golf Not Golf 4/9 - (0 + 0.4) Y Y, Y, Y, N, N = 0.044 Colour Grey Not Grey 4/9 - (1/9 + 0) Y, N, N Y, Y, Y = 0.333 Price > 1000 < 1000 4/9 - (0.133 + 0) Y, N, Y, Y, Y N = 0.31 11 / 30

Decision Trees - CART Car Make VW Ford 4/9 - (2/9 + 2/9) Y, Y, N Y, Y, N = 0 Type Golf Not Golf 4/9 - (0 + 0.4) Y Y, Y, Y, N, N = 0.044 Colour Grey Not Grey 4/9 - (1/9 + 0) Y, N, N Y, Y, Y = 0.333 Price > 1000 < 1000 4/9 - (0.133 + 0) Y, N, Y, Y, Y N = 0.31 Mileage > 140000 < 140000 4/9 - (0 + 0.133) N Y, Y, Y, N, Y = 0.31 11 / 30

Decision Trees - CART Car Make VW Ford 4/9 - (2/9 + 2/9) Y, Y, N Y, Y, N = 0 Type Golf Not Golf 4/9 - (0 + 0.4) Y Y, Y, Y, N, N = 0.044 Colour Grey Not Grey 4/9 - (1/9 + 0) Y, N, N Y, Y, Y = 0.333 Price > 1000 < 1000 4/9 - (0.133 + 0) Y, N, Y, Y, Y N = 0.31 Mileage > 140000 < 140000 4/9 - (0 + 0.133) N Y, Y, Y, N, Y = 0.31 This gives the first split, highest decrease in impurity is Grey / Not Grey The same process is repeated for each impure node until all nodes are pure. 11 / 30

Decision Trees - CART ipynb demo 12 / 30

Decision Trees - CART 13 / 30

Decision Trees - CART 14 / 30

Decision Trees - CART 15 / 30

Decision Trees - CART 16 / 30

Decision Trees - ID3 Similar to CART, Iterative Dichotomy 3 (ID3) minimises entropy instead of Gini impurity Entropy: H(S) = x X p(x) log 2 p(x) Where S is the data set, X is the set of classes in S, p(x) is the proportion of the number of elements in class x to the number of elements in set S 17 / 30

Decision Trees - ID3 Information Gain is measured for a split along each possible attribute A I G (S, A) = H(S) x X S A S H(S A) ID3 is very similar to CART, though doesn t technically support numerical values 18 / 30

Decision Trees - Pruning Overfitting is a serious problem with Decision Trees 19 / 30

Decision Trees - Pruning Overfitting is a serious problem with Decision Trees.. are four splits really required here? The trees it creates are too complex. One solution is pruning This can be done in a variety of ways, including: Reduced Error Pruning Entropy Based Merging 19 / 30

Decision Trees - Reduced Error Pruning Growing a decision tree fully, then removing branches without reducing predictive accuracy, measured using a validation set. Start at leaf nodes Look up branches at last decision split replace with a leaf node predicting the majority class If validation set classification accuracy is not affected, then keep the change This is a simple and fast algorithm that can simplify over complex decision trees 20 / 30

Decision Trees - Entropy Based Pruning Grow a decision tree fully, then Chose a pair of leaf nodes with the same parent What is the entropy gain from merging them? If lower than a threshold, merge nodes This doesn t require additional data 21 / 30

Decision Trees - Missing Data ID3 ignores missing data, CART generally puts them to the node that has the largest number of the same category You can assign a branch specifically to an unknown value You can assign it to the branch with the most of the same target value You can weight each branch using the known factors and put it in the most similar branch 22 / 30

Decision Trees - Regression CART - Classification and Regression Trees How do we use decision trees for regression? i.e. to give numerical values rather than a classification. Could use classification, but using each numerical value as a class.. Problems? 23 / 30

Decision Trees - Regression CART - Classification and Regression Trees How do we use decision trees for regression? i.e. to give numerical values rather than a classification. Could use classification, but using each numerical value as a class.. Problems? How would you generalise? Loses all meaning of ordering, or similarity Solution? 23 / 30

Decision Trees - Regression CART - Classification and Regression Trees How do we use decision trees for regression? i.e. to give numerical values rather than a classification. Could use classification, but using each numerical value as a class.. Problems? How would you generalise? Loses all meaning of ordering, or similarity Solution? Use Variance instead of Gini or Entropy 23 / 30

Decision Trees - Regression Maximise Variance Gain: Split on the feature values that give maximum gain in variance Should make similar numbers group together I. e. lower numbers on one side, higher on the other 24 / 30

Decision Trees - In General There are problems with Decision Trees: Finding an optimal Tree is NP-complete They overfit, so don t generalise well - Hence need to prune Information Gain is biased to features with more categories Splits are axis aligned.. 25 / 30

Decision Trees - Ensemble Methods Bagging: Bootstrap aggregating Uniformly sample initial dataset with replacement in to m subsets For example: if data set has 5 samples, (s 1, s 2, s 3, s 4, s 5 ) make a whole bunch of similar data sets: (s 5, s 2, s 2, s 1, s 5 ) 26 / 30

Decision Trees - Ensemble Methods Bagging: Bootstrap aggregating Uniformly sample initial dataset with replacement in to m subsets For example: if data set has 5 samples, (s 1, s 2, s 3, s 4, s 5 ) make a whole bunch of similar data sets: (s 5, s 2, s 2, s 1, s 5 ) (s 4, s 2, s 1, s 3, s 3 ) 26 / 30

Decision Trees - Ensemble Methods Bagging: Bootstrap aggregating Uniformly sample initial dataset with replacement in to m subsets For example: if data set has 5 samples, (s 1, s 2, s 3, s 4, s 5 ) make a whole bunch of similar data sets: (s 5, s 2, s 2, s 1, s 5 ) (s 4, s 2, s 1, s 3, s 3 ) (s 2, s 5, s 3, s 1, s 1 ) 26 / 30

Decision Trees - Ensemble Methods Bagging: Bootstrap aggregating Uniformly sample initial dataset with replacement in to m subsets For example: if data set has 5 samples, (s 1, s 2, s 3, s 4, s 5 ) make a whole bunch of similar data sets: (s 5, s 2, s 2, s 1, s 5 ) (s 4, s 2, s 1, s 3, s 3 ) (s 2, s 5, s 3, s 1, s 1 ) (s 3, s 1, s 2, s 4, s 4 ) 26 / 30

Decision Trees - Ensemble Methods Bagging: Bootstrap aggregating Uniformly sample initial dataset with replacement in to m subsets For example: if data set has 5 samples, (s 1, s 2, s 3, s 4, s 5 ) make a whole bunch of similar data sets: (s 5, s 2, s 2, s 1, s 5 ) (s 4, s 2, s 1, s 3, s 3 ) (s 2, s 5, s 3, s 1, s 1 ) (s 3, s 1, s 2, s 4, s 4 ) Train a different decision tree on each set To classify, apply each classifier and chose the correct one by majority vote If doing regression, take the mean of the values 26 / 30

Decision Trees - Ensemble Methods Bagging: Bootstrap aggregating Uniformly sample initial dataset with replacement in to m subsets For example: if data set has 5 samples, (s 1, s 2, s 3, s 4, s 5 ) make a whole bunch of similar data sets: (s 5, s 2, s 2, s 1, s 5 ) (s 4, s 2, s 1, s 3, s 3 ) (s 2, s 5, s 3, s 1, s 1 ) (s 3, s 1, s 2, s 4, s 4 ) Train a different decision tree on each set To classify, apply each classifier and chose the correct one by majority vote If doing regression, take the mean of the values This improves generalisation, as decreases variance without increasing bias 26 / 30

Decision Trees - Ensemble Methods Random Forest: Apply bagging When learning the tree for each subset, chose the split by searching over a random sample of the features This reduces overfitting 27 / 30

Decision Trees - Ensemble Methods Boosting: Make a strong learner using a weighted sum of weak learners with a small fixed size n Very shallow trees can be used, each learner performs little better than chance ADABoost - A classic method (covered in AML) Gradient Boosting - Now dominant (also covered in AML) XGBoost - Extreme Gradient Boosting! Gradient Boosting uses the residual error from the previous learners as the target value. if n = 1.. stumps! 28 / 30

Decision Trees - Ensemble Methods Summary Bagging: Handles overfitting well Reduces Variance Made by many voting independent classifiers Boosting Can overfit Reduces both Bias and Variance Made from a linear combination of classifiers 29 / 30

Decision Trees - Summary Advantages: Interpretability Ability to work with numerical and categorical features Good with mixed tabular data Disadvantages: Might not scale effectively for lots of classes Features that interact are problematic 30 / 30

30 / 30