Decision Tree Learning Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 25, 2014
Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 200 150 100 50 L 1 Training set Owns a house Does not own a house 0 L 2 0 10 20 30 40 50 60 70 Age If the training data was as above Could we define some simple rules by observation? Any point above the line L 1 à Owns a house Any point to the right of L 2 à Owns a house Any other point à Does not own a house 2
Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 200 150 100 50 L 1 Training set Owns a house Does not own a house 0 Root node: Split at Income = 101 0 10 20 30 40 50 60 70 Age Income 101: Label = Yes Income < 101: Split at Age = 54 L 2 In general, the data won t be such as above Age 54: Label = Yes Age < 54: Label = No 3
Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 200 150 100 50 Training set Owns a house Does not own a house 0 0 10 20 30 40 50 60 70 Age Approach: recursively split the data into partitions so that each partition becomes purer till How to decide the split? How to measure purity? When to stop? 4
Approach for splikng What are the possible lines for splitting? For each variable, midpoints between pairs of consecutive values for the variable How many? If N = number of points in training set and m = number of variables About O(N m) How to choose which line to use for splitting? The line which reduce impurity (~ heterogeneity of composition) the most How to measure impurity? 5
Gini Index for Measuring Impurity Suppose there are C classes Let p(i t) = fraction of observations belonging to class i in rectangle (node) t Gini index: C i=1 Gini(t) =1 p(i t) 2 If all observations in t belong to one single class Gini(t) = 0 When is Gini(t) maximum? 6
Entropy Average amount of information contained From another point of view average amount of information expected hence amount of uncertainty We will study this in more detail later Entropy: C Entropy(t) = p(i t) log 2 p(i t) i=1 Where 0 log 2 0 is defined to be 0 7
ClassificaOon Error What if we stop the tree building at a node That is, do not create any further branches for that node Make that node a leaf Classify the node with the most frequent class present in the node Classification error as measure of impurity This rectangle (node) is still impure ClassificationError(t) =1 max i [ p(i t)] Intuitively the impurity of the most frequent class in the rectangle (node) 8
The Full Blown Tree Recursive splitting Suppose we don t stop until all nodes are pure A large decision tree with leaf nodes having very few data points Does not represent classes well Overfitting Solution: Stop earlier, or Prune back the tree StaOsOcally not significant Root 1000 400 600 Number of points 200 200 160 240 2 1 5 9
Prune back Pruning step: collapse leaf nodes and make the immediate parent a leaf node Effect of pruning Lose purity of nodes But were they really pure or was that a noise? Too many nodes noise Trade-off between loss of purity and gain in complexity Leaf node (label = Y) Freq = 5 Decision node (Freq = 7) Prune Leaf node (label = B) Freq = 2 Leaf node (label = Y) Freq = 7 10
Prune back: cost complexity Cost complexity of a (sub)tree: Classification error (based on training data) and a penalty for size of the tree Decision node (Freq = 7) tradeoff (T ) = Err(T )+α L(T ) Err(T) is the classification error L(T) = number of leaves in T Penalty factor α is between 0 and 1 If α=0, no penalty for bigger tree Leaf node (label = Y) Freq = 5 Leaf node (label = Y) Freq = 7 Prune Leaf node (label = B) Freq = 2 11
Different Decision Tree Algorithms Chi-square Automatic Interaction Detector (CHAID) Gordon Kass (1980) Stop subtree creation if not statistically significant by chi-square test Classification and Regression Trees (CART) Breiman et al. Decision tree building by Gini s index Iterative Dichotomizer 3 (ID3) Ross Quinlan (1986) Splitting by information gain (difference in entropy) C4.5 Quinlan s next algorithm, improved over ID3 Bottom up pruning, both categorical and continuous variables Handling of incomplete data points C5.0 Ross Quinlan s commercial version 12
ProperOes of Decision Trees Non parametric approach Does not require any prior assumptions regarding the probability distribution of the class and attributes Finding an optimal decision tree is an NP-complete problem Heuristics used: greedy, recursive partitioning, top-down, bottom-up pruning Fast to generate, fast to classify Easy to interpret or visualize Error propagation An error at the top of the tree propagates all the way down 13
References Introduction to Data Mining, by Tan, Steinbach, Kumar Chapter 4 is available online: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf 14