Decision Tree Learning

Size: px

Start display at page:

Download "Decision Tree Learning"

Vivien Dickerson
5 years ago
Views:

1 Decision Tree Learning 1

2 Simple example of object classification Instances Size Color Shape C(x) x1 small red circle positive x2 large red circle positive x3 small red triangle negative x4 large blue circle negative 2

3 Decision Trees Tree-based classifiers for instances represented as feature-vectors. Nodes test features, there is one branch for each value of the feature, and leaves specify the category. color color red blue green red blue green shape shape neg pos B C circle square triangle circle square triangle pos neg neg A B C Can represent arbitrary conjunction and disjunction (any boolean function). Can represent any classification function over discrete feature vectors. Can be rewritten as a set of rules, i.e. disjunctive normal form (DNF). red circle pos red circle A blue B; red square B green C; red triangle C

4 General shape of a decision tree Every non leaf node is a test on the value of one features. For each possible outcome of the test, an arrow is created that links to a subsequent test node or to a leaf node. Leaf nodes are DECISIONS concerning the value of the classification 4

5 Another simple example: risk of giving a loan Here, we have 3 features: income, credit history and debt. The classification function is Risk(x) with 3 values: high, moderate, low 5

6 More complex example: Application of Decision- Tree Model to character recognition based on visual features 6 Can t be visualized without zooming, but boxes (decisions) are alphabet letters, circles are test on values of several visual features

7 How does it work: Top-Down Decision Tree Induction Recursively build a tree top-down by divide and conquer. Subset of examples in D in which color=red <big, red, circle>: + <small, red, circle>: + <small, red, square>: - <big, blue, circle>: - red <big, red, circle>: + <small, red, circle>: + <small, red, square>: - blue color green Initial learning Set D Begin by considering the feature color At each step, we aim to find the best split of our data. What is a good split? One which reduces the uncertainty of classification for some split! 7

8 Top-Down Decision Tree Induction Recursively build a tree top-down by divide and conquer. <big, red, circle>: + <small, red, circle>: + <small, red, square>: - pos <big, red, circle>: + <small, red, circle>: + <big, red, circle>: + <small, red, circle>: + <small, red, square>: - <big, blue, circle>: - color red blue green shape neg neg circle <big, blue, circle>: - square triangle neg pos <small, red, square>: - How do we decide the order in which we test attributes? How do we decide the class of nodes for which we have no examples? Let s ignore for now these 2 issues and describe the algorithm first 8

9 Remember the house classification problem in L1 Mostly SF The data suggests homes higher than 73 meters are mostly in San Francisco 73m

10 Splitting data according to features height >75 <75 SF All instances in this split are green (=SF), so we can output a decison In this split, we need to analyze more features since no decision on the class is possible 10

Decision Tree Induction Pseudocode DTree(examples, features) returns a tree a)if all examples are in one category, return a leaf node with that

Else pick a feature f and create a node R for it color For each possible value v i of f: Let examples i be the subset of examples that have value v

If examples i is empty then attach a leaf node to edge E labeled with the category that is the most common in examples.

11 Decision Tree Induction Pseudocode DTree(examples, features) returns a tree a)if all examples are in one category, return a leaf node with that category label. b)else if the set of features is empty, return a leaf node with the category label that is the most common in examples. Else pick a feature f and create a node R for it color For each possible value v i of f: Let examples i be the subset of examples that have value v i for f Add an out-going edge E to node R labeled with the value v i. If examples i is empty then attach a leaf node to edge E labeled with the category that is the most common in examples. else call DTree(examples i, features {f}) and attach the resulting tree as the subtree under edge E. Return the subtree rooted at R. a) and b) are the termination conditions e.g. color e.g. color=red <big, red, circle>: + <small, red, circle>: + <small, red, square>: - red 11

12 Example (1) Examples: <big, red, circle>: + <small, red, circle>: + <small, red, square>: - <big, blue, circle>: - pick a feature F and create a node R for it, eg. color color For each possible value v i of F: Let examples i be the subset of examples that have value v i for F Add an out-going edge E to node R labeled with the value v i. <big, red, circle>: + <small, red, circle>: + <small, red, square>: - red color if ( ) else call DTree(examples i, features {F}) and attach the resulting tree as the subtree under edge E. Dtree( <big, red, circle>: + <small, red, circle>: + <small, red, square>: -, <dimension,shape>) 12

13 Example (2) DTree( <big, red, circle>: + <small, red, circle>: + <small, red, square>: -, <dimension,shape>) pick a feature F and create a node R for it, eg. shape <big, red, circle>: + <small, red, circle>: + circle shape red color square triangle <small, red, square>: - positive negative positive If all examples are in one category, return a leaf node with that category label. If examples i is empty. then attach a leaf node to edge E labeled with the category that 13 is the most common in examples.

14 Example (3) <big, blue, circle>: - <big, red, circle>: + <small, red, circle>: + <small, red, square>: - pos <big, red, circle>: + <small, red, circle>: + color red blue green shape neg neg circle <big, blue, circle>: - square triangle neg pos <small, red, square>: - Now we know how to decide the class when we have no examples, but how do we decide the order in which we create nodes? 14

15 Picking a Good Split Feature Goal is to have the resulting tree be as small as possible, per Occam s razor. Finding a minimal decision tree (nodes, leaves, or depth) is an NP-hard optimization problem. Top-down divide-and-conquer method does a greedy search for a simple tree but does not guarantee to find the smallest. General lesson in ML: Greed is good. Want to pick a feature that creates subsets of examples that are relatively pure in a single class so they are closer to being leaf nodes. There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979). 15

16 Entropy (recap.. ) Entropy (disorder, impurity) of a set of examples, D, relative to a binary classification is: Entropy(D) = p 1 log 2 ( p 1 ) p 0 log 2 ( p 0 ) where p 1 is the fraction of positive examples in D and p 0 is the fraction of negatives. (Notice that if S is a SAMPLE of a population, Entropy(S) is an estimate of the population entropy) If all examples are in one category, entropy is zero (we define 0 log(0)=0) If examples are equally mixed (p 1 =p 0 =0.5), entropy is a maximum of 1. Entropy can be viewed as the number of bits required on average to encode the class of an example in D. It is also an estimate of the initial disorder or uncertainty about a classification, given the set D. For multi-class problems with c categories, entropy generalizes to: Entropy(D) = p i log 2 ( p i ) c i=1 16

17 Entropy Plot for Binary Classification 17

18 Information Gain The information gain of a feature f is the expected reduction in entropy resulting from splitting on this feature. Gain(D, f ) = Entropy(D) where D v is the subset of D having value v for feature f (e.g, if f=color and v=red) Entropy of each resulting subset weighted by its relative size. Example: <big, red, circle>: + <small, red, circle>: + <small, red, square>: - <big, blue, circle>: - D v Entropy(D v ) D v Values( f ) Initial Entropy is 1 2+, 2 -: E=1 size 2+, 2 - : E=1 color 2+, 2 - : E=1 shape big small 1+,1-1+,1- E big =1 E small =1 Gain=1-( ) = 0 red blue 2+,1-0+,1- E red =0.918 E blue =0 Gain=1-( ) = circle square 2+,1-0+,1- E circle =0.918 E square =0 Gain=1-( ) = 0.311

19 New pseudo-code DTree(examples, features) returns a tree a)if all examples are in one category, return a leaf node with that category label. b)else if the set of features is empty, return a leaf node with the category label that is the most common in examples. Else pick the best feature f according to IG and create a node R for it For each possible value v i of f: Let examples i be the subset of examples that have value v i for f Add an out-going edge E to node R labeled with the value v i. If examples i is empty then attach a leaf node to edge E labeled with the category that is the most common in examples. else call DTree(examples i, features {f}) and attach the resulting tree as the subtree under edge E. Return the subtree rooted at R. 19

20 A small animated example 20

21 A complete example

22 A Decision Tree for PlayTennis? data The Play tennis? data example. instance x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 Outlook Temperature Humidity Windy Play Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High False True False False False True True False False False True True False True No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No

23 The Process of Constructing a Decision Tree Select an attribute to place at the root of the decision tree and make one branch for every possible value. Repeat the process recursively for each branch.

24 Information Gained by Knowing the Result of a Decision In the playtennis example, there are 9 instances of which the decision to play is yes and there are 5 instances of which the decision to play is no. Then, the initial data entropy is: 9 14 # log 9 & # % (+ 5 & # % ( log 5 & % ( = $ 14' $ 14' $ 14' The information initially required to correctly separate the data is bits

25 Information Further Required If Outlook Is Placed at the Root is the entropy of a 5 instances data set where 3 are positive and 2 pegative (or viceversa) yes yes no no no Outlook sunny overcast rainy yes yes yes yes yes yes yes no no 0 is the entropy of a dataset where all instances are positive (or negative) Probabil ity of outlook =sunny Information further required =! 5 $! # & $! # & $ # & 0.971= 0.693bits. " 14% " 14% " 14%

26 Information Gained by Placing Each of the 4 Attributes Gain(outlook) = bits bits = bits. Gain(temperature) = bits. Gain(humidity) = bits. Gain(windy) = bits.

27 The Strategy for Selecting an Attribute to Place at a Node Select the attribute that gives us the largest information gain. In this example, it is the attribute Outlook. Outlook sunny overcast rainy 2 yes 3 no 4 yes 3 yes 2 no

28 The Recursive Procedure for Constructing a Decision Tree Apply to each branch recursively to construct the decision tree. For example, for the branch Outlook = Sunny, we evaluate the information gained by applying each of the remaining 3 attributes. Gain(Outlook=sunny;Temperature) = = Gain(Outlook=sunny;Humidity) = = Gain(Outlook=sunny;Windy) = = 0.02

29 Recursive selection Similarly, we also evaluate the information gained by applying each of the remaining 3 attributes for the branch Outlook = rainy. Gain(Outlook=rainy;Temperature) = = 0.02 Gain(Outlook=rainy;Humidity) = = 0.02 Gain(Outlook=rainy;Windy) = = 0.971

30 The resulting tree Outlook sunny overcast rainy humidity yes windy high normal false true no yes yes no 30

31 DT can be represented as a set of rules Outlook sunny overcast rainy humidity yes windy no high normal false true yes yes no IF Outlook=sunny AND humidity=high è no IF Outlook=sunny AND humidity=normalè yes IF Outlook=overcast è yes IF Outlook=rainy AND windy=falseè yes IF Outlook=rainy AND windy=trueè no 31

Support and Confidence Each rule as a support D v / D represented by the set of examples that fit the rule Each rule has also a confidence which might or might not be equal to support.

32 Support and Confidence Each rule as a support D v / D represented by the set of examples that fit the rule Each rule has also a confidence which might or might not be equal to support. The confidence is the subset of D v which is correctly classified by the rule. Remember one of the 2 exit conditions in the algorithm: Else if the set of features is empty, return a leaf node with the category label that is the most common in examples. Hence the set of examples D v does NOT has a uniform classification, but, say, D v + positive and D v - negative, if D v + > D v - class is positive, support is D v + / D and confidence is D v + / D v For example if we consume all features in a branch of the tree and we remain with 5 examples, of which 3 positive and 2 negative, we append the decision positive to the tree branch (and its associated rule), with support 5 (or 5/ D ) and confidence 3/5 32

33 Example (e.g. if we have just one feature) Outlook sunny overcast rainy 2 yes 3 no 4 yes 3 yes 2 no negative RULE: IF Outlook=sunny THEN neg Support=3/14 Confidence=3/5 decision-trees-tutorial/ 33

34 Properties of Decision Tree Learning Continuous (real-valued) features can be handled by allowing nodes to split a real valued feature into two ranges based on a threshold (e.g. length < 3 and length 3) Classification trees have discrete class labels at the leaves, regression trees allow real-valued outputs at the leaves. Algorithms for finding consistent trees are efficient for processing large amounts of training data for data mining tasks. Methods developed for handling noisy training data (both class and feature noise). Consistency constraint can be relaxed. Methods developed for handling missing feature values. 34

35 Hypothesis Space Search in DL Performs batch learning that processes all training instances at once rather than incremental learning (e.g. like VS) that updates a hypothesis after each example. Performs hill-climbing (greedy search) that may only find a locally-optimal solution. Greedy because it locally determines the best split node to be generated. Guaranteed to find a tree consistent with any conflict-free training set (i.e. identical feature vectors always assigned the same class), but not necessarily the simplest tree. Finds a single discrete hypothesis, so there is no way to provide confidences or create useful queries. 35

36 Bias in Decision-Tree Induction Information-gain gives a bias for trees with minimal depth. Implements a search (preference) bias instead of a language (restriction) bias (language bias e.g. conjunctive hypotheses only as for VS). 36

37 History of Decision-Tree Research Hunt and colleagues use exhaustive search decision-tree methods (CLS) to model human concept learning in the 1960 s. In the late 70 s, Quinlan developed ID3 with the information gain heuristic to learn expert systems from examples. Simulataneously, Breiman and Friedman and colleagues develop CART (Classification and Regression Trees), similar to ID3. In the 1980 s a variety of improvements are introduced to handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results. Quinlan s updated decision-tree package (C4.5) released in Weka includes Java version of C4.5 called J48. 37

38 Summary Decision tree algorithm Ordering nodes (Information Gain) Testing and pruning 38

39 Computational Complexity Worst case builds a complete tree where every path test every feature. Assume D =n examples and m features. f 1 f m Maximum of n examples spread across all nodes at each of the m levels At each level, i, in the tree, must examine the remaining m- i features for each instance at that level to calculate info gains. m 2 i n = O( nm ) i= 1 However, learned tree is rarely complete (number of leaves is usually << n). In practice, complexity is linear in both number of features (m) and number of training examples (n). 39

40 Overfitting Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization to unseen data. There may be noise in the training data that the tree is erroneously fitting. The algorithm may be making poor decisions towards the leaves of the tree that are based on very little data and may not reflect reliable trends (e.g. reliable rules should be supported by many of examples, not just a handful ). A hypothesis, h, is said to overfit the training data is there exists another hypothesis, h, such that h has less error than h on the training data but greater error on independent test data. on training data accuracy on test data hypothesis complexity 40

41 Overfitting example (1) 41

42 Overfitting Example (2) Testing Ohms Law: V = IR (I = (1/R)V) Experimentally measure 10 points Fit a curve to the Resulting data. current (I) voltage (V) Perfect fit to training data with an 9 th degree polynomial (can fit n points exactly with an n-1 degree polynomial) Ohm was wrong, we have found a more accurate function! 42

43 Overfitting Example (3) Testing Ohms Law: V = IR (I = (1/R)V) current (I) voltage (V) Better generalization with a linear function that fits training data less accurately. 43

44 Overfitting Noise in Decision Trees Category or feature noise can easily cause overfitting. Add noisy instance <medium, blue, circle>: pos (but really neg) color red green blue shape neg neg circle square triangle pos neg pos 44

45 Overfitting Noise in Decision Trees Category or feature noise can easily cause overfitting. Add noisy instance <medium, blue, circle>: pos (but really neg) color red green blue <big, blue, circle>: - shape neg <medium, blue, circle>: + circle square triangle small med big pos neg pos neg pos neg Conflicting examples can also arise if the features are incomplete and inadequate to determine the class or if the target concept is non-deterministic (hidden features not discovered during initial phase or not available). For example it is believed that psoriasis also depends on stress, but patients stress condition mostly not available in patients records 45

46 Example (inadequate features) X(has-4-legs, has-tail, has-long-nose, has-big-ears,can-be-ridden) Classes: horse, camel, elephant The missing feature has-hump does not allow to correctly classify a camel (camel and horse would be classified the same according to the above features) 46

47 Overfitting Prevention (Pruning) Methods Two basic approaches for decision trees Prepruning: Stop growing tree as some point during top-down construction when there is no longer sufficient data to make reliable decisions (e.g. D v <k). Postpruning: Grow the full tree, then remove sub-trees that do not have sufficient evidence. Label leaf resulting from pruning with the majority class of the remaining data, or a class probability distribution. Method for determining which sub-trees to prune: Cross-validation: Reserve some training data as a hold-out set (validation set, tuning set) to evaluate utility of subtrees. Note: tuning set different from test set!! Test set always needed at the end. Statistical test: Use a statistical test on the training data to determine if any observed regularity (e.g. rule) can be dismisses, as likely due to random chance. Minimum description length (MDL): Determine if the additional complexity of the hypothesis is less complex than just explicitly remembering any exceptions resulting from pruning (e.g. for mammals: wings=false, except for bats). 47

48 Reduced Error Pruning A post-pruning, cross-validation approach. Partition training data D in learning L and validation V sets. Build a complete tree from the L data. Until accuracy on validation set V decreases do: For each non-leaf node, n, in the tree do: Temporarily prune the subtree below n and replace it with a leaf labeled with the current majority class at that node. Measure and record the accuracy of the pruned tree on the validation set. Permanently prune the node that results in the greatest increase in accuracy on the validation set. 48

49 Example shape red circle square pos neg pos <big, blue, circle>: - <medium, blue, circle>: + <small, blue, circle>: - green blue neg neg triangle small med big neg pos neg Majority of removed examples under the pruned sub-tree is neg 49

50 Issues with Reduced Error Pruning The problem with this approach is that it potentially wastes training data on the validation set. Severity of this problem depends where we are on the learning curve: test accuracy number of training examples A learning curve shows the accuracy, e.g. the learning rate, as the number of training examples grows. 50

51 Cross-Validating without Loosing Training Data If the algorithm is modified to grow trees breadth-first rather than depth-first (e.g. test all values of a feature first and create all nodes at level i), we can stop growing after reaching any specified tree complexity. First, run several trials of reduced error-pruning using different random splits of learning and test sets. Record the complexity of the pruned tree learned in each trial. Let C be the average pruned-tree complexity (C= nodes+ edges of the tree). Grow a final tree breadth-first from all the training data but stop when the complexity reaches C. Similar cross-validation approach can be used to set arbitrary algorithm parameters in general. 51

52 Summary Decision tree algorithm Ordering nodes (Information Gain) Testing and pruning 52

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem