Induction of Decision Trees Blaž Zupan, Ivan Bratko magixfriuni-ljsi/predavanja/uisp An Example Data Set and Decision Tree # Attribute Class Outlook Company Sailboat Sail? 1 sunny big small 2 sunny med small 3 sunny med big 4 sunny small sunny outlook rainy 5 sunny big big 6 rainy small company 7 rainy med small 8 rainy big big 9 rainy big med big 10 rainy med big sailboat small big 1
Classification # Attribute Class Outlook Company Sailboat Sail? 1 sunny big? 2 rainy big small? sunny outlook rainy company med big sailboat small big Induction of Decision Trees Data Set (Learning Set) Each example = Attributes + Class Induced description = Decision tree TDIDT Top Down Induction of Decision Trees Recursive artitioning 2
Some TDIDT Systems ID3 (Quinlan 79) CART (Brieman et al 84) Assistant (Cestnik et al 87) C45 (Quinlan 93) See5 (Quinlan 97) Orange (Demšar, Zupan 98-03) Analysis of Severe Trauma atients Data H_ICU The worst ph value at ICU <72 72-733 >733 Death 00 (0/15) AT_WORST Well 088 (14/16) Well 082 (9/11) <787 >=787 Death 00 (0/7) The worst active partial thromboplastin time H_ICU and AT_WORST are exactly the two factors (theoretically) advocated to be the most important ones in the study by Rotondo et al, 1997 3
Breast Cancer Recurrence Degree of Malig < 3 >= 3 Tumor Size Involved Nodes < 15 >= 15 < 3 >= 3 Age rec 125 recurr 39 rec 30 recurr 18 recurr 27 _rec 10 rec 4 recurr 1 rec 32 recurr 0 Tree induced by Assistant rofessional Interesting: Accuracy of this tree compared to medical specialists rostate cancer recurrence Secondary Gleason Grade 1,2 3 4 5 No SA Level Stage Yes 149 >149 T1c,T2a, T2b,T2c T1ab,T3 No rimary Gleason Grade No Yes 2,3 4 No Yes 4
TDIDT Algorithm Also kwn as ID3 (Quinlan) To construct decision tree T from learning set S: If all examples in S belong to some class C Then make leaf labeled C Otherwise select the most informative attribute A partition S according to A s values recursively construct subtrees T1, T2,, for the subsets of S TDIDT Algorithm Resulting tree T is: A Attribute A v1 v2 vn A s values T1 T2 Tn Subtrees 5
Ather Example # Attribute Class Outlook Temperature Humidity Windy lay 1 sunny hot high N 2 sunny hot high N 3 overcast hot high 4 rainy moderate high 5 rainy cold rmal 6 rainy cold rmal N 7 overcast cold rmal 8 sunny moderate high N 9 sunny cold rmal 10 rainy moderate rmal 11 sunny moderate rmal 12 overcast moderate high 13 overcast hot rmal 14 rainy moderate high N Simple Tree Outlook sunny overcast rainy Humidity Windy high rmal N N 6
Complicated Tree Temperature cold moderate hot Outlook Outlook Windy sunny overcast rainy sunny overcast rainy Windy Windy Humidity N Humidity high rmal high rmal N N Windy Outlook sunny overcast rainy N N null Attribute Selection Criteria Main principle Select attribute which partitions the learning set into subsets as pure as possible Various measures of purity Information-theoretic Gini index X 2 ReliefF Various improvements probability estimates rmalization binarization, subsetting 7
Information-Theoretic Approach To classify an object, a certain information is needed I, information After we have learned the value of A, we only need some remaining amount of information to classify the object Ires, residual information Gain Gain(A) = I Ires(A) The most informative attribute is the one that minimizes Ires, ie, maximizes Gain Entropy The average amount of information I needed to classify an object is given by the entropy measure For a two-class problem: entropy p(c1) 8
Residual Information After applying attribute A, S is partitioned into subsets according to values v of A Ires is equal to weighted sum of the amounts of information for the subsets Triangles and Squares # Attribute Shape Color Outline Dot 1 green dashed triange 2 green dashed triange 3 yellow dashed square 4 red dashed square 5 red solid square 6 red solid triange 7 green solid square 8 green dashed triange 9 yellow solid square 10 red solid square 11 green solid square 12 yellow dashed square 13 yellow solid square 14 red dashed triange 9
Triangles and Squares # Attribute Shape Color Outline Dot 1 green dashed triange 2 green dashed triange 3 yellow dashed square 4 red dashed square 5 red solid square 6 red solid triange 7 green solid square 8 green dashed triange 9 yellow solid square 10 red solid square 11 green solid square 12 yellow dashed square 13 yellow solid square 14 red dashed triange Data Set: A set of classified objects Entropy 5 triangles 9 squares class probabilities entropy 10
red Entropy reduction by data set partitioning Color? yellow green Entropija vredsti atributa Color? red yellow green 11
Information Gain Color? red yellow green Information Gain of The Attribute Attributes Gain(Color) = 0246 Gain(Outline) = 0151 Gain(Dot) = 0048 Heuristics: attribute with the highest gain is chosen This heuristics is local (local minimization of impurity) 12
red Color? green yellow Gain(Outline) = 0971 0 = 0971 bits Gain(Dot) = 0971 0951 = 0020 bits red Color? Gain(Outline) = 0971 0951 = 0020 bits Gain(Dot) = 0971 0 = 0971 bits green yellow Outline? dashed solid 13
14 Color? red yellow green dashed solid Dot? Outline? Decision Tree Color Dot Outline square red yellow green square triangle square triangle dashed solid
A Defect of Ires Ires favors attributes with many values Such attribute splits S to many subsets, and if these are small, they will tend to be pure anyway One way to rectify this is through a corrected measure of information gain ratio Information Gain Ratio I(A) is amount of information needed to determine the value of an attribute A Information gain ratio 15
Information Gain Ratio Color? red yellow green Information Gain and Information Gain Ratio A v(a) Gain(A) GainRatio(A) Color 3 0247 0156 Outline 2 0152 0152 Dot 2 0048 0049 16
Gini Index Ather sensible measure of impurity (i and j are classes) After applying attribute A, the resulting Gini index is Gini can be interpreted as expected error rate Gini Index 17
Gini Index for Color Color? red yellow green Gain of Gini Index 18
Three Impurity Measures A Gain(A) GainRatio(A) GiniGain(A) Color 0247 0156 0058 Outline 0152 0152 0046 Dot 0048 0049 0015 These impurity measures assess the effect of a single attribute Criterion most informative that they define is local (and myopic ) It does t reliably predict the effect of several attributes applied jointly Orange: Shapes Data Set shapetab Color Outline Dot Shape d d d d class green dashed triange green dashed triange yellow dashed square red dashed square red solid square red solid triange green solid square green dashed triange yellow solid square red solid square green solid square yellow dashed square yellow solid square 19
Orange: Impurity Measures import orange data = orangeexampletable('shape') gain = orangemeasureattribute_info gainratio = orangemeasureattribute_gainratio gini = orangemeasureattribute_gini print print "%15s %-8s %-8s %-8s" % ("name", "gain", "g ratio", "gini") for attr in datadomainattributes: print "%15s %43f %43f %43f" % \ (attrname, gain(attr, data), gainratio(attr, data), gini(attr, data)) name gain g ratio gini Color 0247 0156 0058 Outline 0152 0152 0046 Dot 0048 0049 0015 Orange: orngtree import orange, orngtree data = orangeexampletable('shape') tree = orngtreetreelearner(data) orngtreeprinttxt(tree) print '\nwith contingency vector:' orngtreeprinttxt(tree, internalnodefields=['contingency'], leaffields=['contingency']) Color green: Outline dashed: triange (1000%) Outline solid: square (1000%) Color yellow: square (1000%) Color red: Dot : square (1000%) Dot : triange (1000%) With contingency vector: Color (<5, 9>) green: Outline (<3, 2>) dashed: triange (<3, 0>) Outline (<3, 2>) solid: square (<0, 2>) Color (<5, 9>) yellow: square (<0, 4>) Color (<5, 9>) red: Dot (<2, 3>) : square (<0, 3>) Dot (<2, 3>) : triange (<2, 0>) 20
Orange: Saving to DOT import orange, orngtree data = orangeexampletable('shape') tree = orngtreetreelearner(data) orngtreeprintdot(tree, 'shapedot', leafshape='box', internalnodeshape='ellipse') > dot -Tgif shapedot > shapegif DB Miner: visualization 21
SGI MineSet: visualization 22