Decision Trees. Query Selection

Size: px

Start display at page:

Download "Decision Trees. Query Selection"

Britton Patrick
5 years ago
Views:

3 CART Query Selection Key Question: Given a partial tree down to node N, what feature s should we choose for the property test T? The obvious heuristic is to choose the feature that yields as big a decrease in the impurity as possible. The impurity gradient is i(n) =i(n) P L i(n L ) (1 P L )i(n R ), (5) where N L and N R are the left and right descendants, respectively, P L is the fraction of data that will go to the left sub-tree when property T is used. The strategy is then to choose the feature that maximizes i(n). J. Corso (SUNY at Buffalo) Trees 13 / 33

5 CART What can we say about this strategy? For the binary-case, it yields one-dimensional optimization problem (which may have non-unique optima). J. Corso (SUNY at Buffalo) Trees 14 / 33

6 CART What can we say about this strategy? For the binary-case, it yields one-dimensional optimization problem (which may have non-unique optima). In the higher branching factor case, it would yield a higher-dimensional optimization problem. In multi-class binary tree creation, we would want to use the twoing criterion. The goal is to find the split that best separates groups of the c categories. A candidate supercategory C 1 consists of all patterns in some subset of the categories and C 2 has the remainder. When searching for the feature s, we also need to search over possible category groupings. J. Corso (SUNY at Buffalo) Trees 14 / 33

7 CART What can we say about this strategy? For the binary-case, it yields one-dimensional optimization problem (which may have non-unique optima). In the higher branching factor case, it would yield a higher-dimensional optimization problem. In multi-class binary tree creation, we would want to use the twoing criterion. The goal is to find the split that best separates groups of the c categories. A candidate supercategory C 1 consists of all patterns in some subset of the categories and C 2 has the remainder. When searching for the feature s, we also need to search over possible category groupings. This is a local, greedy optimization strategy. J. Corso (SUNY at Buffalo) Trees 14 / 33

8 CART What can we say about this strategy? For the binary-case, it yields one-dimensional optimization problem (which may have non-unique optima). In the higher branching factor case, it would yield a higher-dimensional optimization problem. In multi-class binary tree creation, we would want to use the twoing criterion. The goal is to find the split that best separates groups of the c categories. A candidate supercategory C 1 consists of all patterns in some subset of the categories and C 2 has the remainder. When searching for the feature s, we also need to search over possible category groupings. This is a local, greedy optimization strategy. Hence, there is no guarantee that we have either the global optimum (in classification accuracy) or the smallest tree. J. Corso (SUNY at Buffalo) Trees 14 / 33

9 CART What can we say about this strategy? For the binary-case, it yields one-dimensional optimization problem (which may have non-unique optima). In the higher branching factor case, it would yield a higher-dimensional optimization problem. In multi-class binary tree creation, we would want to use the twoing criterion. The goal is to find the split that best separates groups of the c categories. A candidate supercategory C 1 consists of all patterns in some subset of the categories and C 2 has the remainder. When searching for the feature s, we also need to search over possible category groupings. This is a local, greedy optimization strategy. Hence, there is no guarantee that we have either the global optimum (in classification accuracy) or the smallest tree. In practice, it has been observed that the particular choice of impurity function rarely affects the final classifier and its accuracy. J. Corso (SUNY at Buffalo) Trees 14 / 33

11 CART A Note About Multiway Splits In the case of selecting a multiway split with branching factor B, the following is the direct generalization of the impurity gradient function: i(s) =i(n) B k=1 P k i(n k ) (6) This direct generalization is biased toward higher branching factors. To see this, consider the uniform splitting case. J. Corso (SUNY at Buffalo) Trees 15 / 33

12 CART A Note About Multiway Splits In the case of selecting a multiway split with branching factor B, the following is the direct generalization of the impurity gradient function: i(s) =i(n) B k=1 P k i(n k ) (6) This direct generalization is biased toward higher branching factors. To see this, consider the uniform splitting case. So, we need to normalize each: i B (s) = i(s) B k=1 P k log P k. (7) And then we can again choose the feature that maximizes this normalized criterion. J. Corso (SUNY at Buffalo) Trees 15 / 33

15 CART When to Stop Splitting? If we continue to grow the tree until each leaf node has its lowest impurity (just one sample datum), then we will likely have over-trained the data. This tree will most definitely not generalize well. Conversely, if we stop growing the tree too early, the error on the training data will not be sufficiently low and performance will again suffer. So, how to stop splitting? J. Corso (SUNY at Buffalo) Trees 16 / 33

18 CART Stopping by Thresholding the Impurity Gradient Splitting is stopped if the best candidate split at a node reduces the impurity by less than the preset amount, β: max s i(s) β. (8) Benefit 1: Unlike cross-validation, the tree is trained on the complete training data set. J. Corso (SUNY at Buffalo) Trees 17 / 33

19 CART Stopping by Thresholding the Impurity Gradient Splitting is stopped if the best candidate split at a node reduces the impurity by less than the preset amount, β: max s i(s) β. (8) Benefit 1: Unlike cross-validation, the tree is trained on the complete training data set. Benefit 2: Leaf nodes can lie in different levels of the tree, which is desirable whenver the complexity of the data varies throughout the range of values. J. Corso (SUNY at Buffalo) Trees 17 / 33

20 CART Stopping by Thresholding the Impurity Gradient Splitting is stopped if the best candidate split at a node reduces the impurity by less than the preset amount, β: max s i(s) β. (8) Benefit 1: Unlike cross-validation, the tree is trained on the complete training data set. Benefit 2: Leaf nodes can lie in different levels of the tree, which is desirable whenver the complexity of the data varies throughout the range of values. Drawback: But, how do we set the value of the threshold β? J. Corso (SUNY at Buffalo) Trees 17 / 33

22 CART Stopping with a Complexity Term Define a new global criterion function α size + leaf nodes i(n). (9) which trades complexity for accuracy. Here, size could represent the number of nodes or links and α is some positive constant. The strategy is then to split until a minimum of this global criterion function has been reached. J. Corso (SUNY at Buffalo) Trees 18 / 33

23 CART Stopping with a Complexity Term Define a new global criterion function α size + leaf nodes i(n). (9) which trades complexity for accuracy. Here, size could represent the number of nodes or links and α is some positive constant. The strategy is then to split until a minimum of this global criterion function has been reached. Given the entropy impurity, this global measure is related to the minimum description length principle. The sum of the impurities at the leaf nodes is a measure of uncertainty in the training data given the model represented by the tree. J. Corso (SUNY at Buffalo) Trees 18 / 33

25 CART Stopping by Testing the Statistical Significance During construction, estimate the distribution of the impurity gradients i for the current collection of nodes. J. Corso (SUNY at Buffalo) Trees 19 / 33

26 CART Stopping by Testing the Statistical Significance During construction, estimate the distribution of the impurity gradients i for the current collection of nodes. For any candidate split, estimate if it is statistical different from zero. One possibility is the chi-squared test. J. Corso (SUNY at Buffalo) Trees 19 / 33

27 CART Stopping by Testing the Statistical Significance During construction, estimate the distribution of the impurity gradients i for the current collection of nodes. For any candidate split, estimate if it is statistical different from zero. One possibility is the chi-squared test. More generally, we can consider a hypothesis testing approach to stopping: we seek to determine whether a candidate split differs significantly from a random split. J. Corso (SUNY at Buffalo) Trees 19 / 33

28 CART Stopping by Testing the Statistical Significance During construction, estimate the distribution of the impurity gradients i for the current collection of nodes. For any candidate split, estimate if it is statistical different from zero. One possibility is the chi-squared test. More generally, we can consider a hypothesis testing approach to stopping: we seek to determine whether a candidate split differs significantly from a random split. Suppose we have n samples at node N. Aparticularsplits sends Pn patterns to the left branch and (1 P )n patterns to the right branch. A random split would place P n1 of the ω 1 samples to the left, P n2 of the ω 2 samples to the left and corresponding amounts to the right. J. Corso (SUNY at Buffalo) Trees 19 / 33

30 CART The chi-squared statistic calculates the deviation of a particular split s from this random one: χ 2 = 2 i=1 (n il n ie ) 2 n ie (10) where n il is the number of ω 1 patterns sent to the left under s, and n ie = Pn i is the number expected by the random rule. The larger the chi-squared statistic, the more the candidate split deviates from a random one. J. Corso (SUNY at Buffalo) Trees 20 / 33

31 CART The chi-squared statistic calculates the deviation of a particular split s from this random one: χ 2 = 2 i=1 (n il n ie ) 2 n ie (10) where n il is the number of ω 1 patterns sent to the left under s, and n ie = Pn i is the number expected by the random rule. The larger the chi-squared statistic, the more the candidate split deviates from a random one. When it is greater than a critical value (based on desired significance bounds), we reject the null hypothesis (the random split) and proceed with s. J. Corso (SUNY at Buffalo) Trees 20 / 33

32 CART Pruning Tree construction based on when to stop splitting biases the learning algorithm toward trees in which the greatest impurity reduction occurs near the root. It makes no attempt to look ahead at what splits may occur in the leaf and beyond. J. Corso (SUNY at Buffalo) Trees 21 / 33

33 CART Pruning Tree construction based on when to stop splitting biases the learning algorithm toward trees in which the greatest impurity reduction occurs near the root. It makes no attempt to look ahead at what splits may occur in the leaf and beyond. Pruning is the principal alternative strategy for tree construction. J. Corso (SUNY at Buffalo) Trees 21 / 33

34 CART Pruning Tree construction based on when to stop splitting biases the learning algorithm toward trees in which the greatest impurity reduction occurs near the root. It makes no attempt to look ahead at what splits may occur in the leaf and beyond. Pruning is the principal alternative strategy for tree construction. In pruning, we exhaustively build the tree. Then, all pairs of neighboring leafs nodes are considered for elimination. J. Corso (SUNY at Buffalo) Trees 21 / 33

35 CART Pruning Tree construction based on when to stop splitting biases the learning algorithm toward trees in which the greatest impurity reduction occurs near the root. It makes no attempt to look ahead at what splits may occur in the leaf and beyond. Pruning is the principal alternative strategy for tree construction. In pruning, we exhaustively build the tree. Then, all pairs of neighboring leafs nodes are considered for elimination. Any pair that yields a satisfactory increase in impurity (a small one) is eliminated and the common ancestor node is declared a leaf. J. Corso (SUNY at Buffalo) Trees 21 / 33

36 CART Pruning Tree construction based on when to stop splitting biases the learning algorithm toward trees in which the greatest impurity reduction occurs near the root. It makes no attempt to look ahead at what splits may occur in the leaf and beyond. Pruning is the principal alternative strategy for tree construction. In pruning, we exhaustively build the tree. Then, all pairs of neighboring leafs nodes are considered for elimination. Any pair that yields a satisfactory increase in impurity (a small one) is eliminated and the common ancestor node is declared a leaf. Unbalanced trees often result from this style of pruning/merging. J. Corso (SUNY at Buffalo) Trees 21 / 33

37 CART Pruning Tree construction based on when to stop splitting biases the learning algorithm toward trees in which the greatest impurity reduction occurs near the root. It makes no attempt to look ahead at what splits may occur in the leaf and beyond. Pruning is the principal alternative strategy for tree construction. In pruning, we exhaustively build the tree. Then, all pairs of neighboring leafs nodes are considered for elimination. Any pair that yields a satisfactory increase in impurity (a small one) is eliminated and the common ancestor node is declared a leaf. Unbalanced trees often result from this style of pruning/merging. Pruning avoids the local -ness of the earlier methods and uses all of the training data, but it does so at added computational cost during the tree construction. J. Corso (SUNY at Buffalo) Trees 21 / 33

42 ID3 ID3 Method ID3 is another tree growing method. J. Corso (SUNY at Buffalo) Trees 26 / 33

43 ID3 ID3 Method ID3 is another tree growing method. It assumes nominal inputs. J. Corso (SUNY at Buffalo) Trees 26 / 33

44 ID3 ID3 Method ID3 is another tree growing method. It assumes nominal inputs. Every split has a branching factor B j,whereb j is the number of discrete attribute bins of the variable j chosen for splitting. J. Corso (SUNY at Buffalo) Trees 26 / 33

45 ID3 ID3 Method ID3 is another tree growing method. It assumes nominal inputs. Every split has a branching factor B j,whereb j is the number of discrete attribute bins of the variable j chosen for splitting. These are, hence, seldom binary. J. Corso (SUNY at Buffalo) Trees 26 / 33

46 ID3 ID3 Method ID3 is another tree growing method. It assumes nominal inputs. Every split has a branching factor B j,whereb j is the number of discrete attribute bins of the variable j chosen for splitting. These are, hence, seldom binary. The number of levels in the trees are equal to the number of input variables. J. Corso (SUNY at Buffalo) Trees 26 / 33

47 ID3 ID3 Method ID3 is another tree growing method. It assumes nominal inputs. Every split has a branching factor B j,whereb j is the number of discrete attribute bins of the variable j chosen for splitting. These are, hence, seldom binary. The number of levels in the trees are equal to the number of input variables. The algorithm continues until all nodes are pure or there are no more variables on which to split. J. Corso (SUNY at Buffalo) Trees 26 / 33

48 ID3 ID3 Method ID3 is another tree growing method. It assumes nominal inputs. Every split has a branching factor B j,whereb j is the number of discrete attribute bins of the variable j chosen for splitting. These are, hence, seldom binary. The number of levels in the trees are equal to the number of input variables. The algorithm continues until all nodes are pure or there are no more variables on which to split. One can follow this by pruning. J. Corso (SUNY at Buffalo) Trees 26 / 33

49 C4.5 C4.5 Method (in brief) This is a successor to the ID3 method. J. Corso (SUNY at Buffalo) Trees 27 / 33

50 C4.5 C4.5 Method (in brief) This is a successor to the ID3 method. It handles real valued variables like CART and uses the ID3 multiway splits for nominal data. J. Corso (SUNY at Buffalo) Trees 27 / 33

52 Example Example from T. Mitchell Book: PlayTennis Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No J. Corso (SUNY at Buffalo) Trees 28 / 33

53 Example Which attribute is the best classifier? S: [9+,5-] E =0.940 E =0.940 S: [9+,5-] Humidity Wind High Normal Weak Strong [3+,4-] [6+,1-] Gain (S, Humidity ) Gain (S, Wind) [6+,2-] [3+,3-] E =0.985 E =0.592 E =0.811 E =1.00 = (7/14) (7/14).592 =.151 = (8/14) (6/14)1.0 =.048 J. Corso (SUNY at Buffalo) Trees 29 / 33

54 Example {D1, D2,..., D14} [9+,5 ] Outlook Sunny Overcast Rain {D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14} [2+,3 ] [4+,0 ] [3+,2 ]?? Yes Which attribute should be tested here? Ssunny = {D1,D2,D8,D9,D11} Gain (S sunny, Humidity) =.970 (3/5) 0.0 (2/5) 0.0 =.970 Gain (S sunny, Temperature) =.970 (2/5) 0.0 (2/5) 1.0 (1/5) 0.0 =.570 Gain (S sunny, Wind) =.970 (2/5) 1.0 (3/5).918 =.019 J. Corso (SUNY at Buffalo) Trees 30 / 33

56 Learned Tree Decision Trees Example Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes J. Corso (SUNY at Buffalo) Trees 32 / 33

57 Example Overfitting Instance Consider adding a new, noisy training example #15: Sunny, Hot, N ormal, Strong, P layt ennis = N o What effect would it have on the earlier tree? J. Corso (SUNY at Buffalo) Trees 33 / 33

59 I f,-.",) 7

60 ( r~''''+jl''ia"d Or,) I \ (

61 - ~ ~ - -1,I -, v~, 5?'f?':ll"l p3 11:_,/..._ t":::..l ;1-1#o<t;"'i~ ",rc,. Pt:r--l r....f' I/~ ',/,_ ;:::;..,..., c;.-/-.t J...,v ',F R-P _ 5;,.-y ('~;\ 1.t'l. oj- L,.r.,...i... ~ 1-1>1:>(, ~ ';"'-n:.~,",,;...,l,.-;h h_ I.)..1 J \./. /' IL... r!-':"rol s:".,.~ (f" "'tv-..., hnlil..."..j- -!V-U.. I... T /1 " ~'.ti'j(. 1"',. T \ --'> '-. I I N"<.I ~ I.(t r-:l- () 11> h "1'II1f7., ~'"" ", ~o"'- 6-fA- J p( c,/ ::; (I, -r:i" \~ Dr-. '>+~rl'oj_ #oj..u.... MY...J ( iba -fl",rf ~-,./ 'V --L.,,... 4'i.("1'- y' J., ~ ('i<,') -:::: "I 2.. M (~) \/.1./ I /' 1\-:;.. I' 'V Jh..- h/a ~ Jl.AA~('",Iv. 'A, RAJ /...,,.I tl",- ( f?! " '>J I " I f-... *'{~ v rl- T\ ~ CAn. ~t_'- ";" I~') A-. "'''',,_,... _ el hio -.. ~.' I ~.'

62 'it t(lj.d(j~ ~~ ~>.d1a~ - _ I~ (]) y ~r~~... "2i.y-c-tt~-P~~~lko/~-/il -----l-jllf--~~'_!a~~ w/k ' ~~. ""'1=~~ L~ ~ _Ll!~"cM t,~~:b... _~1:.. I~)- /~ t I'ff' tl-t-i _,Lfik~~~ ~_~.~~,",-- -r~~ _ (j) c;.--/-i"~ilm ~ ~U~(.Li kvt A-+--~~4j II ~~ ls.~~ja 'fj - /~'lcte. d,,~._~~~.vl~ _ 0-0- _ ---+l"-ir-l'h4-tj-~~,f-tv~ b~~jr_~d::l~~ <1P' /.')_.0)" 2? _ H- oj ~-vfi0-u...- 5? ~ - t:q. I)'" )'<,... > k ~~ - _~_~_ -H- _ql..l~-..l3m - I ~Il- -r~-)q--'il~u]{-i-;-~)l, ~-If!t-:t(~l~,------' '~ Hllr II I

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN NonMetric Data Nominal Data So far we consider patterns to be represented by feature vectors of real or integer values Easy to come up with a distance (similarity) measure by using a variety of mathematical