Chapter 5. Tree-based Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan
Regression And Classification Tree (CART) 9.2: Breiman et al (1984). C4.5 (Quinlan 1993). Main idea: approximate any f (x) by a piece-wise constant ˆf (x). Use recursive partitioning: Fig 9.2, 1) Partition the x space into two regions R 1 and R 2 by x j < c j ; 2) Partition R 1, R 2 ; 3) Then their sub-regions,... until the model fits data well. ˆf (x) = m c mi (x R m ). can be represented as a (decision) tree.
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 R 2 R 5 t 4 X2 X2 t 2 R 3 R 4 R 1 X 1 t 1 t 3 X 1 X 1 t 1 X 2 t 2X 1 t 3 X 2 t 4 R 1 R 2 R 3 X 2 X 1 R 4 R 5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a two-dimensional feature space by recursive binary splitting, as used in CART, applied to some fake data. Top left panel shows a general partition
Regression Tree Y : continuous. Key: 1) determin splitting variables and split points (e.g. x j < t j ); = R 1, R 2,...; 2) determine c m in each R m. in 1), use a sequential or greedy searchfor each j and s: find x j < s s.t. R 1 (j, s) = {x x j < s}, R 2 (j, s) = {x x j s}, min j,s [min c1 X i R 1 (j,s) (Y i c 1 ) 2 +min c2 X i R 2 (j,s) (Y i c 2 ) 2 ]. in 2), given R 1 and R 2, ĉ k = Ave(Y i X i R k } for k = 1, 2. Repeat the process on R 1 and R 2 respectively,... When to stop? Have to stop when having all equal or too few Y i s in R m ; Tree size gives a model complexity!
A strategy: first grow a large tree, then prune it. Cost-complexity criterion for tree T : C α (T ) = RSS(T ) + α T = (Y i ĉ m ) 2 + α T, m X i R m where T is # of terminal nodes (leaves) and α > 0 is a tuning parameter to be determined by CV.
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 α 176 21 7 5 3 2 0 Misclassification Rate 0.0 0.1 0.2 0.3 0.4 0 10 20 30 40 Tree Size FIGURE 9.4. Results for spam example. The blue curve is the 10-fold cross-validation estimate of misclassification rate as a function of tree size, with standard error bars. The minimum occurs at a tree size with about 17 terminal nodes (using the one-standard- -error rule). The orange curve is the test error, which tracks the CV error quite closely. The cross-validation is indexed by values of α, shown above. The tree sizes shown below refer to T α, the size of the original tree
Classification Tree Y i {1, 2,..., K}. Classify obs s in node m to the majority class: ˆp mk = X i R m I (Y i = k)/n m, k(m) = arg max k ˆp mk. Impurity measure Q m (T ): Used squarted error in regression trees. 1. Misclassification error: 1 n m X i R m I (Y i k(m)) = 1 ˆp m,k(m). 2. Gini index: K k=1 ˆp mk(1 ˆp mk ). 3. Cross-entropy or deviance: K k=1 ˆp mk log ˆp mk. For K = 2, 1-3 reduce to 1 max(ˆp, 1 ˆp), 2ˆp(1 ˆp), ˆp log ˆp (1 ˆp) log(1 ˆp). Look similar; see Fig 9.3. Example: ex5.1.r
Advantages: 1. Easy to incorporate unequal losses of misclassifications: 1 n m X i R m w i I (Y i k(m)) with w i = C k if Y i = k. 2. Handling missing data: use a surrogate splitting var/value at each node (to best approximate the selected one). Extensions: 1. May use non-binary splits; 2. A linear combination of multiple var s as a splitting var. more flexible, but better? +: easy interpretation decision trees! -: unstable due to greedy search and discontinuity; predicting performance not best. R packages tree, rpart; commercial CART. Other implementations: C4.5/C5.0; FIRM by Prof Hawkins (U of M): to detect interactions; by Prof Loh s group (UW-Madison): for count, survival,... data; regression in each terminal node;...
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 email 600/1536 ch$<0.0555 email 280/1177 remove<0.06 remove>0.06 ch$>0.0555 spam 48/359 hp<0.405 hp>0.405 email spam spam email 180/1065 9/112 26/337 0/22 ch!<0.191 george<0.15 CAPAVE<2.907 ch!>0.191 george>0.15 CAPAVE>2.907 email email spam email spam spam 80/861 100/204 6/109 0/3 19/110 7/227 george<0.005 CAPAVE<2.7505 1999<0.58 george>0.005 CAPAVE>2.7505 1999>0.58 email email email spam 80/652 0/209 36/123 16/81 hp<0.03 free<0.065 hp>0.03 free>0.065 spam email 18/109 0/1 email email email spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5business>0.145 email email email spam 20/238 57/185 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu>0.045 email spam email email 19/236 1/2 48/113 9/72 our<1.2 our>1.2 email spam 37/101 1/12
Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 9 Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 Tree (0.95) GAM (0.98) Weighted Tree (0.90) 0.0 0.2 0.4 0.6 0.8 1.0 Specificity FIGURE 9.6. ROC curves for the classification rules fit to the spam data. Curves that are closer to the northeast corner represent better classifiers. In this case the GAM classifier dominates the trees. The weighted tree achieves better sensitivity for higher specificity than the unweighted tree. The numbers in the legend represent
Application: personalized medicine Also called subgroup analysis (or Precision Medicine): to identify subgroups of patients that would be most benefit from a treatment. Statistical problem: detect (qualitative) trt-predictor interaction! quantitative interactions: differ in magnitudes but in teh same direction; qualitative interactions: differ in directions. Many approaches... one of them is to use trees. Prof Loh s GUIDE: http://www.stat.wisc.edu/ loh/guide.html An example: http://onlinelibrary.wiley.com/doi/10.1002/sim. 6454/abstract Another example: https://www.ncbi.nlm.nih.gov/pubmed/24983709