Deep Boosting. Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Umar Syed (Google Research)

Size: px

Start display at page:

Download "Deep Boosting. Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Umar Syed (Google Research)"

Hugh Owens
5 years ago
Views:

1 Deep Boosting Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Umar Syed (Google Research) MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Deep Boosting Essence page 2

3 Ensemble Methods in ML Combining several base classifiers to create a more accurate one. Bagging (Breiman 1996). AdaBoost (Freund and Schapire 1997). Stacking (Smyth and Wolpert 1999). Bayesian averaging (MacKay 1996). Other averaging schemes e.g., (Freund et al. 2004). Often very effective in practice. Benefit of favorable learning guarantees. page 3

4 Convex Combinations Base classifier set. boosting stumps. decision trees with limited depth or number of leaves. Ensemble combinations: convex hull of base classifier set. conv(h) = T t=1 H th t : t 0; T t=1 t 1; t, h t H. page 4

5 Ensembles - Margin Bound (Koltchinskii and Panchencko, 2002) H >0 Theorem: let be a family of real-valued functions. Fix. Then, for any >0, with probability at least 1, the following holds for all : f = T t=1 th t conv(h) s log 1 2m, R(f) apple b R S, (f)+ 2 R m(h)+ where R S, (f) = 1 m m i=1 1 yi f(x i ). page 5

6 Questions Can we use a much richer or deeper base classifier set? richer families needed for difficult tasks. but generalization bound indicates risk of overfitting. page 6

7 AdaBoost Description: coordinate descent applied to (Freund and Schapire, 1997) F ( )= m i=1 e y if(x i ) = m i=1 exp y i T t=1 th t (x i ). Guarantees: ensemble margin bound. but AdaBoost does not maximize the margin! some margin maximizing algorithms such as arc-gv are outperformed by AdaBoost! (Reyzin and Schapire, 2006) page 7

8 Suspicions Complexity of hypotheses used: arc-gv tends to use deeper decision trees to achieve a larger margin. Notion of margin: minimal margin perhaps not the appropriate notion. margin distribution is key. can we shed more light on these questions? page 8

9 Question Main question: how can we design ensemble algorithms that can succeed even with very deep decision trees or other complex sets? theory. algorithms. experimental results. model selection. page 9

10 Theory Deep Boosting - page

11 Base Classifier Set H Decomposition in terms of sub-families or their union. H 1 H 2 H 3 H 4 H 5 H 1 H 1 H 2 H 1 H p page 11

12 Ensemble Family Non-negative linear ensembles F =conv( p k=1 H k) : with t 0, f = T t=1 th t T t=1 t 1,h t H kt. H 2 H 4 H 1 H 3 H 5 page 12

13 Ideas Use hypotheses drawn from s with larger ks but allocate more weight to hypotheses drawn from smaller ks. how can we determine quantitatively the amounts of mixture weights apportioned to different families? H k can we provide learning guarantees guiding these choices? page 13

14 Learning Guarantee (Cortes, MM, and Syed, 2014) Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : R(f) R S, (f)+ 4 T t=1 tr m (H kt )+O log p 2 m. page 14

15 Consequences Complexity term with explicit dependency on mixture weights. quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to inspire, or directly define an ensemble algorithm. page 15

16 Algorithms Deep Boosting - page

17 Set-Up H 1,...,H p [ 1, +1] : disjoint sub-families of functions taking values in. Further assumption (not necessary): symmetric subfamilies, i.e.. Notation: h H k h H k r j = R m (H kj ) h j H kj with. page 17

18 Derivation Learning bound suggests seeking to minimize 1 m m i=1 1 yi P Tt=1 th t (x i ) + 4 T 0 with t=1 tr t. T t=1 t 1 page 18

19 Convex Surrogates u ( u) u 1 u 0 Let be a decreasing convex function upper bounding, with differentiable. Two principal choices: Exponential loss: Logistic loss: ( u) =exp( u). ( u) =log 2 (1 + exp( u)). page 19

20 Optimization Problem (Cortes, MM, and Syed, 2014) Moving the constraint to the objective and using the fact that the sub-families are symmetric leads to: min R N 1 m m i=1 1 y i N j=1 jh j (x i ) + N t=1 ( r j + ) j, where, 0, and for each hypothesis, keep either h or -h. page 20

21 DeepBoost Algorithm Coordinate descent applied to convex objective. non-differentiable function. definition of maximum coordinate descent. page 21

22 Direction & Step Maximum direction: definition based on the error t,j = E [y i h j (x i )], i D t where D t is the distribution over sample at iteration t. Step: closed-form expressions for exponential and logistic losses. general case: line search. page 22

23 Pseudocode j = r j +. page 23

24 Connections with Previous Work For = =0, DeepBoost coincides with AdaBoost (Freund and Schapire 1997), run with union of subfamilies, for the exponential loss. additive Logistic Regression (Friedman et al., 1998), run with union of sub-families, for the logistic loss. =0 =0 For and, DeepBoost coincides with L1-regularized AdaBoost (Raetsch, Mika, and Warmuth 2001), for the exponential loss. L1-regularized Logistic Regression (Duchi and Singer 2009), for the logistic loss. page 24

25 Experiments Deep Boosting - page

26 Rad. Complexity Estimates Benefit of data-dependent analysis: empirical estimates of each. example: for kernel function K k, R S (H k ) R m (H k ) Tr[K k ] m. alternatively, upper bounds in terms of growth functions, R m (H k ) 2log Hk (m) m. page 26

27 Experiments (1) Family of base classifiers defined by boosting stumps: boosting stumps (threshold functions). in dimension,, thus decision trees of depth 2, d H stumps 1 R m (H stumps 1 ) H stumps 1 question at the internal nodes of depth 1. d (m) 2md 2log(2md) m H stumps 2, with the same in dimension, H stumps, thus 2 R m (H stumps 2 ) (m) (2m) 2 d(d 1) 2log(2m 2 d(d 1)) m. 2. page 27

28 Experiments (1) Base classifier set:. Data sets: same UCI Irvine data sets as (Breiman 1999) and (Reyzin and Schapire 2006). H stumps 1 H stumps 2 OCR data sets used by (Reyzin and Schapire 2006): ocr17, ocr49. MNIST data sets: ocr17-mnist, ocr49-mnist. Experiments with exponential loss. Comparison with AdaBoost and AdaBoost-L1. page 28

29 Experiments - Stumps Exp Loss Table 1: Results for boosted decision stumps and the exponential loss function. (Cortes, MM, and Syed, 2014) AdaBoost AdaBoost AdaBoost AdaBoost breastcancer H stumps 1 H stumps 2 AdaBoost-L1 DeepBoost ocr17 H stumps 1 H stumps 2 AdaBoost-L1 DeepBoost Error Error (std dev) (0.0248) (0.0214) (0.0223) (0.0225) (std dev) (0.0048) Avg tree size Avg tree size Avg no. of trees Avg no. of trees AdaBoost AdaBoost AdaBoost AdaBoost ionosphere H stumps 1 H stumps 2 AdaBoost-L1 DeepBoost ocr49 H stumps 1 H stumps 2 AdaBoost-L1 DeepBoost Error Error (std dev) (0.0414) (0.0413) (0.0331) (0.0394) (std dev) (0.0095) Avg tree size Avg tree size Avg no. of trees Avg no. of trees AdaBoost AdaBoost AdaBoost AdaBoost german H stumps 1 H stumps 2 AdaBoost-L1 DeepBoost ocr17-mnist H stumps 1 H stumps 2 AdaBoost-L1 DeepBoost Error Error (std dev) (0.0445) (0.0487) (0.0438) (0.0462) (std dev) (0.0014) Avg tree size Avg tree size Avg no. of trees Avg no. of trees AdaBoost AdaBoost AdaBoost AdaBoost diabetes H stumps 1 H stumps 2 AdaBoost-L1 DeepBoost ocr49-mnist H stumps 1 H stumps 2 AdaBoost-L1 DeepBoost Error Error (std dev) (0.0330) (0.0518) ( ) (0.0510) (std dev) ( ) Avg tree size Avg tree size Avg no. of trees Avg no. of trees page 29

30 Experiments (2) Family of base classifiers defined by decision trees of depth. For trees with at most n nodes: k R m (T n ) (4n +2)log 2 (d +2)log(m +1) m. K k=1 Htrees k Base classifier set:. Same data sets as with Experiments (1). Both exponential and logistic loss. Comparison with AdaBoost and AdaBoost-L1, Logistic Regression and L1-Logistic Regression. page 30

31 Experiments - Trees Exp Loss (Cortes, MM, and Syed, 2014) breastcancer AdaBoost AdaBoost-L1 DeepBoost ocr17 AdaBoost AdaBoost-L1 DeepBoost Error Error (std dev) ( ) (0.0098) ( ) (std dev) ( ) ( ) ( ) Avg tree size Avg tree size Avg no. of trees Avg no. of trees ionosphere AdaBoost AdaBoost-L1 DeepBoost ocr49 AdaBoost AdaBoost-L1 DeepBoost Error Error (std dev) (0.0315) (0.0257) (0.0316) (std dev) ( ) ( ) ( ) Avg tree size Avg tree size Avg no. of trees Avg no. of trees german AdaBoost AdaBoost-L1 DeepBoost ocr17-mnist AdaBoost AdaBoost-L1 DeepBoost Error Error (std dev) (0.0165) (0.0201) (0.0148) (std dev) (0.0022) (0.0021) (0.0021) Avg tree size Avg tree size Avg no. of trees Avg no. of trees diabetes AdaBoost AdaBoost-L1 DeepBoost ocr49-mnist AdaBoost AdaBoost-L1 DeepBoost Error Error (std dev) (0.0272) (0.0313) (0.0399) (std dev) ( ) ( ) ( ) Avg tree size Avg tree size Avg no. of trees Avg no. of trees page 31

32 Experiments - Trees Log Loss (Cortes, MM, and Syed, 2014) breastcancer LogReg LogReg-L1 DeepBoost ocr17 LogReg LogReg-L1 DeepBoost Error Error (std dev) (0.0101) (0.0120) ( ) (std dev) ( ) ( ) ( ) Avg tree size Avg tree size Avg no. of trees Avg no. of trees ionosphere LogReg LogReg-L1 DeepBoost ocr49 LogReg LogReg-L1 DeepBoost Error Error (std dev) (0.0236) (0.0219) (0.0188) (std dev) ( ) ( ) ( ) Avg tree size Avg tree size Avg no. of trees Avg no. of trees german LogReg LogReg-L1 DeepBoost ocr17-mnist LogReg LogReg-L1 DeepBoost Error Error (std dev) (0.0114) (0.0123) (0.0103) (std dev) ( ) ( ) ( ) Avg tree size Avg tree size Avg no. of trees Avg no. of trees diabetes LogReg LogReg-L1 DeepBoost ocr49-mnist LogReg LogReg-L1 DeepBoost Error Error (std dev) (0.0374) (0.0356) (0.0356) (std dev) ( ) ( ) ( ) Avg tree size Avg tree size Avg no. of trees Avg no. of trees page 32

33 Multi-Class Learning Guarantee with c number of classes. (Kuznetsov, MM, and Syed, 2014) Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : s! R(f) apple R b S, (f)+ 8c TX t R m ( 1 (H kt )) + O e log p 2. m t=1 and 1(H k )={x h(x, y): y Y,h H k }. page 33

34 Extension to Multi-Class Similar data-dependent learning guarantee proven for the multi-class setting. bound depending on mixture weights and complexity of sub-families. Deep Boosting algorithm for multi-class: similar extension taking into account the complexities of sub-families. several variants depending on number of classes. different possible loss functions for each variant. page 34

35 Experiments - Multi-Class = exp. AB stands for Ad- Table 1: Empirical results for MDeepBoostSum, aboost. abalone AB.MR AB.MR-L1 MDeepBoost handwritten AB.MR AB.MR-L1 MDeepBoost Error Error (std dev) (0.0130) (0.0132) (0.0092) (std dev) (0.0047) (0.0026) (0.0012) Avg tree size Avg tree size Avg no. of trees Avg no. of trees letters AB.MR AB.MR-L1 MDeepBoost pageblocks AB.MR AB.MR-L1 MDeepBoost Error Error (std dev) (0.0023) (0.0018) (0.0016) (std dev) (0.0037) (0.0021) (0.0027) Avg tree size Avg tree size Avg no. of trees Avg no. of trees pendigits AB.MR AB.MR-L1 MDeepBoost satimage AB.MR AB.MR-L1 MDeepBoost Error Error (std dev) (0.0015) (0.0023) (0.0011) (std dev) (0.0062) (0.0040) (0.0045) Avg tree size Avg tree size Avg no. of trees Avg no. of trees statlog AB.MR AB.MR-L1 MDeepBoost yeast AB.MR AB.MR-L1 MDeepBoost Error Error (std dev) (0.0059) (0.0035) (0.0030) (std dev) (0.0392) (0.0431) (0.0402) Avg tree size Avg tree size Avg no. of trees Avg no. of trees page 35

36 Experiments - Multi-Class Table 1: Empirical results for MDeepBoostCompSum, comparison with multinomial logistic regression. abalone LogReg LogReg-L1 MDeepBoost handwritten LogReg LogReg-L1 MDeepBoost Error Error (std dev) (0.0170) (0.0102) (0.0104) (std dev) (0.0031) (0.0020) (0.0024) Avg tree size Avg tree size Avg no. of trees Avg no. of trees letters LogReg LogReg-L1 MDeepBoost pageblocks LogReg LogReg-L1 MDeepBoost Error Error (std dev) (0.0018) (0.0012) (0.0012) (std dev) (0.0035) (0.0025) (0.0022) Avg tree size Avg tree size Avg no. of trees Avg no. of trees pendigits LogReg LogReg-L1 MDeepBoost satimage LogReg LogReg-L1 MDeepBoost Error Error (std dev) (0.0021) (0.0014) (0.0012) (std dev) (0.0066) (0.0057) (0.0056) Avg tree size Avg tree size Avg no. of trees Avg no. of trees statlog LogReg LogReg-L1 MDeepBoost yeast LogReg LogReg-L1 MDeepBoost Error Error (std dev) (0.0054) (0.0020) (0.0022) (std dev) (0.0467) (0.0458) (0.0468) Avg tree size Avg tree size Avg no. of trees Avg no. of trees page 36

37 Other Related Algorithms Structural Maxent models (Cortes, Kuznetsov, MM, and Syed, ICML 2015): feature functions chosen from a union of very complex families. Deep Cascades (DeSalvo, MM, and Syed, ALT 2015): cascade of predictors with leaf predictors and node questions selected from very rich families. page 37

38 Model Selection Deep Boosting - page

39 Model Selection Problem: how to select hypothesis set? H too complex, no gen. bound, overfitting. H too simple, gen. bound, but underfitting. balance between estimation and approx. errors. H H h h h Bayes page 39

40 Structural Risk Minimization SRM: solution: error H = k=1 f (Vapnik and Chervonenkis, 1974; Vapnik, 1995) H k with H 1 H 2 H k = argmin h H k,k 1 R S (h)+pen(k, m). training error + penalty penalty training error complexity page 40

41 Voted Risk Minimization Ideas: no selection of specific. H k instead, use all s:,,. hypothesis-dependent penalty: p k=1 Deep ensembles. H k h = p k=1 kh k h k H k kr m (H k ). page 41

42 Conclusion Deep Boosting: ensemble learning with increasingly complex families. data-dependent theoretical analysis. algorithm based on learning bound. extension to multi-class. ranking and other losses. enhancement of many existing algorithms. compares favorably to AdaBoost and Logistic Regression or their L1-regularized variants in experiments. page 42

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning