Globally Induced Forest: A Prepruning Compression Scheme

Size: px

Start display at page:

Download "Globally Induced Forest: A Prepruning Compression Scheme"

Toby Shelton
5 years ago
Views:

1 Globally Induced Forest: A Prepruning Compression Scheme Jean-Michel Begon, Arnaud Joly, Pierre Geurts Systems and Modeling, Dept. of EE and CS, University of Liege, Belgium ICML 2017

2 Goal and motivation What? Is it possible to build accurate yet lightweight random forests without building the whole model first? Why? Random forests are heavy models memory-wise : Number of nodes in a tree is (at worst) linear with the size of the data ; number of required trees grows with the problem complexity. What for? Big data ; small memory devices ; better interpretability, less overfitting, faster prediction,... How? Goal GIF algorithm Results Conclusion 1 / 16

3 GIF algorithm High level view Build an additive model corresponding to a forest by introducing decision nodes sequentially until a node budget constraint is met. Splits in decision nodes are optimized locally. Nodes are taken from a candidate list. The best node is added together with its weight. Goal GIF algorithm Results Conclusion 2 / 16

4 GIF algorithm High level view Build an additive model corresponding to a forest by introducing decision nodes sequentially until a node budget constraint is met. Splits in decision nodes are optimized locally. Nodes are taken from a candidate list. The best node is added together with its weight. Goal GIF algorithm Results Conclusion 2 / 16

5 GIF algorithm Additive model The model prediction ŷ (t) (x) at step t for instance x is given by : t ŷ (t) (x) = ŷ (t 1) (x) + λw t z t (x) = w 0 + λ w τ z τ (x) (1) τ=1 where w 0 is some initial bias. w τ is the weight of node j (1 τ t). λ is the learning rate. { 1, if x reaches node τ z τ (x) = 0, otherwise (1 τ t) i.e. node τ indicator function For classification, the sum of weights represents the class probability vector (i.e. the weights are multidimensional). Goal GIF algorithm Results Conclusion 3 / 16

GIF algorithm Additive model x x 0.0 0.0-1.

0 ŷ(x) = w 0 + λ(1.4 + 1.2) + λ( 0.7 + 1.

6 GIF algorithm Additive model x x ŷ(x) = w 0 + λ( ) + λ( ) (2) Goal GIF algorithm Results Conclusion 4 / 16

7 GIF algorithm High level view Build an additive model corresponding to a forest by introducing decision nodes sequentially until a node budget constraint is met. Splits in decision nodes are optimized locally. Nodes are taken from a candidate list. The best node is added together with its weight. Goal GIF algorithm Results Conclusion 5 / 16

8 GIF algorithm Illustration (regression) Node belonging to the model Hypothe=cal unpruned trees ŷ( x) = w 0 + λw 9 z 9 (x) Candidate node Goal GIF algorithm Results Conclusion 6 / 16

9 GIF algorithm Illustration (regression) Node belonging to the model Hypothe=cal unpruned trees ŷ( x) = w 0 + λw 9 z 9 (x) Candidate node Randomly preselected candidate node Goal GIF algorithm Results Conclusion 6 / 16

10 GIF algorithm Illustration (regression) Δerr Node belonging to the model Hypothe=cal unpruned trees ŷ( x) = w 0 + λw 9 z 9 (x) Candidate node Randomly preselected candidate node Goal GIF algorithm Results Conclusion 6 / 16

11 GIF algorithm Illustration (regression) Chosen node Δerr Node belonging to the model Hypothe=cal unpruned trees ŷ( x) = w 0 + λw 9 z 9 (x) Candidate node Randomly preselected candidate node Goal GIF algorithm Results Conclusion 6 / 16

12 GIF algorithm Illustration (regression) Node belonging to the model Hypothe=cal unpruned trees ŷ( x) = w 0 + λw 9 z 9 (x)+ λw 6 z 6 (x) Candidate node Goal GIF algorithm Results Conclusion 6 / 16

13 GIF algorithm Illustration (regression) Node belonging to the model Hypothe=cal unpruned trees ŷ( x) = w 0 + λw 9 z 9 (x)+ λw 6 z 6 (x) Candidate node Goal GIF algorithm Results Conclusion 6 / 16

14 GIF algorithm High level view Build an additive model corresponding to a forest by introducing decision nodes sequentially until a node budget constraint is met. Splits in decision nodes are optimized locally. Nodes are taken from a candidate list. The best node is added together with its weight. Candidate list Each time a node is added to the model, the learning instances reaching that node are split according to a local criterion. The resulting children are placed in the candidate list. Goal GIF algorithm Results Conclusion 7 / 16

15 GIF algorithm High level view Build an additive model corresponding to a forest by introducing decision nodes sequentially until a node budget constraint is met. Splits in decision nodes are optimized locally. Nodes are taken from a candidate list. The best node is added together with its weight. Goal GIF algorithm Results Conclusion 8 / 16

16 GIF algorithm Node selection The best node j, together with its optimal weight wj, are the ones that minimize some loss L over the training set (x i, y i ) N i=1 : (j, wj ) = arg min j C t,w R K N i=1 ( ) L y i, ŷ (t 1) (x i ) + wz j (x i ) where C t is the subsample of candidates. This problem is solved in two steps 1. for a candidate j, compute the best weight w j : w j = arg min w R K N i=1 ( ) L y i, ŷ (t 1) (x i ) + wz j (x i ) 2. select the best candidate with exhaustive search : N ( ) j = arg min j C t L y i, ŷ (t 1) (x i ) + wj z j (x i ) i=1 (3) (4) (5) Goal GIF algorithm Results Conclusion 9 / 16

17 GIF algorithm Boostrapping Candidate list Since all root nodes see all the learning examples, they would produce the same loss reduction. To increase diversity, we grow T stumps and use the leaves to form the initial candidate list. Initial bias The initial bias w 0 is the best constant that fits the training set w 0 = arg min y R K N L(y i, y) (6) i=1 Goal GIF algorithm Results Conclusion 10 / 16

18 Results Experimental setting 1. Grow a forest of a thousand fully-developed Extremely randomized trees (ET 100% ) and count the number of nodes M. 2. Compare how different methods fare (in average over ten runs) under a constraint of 1% and 10% of that budget. GIF x% grow the forest of a thousand trees until the node budget is met with the GIF algorithm. RAND x% grow a forest of a thousand trees randomly. ET x% grow only 10x fully-developed trees. We used the following values for the hyper-parameters : Number of trees : 1000 Candidate window size : 1 Learning rate : Splitting algorithm : Extremely randomized trees (ET) Goal GIF algorithm Results Conclusion 11 / 16

19 Results Candidate window size Error Friedman CT slice Twonorm Fitting time [s] Musk2 Candidate window size Goal GIF algorithm Results Conclusion 12 / 16

20 Results Regression 1% budget 10% budget CT Slice Abalone Hwang F5 Friedman1 Cadata ET x% GIF x% RAND x% Relative average mean square error to ET 100%. Goal GIF algorithm Results Conclusion 13 / 16

21 Results Binary classification 1% budget 10% budget Ringnorm Twonorm Hastie Musk2 Madelon Mnist 8vs ET x% GIF x% RAND x% Relative average misclassification rate to ET 100%. Goal GIF algorithm Results Conclusion 14 / 16

22 Results Multi-class problems 1% budget 10% budget Letter Vowel Mnist Waveform ET x% GIF x% RAND x% Relative average misclassification rate to ET 100%. Goal GIF algorithm Results Conclusion 15 / 16

23 Conclusion and future works Performances GIF allows for lightweight yet accurate forests. Global optimization usually helps. Surprisingly, optimizing the shapes hurts. TODOs Handle multiclass problems better. In depth comparison with Boosting methods. Goal GIF algorithm Results Conclusion 16 / 16

24 GIF algorithm 1: Input : D = (x i, y i ) N i=1, the learning set with x i R P and y i R K ; A, the tree learning algorithm ; L, the loss function ; B, the node budget ; T, the number of trees ; CW, the candidate window size ; λ, the learning rate. 2: Output : An ensemble S of B tree nodes with their corresponding weights. 3: Algorithm : 4: S = ; C = ; t = 1 5: ŷ (0) (.) = arg min N y R K i=1 L(y i, y) 6: Grow T stumps with A on D and add the left and right successors of all stumps to C. 7: repeat 8: C t is a subset of size min{cw, C } of C chosen uniformly at random. 9: Compute : N ( ) (j, wj ) = arg min L y i, ŷ (t 1) (x i ) + wz j (x i ) j C t,w R K i=1 10: S = S {(j, wj )} ; C = C \ {j } ; y (t) (.) = y (t 1) (.) + λwj z j (.) 11: Split j using A to obtain children j l and j r 12: C = C {j l, j r } ; t = t : until budget B is met 17 / 16

25 GIF algorithm Regularization Node split Although the weight are optimized globally, the splitting elements of a node are still determined locally. Learning rate A learning rate λ was introduced to prevent overfitting : y (t) (.) = y (t 1) (.) + λw j z j (.) (7) Candidate window Only a uniformly drawn subset of candidates are examined at each iterations. This also serves to speed up computations 18 / 16

26 Result Error rates droupout with iteration Learning rate = 10 3 = 10 1: 5 = 1 Error Budget Friedman1 : average test set error with respect to the budget B (CW = +, m= 10, T = 1000). 19 / 16

27 Result Cumulative node distribution Learning rate = 10 3 = 10 1: 5 = 1 Cumulative node ratio Ranks Friedman1 : cumulative node distribution with respect to the size-ranks (CW =, m= 10, T = 1000, B = 10%) 20 / 16

28 Datasets Table: Characteristics of the datasets. N is the learning sample size, TS stands for testing set, and p is the number of features. Dataset N TS p # classes Friedman Abalone CT slice Hwang F Cadata Ringnorm Twonorm Hastie Musk Madelon Mnist8vs Waveform Vowel Mnist Letter / 16

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging Prof. Daniel Cremers 8. Boosting and Bagging Repetition: Regression We start with a set of basis functions (x) =( 0 (x), 1(x),..., M 1(x)) x 2 í d The goal is to fit a model into the data y(x, w) =w T