Gradient Boosted Feature Selection. Zhixiang (Eddie) Xu, Gao Huang, Kilian Q. Weinberger, Alice X. Zheng

Size: px

Start display at page:

Download "Gradient Boosted Feature Selection. Zhixiang (Eddie) Xu, Gao Huang, Kilian Q. Weinberger, Alice X. Zheng"

Marianna Stevens
5 years ago
Views:

1 Gradient Boosted Feature Selection Zhixiang (Eddie) Xu, Gao Huang, Kilian Q. Weinberger, Alice X. Zheng 1

2 Goals of feature selection Reliably extract relevant features Identify non-linear feature dependency Scale linearly with the and inputs Allow the incorporation of known sparsity patterns 2

3 l 1 [R. Tibshirani, 1996] Feature selection with -norm data D = {(x 1,y 1 ),, (x n,y n )} R d Y nx optimization min w i=1 `(x > i w,y i )+ kwk 1 it serves to enforce sparsity it is also a regularizer always tied together 3

4 The capped -norm l 1 [T. Zhang, 28] it serves to enforce sparsity it is also a regularizer (no penalty on large weights) Detangle these two effects kw i k d1 =min( w i, ) kwf(x) i k d & kw i k 1 w i l 1 norm capped l 1 norm( = 1) 1. kw i k d wx i 4

5 Feature selection with capped -norm l 1 min w nx i=1 x > i `(x > i w,y i )+ kwk 1 + µkwk d1 loss regularization sparsity

6 Equivalent constrained optimization min w nx `(x i w,y i )+µkwk d1 i=1 s.t. kwk 1 apple C Cannot discover non-linear feature dependency 6

7 Gradient Boosted Regression Trees (GBRT) GBRT solves a l 1 regularized problem in h(x) space [L. Mason et. al., 2] H(x i )=h(x i ) > nx nx min `(h(x i ) >,y i ) min `(H(x i ),y i ) i=1 i=1 s.t. k k 1 apple C s.t. k k 1 apple C 7

8 Gradient Boosted Regression Trees (GBRT) min n X xi `(H(xi ), yi ) i=1 non-linear transformation s.t. k k1 C h1 (xi ) h(xi ) = h(xi ) h2 (xi ) h3 (xi ) all possible regression trees, extremely high dimensional each h(xi ) is a regression tree each iteration a new tree ht (xi ) is added by activating the coefficient t =, and minimizes the losslimited depth X regression tree, 2 (H(x ) + h y ) ht+1 (x) = arg min generated `(H by +CART i h) i h [Breiman, L, 1984] i ht (xi ) `(H) steepest 8

9 Gradient Boosted Feature Selection (GBFS) min nx `(H(x i ),y i ) +µkhk d1 i=1 s.t. k k 1 apple C loss sparsity on original features regularization 9

10 Gradient Boosted Feature Selection (GBFS) min nx i=1 `(H(x i ),y i ) +µkhk d1 s.t. k k 1 apple C A new tree is added by minimizing h t+1 = arg min h `(H + h, y i )+µkh + find hkthis using d1 modified CART `(H) best feature sparsity/descent trade-off (Details in the full paper.)

11 Gradient Boosted Feature Selection Non-linear feature combination scales linearly with the and inputs Algorithms Non-linearity Training time complexity Lasso O(nd) Kernel Feature Selection GBFS NO YES YES O(n 2 d) O(nd) Testing time complexity O(d) O(nd) O(m) m # of trees < 1, 11

12 Synthetic data L1-LR feature selection GBFS GNFS feature selection ( iterations) 3 Selected features: y Ignored features: x, z Test error : 4. % 3 Selected features: x, y Ignored features: z Test error : % 2 2 y y x x 12

13 Structured feature selection Colon data set [S. Ma, et. al., 27] bag GBFS Group Lasso HSIC Lasso Random Forest L1-LR : 4 test error: 1.38% : 23 test error: 36.1% : 13 test error: 21.8% : test error: 1.38% : test error: 17.69% 13

14 Benchmark data sets 2 pcmac 2 uspst 2 spam Better isolet mnist 3vs8 adult L1 -LR (Lee et al., 26) RF FS (Hastie et al., 29) HSIC Lasso (Yamada et al., 212) mrmr (Peng et al., 2) GBFS

15 Large data set kddcup99, n = 4,898,431 d = 122 Better kddcup99 L1 LR (Lee et al., 26) RF FS (Hastie et al., 29) GBFS, µ = 2 32 GBFS, µ = 2 8 GBFS, µ = 2 2 GBFS, µ = 2 2 GBFS, µ = 2 8 GBFS, µ =

16 Feature quality Feature selection only. Classifier is RBF-SVM Better 2 1 feature quality L1 LR (Lee et al., 26) RF FS (Hastie et al., 29) HSIC Lasso (Yamada et al., 212) mrmr (Peng et al., 2) GBFS

17 Conclusion Gradient Boosted Feature Selection (GBFS) Scales naturally to large data sets Discovers non-linear feature dependency Incorporates known feature dependency Out-performs the current state-of-the-art kddcup99 L1 LR (Lee et al., 26) RF FS (Hastie et al., 29) GBFS, µ = 2 32 GBFS, µ = 2 8 GBFS, µ = 2 2 GBFS, µ = 2 2 GBFS, µ = 2 8 GBFS, µ =

18 Conclusion Gradient Boosted Feature Selection (GBFS) Scales naturally to large data sets Discovers non-linear feature dependency Incorporates known feature dependency Out-performs the current state-of-the-art L1-LR feature selection Selected features: y Ignored features: x, z Test error : 4. % 3 3 GNFS feature selection ( iterations) Selected features: x, y Ignored features: z Test error : % 2 2 y y x x 18

19 Conclusion Gradient Boosted Feature Selection (GBFS) Scales naturally to large data sets Discovers non-linear feature dependency Incorporates known feature dependency Out-performs the current state-of-the-art bag GBFS Group Lasso HSIC Lasso Random Forest L1-LR : 4 test error: 1.38% : 23 test error: 36.1% : 13 test error: 21.8% : test error: 1.38% : test error: 17.69% 19

20 Conclusion Gradient Boosted Feature Selection (GBFS) Scales naturally to large data sets Discovers non-linear feature dependency Incorporates known feature dependency Out-performs the current state-of-the-art 2 pcmac 2 uspst 2 spam isolet mnist 3vs8 adult L1 -LR (Lee et al., 26) RF FS (Hastie et al., 29) HSIC Lasso (Yamada et al., 212) mrmr (Peng et al., 2) GBFS

21 Thank you! Questions? 21

22 22

Learning via Optimization

Learning via Optimization Lecture 7 1 Outline 1. Optimization Convexity 2. Linear regression in depth Locally weighted linear regression 3. Brief dips Logistic Regression [Stochastic] gradient ascent/descent Support Vector Machines