SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian

Size: px

Start display at page:

Download "SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian"

Felicia Baldwin
5 years ago
Views:

1 SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School of Information Science and Technology Department of Mathematical Informatics 23rd Oct., 1

2 Multiple Kernel Learning Fast Algorithm 2

3 Kernel Learning Regression, Classification Kernel Function 3

4 Gaussian, polynomial, chi-square,. Parameters:Gaussian width, polynomial degree Features Thousands of kernels Computer Vision:color, gradient, sift (sift, hsvsift, huesift, scaling of sift), Geometric Blur, image regions,... Multiple Kernel Learning (Lanckriet et al. 2004) Select important kernels and combine them 4

5 Multiple Kernel Learning (MKL) Single Kernel Multiple Kernel d m :convex combination, sparse (many 0 components) 5

6 Sparse learning grouping Lasso Group Lasso kernelize Multiple Kernel Learning (MKL) :N samples :M kernels L 1 -regularization sparse :RKHS associated with k m :Convex loss(hinge, square, logistic) 6

7 Representer theorem Finite dimensional problem :Gram matrix of mth kernel 7

8 SpicyMKL Proposed: SpicyMKL Scales well against # of kernels. Formulated in a general framework. 8

9 Outline Introduction Details of Multiple Kernel Learning Our method Proximal minimization Skipping inactive kernels Numerical Experiments Bench-mark datasets Sparse or dense 9

10 Details of Multiple Kernel Learning(MKL) 10

11 Generalized Formulation MKL : sparse regularization 11

12 hinge loss elastic net logistic loss L1regularization hinge: logistic: elastic net: L1regularization: 12

13 Rank 1 decomposition If g is monotonically increasing and f l is diff ble at the optimum, the solution of MKL is rank 1: such that convex combination of kernels Proof Derivative w.r.t. 13

14 Existing approach 1 Constraints based formulation: [Lanckriet et al. 2004, JMLR] [Bach, Lanckriet & Jordan 2004, ICML] primal Dual Lanckriet et al. 2004: SDP Bach, Lanckriet & Jordan 2004: SOCP 14

15 Existing approach 2 Upper bound based formulation: [Sonnenburg et al. 2006, JMLR] [Rakotomamonjy et al. 2008, JMLR] [Chapelle & Rakotomamonjy 2008, NIPS workshop] primal (Jensen s inequality) Sonnenburg et al. 2006: Cutting plane (SILP) Rakotomamonjy et al. 2008: Gradient descent (SimpleMKL) Chapelle & Rakotomamonjy 2008: Newton method (HessianMKL) 15

16 Problem of existing methods Do not make use of sparsity during the optimization. Do not perform efficiently when # of kernels is large. 16

17 Outline Introduction Details of Multiple Kernel Learning Our method Proximal minimization Skipping inactive kernels Numerical Experiments Bench-mark datasets Sparse or dense 17

18 Our method SpicyMKL 18

19 DAL (Tomioka & Sugiyama 09) Dual Augmented Lagrangian Lasso, group Lasso, trace norm regularization SpicyMKL = DAL + MKL Kernelization of DAL 19

20 Difficulty of MKL(Lasso) is non-diff ble at 0. 0 Relax by Proximal Minimization Proximal Minimization (Rockafellar, 1976) 20

21 Our algorithm SpicyMKL: Proximal Minimization (Rockafellar, 1976) 1 2 Set some initial solution. Choose increasing sequence. Iterate the following update rule until convergence: It can be shown that super linearly! Regularization toward the last update. Taking its dual, the update step can be solved efficiently 21

22 Fenchel s duality theorem Split into two parts 22

23 Non-diff ble dual dual Non-diff ble Smooth! Moreau envelope 23

24 Inner optimization Newton method is applicable. Twice Differentiable Twice Differentiable (a.e.) Even if is not differentiable, we can apply almost the same algorithm. 24

25 Rigorous Derivation of Convolution and duality 25

26 Moreau envelope Moreau envelope (Moreau 65) For a convex function, we define its Moreau envelope as 26

27 Moreau envelope L 1 -penalty dual Smooth! 27

28 We arrived at What we have observed Dual of update rule in proximal minimization Differentiable Only active kernels need to be calculated. (next slide) 28

29 Skipping inactive kernels Soft threshold Hard threshold If (inactive),, and its gradient also vanishes. 29

30 The objective function: Derivatives We can apply Newton method for the dual optimization. Its derivatives: where Only active kernels contributes their computations. Efficient for large number of kernels. 30

31 Convergence result Thm. For any proper convex functions and, and any non-decreasing sequence, Thm. If is strongly convex and is sparse reg., for any increasing sequence, the convergence rate is super linear. 31

32 Non-differentiable loss If loss is non-differentiable, almost the same algorithm is available by utilizing Moreau envelope of. primal dual Moreau envelope of dual 32

33 Dual of elastic net primal dual Smooth! Naïve optimization in the dual is applicable. But we applied prox. minimization method for small to avoid numerical instability. 33

34 Why Spicy? SpicyMKL = DAL + MKL. DAL also means a major Indian cuisine. hot, spicy SpicyMKL Dal from Wikipedia 34

35 Outline Introduction Details of Multiple Kernel Learning Our method Proximal minimization Skipping inactive kernels Numerical Experiments Bench-mark datasets Sparse or dense 35

36 Numerical Experiments 36

37 IDA data set, L1 regularization. CPU time against # of kernels Splice SpicyMKL Ringnorm Scales well against # of kernels. 37

38 IDA data set, L1 regularization. CPU time against # of samples Splice SpicyMKL Ringnorm Scaling against # of samples is almost the same as the existing methods. 38

39 UCI dataset # of kernels

40 Comparison in elastic net (Sparse v.s. Dense) 40

41 Sparse or Dense? Elastic Net where. 0 1 Sparse Dense Compare performances of sparse and dense solutions in a computer vision task. 41

42 caltech101 dataset (Fei-Fei et al., 2004) object recognition task anchor, ant, cannon, chair, cup 10 classification problems (10=5 (5-1)/2) # of kernels: 1760 = 4 (features) 22 (regions) あああああ2 (kernel functions) 10 (kernel parameters) sift features [Lowe99]: colordescriptor [van de Sande et al.10] hsvsift, sift (scale = 4px, 8px, automatic scale) regions: whole region, 4 division, 16 division, spatial pyramid. (computed histogram of features on each region.) kernel functions: Gaussian kernels, kernels with 10 parameters. 42

43 Performances are averaged over all classification problems. Best Performance of sparse solution is comparable with dense solution. dense Medium density 43

44 Conclusion まとめと今後の課題 and Future works Conclusion We proposed a new MKL algorithm that is efficient when # of kernels is large. proximal minimization neglect in-active kernels Medium density showed the best performance, but sparse solution also works well. Future work Second order update 44

45 Technical report T. Suzuki & R. Tomioka: SpicyMKL. arxiv: DAL R. Tomioka & M. Sugiyama: Dual Augmented Lagrangian Method for Efficient Sparse Reconstruction. IEEE Signal Proccesing Letters, 16 (12) pp ,

46 Thank you for your attention! 46

47 Some properties Moreau envelope proximal operator (Fenchel duality th.) (Fenchel duality th.) Differentiable! 0 47

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach Basic approaches I. Primal Approach - Feasible Direction