Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization

Size: px

Start display at page:

Download "Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization"

Florence Wilkins
5 years ago
Views:

1 Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization PhD defense of Federico Pierucci Thesis supervised by Prof. Zaid Harchaoui Prof. Anatoli Juditsky Dr. Jérôme Malick Université Grenoble Alpes Inria - Thoth and BiPoP Inria Grenoble June 23rd, 2017

2 Application 1: Collaborative filtering Collaborative filtering for recommendation systems Matrix completion optimization problem. Ratings X: film 1 film 2 film 3 Albert Ben Celine Diana Elia Franz Data: for user i and movie j X ij R, with (i, j) I: known ratings Purpose: predict a future rating New (i, j) X ij =? Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 2 / 33

3 Application 1: Collaborative filtering Collaborative filtering for recommendation systems Matrix completion optimization problem. Ratings X: film 1 film 2 film 3 Albert Ben Celine Diana Elia Franz Low rank assumption: Movies can be divided into a small number of types Data: for user i and movie j X ij R, with (i, j) I: known ratings Purpose: predict a future rating New (i, j) X ij =? For example: 1 min W R d k N (i,j) I W ij X ij + λ W σ,1 W σ,1 Nuclear norm = sum of singular values convex function surrogate of rank Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 2 / 33

4 Application 2: Multiclass classification Multiclass classification of images Example: ImageNet challenge 7 marmot 7 edgehog 7? Data (xi, yi ) Rd Rk : pairs of (image, category) Purpose: predict the category for a new image New image x 7 y =? Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 3 / 33

Data (xi, yi ) Rd Rk : pairs of (image, category) Purpose: predict the category for a new image New

Low rank assumption: The features are assumed to be embedded in a lower dimensional space Multiclass

5 Application 2: Multiclass classification Multiclass classification of images Example: ImageNet challenge 7 marmot 7 edgehog 7? Data (xi, yi ) Rd Rk : pairs of (image, category) Purpose: predict the category for a new image New image x 7 y =? Low rank assumption: The features are assumed to be embedded in a lower dimensional space Multiclass version of support vector machine (SVM): min W Rd k N 1 X max 0, 1 + max {Wr> xi Wy>i xi } r s.t. r6=yi N i=1 {z k(ax,y W)+ k + λ kwkσ,1 } Wj Rd : the j-th column of W Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 3 / 33

6 These two problems have the form: ŷ 1 N {}} i { min l(y i, F(x i, W)) W R d k N i=1 }{{} =:R(W), empirical risk Notation Prediction F prediction function l loss function + λ W }{{} regularization Matrix learning problem Data: N number of examples x i feature vector y i outcome ŷ i predicted outcome y F(, W 1) F(, W 2) (xi, y i) x Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 4 / 33

7 These two problems have the form: ŷ 1 N {}} i { min l(y i, F(x i, W)) W R d k N i=1 }{{} =:R(W), empirical risk Notation Prediction F prediction function l loss function + λ W }{{} regularization Matrix learning problem Data: N number of examples x i feature vector y i outcome ŷ i predicted outcome Nonsmooth empirical risk: y F(, W 1) F(, W 2) (xi, y i) x Challenges Large scale: N, k, d Robust learning: g(ξ) = ξ max{0, ξ} Generalization nonsmooth regularization Noisy data, outliers nonsmooth empirical risk Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 4 / 33

8 My thesis in one slide min W }{{} 2nd contribution 1 N N l(y i, F(x i, W)) } i=1 {{ } 1st contribution + λ W }{{} 3rd contribution 1 - Smoothing techniques 2 - Conditional gradient algorithms 3 - Group nuclear norm Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 5 / 33

9 Motivations: Smoothing is a key tool in optimization Part 1 Unified view of smoothing techniques for first order optimization Smooth loss allows the use of gradient-based optimization g(ξ) = ξ g(ξ) = max{0, ξ} Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 6 / 33

10 Motivations: Smoothing is a key tool in optimization Part 1 Unified view of smoothing techniques for first order optimization Smooth loss allows the use of gradient-based optimization g(ξ) = ξ g(ξ) = max{0, ξ} Contributions: Unified view of smoothing techniques for nonsmooth functions New example: smoothing of top-k error (for list ranking and classification) Study of algorithms = smoothing + state of art algorithms for smooth problems Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 6 / 33

11 Part 2 Conditional gradient algorithms for doubly nonsmooth learning Motivations: Common matrix learning problems formulated as min W R d k R(W) }{{} nonsmooth emp.risk + λ W }{{} nonsmooth regularization Nonsmooth empirical risk, e.g. L1 norm robust to noise and outlyers Standard nonsmooth optimization methods not always scalable (e.g. nuclear norm) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 7 / 33

12 Part 2 Conditional gradient algorithms for doubly nonsmooth learning Motivations: Common matrix learning problems formulated as min W R d k R(W) }{{} nonsmooth emp.risk + λ W }{{} nonsmooth regularization Nonsmooth empirical risk, e.g. L1 norm robust to noise and outlyers Standard nonsmooth optimization methods not always scalable (e.g. nuclear norm) Contributions: New algorithms based on (composite) conditional gradient Convergence analysis: rate of convergence + guarantees Some numerical experiences on real data Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 7 / 33

13 Motivations: Part 3 Regularization by group nuclear norm Structured matrices can join information coming from different sources Low-rank models improve robustness and dimensionality reduction W 1 W 1 = W 3 W 2 W 3 W 2 Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 8 / 33

14 Motivations: Part 3 Regularization by group nuclear norm Structured matrices can join information coming from different sources Low-rank models improve robustness and dimensionality reduction W 1 W 1 = W 3 W 2 W 3 W 2 Contributions: Definition of a new norm for matrices with underlying groups Analysis of its convexity properties Used as regularizer provides low rank by groups and aggregate models Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 8 / 33

15 Outline 1 Unified view of smoothing techniques 2 Conditional gradient algorithms for doubly nonsmooth learning 3 Regularization by group nuclear norm 4 Conclusion and perspectives Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 9 / 33

16 Smoothing techniques Purpose: to smooth a convex function g : R n R Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 10 / 33

17 Smoothing techniques Purpose: to smooth a convex function g : R n R Two techniques: 1) Product convolution [Bertsekas 1978] [Duchi et al. 2012] ( ) g pc γ (ξ) := g(ξ z) 1 z γ µ dz γ R n 0.5 µ : probability density 2) Infimal convolution [Moreau 1965] [Nesterov 2007] [Beck, Teboulle 2012] { ( )} g ic z γ (ξ) := inf g(ξ z) + γ ω z R n γ ω : smooth convex function Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 10 / 33

18 Smoothing techniques Purpose: to smooth a convex function g : R n R Two techniques: 1) Product convolution [Bertsekas 1978] [Duchi et al. 2012] ( ) g pc γ (ξ) := g(ξ z) 1 z γ µ dz γ R n 0.5 µ : probability density 2) Infimal convolution [Moreau 1965] [Nesterov 2007] [Beck, Teboulle 2012] { ( )} g ic z γ (ξ) := inf g(ξ z) + γ ω z R n γ ω : smooth convex function Result g γ is uniform approximation of g, i.e. m, M 0 : γm g γ(x) g(x) γm g γ is L γ-smooth, i.e. g γ differentiable, convex, g γ(x) g γ(y) L γ x y (L γ proportional to 1 ) γ where is the dual norm of Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 10 / 33

19 Smoothing surrogates of nonsmooth functions Purpose: obtain g γ to be used into algorithms (possibly) explicit expression easy to evaluate numerically Elementary example (in R) : absolute value g(x) = x Product convolution, with µ(x) = 1 2π e 1 2 x2 g c γ(x) = xf ( x ) 2 γ π γe x2 2γ 2 + xf ( x ) γ F (x) := 1 2π x e t2 2 dt cumulative distribution of Gaussian Infimal convolution, with ω(x) = 1 2 x 2 g ic γ (x) = { 1 2γ x2 + γ 2 x if x γ if x > γ Motivating nonsmooth function: top-k loss (next) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 11 / 33

20 Top-3 loss for Classification Motivating nonsmooth functions: top-k loss Example: top-3 loss Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 12 / 33

21 Top-3 loss for Classification Motivating nonsmooth functions: top-k loss Example: top-3 loss Cat Ground truth 1 Paper towel 2 Wall 3 Cat Prediction = loss = 0 Good prediction if the true class is among the first 3 predicted. Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 12 / 33

Top-3 loss for Classification Motivating nonsmooth functions: top-k loss Example: top-3 loss Cat Ground truth 1 Paper towel 2 Wall 3 Cat Prediction = loss = 0 Good prediction if the true class is

22 Top-3 loss for Classification Motivating nonsmooth functions: top-k loss Example: top-3 loss Cat Ground truth 1 Paper towel 2 Wall 3 Cat Prediction = loss = 0 Good prediction if the true class is among the first 3 predicted. Top-3 loss for Ranking 1 Janis Joplins 2 David Bowie 3 Eric Clapton 4 Patty Smith 5 Jean-Jacques Goldman 6 Francesco Guccini. Grund truth 1 David Bowie 2 Patty Smith 3 Janis Joplins Prediction = loss = Predict an ordered list, the loss counts the mismatches to the true list Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 12 / 33

23 Convex top-k error function, written as a sublinear function g(x) = max x, z z Z { } n Z := z R n : 0 z i 1, z k i 1 = cube simplex i=1 Smoothing of top-k Case k = 1 Top-1 g(x) = x + = max{0, max{x i}} i ( n ) Infimal convolution with ω(x) = x i ln(x i) x i i=1 ( γ 1 + ln x n i=1 g γ(x) = e i ) x γ n if e i γ i=1 > 1 γ n x γ n if e i γ i=1 1 i=1 e x i x x Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 13 / 33

24 Convex top-k error function, written as a sublinear function g(x) = max x, z z Z { } n Z := z R n : 0 z i 1, z k i 1 = cube simplex i=1 Smoothing of top-k Case k = 1 Top-1 g(x) = x + = max{0, max{x i}} i ( n ) Infimal convolution with ω(x) = x i ln(x i) x i i=1 γ g γ(x) = ( 1 + ln x n e i ) γ i=1 if n i=1 e x i γ > 1 Classification Same result as in statistics [Hastie et al., 2008] γ = 1 multinomial logistic loss Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 13 / 33

25 Smoothing of top-k case k > 1 Infimal convolution with ω = t < 0 H γ(t) 1 = 2 t2 t [0, 1 ] k t > 1 k t 1 k k 2 g γ(x) = λ (x, γ) + We need to solve an auxiliary problem (smooth dual problem) Evaluate g γ(x) through the dual problem Define P x := {x i, x i k : i = 1... n} Θ (λ) = 1 t j P x π [0,1/k] (t j + λ) Find a, b P x s.t. Θ (a) 0 Θ (b) { } λ (x, γ) = max 0, a Θ (a)(b a) Θ (b) Θ (a) n H γ(x i + λ (x, γ)) i=1 Θ (λ) a b λ Θ (λ) λ t j Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 14 / 33

26 Outline 1 Unified view of smoothing techniques 2 Conditional gradient algorithms for doubly nonsmooth learning 3 Regularization by group nuclear norm 4 Conclusion and perspectives Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 15 / 33

27 Matrix learning problem min W R d k R(W) }{{} nonsmooth + λ Ω(W) }{{} nonsmooth Empirical risk R(W) := 1 N l(w, xi, yi) N i=1 Top-k for ranking and multiclass classification L1 for regression l 1(W, x, y) := A x,yw l 1(W, x, y) := (A x,yw) + Regularizer (typically norm) Nuclear norm W σ,1 Ω(W) sparsity on singular values L1 norm W 1 := d k Wij sparsity on entries i=1 j=1 Group nuclear norm Ω G(W) (of contribution 3) sparsity feature selection Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 16 / 33

28 Existing algorithms for nonsmooth optimization min W R d k R(W) }{{} nonsmooth + λ Ω(W) }{{} nonsmooth Subgradient, bundle algorithms [Nemirovski, Yudin 1976] [Lemarechal 1979] Proximal algorithms [Douglas, Rachford 1956] Algorithms are not scalable for nuclear norm: iteration cost full SVD = O(dk 2 ) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 17 / 33

29 Existing algorithms for nonsmooth optimization min W R d k R(W) }{{} nonsmooth + λ Ω(W) }{{} nonsmooth Subgradient, bundle algorithms [Nemirovski, Yudin 1976] [Lemarechal 1979] Proximal algorithms [Douglas, Rachford 1956] Algorithms are not scalable for nuclear norm: iteration cost full SVD = O(dk 2 ) What if the loss were smooth? min W R d k S(W) }{{} smooth + λ Ω(W) }{{} nonsmooth Algorithms with faster convergence when S is smooth Proximal gradient algorithms [Nesterov 2005] [Beck, Teboulle, 2009] Still not scalable for nuclear norm: iteration cost full SVD (Composite) conditional gradient algorithms [Frank, Wolfe, 1956][Harchaoui, Juditsky, Nemirovski, 2013] Efficient iterations for nuclear norm: iteration cost compute largest singular value = O(dk) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 17 / 33

30 Composite conditional gradient algorithm State of art algorithm: min W R d k S(W) }{{} smooth + λ Ω(W) }{{} nonsmooth Composite conditional gradient algorithm Let W 0 = 0 r 0 such that Ω(W ) r 0 for t = 0... T do Compute Z t = argmin S(W t), D D s.t. Ω(D) r t α t, β t = argmin α,β 0; α+β 1 Update W t+1 = α tz t + β tw t r t+1 = (α t + β t)r t end for S(αZ t + βw t) + λ(α + β)r t [gradient step] [optimal stepsize] where W t, Z t, D R d k Efficient and scalable for some Ω e.g. nuclear norm, where Z t = uv Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 18 / 33

31 Conditional gradient despite nonsmooth loss Use conditional gradient replacing S(W t) with a subgradient s t R(W t) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 19 / 33

32 Conditional gradient despite nonsmooth loss Use conditional gradient replacing S(W t) with a subgradient s t R(W t) Simple counter example in R 2 min Aw + b 1 + w 1 2 w R w 2 atoms of A level sets of R emp w 1 algorithm stuck at (0, 0) here the algorithm using a subgradient of a nonsmooth empirical risk doe Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 19 / 33

33 Smoothed composite conditional gradient algorithm Idea: Replace the nonsmooth loss with a smoothed loss min W R d k R(W) + λ Ω(W) min }{{} W R d k R γ(w) + λ Ω(W) }{{} nonsmooth smooth {R γ} γ>0 family of smooth approximations of R Let W 0 = 0 r 0 such that Ω(W ) r 0 for t = 0... T do Compute Z t = argmin R γt (W t), D D s.t. Ω(D) r t α t, β t = argmin α,β 0; α+β 1 Update W t+1 = α tz t + β tw t r t+1 = (α t + β t)r t end for α t, β t = stepsize γ t = smoothing parameter R γt (αz t + βw t) + λ(α + β)r t Note: We want solve the initial doubly nonsmooth problem Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 20 / 33

34 Convergence analysis Doubly nonsmooth problem min F (W) = R(W) + λ Ω(W) W R d k W optimal solution γ t = smoothing parameter ( stepsize) Theorems of convergence Fixed smoothing of R γ t = γ F (W t) F (W ) γm + Dimensionality freedom of M depends on ω or µ The best γ depends on the required accuracy ε 2 γ(t + 14) Time-varying smoothing of R γ t = γ 0 t+1 F (W t) F (W ) C t Dimensionality freedom of C depends on ω or µ, γ 0 and W Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 21 / 33

35 Algoritm implementation Package All the Matlab code written from scratch, in particular: Multiclass SVM Top-k multiclass SVM All other smoothed functions Memory Efficient memory management Tools to operate with low rank variables Tools to work with sparse sub-matrices of low rank matrices (collaborative filtering) Numerical experiments - 2 motivating applications Fix smoothing - matrix completion (regression) Time-varying smoothing - top-5 multiclass classification Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 22 / 33

36 Fix smoothing Example with matrix completion, regression Data: Movielens Benchmark d = users Iterates W t generated on a train set k = movies We observe R(W t) on the validation set ratings (= 1.3%) Choose the best γ that minimizes R(W t) in the ( normalized into [0,1] ) validation set Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 23 / 33

37 Fix smoothing Example with matrix completion, regression Data: Movielens Benchmark d = users Iterates W t generated on a train set k = movies We observe R(W t) on the validation set ratings (= 1.3%) Choose the best γ that minimizes R(W t) in the ( normalized into [0,1] ) validation set Emp. risk on validation set, λ= γ = γ = 0.01 γ = 0.1 γ = 0.5 γ = 1 γ = 5 γ = 10 γ = 50 γ = best iterations Each γ gives a different optimization problem Tiny smoothing slower convergence Large smoothing objective much different than the initial one Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 23 / 33

38 Time-varying smoothing Example with top-5 multiclass classification Data: ImageNet k = 134 classes N = images Features: BOW d = 4096 features Benchmark Iterates W t generated on a train set We observe top-5 misclassification error on the validation set To compare: find best fixed smoothing parameter (using the other benchmark) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 24 / 33

39 Time-varying smoothing Example with top-5 multiclass classification Data: ImageNet k = 134 classes N = images Features: BOW d = 4096 features Benchmark Iterates W t generated on a train set We observe top-5 misclassification error on the validation set To compare: find best fixed smoothing parameter (using the other benchmark) Time-varying smoothing parameter p { 1 2, 1} γ t = γ 0 (1 + t) p Misclassification error Validation set iterations γ 0 =0.01 p=1 γ 0 =0.1 p=1 γ 0 =1 p=1 γ 0 =10 p=1 γ 0 =100 p=1 γ 0 =0.01 p=0.5 γ 0 =0.1 p=0.5 γ 0 =1 p=0.5 γ 0 =10 p=0.5 γ 0 =100 p=0.5 γ=0.1 non ad. No need to tune γ 0: Time-varying smoothing matches the performances of the best experimentally tuned fixed smoothing Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 24 / 33

40 Outline 1 Unified view of smoothing techniques 2 Conditional gradient algorithms for doubly nonsmooth learning 3 Regularization by group nuclear norm 4 Conclusion and perspectives Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 25 / 33

41 Group nuclear norm Matrix generalization of the popular group lasso norm [Turlach et al., 2005] [Yuan and Lin, 2006] [Zhao et al., 2009] [Jacob et al., 2009] Nuclear norm W σ,1 : sum of singular values of W Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 26 / 33

42 Group nuclear norm Matrix generalization of the popular group lasso norm [Turlach et al., 2005] [Yuan and Lin, 2006] [Zhao et al., 2009] [Jacob et al., 2009] Nuclear norm W σ,1 : sum of singular values of W W = W 1 i 1 Π 1 W 2 W 3 i 3 i 3 Π 3 W 1 W 2 i g: immersion Π g: projection Π 3 W 3 G = {1, 2, 3} Ω G(W) := W= min g G i g(w g) α g W g σ,1 g G [Tomioka, Suzuki 2013] non-overlapping groups Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 26 / 33

43 Group nuclear norm Matrix generalization of the popular group lasso norm [Turlach et al., 2005] [Yuan and Lin, 2006] [Zhao et al., 2009] [Jacob et al., 2009] Nuclear norm W σ,1 : sum of singular values of W W = W 1 i 1 Π 1 W 2 W 3 i 3 i 3 Π 3 W 1 W 2 i g: immersion Π g: projection Π 3 W 3 G = {1, 2, 3} Ω G(W) := W= min g G i g(w g) α g W g σ,1 g G [Tomioka, Suzuki 2013] non-overlapping groups Convex analysis - theoretical study Fenchel conjugate Ω G Dual norm Ω G Expression of Ω G as a support function Convex hull of functions involving rank Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 26 / 33

44 Convex hull - Results In words, the convex hull is the largest convex function lying below the given one Properly restricted to a ball, the nuclear norm is the convex hull of rank [Fazel 2001] generalization Theorem Properly restricted to a ball, group nuclear norm is the convex hull of: The reweighted group rank function: Ω rank G (W) := inf α g rank(w g) W= i g(w g) g G g G The reweighted restricted rank function: Ω rank (W) := min g G α g rank(w) + δ g(w) δ g indicator function Learning with group nuclear norm enforces low-rank property on groups Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 27 / 33

45 Learning with group nuclear norm Usual optimization algorithms can handle the group nuclear norm: composite conditional gradient algorithms (accelerated) proximal gradient algorithms Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 28 / 33

46 Learning with group nuclear norm Usual optimization algorithms can handle the group nuclear norm: composite conditional gradient algorithms (accelerated) proximal gradient algorithms Illustration with proximal gradient optimization algorithm The key computations are parallelized on each group Good scalability when there are many small groups prox of group nuclear norm where D γ : soft thresholding operator prox γωg ((W g) g) = ( U gd γ(s g)v g ) g G SVD decomposition W g = U gs gv g D γ(s) = Diag({max{s i γ, 0}} 1 i r ). Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 28 / 33

47 Learning with group nuclear norm Usual optimization algorithms can handle the group nuclear norm: composite conditional gradient algorithms (accelerated) proximal gradient algorithms Illustration with proximal gradient optimization algorithm The key computations are parallelized on each group Good scalability when there are many small groups prox of group nuclear norm where D γ : soft thresholding operator prox γωg ((W g) g) = ( U gd γ(s g)v g ) g G SVD decomposition W g = U gs gv g D γ(s) = Diag({max{s i γ, 0}} 1 i r ). Package in Matlab, in particular: vector space of group nuclear norm, overloading of + * Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 28 / 33

48 Numerical illustration: matrix completion Ground truth data Synthetic low rank matrix X sum of 10 rank-1 groups normalized to have µ = 0, σ = Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 29 / 33

Numerical illustration: matrix completion 200 400 600 800 1000 1200 1400 1600 1800 2000 Ground truth data 500 1000 1500 2000 Synthetic low rank matrix X sum of 10 rank-1 groups normalized to have µ =

49 Numerical illustration: matrix completion Ground truth data Synthetic low rank matrix X sum of 10 rank-1 groups normalized to have µ = 0, σ = Observation observations Uniform 10% sampling X ij with (i, j) I Gaussian additive noise σ = Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 29 / 33

50 Numerical illustration: matrix completion Solution W? Ground truth X data solution Recovery error: 1 kw? Xk2 = N min W Rd k Pierucci 1 X N 1 (Wij 2 Xij )2 + λ ΩG (W) (i,j) I Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 30 / 33

51 Outline 1 Unified view of smoothing techniques 2 Conditional gradient algorithms for doubly nonsmooth learning 3 Regularization by group nuclear norm 4 Conclusion and perspectives Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 31 / 33

52 Summary Smoothing Versatile tool in optimization Ways to combine smoothing with many existing algorithms Time-varying smoothing Theory: minimization convergence analysis Practice: recover the best, no need to tune γ Group nuclear norm Theory and practice to combine groups and rank sparsity Overlapping groups Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 32 / 33

53 Perspectives Smoothing for faster convergence: Moreau-Yosida smoothing can be used to improve the condition number of poorly conditioned objectives before applying linearly-convergent convex optimization algorithms [Hongzhou et al. 2017] Smoothing for better prediction: Smoothing can be adapted to properties of the dataset and be used to improve the prediction performance of machine learning algorithms Learning group structure and weights for better prediction: The group structure in the group nuclear norm can be learned to leveraged underlying structure and improve the prediction Extensions to group Schatten norm Potential applications of group nuclear norm multi-attribute classification multiple tree hierarchies dimensionality reduction, feature selection e.g. concatenate features, avoid PCA Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 33 / 33

54 Perspectives Smoothing for faster convergence: Moreau-Yosida smoothing can be used to improve the condition number of poorly conditioned objectives before applying linearly-convergent convex optimization algorithms [Hongzhou et al. 2017] Smoothing for better prediction: Smoothing can be adapted to properties of the dataset and be used to improve the prediction performance of machine learning algorithms Learning group structure and weights for better prediction: The group structure in the group nuclear norm can be learned to leveraged underlying structure and improve the prediction Extensions to group Schatten norm Potential applications of group nuclear norm multi-attribute classification multiple tree hierarchies dimensionality reduction, feature selection e.g. concatenate features, avoid PCA Thank You Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 33 / 33

Proximal operator and methods

Proximal operator and methods Master 2 Data Science, Univ. Paris Saclay Robert M. Gower Optimization Sum of Terms A Datum Function Finite Sum Training Problem The Training Problem Convergence GD I Theorem