Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization

Size: px
Start display at page:

Download "Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization"

Transcription

1 Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization PhD defense of Federico Pierucci Thesis supervised by Prof. Zaid Harchaoui Prof. Anatoli Juditsky Dr. Jérôme Malick Université Grenoble Alpes Inria - Thoth and BiPoP Inria Grenoble June 23rd, 2017

2 Application 1: Collaborative filtering Collaborative filtering for recommendation systems Matrix completion optimization problem. Ratings X: film 1 film 2 film 3 Albert Ben Celine Diana Elia Franz Data: for user i and movie j X ij R, with (i, j) I: known ratings Purpose: predict a future rating New (i, j) X ij =? Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 2 / 33

3 Application 1: Collaborative filtering Collaborative filtering for recommendation systems Matrix completion optimization problem. Ratings X: film 1 film 2 film 3 Albert Ben Celine Diana Elia Franz Low rank assumption: Movies can be divided into a small number of types Data: for user i and movie j X ij R, with (i, j) I: known ratings Purpose: predict a future rating New (i, j) X ij =? For example: 1 min W R d k N (i,j) I W ij X ij + λ W σ,1 W σ,1 Nuclear norm = sum of singular values convex function surrogate of rank Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 2 / 33

4 Application 2: Multiclass classification Multiclass classification of images Example: ImageNet challenge 7 marmot 7 edgehog 7? Data (xi, yi ) Rd Rk : pairs of (image, category) Purpose: predict the category for a new image New image x 7 y =? Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 3 / 33

5 Application 2: Multiclass classification Multiclass classification of images Example: ImageNet challenge 7 marmot 7 edgehog 7? Data (xi, yi ) Rd Rk : pairs of (image, category) Purpose: predict the category for a new image New image x 7 y =? Low rank assumption: The features are assumed to be embedded in a lower dimensional space Multiclass version of support vector machine (SVM): min W Rd k N 1 X max 0, 1 + max {Wr> xi Wy>i xi } r s.t. r6=yi N i=1 {z k(ax,y W)+ k + λ kwkσ,1 } Wj Rd : the j-th column of W Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 3 / 33

6 These two problems have the form: ŷ 1 N {}} i { min l(y i, F(x i, W)) W R d k N i=1 }{{} =:R(W), empirical risk Notation Prediction F prediction function l loss function + λ W }{{} regularization Matrix learning problem Data: N number of examples x i feature vector y i outcome ŷ i predicted outcome y F(, W 1) F(, W 2) (xi, y i) x Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 4 / 33

7 These two problems have the form: ŷ 1 N {}} i { min l(y i, F(x i, W)) W R d k N i=1 }{{} =:R(W), empirical risk Notation Prediction F prediction function l loss function + λ W }{{} regularization Matrix learning problem Data: N number of examples x i feature vector y i outcome ŷ i predicted outcome Nonsmooth empirical risk: y F(, W 1) F(, W 2) (xi, y i) x Challenges Large scale: N, k, d Robust learning: g(ξ) = ξ max{0, ξ} Generalization nonsmooth regularization Noisy data, outliers nonsmooth empirical risk Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 4 / 33

8 My thesis in one slide min W }{{} 2nd contribution 1 N N l(y i, F(x i, W)) } i=1 {{ } 1st contribution + λ W }{{} 3rd contribution 1 - Smoothing techniques 2 - Conditional gradient algorithms 3 - Group nuclear norm Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 5 / 33

9 Motivations: Smoothing is a key tool in optimization Part 1 Unified view of smoothing techniques for first order optimization Smooth loss allows the use of gradient-based optimization g(ξ) = ξ g(ξ) = max{0, ξ} Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 6 / 33

10 Motivations: Smoothing is a key tool in optimization Part 1 Unified view of smoothing techniques for first order optimization Smooth loss allows the use of gradient-based optimization g(ξ) = ξ g(ξ) = max{0, ξ} Contributions: Unified view of smoothing techniques for nonsmooth functions New example: smoothing of top-k error (for list ranking and classification) Study of algorithms = smoothing + state of art algorithms for smooth problems Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 6 / 33

11 Part 2 Conditional gradient algorithms for doubly nonsmooth learning Motivations: Common matrix learning problems formulated as min W R d k R(W) }{{} nonsmooth emp.risk + λ W }{{} nonsmooth regularization Nonsmooth empirical risk, e.g. L1 norm robust to noise and outlyers Standard nonsmooth optimization methods not always scalable (e.g. nuclear norm) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 7 / 33

12 Part 2 Conditional gradient algorithms for doubly nonsmooth learning Motivations: Common matrix learning problems formulated as min W R d k R(W) }{{} nonsmooth emp.risk + λ W }{{} nonsmooth regularization Nonsmooth empirical risk, e.g. L1 norm robust to noise and outlyers Standard nonsmooth optimization methods not always scalable (e.g. nuclear norm) Contributions: New algorithms based on (composite) conditional gradient Convergence analysis: rate of convergence + guarantees Some numerical experiences on real data Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 7 / 33

13 Motivations: Part 3 Regularization by group nuclear norm Structured matrices can join information coming from different sources Low-rank models improve robustness and dimensionality reduction W 1 W 1 = W 3 W 2 W 3 W 2 Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 8 / 33

14 Motivations: Part 3 Regularization by group nuclear norm Structured matrices can join information coming from different sources Low-rank models improve robustness and dimensionality reduction W 1 W 1 = W 3 W 2 W 3 W 2 Contributions: Definition of a new norm for matrices with underlying groups Analysis of its convexity properties Used as regularizer provides low rank by groups and aggregate models Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 8 / 33

15 Outline 1 Unified view of smoothing techniques 2 Conditional gradient algorithms for doubly nonsmooth learning 3 Regularization by group nuclear norm 4 Conclusion and perspectives Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 9 / 33

16 Smoothing techniques Purpose: to smooth a convex function g : R n R Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 10 / 33

17 Smoothing techniques Purpose: to smooth a convex function g : R n R Two techniques: 1) Product convolution [Bertsekas 1978] [Duchi et al. 2012] ( ) g pc γ (ξ) := g(ξ z) 1 z γ µ dz γ R n 0.5 µ : probability density 2) Infimal convolution [Moreau 1965] [Nesterov 2007] [Beck, Teboulle 2012] { ( )} g ic z γ (ξ) := inf g(ξ z) + γ ω z R n γ ω : smooth convex function Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 10 / 33

18 Smoothing techniques Purpose: to smooth a convex function g : R n R Two techniques: 1) Product convolution [Bertsekas 1978] [Duchi et al. 2012] ( ) g pc γ (ξ) := g(ξ z) 1 z γ µ dz γ R n 0.5 µ : probability density 2) Infimal convolution [Moreau 1965] [Nesterov 2007] [Beck, Teboulle 2012] { ( )} g ic z γ (ξ) := inf g(ξ z) + γ ω z R n γ ω : smooth convex function Result g γ is uniform approximation of g, i.e. m, M 0 : γm g γ(x) g(x) γm g γ is L γ-smooth, i.e. g γ differentiable, convex, g γ(x) g γ(y) L γ x y (L γ proportional to 1 ) γ where is the dual norm of Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 10 / 33

19 Smoothing surrogates of nonsmooth functions Purpose: obtain g γ to be used into algorithms (possibly) explicit expression easy to evaluate numerically Elementary example (in R) : absolute value g(x) = x Product convolution, with µ(x) = 1 2π e 1 2 x2 g c γ(x) = xf ( x ) 2 γ π γe x2 2γ 2 + xf ( x ) γ F (x) := 1 2π x e t2 2 dt cumulative distribution of Gaussian Infimal convolution, with ω(x) = 1 2 x 2 g ic γ (x) = { 1 2γ x2 + γ 2 x if x γ if x > γ Motivating nonsmooth function: top-k loss (next) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 11 / 33

20 Top-3 loss for Classification Motivating nonsmooth functions: top-k loss Example: top-3 loss Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 12 / 33

21 Top-3 loss for Classification Motivating nonsmooth functions: top-k loss Example: top-3 loss Cat Ground truth 1 Paper towel 2 Wall 3 Cat Prediction = loss = 0 Good prediction if the true class is among the first 3 predicted. Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 12 / 33

22 Top-3 loss for Classification Motivating nonsmooth functions: top-k loss Example: top-3 loss Cat Ground truth 1 Paper towel 2 Wall 3 Cat Prediction = loss = 0 Good prediction if the true class is among the first 3 predicted. Top-3 loss for Ranking 1 Janis Joplins 2 David Bowie 3 Eric Clapton 4 Patty Smith 5 Jean-Jacques Goldman 6 Francesco Guccini. Grund truth 1 David Bowie 2 Patty Smith 3 Janis Joplins Prediction = loss = Predict an ordered list, the loss counts the mismatches to the true list Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 12 / 33

23 Convex top-k error function, written as a sublinear function g(x) = max x, z z Z { } n Z := z R n : 0 z i 1, z k i 1 = cube simplex i=1 Smoothing of top-k Case k = 1 Top-1 g(x) = x + = max{0, max{x i}} i ( n ) Infimal convolution with ω(x) = x i ln(x i) x i i=1 ( γ 1 + ln x n i=1 g γ(x) = e i ) x γ n if e i γ i=1 > 1 γ n x γ n if e i γ i=1 1 i=1 e x i x x Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 13 / 33

24 Convex top-k error function, written as a sublinear function g(x) = max x, z z Z { } n Z := z R n : 0 z i 1, z k i 1 = cube simplex i=1 Smoothing of top-k Case k = 1 Top-1 g(x) = x + = max{0, max{x i}} i ( n ) Infimal convolution with ω(x) = x i ln(x i) x i i=1 γ g γ(x) = ( 1 + ln x n e i ) γ i=1 if n i=1 e x i γ > 1 Classification Same result as in statistics [Hastie et al., 2008] γ = 1 multinomial logistic loss Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 13 / 33

25 Smoothing of top-k case k > 1 Infimal convolution with ω = t < 0 H γ(t) 1 = 2 t2 t [0, 1 ] k t > 1 k t 1 k k 2 g γ(x) = λ (x, γ) + We need to solve an auxiliary problem (smooth dual problem) Evaluate g γ(x) through the dual problem Define P x := {x i, x i k : i = 1... n} Θ (λ) = 1 t j P x π [0,1/k] (t j + λ) Find a, b P x s.t. Θ (a) 0 Θ (b) { } λ (x, γ) = max 0, a Θ (a)(b a) Θ (b) Θ (a) n H γ(x i + λ (x, γ)) i=1 Θ (λ) a b λ Θ (λ) λ t j Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 14 / 33

26 Outline 1 Unified view of smoothing techniques 2 Conditional gradient algorithms for doubly nonsmooth learning 3 Regularization by group nuclear norm 4 Conclusion and perspectives Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 15 / 33

27 Matrix learning problem min W R d k R(W) }{{} nonsmooth + λ Ω(W) }{{} nonsmooth Empirical risk R(W) := 1 N l(w, xi, yi) N i=1 Top-k for ranking and multiclass classification L1 for regression l 1(W, x, y) := A x,yw l 1(W, x, y) := (A x,yw) + Regularizer (typically norm) Nuclear norm W σ,1 Ω(W) sparsity on singular values L1 norm W 1 := d k Wij sparsity on entries i=1 j=1 Group nuclear norm Ω G(W) (of contribution 3) sparsity feature selection Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 16 / 33

28 Existing algorithms for nonsmooth optimization min W R d k R(W) }{{} nonsmooth + λ Ω(W) }{{} nonsmooth Subgradient, bundle algorithms [Nemirovski, Yudin 1976] [Lemarechal 1979] Proximal algorithms [Douglas, Rachford 1956] Algorithms are not scalable for nuclear norm: iteration cost full SVD = O(dk 2 ) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 17 / 33

29 Existing algorithms for nonsmooth optimization min W R d k R(W) }{{} nonsmooth + λ Ω(W) }{{} nonsmooth Subgradient, bundle algorithms [Nemirovski, Yudin 1976] [Lemarechal 1979] Proximal algorithms [Douglas, Rachford 1956] Algorithms are not scalable for nuclear norm: iteration cost full SVD = O(dk 2 ) What if the loss were smooth? min W R d k S(W) }{{} smooth + λ Ω(W) }{{} nonsmooth Algorithms with faster convergence when S is smooth Proximal gradient algorithms [Nesterov 2005] [Beck, Teboulle, 2009] Still not scalable for nuclear norm: iteration cost full SVD (Composite) conditional gradient algorithms [Frank, Wolfe, 1956][Harchaoui, Juditsky, Nemirovski, 2013] Efficient iterations for nuclear norm: iteration cost compute largest singular value = O(dk) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 17 / 33

30 Composite conditional gradient algorithm State of art algorithm: min W R d k S(W) }{{} smooth + λ Ω(W) }{{} nonsmooth Composite conditional gradient algorithm Let W 0 = 0 r 0 such that Ω(W ) r 0 for t = 0... T do Compute Z t = argmin S(W t), D D s.t. Ω(D) r t α t, β t = argmin α,β 0; α+β 1 Update W t+1 = α tz t + β tw t r t+1 = (α t + β t)r t end for S(αZ t + βw t) + λ(α + β)r t [gradient step] [optimal stepsize] where W t, Z t, D R d k Efficient and scalable for some Ω e.g. nuclear norm, where Z t = uv Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 18 / 33

31 Conditional gradient despite nonsmooth loss Use conditional gradient replacing S(W t) with a subgradient s t R(W t) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 19 / 33

32 Conditional gradient despite nonsmooth loss Use conditional gradient replacing S(W t) with a subgradient s t R(W t) Simple counter example in R 2 min Aw + b 1 + w 1 2 w R w 2 atoms of A level sets of R emp w 1 algorithm stuck at (0, 0) here the algorithm using a subgradient of a nonsmooth empirical risk doe Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 19 / 33

33 Smoothed composite conditional gradient algorithm Idea: Replace the nonsmooth loss with a smoothed loss min W R d k R(W) + λ Ω(W) min }{{} W R d k R γ(w) + λ Ω(W) }{{} nonsmooth smooth {R γ} γ>0 family of smooth approximations of R Let W 0 = 0 r 0 such that Ω(W ) r 0 for t = 0... T do Compute Z t = argmin R γt (W t), D D s.t. Ω(D) r t α t, β t = argmin α,β 0; α+β 1 Update W t+1 = α tz t + β tw t r t+1 = (α t + β t)r t end for α t, β t = stepsize γ t = smoothing parameter R γt (αz t + βw t) + λ(α + β)r t Note: We want solve the initial doubly nonsmooth problem Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 20 / 33

34 Convergence analysis Doubly nonsmooth problem min F (W) = R(W) + λ Ω(W) W R d k W optimal solution γ t = smoothing parameter ( stepsize) Theorems of convergence Fixed smoothing of R γ t = γ F (W t) F (W ) γm + Dimensionality freedom of M depends on ω or µ The best γ depends on the required accuracy ε 2 γ(t + 14) Time-varying smoothing of R γ t = γ 0 t+1 F (W t) F (W ) C t Dimensionality freedom of C depends on ω or µ, γ 0 and W Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 21 / 33

35 Algoritm implementation Package All the Matlab code written from scratch, in particular: Multiclass SVM Top-k multiclass SVM All other smoothed functions Memory Efficient memory management Tools to operate with low rank variables Tools to work with sparse sub-matrices of low rank matrices (collaborative filtering) Numerical experiments - 2 motivating applications Fix smoothing - matrix completion (regression) Time-varying smoothing - top-5 multiclass classification Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 22 / 33

36 Fix smoothing Example with matrix completion, regression Data: Movielens Benchmark d = users Iterates W t generated on a train set k = movies We observe R(W t) on the validation set ratings (= 1.3%) Choose the best γ that minimizes R(W t) in the ( normalized into [0,1] ) validation set Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 23 / 33

37 Fix smoothing Example with matrix completion, regression Data: Movielens Benchmark d = users Iterates W t generated on a train set k = movies We observe R(W t) on the validation set ratings (= 1.3%) Choose the best γ that minimizes R(W t) in the ( normalized into [0,1] ) validation set Emp. risk on validation set, λ= γ = γ = 0.01 γ = 0.1 γ = 0.5 γ = 1 γ = 5 γ = 10 γ = 50 γ = best iterations Each γ gives a different optimization problem Tiny smoothing slower convergence Large smoothing objective much different than the initial one Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 23 / 33

38 Time-varying smoothing Example with top-5 multiclass classification Data: ImageNet k = 134 classes N = images Features: BOW d = 4096 features Benchmark Iterates W t generated on a train set We observe top-5 misclassification error on the validation set To compare: find best fixed smoothing parameter (using the other benchmark) Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 24 / 33

39 Time-varying smoothing Example with top-5 multiclass classification Data: ImageNet k = 134 classes N = images Features: BOW d = 4096 features Benchmark Iterates W t generated on a train set We observe top-5 misclassification error on the validation set To compare: find best fixed smoothing parameter (using the other benchmark) Time-varying smoothing parameter p { 1 2, 1} γ t = γ 0 (1 + t) p Misclassification error Validation set iterations γ 0 =0.01 p=1 γ 0 =0.1 p=1 γ 0 =1 p=1 γ 0 =10 p=1 γ 0 =100 p=1 γ 0 =0.01 p=0.5 γ 0 =0.1 p=0.5 γ 0 =1 p=0.5 γ 0 =10 p=0.5 γ 0 =100 p=0.5 γ=0.1 non ad. No need to tune γ 0: Time-varying smoothing matches the performances of the best experimentally tuned fixed smoothing Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 24 / 33

40 Outline 1 Unified view of smoothing techniques 2 Conditional gradient algorithms for doubly nonsmooth learning 3 Regularization by group nuclear norm 4 Conclusion and perspectives Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 25 / 33

41 Group nuclear norm Matrix generalization of the popular group lasso norm [Turlach et al., 2005] [Yuan and Lin, 2006] [Zhao et al., 2009] [Jacob et al., 2009] Nuclear norm W σ,1 : sum of singular values of W Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 26 / 33

42 Group nuclear norm Matrix generalization of the popular group lasso norm [Turlach et al., 2005] [Yuan and Lin, 2006] [Zhao et al., 2009] [Jacob et al., 2009] Nuclear norm W σ,1 : sum of singular values of W W = W 1 i 1 Π 1 W 2 W 3 i 3 i 3 Π 3 W 1 W 2 i g: immersion Π g: projection Π 3 W 3 G = {1, 2, 3} Ω G(W) := W= min g G i g(w g) α g W g σ,1 g G [Tomioka, Suzuki 2013] non-overlapping groups Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 26 / 33

43 Group nuclear norm Matrix generalization of the popular group lasso norm [Turlach et al., 2005] [Yuan and Lin, 2006] [Zhao et al., 2009] [Jacob et al., 2009] Nuclear norm W σ,1 : sum of singular values of W W = W 1 i 1 Π 1 W 2 W 3 i 3 i 3 Π 3 W 1 W 2 i g: immersion Π g: projection Π 3 W 3 G = {1, 2, 3} Ω G(W) := W= min g G i g(w g) α g W g σ,1 g G [Tomioka, Suzuki 2013] non-overlapping groups Convex analysis - theoretical study Fenchel conjugate Ω G Dual norm Ω G Expression of Ω G as a support function Convex hull of functions involving rank Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 26 / 33

44 Convex hull - Results In words, the convex hull is the largest convex function lying below the given one Properly restricted to a ball, the nuclear norm is the convex hull of rank [Fazel 2001] generalization Theorem Properly restricted to a ball, group nuclear norm is the convex hull of: The reweighted group rank function: Ω rank G (W) := inf α g rank(w g) W= i g(w g) g G g G The reweighted restricted rank function: Ω rank (W) := min g G α g rank(w) + δ g(w) δ g indicator function Learning with group nuclear norm enforces low-rank property on groups Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 27 / 33

45 Learning with group nuclear norm Usual optimization algorithms can handle the group nuclear norm: composite conditional gradient algorithms (accelerated) proximal gradient algorithms Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 28 / 33

46 Learning with group nuclear norm Usual optimization algorithms can handle the group nuclear norm: composite conditional gradient algorithms (accelerated) proximal gradient algorithms Illustration with proximal gradient optimization algorithm The key computations are parallelized on each group Good scalability when there are many small groups prox of group nuclear norm where D γ : soft thresholding operator prox γωg ((W g) g) = ( U gd γ(s g)v g ) g G SVD decomposition W g = U gs gv g D γ(s) = Diag({max{s i γ, 0}} 1 i r ). Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 28 / 33

47 Learning with group nuclear norm Usual optimization algorithms can handle the group nuclear norm: composite conditional gradient algorithms (accelerated) proximal gradient algorithms Illustration with proximal gradient optimization algorithm The key computations are parallelized on each group Good scalability when there are many small groups prox of group nuclear norm where D γ : soft thresholding operator prox γωg ((W g) g) = ( U gd γ(s g)v g ) g G SVD decomposition W g = U gs gv g D γ(s) = Diag({max{s i γ, 0}} 1 i r ). Package in Matlab, in particular: vector space of group nuclear norm, overloading of + * Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 28 / 33

48 Numerical illustration: matrix completion Ground truth data Synthetic low rank matrix X sum of 10 rank-1 groups normalized to have µ = 0, σ = Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 29 / 33

49 Numerical illustration: matrix completion Ground truth data Synthetic low rank matrix X sum of 10 rank-1 groups normalized to have µ = 0, σ = Observation observations Uniform 10% sampling X ij with (i, j) I Gaussian additive noise σ = Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 29 / 33

50 Numerical illustration: matrix completion Solution W? Ground truth X data solution Recovery error: 1 kw? Xk2 = N min W Rd k Pierucci 1 X N 1 (Wij 2 Xij )2 + λ ΩG (W) (i,j) I Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 30 / 33

51 Outline 1 Unified view of smoothing techniques 2 Conditional gradient algorithms for doubly nonsmooth learning 3 Regularization by group nuclear norm 4 Conclusion and perspectives Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 31 / 33

52 Summary Smoothing Versatile tool in optimization Ways to combine smoothing with many existing algorithms Time-varying smoothing Theory: minimization convergence analysis Practice: recover the best, no need to tune γ Group nuclear norm Theory and practice to combine groups and rank sparsity Overlapping groups Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 32 / 33

53 Perspectives Smoothing for faster convergence: Moreau-Yosida smoothing can be used to improve the condition number of poorly conditioned objectives before applying linearly-convergent convex optimization algorithms [Hongzhou et al. 2017] Smoothing for better prediction: Smoothing can be adapted to properties of the dataset and be used to improve the prediction performance of machine learning algorithms Learning group structure and weights for better prediction: The group structure in the group nuclear norm can be learned to leveraged underlying structure and improve the prediction Extensions to group Schatten norm Potential applications of group nuclear norm multi-attribute classification multiple tree hierarchies dimensionality reduction, feature selection e.g. concatenate features, avoid PCA Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 33 / 33

54 Perspectives Smoothing for faster convergence: Moreau-Yosida smoothing can be used to improve the condition number of poorly conditioned objectives before applying linearly-convergent convex optimization algorithms [Hongzhou et al. 2017] Smoothing for better prediction: Smoothing can be adapted to properties of the dataset and be used to improve the prediction performance of machine learning algorithms Learning group structure and weights for better prediction: The group structure in the group nuclear norm can be learned to leveraged underlying structure and improve the prediction Extensions to group Schatten norm Potential applications of group nuclear norm multi-attribute classification multiple tree hierarchies dimensionality reduction, feature selection e.g. concatenate features, avoid PCA Thank You Pierucci Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization 33 / 33

Proximal operator and methods

Proximal operator and methods Proximal operator and methods Master 2 Data Science, Univ. Paris Saclay Robert M. Gower Optimization Sum of Terms A Datum Function Finite Sum Training Problem The Training Problem Convergence GD I Theorem

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - C) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Conditional gradient algorithms for machine learning

Conditional gradient algorithms for machine learning Conditional gradient algorithms for machine learning Zaid Harchaoui LEAR-LJK, INRIA Grenoble Anatoli Juditsky LJK, Université de Grenoble Arkadi Nemirovski Georgia Tech Abstract We consider penalized formulations

More information

Convex optimization algorithms for sparse and low-rank representations

Convex optimization algorithms for sparse and low-rank representations Convex optimization algorithms for sparse and low-rank representations Lieven Vandenberghe, Hsiao-Han Chao (UCLA) ECC 2013 Tutorial Session Sparse and low-rank representation methods in control, estimation,

More information

Convex Optimization / Homework 2, due Oct 3

Convex Optimization / Homework 2, due Oct 3 Convex Optimization 0-725/36-725 Homework 2, due Oct 3 Instructions: You must complete Problems 3 and either Problem 4 or Problem 5 (your choice between the two) When you submit the homework, upload a

More information

Convex Optimization MLSS 2015

Convex Optimization MLSS 2015 Convex Optimization MLSS 2015 Constantine Caramanis The University of Texas at Austin The Optimization Problem minimize : f (x) subject to : x X. The Optimization Problem minimize : f (x) subject to :

More information

Convex and Distributed Optimization. Thomas Ropars

Convex and Distributed Optimization. Thomas Ropars >>> Presentation of this master2 course Convex and Distributed Optimization Franck Iutzeler Jérôme Malick Thomas Ropars Dmitry Grishchenko from LJK, the applied maths and computer science laboratory and

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University September 20 2018 Review Solution for multiple linear regression can be computed in closed form

More information

Optimization for Machine Learning

Optimization for Machine Learning with a focus on proximal gradient descent algorithm Department of Computer Science and Engineering Outline 1 History & Trends 2 Proximal Gradient Descent 3 Three Applications A Brief History A. Convex

More information

Limitations of Matrix Completion via Trace Norm Minimization

Limitations of Matrix Completion via Trace Norm Minimization Limitations of Matrix Completion via Trace Norm Minimization ABSTRACT Xiaoxiao Shi Computer Science Department University of Illinois at Chicago xiaoxiao@cs.uic.edu In recent years, compressive sensing

More information

Conditional gradient algorithms for machine learning

Conditional gradient algorithms for machine learning 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems

An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems Kim-Chuan Toh Sangwoon Yun Abstract The affine rank minimization problem, which consists of finding a matrix

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics 1. Introduction EE 546, Univ of Washington, Spring 2016 performance of numerical methods complexity bounds structural convex optimization course goals and topics 1 1 Some course info Welcome to EE 546!

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Iteratively Re-weighted Least Squares for Sums of Convex Functions

Iteratively Re-weighted Least Squares for Sums of Convex Functions Iteratively Re-weighted Least Squares for Sums of Convex Functions James Burke University of Washington Jiashan Wang LinkedIn Frank Curtis Lehigh University Hao Wang Shanghai Tech University Daiwei He

More information

A More Efficient Approach to Large Scale Matrix Completion Problems

A More Efficient Approach to Large Scale Matrix Completion Problems A More Efficient Approach to Large Scale Matrix Completion Problems Matthew Olson August 25, 2014 Abstract This paper investigates a scalable optimization procedure to the low-rank matrix completion problem

More information

Mining Sparse Representations: Theory, Algorithms, and Applications

Mining Sparse Representations: Theory, Algorithms, and Applications Mining Sparse Representations: Theory, Algorithms, and Applications Jun Liu, Shuiwang Ji, and Jieping Ye Computer Science and Engineering The Biodesign Institute Arizona State University What is Sparsity?

More information

Learning with infinitely many features

Learning with infinitely many features Learning with infinitely many features R. Flamary, Joint work with A. Rakotomamonjy F. Yger, M. Volpi, M. Dalla Mura, D. Tuia Laboratoire Lagrange, Université de Nice Sophia Antipolis December 2012 Example

More information

The convex geometry of inverse problems

The convex geometry of inverse problems The convex geometry of inverse problems Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison Joint work with Venkat Chandrasekaran Pablo Parrilo Alan Willsky Linear Inverse Problems

More information

Divide and Conquer Kernel Ridge Regression

Divide and Conquer Kernel Ridge Regression Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem

More information

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization Author: Martin Jaggi Presenter: Zhongxing Peng Outline 1. Theoretical Results 2. Applications Outline 1. Theoretical Results 2. Applications

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund Optimization Plugin for RapidMiner Technical Report Venkatesh Umaashankar Sangkyun Lee 04/2012 technische universität dortmund Part of the work on this technical report has been supported by Deutsche Forschungsgemeinschaft

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Non-exhaustive, Overlapping k-means

Non-exhaustive, Overlapping k-means Non-exhaustive, Overlapping k-means J. J. Whang, I. S. Dhilon, and D. F. Gleich Teresa Lebair University of Maryland, Baltimore County October 29th, 2015 Teresa Lebair UMBC 1/38 Outline Introduction NEO-K-Means

More information

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Journal of Machine Learning Research 6 (205) 553-557 Submitted /2; Revised 3/4; Published 3/5 The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Xingguo Li Department

More information

PIPA: A New Proximal Interior Point Algorithm for Large-Scale Convex Optimization

PIPA: A New Proximal Interior Point Algorithm for Large-Scale Convex Optimization PIPA: A New Proximal Interior Point Algorithm for Large-Scale Convex Optimization Marie-Caroline Corbineau 1, Emilie Chouzenoux 1,2, Jean-Christophe Pesquet 1 1 CVN, CentraleSupélec, Université Paris-Saclay,

More information

Convex Optimization. Lijun Zhang Modification of

Convex Optimization. Lijun Zhang   Modification of Convex Optimization Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Modification of http://stanford.edu/~boyd/cvxbook/bv_cvxslides.pdf Outline Introduction Convex Sets & Functions Convex Optimization

More information

SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian

SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School of Information Science and Technology Department of

More information

Leaves Machine Learning and Optimization Library

Leaves Machine Learning and Optimization Library Leaves Machine Learning and Optimization Library October 7, 07 This document describes how to use LEaves Machine Learning and Optimization Library (LEMO) for modeling. LEMO is an open source library for

More information

Composite Self-concordant Minimization

Composite Self-concordant Minimization Composite Self-concordant Minimization Volkan Cevher Laboratory for Information and Inference Systems-LIONS Ecole Polytechnique Federale de Lausanne (EPFL) volkan.cevher@epfl.ch Paris 6 Dec 11, 2013 joint

More information

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1 Section 5 Convex Optimisation 1 W. Dai (IC) EE4.66 Data Proc. Convex Optimisation 1 2018 page 5-1 Convex Combination Denition 5.1 A convex combination is a linear combination of points where all coecients

More information

Sparsity and image processing

Sparsity and image processing Sparsity and image processing Aurélie Boisbunon INRIA-SAM, AYIN March 6, Why sparsity? Main advantages Dimensionality reduction Fast computation Better interpretability Image processing pattern recognition

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description

More information

Gradient LASSO algoithm

Gradient LASSO algoithm Gradient LASSO algoithm Yongdai Kim Seoul National University, Korea jointly with Yuwon Kim University of Minnesota, USA and Jinseog Kim Statistical Research Center for Complex Systems, Korea Contents

More information

Robust Principal Component Analysis (RPCA)

Robust Principal Component Analysis (RPCA) Robust Principal Component Analysis (RPCA) & Matrix decomposition: into low-rank and sparse components Zhenfang Hu 2010.4.1 reference [1] Chandrasekharan, V., Sanghavi, S., Parillo, P., Wilsky, A.: Ranksparsity

More information

When Sparsity Meets Low-Rankness: Transform Learning With Non-Local Low-Rank Constraint For Image Restoration

When Sparsity Meets Low-Rankness: Transform Learning With Non-Local Low-Rank Constraint For Image Restoration When Sparsity Meets Low-Rankness: Transform Learning With Non-Local Low-Rank Constraint For Image Restoration Bihan Wen, Yanjun Li and Yoram Bresler Department of Electrical and Computer Engineering Coordinated

More information

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums José Garrido Department of Mathematics and Statistics Concordia University, Montreal EAJ 2016 Lyon, September

More information

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology Hausdorff Institute for Mathematics (HIM) Trimester: Mathematics of Signal Processing

More information

Combine the PA Algorithm with a Proximal Classifier

Combine the PA Algorithm with a Proximal Classifier Combine the Passive and Aggressive Algorithm with a Proximal Classifier Yuh-Jye Lee Joint work with Y.-C. Tseng Dept. of Computer Science & Information Engineering TaiwanTech. Dept. of Statistics@NCKU

More information

A General Greedy Approximation Algorithm with Applications

A General Greedy Approximation Algorithm with Applications A General Greedy Approximation Algorithm with Applications Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, NY 10598 tzhang@watson.ibm.com Abstract Greedy approximation algorithms have been

More information

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007 : Convex Non-Smooth Optimization April 2, 2007 Outline Lecture 19 Convex non-smooth problems Examples Subgradients and subdifferentials Subgradient properties Operations with subgradients and subdifferentials

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Tutorial on Convex Optimization for Engineers

Tutorial on Convex Optimization for Engineers Tutorial on Convex Optimization for Engineers M.Sc. Jens Steinwandt Communications Research Laboratory Ilmenau University of Technology PO Box 100565 D-98684 Ilmenau, Germany jens.steinwandt@tu-ilmenau.de

More information

Convex Optimization: from Real-Time Embedded to Large-Scale Distributed

Convex Optimization: from Real-Time Embedded to Large-Scale Distributed Convex Optimization: from Real-Time Embedded to Large-Scale Distributed Stephen Boyd Neal Parikh, Eric Chu, Yang Wang, Jacob Mattingley Electrical Engineering Department, Stanford University Springer Lectures,

More information

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation Bryan Poling University of Minnesota Joint work with Gilad Lerman University of Minnesota The Problem of Subspace

More information

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration Three-Dimensional Sensors Lecture 6: Point-Cloud Registration Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/ Point-Cloud Registration Methods Fuse data

More information

Lecture 19: November 5

Lecture 19: November 5 0-725/36-725: Convex Optimization Fall 205 Lecturer: Ryan Tibshirani Lecture 9: November 5 Scribes: Hyun Ah Song Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not

More information

EE8231 Optimization Theory Course Project

EE8231 Optimization Theory Course Project EE8231 Optimization Theory Course Project Qian Zhao (zhaox331@umn.edu) May 15, 2014 Contents 1 Introduction 2 2 Sparse 2-class Logistic Regression 2 2.1 Problem formulation....................................

More information

Outlier Pursuit: Robust PCA and Collaborative Filtering

Outlier Pursuit: Robust PCA and Collaborative Filtering Outlier Pursuit: Robust PCA and Collaborative Filtering Huan Xu Dept. of Mechanical Engineering & Dept. of Mathematics National University of Singapore Joint w/ Constantine Caramanis, Yudong Chen, Sujay

More information

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know learn the proximal

More information

The Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein October 8, 2015 1 / 30 Introduction Presentation Outline 1 Convex

More information

Convexity Theory and Gradient Methods

Convexity Theory and Gradient Methods Convexity Theory and Gradient Methods Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Convex Functions Optimality

More information

Transfer Learning Algorithms for Image Classification

Transfer Learning Algorithms for Image Classification Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell 1 Motivation Goal: We want to be able to build classifiers for thousands of visual

More information

Sparse & Functional Principal Components Analysis

Sparse & Functional Principal Components Analysis Sparse & Functional Principal Components Analysis Genevera I. Allen Department of Statistics and Electrical and Computer Engineering, Rice University, Department of Pediatrics-Neurology, Baylor College

More information

x 1 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) 75 x y=20

x 1 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) 75 x y=20 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) L R 75 x 2 38 0 y=20-38 -75 x 1 1 Introducing the model 2 1 Brain decoding

More information

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009] A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009] Yongjia Song University of Wisconsin-Madison April 22, 2010 Yongjia Song

More information

Learning Low-rank Transformations: Algorithms and Applications. Qiang Qiu Guillermo Sapiro

Learning Low-rank Transformations: Algorithms and Applications. Qiang Qiu Guillermo Sapiro Learning Low-rank Transformations: Algorithms and Applications Qiang Qiu Guillermo Sapiro Motivation Outline Low-rank transform - algorithms and theories Applications Subspace clustering Classification

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

LECTURE 18 LECTURE OUTLINE

LECTURE 18 LECTURE OUTLINE LECTURE 18 LECTURE OUTLINE Generalized polyhedral approximation methods Combined cutting plane and simplicial decomposition methods Lecture based on the paper D. P. Bertsekas and H. Yu, A Unifying Polyhedral

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Random Walk Distributed Dual Averaging Method For Decentralized Consensus Optimization

Random Walk Distributed Dual Averaging Method For Decentralized Consensus Optimization Random Walk Distributed Dual Averaging Method For Decentralized Consensus Optimization Cun Mu, Asim Kadav, Erik Kruus, Donald Goldfarb, Martin Renqiang Min Machine Learning Group, NEC Laboratories America

More information

The Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers The Alternating Direction Method of Multipliers Customizable software solver package Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein April 27, 2016 1 / 28 Background The Dual Problem Consider

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

CS 231A Computer Vision (Autumn 2012) Problem Set 1

CS 231A Computer Vision (Autumn 2012) Problem Set 1 CS 231A Computer Vision (Autumn 2012) Problem Set 1 Due: Oct. 9 th, 2012 (2:15 pm) 1 Finding an Approximate Image asis EigenFaces (25 points) In this problem you will implement a solution to a facial recognition

More information

ELEG Compressive Sensing and Sparse Signal Representations

ELEG Compressive Sensing and Sparse Signal Representations ELEG 867 - Compressive Sensing and Sparse Signal Representations Introduction to Matrix Completion and Robust PCA Gonzalo Garateguy Depart. of Electrical and Computer Engineering University of Delaware

More information

Data fusion and multi-cue data matching using diffusion maps

Data fusion and multi-cue data matching using diffusion maps Data fusion and multi-cue data matching using diffusion maps Stéphane Lafon Collaborators: Raphy Coifman, Andreas Glaser, Yosi Keller, Steven Zucker (Yale University) Part of this work was supported by

More information

arxiv: v1 [cs.cv] 2 May 2016

arxiv: v1 [cs.cv] 2 May 2016 16-811 Math Fundamentals for Robotics Comparison of Optimization Methods in Optical Flow Estimation Final Report, Fall 2015 arxiv:1605.00572v1 [cs.cv] 2 May 2016 Contents Noranart Vesdapunt Master of Computer

More information

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems Panos Parpas May 9, 2017 Abstract Composite optimization models consist of the minimization of the sum of a smooth

More information

Mathematical Programming and Research Methods (Part II)

Mathematical Programming and Research Methods (Part II) Mathematical Programming and Research Methods (Part II) 4. Convexity and Optimization Massimiliano Pontil (based on previous lecture by Andreas Argyriou) 1 Today s Plan Convex sets and functions Types

More information

Parallel and Distributed Sparse Optimization Algorithms

Parallel and Distributed Sparse Optimization Algorithms Parallel and Distributed Sparse Optimization Algorithms Part I Ruoyu Li 1 1 Department of Computer Science and Engineering University of Texas at Arlington March 19, 2015 Ruoyu Li (UTA) Parallel and Distributed

More information

Sparse Estimation of Movie Preferences via Constrained Optimization

Sparse Estimation of Movie Preferences via Constrained Optimization Sparse Estimation of Movie Preferences via Constrained Optimization Alexander Anemogiannis, Ajay Mandlekar, Matt Tsao December 17, 2016 Abstract We propose extensions to traditional low-rank matrix completion

More information

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM) School of Computer Science Probabilistic Graphical Models Structured Sparse Additive Models Junming Yin and Eric Xing Lecture 7, April 4, 013 Reading: See class website 1 Outline Nonparametric regression

More information

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Minh Dao 1, Xiang Xiang 1, Bulent Ayhan 2, Chiman Kwan 2, Trac D. Tran 1 Johns Hopkins Univeristy, 3400

More information

Towards direct motion and shape parameter recovery from image sequences. Stephen Benoit. Ph.D. Thesis Presentation September 25, 2003

Towards direct motion and shape parameter recovery from image sequences. Stephen Benoit. Ph.D. Thesis Presentation September 25, 2003 Towards direct motion and shape parameter recovery from image sequences Stephen Benoit Ph.D. Thesis Presentation September 25, 2003 September 25, 2003 Towards direct motion and shape parameter recovery

More information

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding B. O Donoghue E. Chu N. Parikh S. Boyd Convex Optimization and Beyond, Edinburgh, 11/6/2104 1 Outline Cone programming Homogeneous

More information

Laplacian Eigenmaps and Bayesian Clustering Based Layout Pattern Sampling and Its Applications to Hotspot Detection and OPC

Laplacian Eigenmaps and Bayesian Clustering Based Layout Pattern Sampling and Its Applications to Hotspot Detection and OPC Laplacian Eigenmaps and Bayesian Clustering Based Layout Pattern Sampling and Its Applications to Hotspot Detection and OPC Tetsuaki Matsunawa 1, Bei Yu 2 and David Z. Pan 3 1 Toshiba Corporation 2 The

More information

The Benefit of Tree Sparsity in Accelerated MRI

The Benefit of Tree Sparsity in Accelerated MRI The Benefit of Tree Sparsity in Accelerated MRI Chen Chen and Junzhou Huang Department of Computer Science and Engineering, The University of Texas at Arlington, TX, USA 76019 Abstract. The wavelet coefficients

More information

Efficient MR Image Reconstruction for Compressed MR Imaging

Efficient MR Image Reconstruction for Compressed MR Imaging Efficient MR Image Reconstruction for Compressed MR Imaging Junzhou Huang, Shaoting Zhang, and Dimitris Metaxas Division of Computer and Information Sciences, Rutgers University, NJ, USA 08854 Abstract.

More information

CLEANING UP TOXIC WASTE: REMOVING NEFARIOUS CONTRIBUTIONS TO RECOMMENDATION SYSTEMS

CLEANING UP TOXIC WASTE: REMOVING NEFARIOUS CONTRIBUTIONS TO RECOMMENDATION SYSTEMS CLEANING UP TOXIC WASTE: REMOVING NEFARIOUS CONTRIBUTIONS TO RECOMMENDATION SYSTEMS Adam Charles, Ali Ahmed, Aditya Joshi, Stephen Conover, Christopher Turnes, Mark Davenport Georgia Institute of Technology

More information

Efficient Iterative Semi-supervised Classification on Manifold

Efficient Iterative Semi-supervised Classification on Manifold . Efficient Iterative Semi-supervised Classification on Manifold... M. Farajtabar, H. R. Rabiee, A. Shaban, A. Soltani-Farani Sharif University of Technology, Tehran, Iran. Presented by Pooria Joulani

More information

Sparsity Based Regularization

Sparsity Based Regularization 9.520: Statistical Learning Theory and Applications March 8th, 200 Sparsity Based Regularization Lecturer: Lorenzo Rosasco Scribe: Ioannis Gkioulekas Introduction In previous lectures, we saw how regularization

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Templates for Convex Cone Problems with Applications to Sparse Signal Recovery

Templates for Convex Cone Problems with Applications to Sparse Signal Recovery Mathematical Programming Computation manuscript No. (will be inserted by the editor) Templates for Convex Cone Problems with Applications to Sparse Signal Recovery Stephen R. Becer Emmanuel J. Candès Michael

More information

Connections between the Lasso and Support Vector Machines

Connections between the Lasso and Support Vector Machines Connections between the Lasso and Support Vector Machines Martin Jaggi Ecole Polytechnique 2013 / 07 / 08 ROKS 13 - International Workshop on Advances in Regularization, Optimization, Kernel Methods and

More information

Structured Sparsity with Group-Graph Regularization

Structured Sparsity with Group-Graph Regularization Structured Sparsity with Group-Graph Regularization Xin-Yu Dai, Jian-Bing Zhang, Shu-Jian Huang, Jia-Jun hen, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing

More information

Exploiting Low-Rank Structure in Semidenite Programming by Approximate Operator Splitting

Exploiting Low-Rank Structure in Semidenite Programming by Approximate Operator Splitting Exploiting Low-Rank Structure in Semidenite Programming by Approximate Operator Splitting Mario Souto, Joaquim D. Garcia and Álvaro Veiga June 26, 2018 Outline Introduction Algorithm Exploiting low-rank

More information

A General Framework for Fast Stagewise Algorithms

A General Framework for Fast Stagewise Algorithms Journal of Machine Learning Research 16 (2015) 2543-2588 Submitted 8/14; Revised 5/15; Published 12/15 A General Framework for Fast Stagewise Algorithms Ryan J. Tibshirani Departments of Statistics and

More information

A dual domain algorithm for minimum nuclear norm deblending Jinkun Cheng and Mauricio D. Sacchi, University of Alberta

A dual domain algorithm for minimum nuclear norm deblending Jinkun Cheng and Mauricio D. Sacchi, University of Alberta A dual domain algorithm for minimum nuclear norm deblending Jinkun Cheng and Mauricio D. Sacchi, University of Alberta Downloaded /1/15 to 142.244.11.63. Redistribution subject to SEG license or copyright;

More information

Robust Transfer Principal Component Analysis with Rank Constraints

Robust Transfer Principal Component Analysis with Rank Constraints Robust Transfer Principal Component Analysis with Rank Constraints Yuhong Guo Department of Computer and Information Sciences Temple University, Philadelphia, PA 19122, USA yuhong@temple.edu Abstract Principal

More information

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems

LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems Xiangrui Meng Joint with Michael A. Saunders and Michael W. Mahoney Stanford University June 19, 2012 Meng, Saunders, Mahoney

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE REGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 6, 2011 ABOUT THIS

More information

A Geometric Analysis of Subspace Clustering with Outliers

A Geometric Analysis of Subspace Clustering with Outliers A Geometric Analysis of Subspace Clustering with Outliers Mahdi Soltanolkotabi and Emmanuel Candés Stanford University Fundamental Tool in Data Mining : PCA Fundamental Tool in Data Mining : PCA Subspace

More information

Efficient Inexact Proximal Gradient Algorithm for Nonconvex Problems

Efficient Inexact Proximal Gradient Algorithm for Nonconvex Problems Efficient Inexact Proximal Gradient Algorithm for Nonconvex Problems Quanming Yao James T. Kwok Fei Gao 2 Wei Chen 2 Tie-Yan Liu 2 Hong Kong University of Science and Technology, Clear Water Bay, Hong

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification?

Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification? Supplementary material for the paper Are Sparse Representations Really Relevant for Image Classification? Roberto Rigamonti, Matthew A. Brown, Vincent Lepetit CVLab, EPFL Lausanne, Switzerland firstname.lastname@epfl.ch

More information

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 WRI C225 Lecture 04 130131 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Histogram Equalization Image Filtering Linear

More information