Optimization for Machine Learning
|
|
- Lesley Quinn
- 6 years ago
- Views:
Transcription
1 Optimization for Machine Learning (Problems; Algorithms - C) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017)
2 Course materials Some references: Introductory lectures on convex optimization Nesterov Convex optimization Boyd & Vandenberghe Nonlinear programming Bertsekas Convex Analysis Rockafellar Fundamentals of convex analysis Urruty, Lemaréchal Lectures on modern convex optimization Nemirovski Optimization for Machine Learning Sra, Nowozin, Wright Theory of Convex Optimization for Machine Learning Bubeck NIPS 2016 Optimization Tutorial Bach, Sra Some related courses: EE227A, Spring 2013, (Sra, UC Berkeley) , Spring 2014 (Sra, CMU) EE364a,b (Boyd, Stanford) EE236b,c (Vandenberghe, UCLA) Venues: NIPS, ICML, UAI, AISTATS, SIOPT, Math. Prog. Suvrit Sra Optimization for Machine Learning 2 / 29
3 Lecture Plan Introduction (3 lectures) Problems and algorithms (5 lectures) Non-convex optimization, perspectives (2 lectures) Suvrit Sra Optimization for Machine Learning 3 / 29
4 Nonsmooth convergence rates Theorem. (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in C 0 L (B) (with L > 0), such that for 0 k n 1, the lower-bound f (x k ) f (x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Exercise: So design problems where we can do better! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 4 / 29
5 Composite problems Suvrit Sra Optimization for Machine Learning 5 / 29
6 Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29
7 Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) l + r Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29
8 Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29
9 Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing Example: l(x) : Logistic loss, and r(x) = λ x 1 L1-Logistic regression, sparse LR Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29
10 Composite objective minimization minimize f (x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f (x k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 29
11 Composite objective minimization minimize f (x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f (x k ) subgradient: converges slowly at rate O(1/ k) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 29
12 Composite objective minimization minimize f (x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f (x k ) subgradient: converges slowly at rate O(1/ k) but: f is smooth plus nonsmooth we should exploit: smoothness of l for better method! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 29
13 Proximal Gradient Method min f (x) x X Projected (sub)gradient x P X (x α f (x)) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 29
14 Proximal Gradient Method min f (x) x X Projected (sub)gradient x P X (x α f (x)) min f (x) + h(x) Proximal gradient x prox αh (x α f (x)) prox αh denotes proximity operator for h Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 29
15 Proximal Gradient Method min f (x) x X Projected (sub)gradient x P X (x α f (x)) min f (x) + h(x) Proximal gradient x prox αh (x α f (x)) prox αh denotes proximity operator for h Why? If we can compute prox h (x) easily, prox-grad converges as fast gradient methods for smooth problems! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 29
16 Proximity operator Projection P X (y) := argmin x R n 1 2 x y X (x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 29
17 Proximity operator Projection P X (y) := argmin x R n 1 2 x y X (x) Proximity: Replace 1 X by a closed convex function prox r (y) := argmin x R n 1 2 x y r(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 29
18 Proximity operator l 1 -norm ball of radius ρ(λ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 10 / 29
19 Proximity operators Exercise: Let r(x) = x 1. Solve prox λr (y). 1 min x R n 2 x y λ x 1. Hint 1: The above problem decomposes into n independent subproblems of the form min x R 1 2 (x y)2 + λ x. Hint 2: Consider the two cases: either x = 0 or x 0 Exercise: Moreau decomposition y = prox h y + prox h y (notice analogy to V = S + S in linear algebra) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 29
20 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29
21 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29
22 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29
23 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29
24 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29
25 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) x = (I + α h) 1 (x α f (x )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29
26 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) x = (I + α h) 1 (x α f (x )) x = prox αh (x α f (x )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29
27 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) x = (I + α h) 1 (x α f (x )) x = prox αh (x α f (x )) Above fixed-point eqn suggests iteration x k+1 = prox αk h (x k α k f (x k )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29
28 Convergence Suvrit Sra Optimization for Machine Learning 13 / 29
29 Proximal-gradient works, why? x k+1 = prox αk h (x k α k f (x k )) x k+1 = x k α k G αk (x k ). Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 29
30 Proximal-gradient works, why? x k+1 = prox αk h (x k α k f (x k )) x k+1 = x k α k G αk (x k ). Gradient mapping: the gradient-like object G α (x) = 1 α (x P αh(x α f (x))) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 29
31 Proximal-gradient works, why? x k+1 = prox αk h (x k α k f (x k )) x k+1 = x k α k G αk (x k ). Gradient mapping: the gradient-like object G α (x) = 1 α (x P αh(x α f (x))) Our lemma shows: G α (x) = 0 if and only if x is optimal So G α analogous to f If x locally optimal, then G α (x) = 0 (nonconvex f ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 29
32 Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29
33 Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29
34 Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Lemma (Descent). Let f C 1 L. Then, f (y) f (x) + f (x), y x + L 2 y x 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29
35 Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Lemma (Descent). Let f C 1 L. Then, f (y) f (x) + f (x), y x + L 2 y x 2 2 For convex f, compare with f (y) f (x) + f (x), y x. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29
36 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29
37 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29
38 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29
39 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x), y x dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29
40 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x) 2 y x 2 dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29
41 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x) 2 y x 2 dt L 1 0 t x y 2 2 dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29
42 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt Bounds f (y) around x with quadratic functions 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x) 2 y x 2 dt L 1 0 t x y 2 2 dt = L 2 x y 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29
43 Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29
44 Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 f (y) f (x) α f (x), G α (x) + α2 L 2 G α(x) 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29
45 Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 f (y) f (x) α f (x), G α (x) + α2 L 2 G α(x) 2 2. Corollary. So if 0 α 1/L, we have f (y) f (x) α f (x), G α (x) + α 2 G α(x) 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29
46 Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 f (y) f (x) α f (x), G α (x) + α2 L 2 G α(x) 2 2. Corollary. So if 0 α 1/L, we have f (y) f (x) α f (x), G α (x) + α 2 G α(x) 2 2. Lemma Let y = x αg α (x). Then, for any z we have f (y) + h(y) f (z) + h(z) + G α (x), x z α 2 G α(x) 2 2. Exercise: Prove! (hint: f, h are convex, G α (x) f (x) h(y)) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29
47 Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29
48 Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Use z = x in corollary to obtain progress in terms of iterates: φ(x ) φ G α (x), x x α 2 G α(x) 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29
49 Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Use z = x in corollary to obtain progress in terms of iterates: φ(x ) φ G α (x), x x α 2 G α(x) 2 2 = 1 [ 2 αgα (x), x x αg α (x) 2 2α 2] = 1 [ x x 2 2 2α x x αg α (x) 2 2] Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29
50 Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Use z = x in corollary to obtain progress in terms of iterates: φ(x ) φ G α (x), x x α 2 G α(x) 2 2 = 1 [ 2 αgα (x), x x αg α (x) 2 2α 2] = 1 2α = 1 2α [ x x 2 2 x x αg α (x) 2 2] [ x x 2 2 x x 2 ] 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29
51 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29
52 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 i=1 (φ(x i) φ ) L 2 k+1 i=1 [ xi x 2 2 x i+1 x 2 2] Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29
53 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 k+1 [ xi x 2 2 x i+1 x 2 2] (φ(x i) φ ) L i=1 2 i=1 [ x1 x 2 2 x k+1 x 2 2] = L 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29
54 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 k+1 [ xi x 2 2 x i+1 x 2 2] (φ(x i) φ ) L i=1 2 i=1 [ x1 x 2 2 x k+1 x 2 2] = L 2 L 2 x 1 x 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29
55 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 k+1 [ xi x 2 2 x i+1 x 2 2] (φ(x i) φ ) L i=1 2 i=1 [ x1 x 2 2 x k+1 x 2 2] = L 2 L 2 x 1 x 2 2. Since φ(x k ) is a decreasing sequence, it follows that φ(x k+1 ) φ 1 k+1 k + 1 i=1 (φ(x i ) φ ) L 2(k + 1) x 1 x 2 2. This is the well-known O(1/k) rate. But for C 1 L convex functions, optimal rate is O(1/k2 )! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29
56 Accelerated Proximal Gradient min φ(x) = f (x) + h(x) Let x 0 = y 0 dom h. For k 1: x k = prox αk h (yk 1 α k f (y k 1 )) y k = x k + k 1 k + 2 (xk x k 1 ). Framework due to: Nesterov (1983, 2004); also Beck, Teboulle (2009). Simplified analysis: Tseng (2008). Uses extra memory for interpolation Same computational cost as ordinary prox-grad Convergence rate theoretically optimal φ(x k ) φ 2L (k + 1) 2 x0 x 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 20 / 29
57 Proximal splitting methods l(x) + f (x) + h(x) Direct use of prox-grad not easy Requires computation of: prox λ(f +h) (i.e., (I + λ( f + h)) 1 ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 29
58 Proximal splitting methods l(x) + f (x) + h(x) Direct use of prox-grad not easy Requires computation of: prox λ(f +h) (i.e., (I + λ( f + h)) 1 ) Example: 1 min 2 x y λ x 2 + µ n 1 }{{} x i+1 x i. i=1 }{{} f (x) h(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 29
59 Proximal splitting methods l(x) + f (x) + h(x) Direct use of prox-grad not easy Requires computation of: prox λ(f +h) (i.e., (I + λ( f + h)) 1 ) Example: 1 min 2 x y λ x 2 + µ n 1 }{{} x i+1 x i. i=1 }{{} f (x) h(x) But good feature: prox f and prox h separately easier Can we exploit that? Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 29
60 Proximal splitting operator notation If (I + f + h) 1 hard, but (I + f ) 1 and (I + h) 1 easy Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 29
61 Proximal splitting operator notation If (I + f + h) 1 hard, but (I + f ) 1 and (I + h) 1 easy Let us derive a fixed-point equation that splits the operators Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 29
62 Proximal splitting operator notation If (I + f + h) 1 hard, but (I + f ) 1 and (I + h) 1 easy Let us derive a fixed-point equation that splits the operators Assume we are solving min f (x) + h(x), where both f and h are convex but potentially nondifferentiable. Notice: We implicitly assumed: (f + h) = f + h. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 29
63 Proximal splitting 0 f (x) + h(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29
64 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29
65 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29
66 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29
67 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) = x (I + f ) 1 (2x z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29
68 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) = x (I + f ) 1 (2x z) Not a fixed-point equation yet Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29
69 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) = x (I + f ) 1 (2x z) Not a fixed-point equation yet We need one more idea Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29
70 Douglas-Rachford splitting Reflection operator R h (z) := 2 prox h (z) z Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29
71 Douglas-Rachford splitting Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method z (I + h)(x), x = prox h (z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29
72 Douglas-Rachford splitting Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method z (I + h)(x), x = prox h (z) = R h (z) = 2x z Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29
73 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29
74 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29
75 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) but R h (z) = 2x z = z = 2x R h (z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29
76 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) but R h (z) = 2x z = z = 2x R h (z) z = 2 prox f (R h (z)) R h (z) = Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29
77 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) but R h (z) = 2x z = z = 2x R h (z) z = 2 prox f (R h (z)) R h (z) = R f (R h (z)) Finally, z is on both sides of the eqn Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29
78 Douglas-Rachford method 0 f (x) + h(x) DR method: given z 0, iterate for k 0 x k = prox h (z k ) { x = prox h (z) z = R f (R h (z)) v k = prox f (2x k z k ) z k+1 = z k + γ k (v k x k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 25 / 29
79 Douglas-Rachford method 0 f (x) + h(x) DR method: given z 0, iterate for k 0 x k = prox h (z k ) { x = prox h (z) z = R f (R h (z)) v k = prox f (2x k z k ) z k+1 = z k + γ k (v k x k ) Theorem. If f + h admits minimizers, and (γ k ) satisfy γ k [0, 2], γ k(2 γ k ) =, k then the DR-iterates v k and x k converge to a minimizer. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 25 / 29
80 Douglas-Rachford method For γ k = 1, we have z k+1 = z k + v k x k z k+1 = z k + prox f (2 prox h (z k ) z k ) prox h (z k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 29
81 Douglas-Rachford method For γ k = 1, we have z k+1 = z k + v k x k z k+1 = z k + prox f (2 prox h (z k ) z k ) prox h (z k ) Dropping superscripts, writing P prox, we have z Tz T = I + P f (2P h I) P h Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 29
82 Douglas-Rachford method For γ k = 1, we have z k+1 = z k + v k x k z k+1 = z k + prox f (2 prox h (z k ) z k ) prox h (z k ) Dropping superscripts, writing P prox, we have z Tz T = I + P f (2P h I) P h Lemma DR can be written as: z 1 2 (R f R h + I)z, where R f denotes the reflection operator 2P f I (similarly R h ). Exercise: Prove this claim. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 29
83 Proximal methods cornucopia Douglas Rachford splitting ADMM (special case of DR on dual) Proximal-Dykstra Proximal methods for f 1 + f f n Peaceman-Rachford Proximal quasi-newton, Newton Many other variation... Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 27 / 29
84 Best approximation problem min δ A (x) + δ B (x) where A B =. Can we use DR? Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 28 / 29
85 Best approximation problem min δ A (x) + δ B (x) where A B =. Can we use DR? Using a clever analysis of Bauschke & Combettes (2004), DR can still be applied! However, it generates diverging iterates which can be projected back to obtain a solution to min a b 2 a A, b B. See: Jegelka, Bach, Sra (NIPS 2013) for an example. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 28 / 29
86 ADMM Let us see separable objective with constraints Suvrit Sra Optimization for Machine Learning 29 / 29
87 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29
88 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29
89 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29
90 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29
91 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL x k+1 = argmin x L ρ (x, z k, y k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29
92 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29
93 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) y k+1 = y k + ρ(ax k+1 + Bz k+1 c) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29
Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology
Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology Hausdorff Institute for Mathematics (HIM) Trimester: Mathematics of Signal Processing
More informationConvex optimization algorithms for sparse and low-rank representations
Convex optimization algorithms for sparse and low-rank representations Lieven Vandenberghe, Hsiao-Han Chao (UCLA) ECC 2013 Tutorial Session Sparse and low-rank representation methods in control, estimation,
More informationConvex Optimization MLSS 2015
Convex Optimization MLSS 2015 Constantine Caramanis The University of Texas at Austin The Optimization Problem minimize : f (x) subject to : x X. The Optimization Problem minimize : f (x) subject to :
More informationSparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual
Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know learn the proximal
More information1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics
1. Introduction EE 546, Univ of Washington, Spring 2016 performance of numerical methods complexity bounds structural convex optimization course goals and topics 1 1 Some course info Welcome to EE 546!
More informationThe Alternating Direction Method of Multipliers
The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein October 8, 2015 1 / 30 Introduction Presentation Outline 1 Convex
More informationLecture 19: Convex Non-Smooth Optimization. April 2, 2007
: Convex Non-Smooth Optimization April 2, 2007 Outline Lecture 19 Convex non-smooth problems Examples Subgradients and subdifferentials Subgradient properties Operations with subgradients and subdifferentials
More informationProximal operator and methods
Proximal operator and methods Master 2 Data Science, Univ. Paris Saclay Robert M. Gower Optimization Sum of Terms A Datum Function Finite Sum Training Problem The Training Problem Convergence GD I Theorem
More informationLecture 18: March 23
0-725/36-725: Convex Optimization Spring 205 Lecturer: Ryan Tibshirani Lecture 8: March 23 Scribes: James Duyck Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not
More informationThe Alternating Direction Method of Multipliers
The Alternating Direction Method of Multipliers Customizable software solver package Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein April 27, 2016 1 / 28 Background The Dual Problem Consider
More informationConditional gradient algorithms for machine learning
Conditional gradient algorithms for machine learning Zaid Harchaoui LEAR-LJK, INRIA Grenoble Anatoli Juditsky LJK, Université de Grenoble Arkadi Nemirovski Georgia Tech Abstract We consider penalized formulations
More informationConvergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM
Convergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM Shuvomoy Das Gupta Thales Canada 56th Allerton Conference on Communication, Control and Computing, 2018 1 What is this talk about?
More informationComposite Self-concordant Minimization
Composite Self-concordant Minimization Volkan Cevher Laboratory for Information and Inference Systems-LIONS Ecole Polytechnique Federale de Lausanne (EPFL) volkan.cevher@epfl.ch Paris 6 Dec 11, 2013 joint
More informationA proximal center-based decomposition method for multi-agent convex optimization
Proceedings of the 47th IEEE Conference on Decision and Control Cancun, Mexico, Dec. 9-11, 2008 A proximal center-based decomposition method for multi-agent convex optimization Ion Necoara and Johan A.K.
More informationDistributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Stephen Boyd MIIS Xi An, 1/7/12 source: Distributed Optimization and Statistical Learning via the Alternating
More informationParallel and Distributed Sparse Optimization Algorithms
Parallel and Distributed Sparse Optimization Algorithms Part I Ruoyu Li 1 1 Department of Computer Science and Engineering University of Texas at Arlington March 19, 2015 Ruoyu Li (UTA) Parallel and Distributed
More informationLecture 4 Duality and Decomposition Techniques
Lecture 4 Duality and Decomposition Techniques Jie Lu (jielu@kth.se) Richard Combes Alexandre Proutiere Automatic Control, KTH September 19, 2013 Consider the primal problem Lagrange Duality Lagrangian
More informationConditional gradient algorithms for machine learning
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationConvex Optimization M2
Convex Optimization M2 Lecture 1 A. d Aspremont. Convex Optimization M2. 1/49 Today Convex optimization: introduction Course organization and other gory details... Convex sets, basic definitions. A. d
More informationRecent Developments in Model-based Derivative-free Optimization
Recent Developments in Model-based Derivative-free Optimization Seppo Pulkkinen April 23, 2010 Introduction Problem definition The problem we are considering is a nonlinear optimization problem with constraints:
More informationIntroduction to Optimization
Introduction to Optimization Second Order Optimization Methods Marc Toussaint U Stuttgart Planned Outline Gradient-based optimization (1st order methods) plain grad., steepest descent, conjugate grad.,
More informationDistributed Alternating Direction Method of Multipliers
Distributed Alternating Direction Method of Multipliers Ermin Wei and Asuman Ozdaglar Abstract We consider a network of agents that are cooperatively solving a global unconstrained optimization problem,
More informationLeaves Machine Learning and Optimization Library
Leaves Machine Learning and Optimization Library October 7, 07 This document describes how to use LEaves Machine Learning and Optimization Library (LEMO) for modeling. LEMO is an open source library for
More informationSurrogate Gradient Algorithm for Lagrangian Relaxation 1,2
Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2 X. Zhao 3, P. B. Luh 4, and J. Wang 5 Communicated by W.B. Gong and D. D. Yao 1 This paper is dedicated to Professor Yu-Chi Ho for his 65th birthday.
More informationConvex Optimization. Lijun Zhang Modification of
Convex Optimization Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Modification of http://stanford.edu/~boyd/cvxbook/bv_cvxslides.pdf Outline Introduction Convex Sets & Functions Convex Optimization
More informationNonlinear Programming
Nonlinear Programming SECOND EDITION Dimitri P. Bertsekas Massachusetts Institute of Technology WWW site for book Information and Orders http://world.std.com/~athenasc/index.html Athena Scientific, Belmont,
More informationConvex Optimization / Homework 2, due Oct 3
Convex Optimization 0-725/36-725 Homework 2, due Oct 3 Instructions: You must complete Problems 3 and either Problem 4 or Problem 5 (your choice between the two) When you submit the homework, upload a
More informationIDENTIFYING ACTIVE MANIFOLDS
Algorithmic Operations Research Vol.2 (2007) 75 82 IDENTIFYING ACTIVE MANIFOLDS W.L. Hare a a Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada. A.S. Lewis b b School of ORIE,
More information25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.
CS/ECE/ISyE 524 Introduction to Optimization Spring 2017 18 25. NLP algorithms ˆ Overview ˆ Local methods ˆ Constrained optimization ˆ Global methods ˆ Black-box methods ˆ Course wrap-up Laurent Lessard
More informationA Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems
A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems Panos Parpas May 9, 2017 Abstract Composite optimization models consist of the minimization of the sum of a smooth
More informationEfficient Euclidean Projections in Linear Time
Jun Liu J.LIU@ASU.EDU Jieping Ye JIEPING.YE@ASU.EDU Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA Abstract We consider the problem of computing the Euclidean
More informationCharacterizing Improving Directions Unconstrained Optimization
Final Review IE417 In the Beginning... In the beginning, Weierstrass's theorem said that a continuous function achieves a minimum on a compact set. Using this, we showed that for a convex set S and y not
More informationConvex Optimization: from Real-Time Embedded to Large-Scale Distributed
Convex Optimization: from Real-Time Embedded to Large-Scale Distributed Stephen Boyd Neal Parikh, Eric Chu, Yang Wang, Jacob Mattingley Electrical Engineering Department, Stanford University Springer Lectures,
More informationConvexity: an introduction
Convexity: an introduction Geir Dahl CMA, Dept. of Mathematics and Dept. of Informatics University of Oslo 1 / 74 1. Introduction 1. Introduction what is convexity where does it arise main concepts and
More informationAugmented Lagrangian Methods
Augmented Lagrangian Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. IMA, August 2016 Stephen Wright (UW-Madison) Augmented Lagrangian IMA, August 2016 1 /
More informationNonsmooth Optimization for Statistical Learning with Structured Matrix Regularization
Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization PhD defense of Federico Pierucci Thesis supervised by Prof. Zaid Harchaoui Prof. Anatoli Juditsky Dr. Jérôme Malick
More informationACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A.
ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS Donghwan Kim and Jeffrey A. Fessler University of Michigan Dept. of Electrical Engineering and Computer Science
More informationIteratively Re-weighted Least Squares for Sums of Convex Functions
Iteratively Re-weighted Least Squares for Sums of Convex Functions James Burke University of Washington Jiashan Wang LinkedIn Frank Curtis Lehigh University Hao Wang Shanghai Tech University Daiwei He
More informationIntroduction to Convex Optimization
Introduction to Convex Optimization Lieven Vandenberghe Electrical Engineering Department, UCLA IPAM Graduate Summer School: Computer Vision August 5, 2013 Convex optimization problem minimize f 0 (x)
More informationAugmented Lagrangian Methods
Augmented Lagrangian Methods Mário A. T. Figueiredo 1 and Stephen J. Wright 2 1 Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal 2 Computer Sciences Department, University of
More informationSection 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1
Section 5 Convex Optimisation 1 W. Dai (IC) EE4.66 Data Proc. Convex Optimisation 1 2018 page 5-1 Convex Combination Denition 5.1 A convex combination is a linear combination of points where all coecients
More informationConvex Optimization CMU-10725
Convex Optimization CMU-10725 Conjugate Direction Methods Barnabás Póczos & Ryan Tibshirani Conjugate Direction Methods 2 Books to Read David G. Luenberger, Yinyu Ye: Linear and Nonlinear Programming Nesterov:
More informationConic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding
Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding B. O Donoghue E. Chu N. Parikh S. Boyd Convex Optimization and Beyond, Edinburgh, 11/6/2104 1 Outline Cone programming Homogeneous
More informationA primal-dual framework for mixtures of regularizers
A primal-dual framework for mixtures of regularizers Baran Gözcü baran.goezcue@epfl.ch Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale de Lausanne (EPFL) Switzerland
More informationEE8231 Optimization Theory Course Project
EE8231 Optimization Theory Course Project Qian Zhao (zhaox331@umn.edu) May 15, 2014 Contents 1 Introduction 2 2 Sparse 2-class Logistic Regression 2 2.1 Problem formulation....................................
More informationThe Benefit of Tree Sparsity in Accelerated MRI
The Benefit of Tree Sparsity in Accelerated MRI Chen Chen and Junzhou Huang Department of Computer Science and Engineering, The University of Texas at Arlington, TX, USA 76019 Abstract. The wavelet coefficients
More informationMining Sparse Representations: Theory, Algorithms, and Applications
Mining Sparse Representations: Theory, Algorithms, and Applications Jun Liu, Shuiwang Ji, and Jieping Ye Computer Science and Engineering The Biodesign Institute Arizona State University What is Sparsity?
More informationProgramming, numerics and optimization
Programming, numerics and optimization Lecture C-4: Constrained optimization Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428 June
More informationMathematical Programming and Research Methods (Part II)
Mathematical Programming and Research Methods (Part II) 4. Convexity and Optimization Massimiliano Pontil (based on previous lecture by Andreas Argyriou) 1 Today s Plan Convex sets and functions Types
More informationIE521 Convex Optimization Introduction
IE521 Convex Optimization Introduction Instructor: Niao He Jan 18, 2017 1 About Me Assistant Professor, UIUC, 2016 Ph.D. in Operations Research, M.S. in Computational Sci. & Eng. Georgia Tech, 2010 2015
More informationFast Newton methods for the group fused lasso
Fast Newton methods for the group fused lasso Matt Wytock Machine Learning Dept. Carnegie Mellon University Pittsburgh, PA Suvrit Sra Machine Learning Dept. Carnegie Mellon University Pittsburgh, PA J.
More information15. Cutting plane and ellipsoid methods
EE 546, Univ of Washington, Spring 2012 15. Cutting plane and ellipsoid methods localization methods cutting-plane oracle examples of cutting plane methods ellipsoid method convergence proof inequality
More informationDavid G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer
David G. Luenberger Yinyu Ye Linear and Nonlinear Programming Fourth Edition ö Springer Contents 1 Introduction 1 1.1 Optimization 1 1.2 Types of Problems 2 1.3 Size of Problems 5 1.4 Iterative Algorithms
More informationA C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions
A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions Nira Dyn School of Mathematical Sciences Tel Aviv University Michael S. Floater Department of Informatics University of
More informationME 555: Distributed Optimization
ME 555: Distributed Optimization Duke University Spring 2015 1 Administrative Course: ME 555: Distributed Optimization (Spring 2015) Instructor: Time: Location: Office hours: Website: Soomin Lee (email:
More informationSolution Methods Numerical Algorithms
Solution Methods Numerical Algorithms Evelien van der Hurk DTU Managment Engineering Class Exercises From Last Time 2 DTU Management Engineering 42111: Static and Dynamic Optimization (6) 09/10/2017 Class
More informationA More Efficient Approach to Large Scale Matrix Completion Problems
A More Efficient Approach to Large Scale Matrix Completion Problems Matthew Olson August 25, 2014 Abstract This paper investigates a scalable optimization procedure to the low-rank matrix completion problem
More informationConvex Optimization and Machine Learning
Convex Optimization and Machine Learning Mengliu Zhao Machine Learning Reading Group School of Computing Science Simon Fraser University March 12, 2014 Mengliu Zhao SFU-MLRG March 12, 2014 1 / 25 Introduction
More informationSYSTEMS OF NONLINEAR EQUATIONS
SYSTEMS OF NONLINEAR EQUATIONS Widely used in the mathematical modeling of real world phenomena. We introduce some numerical methods for their solution. For better intuition, we examine systems of two
More informationIncremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. Chapter 4 : Optimization for Machine Learning
Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey Chapter 4 : Optimization for Machine Learning Summary of Chapter 2 Chapter 2: Convex Optimization with Sparsity
More informationEfficient MR Image Reconstruction for Compressed MR Imaging
Efficient MR Image Reconstruction for Compressed MR Imaging Junzhou Huang, Shaoting Zhang, and Dimitris Metaxas Division of Computer and Information Sciences, Rutgers University, NJ, USA 08854 Abstract.
More information15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs
15.082J and 6.855J Lagrangian Relaxation 2 Algorithms Application to LPs 1 The Constrained Shortest Path Problem (1,10) 2 (1,1) 4 (2,3) (1,7) 1 (10,3) (1,2) (10,1) (5,7) 3 (12,3) 5 (2,2) 6 Find the shortest
More informationCalifornia Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas
California Institute of Technology HSS Division Crash-Course on Convex Optimization Fall 2011-12 Ec 133 Guilherme Freitas In this text, we will study the following basic problem: maximize x C f(x) subject
More informationHomotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ)
Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ) Yi Xu, Yan Yan, Qihang Lin, Tianbao Yang Department of Computer Science, University of Iowa, Iowa City, IA 52242 QCIS, University
More informationIE598 Big Data Optimization Summary Nonconvex Optimization
IE598 Big Data Optimization Summary Nonconvex Optimization Instructor: Niao He April 16, 2018 1 This Course Big Data Optimization Explore modern optimization theories, algorithms, and big data applications
More informationLecture 15: Log Barrier Method
10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 15: Log Barrier Method Scribes: Pradeep Dasigi, Mohammad Gowayyed Note: LaTeX template courtesy of UC Berkeley EECS dept.
More informationSpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian
SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School of Information Science and Technology Department of
More informationLagrangian Relaxation: An overview
Discrete Math for Bioinformatics WS 11/12:, by A. Bockmayr/K. Reinert, 22. Januar 2013, 13:27 4001 Lagrangian Relaxation: An overview Sources for this lecture: D. Bertsimas and J. Tsitsiklis: Introduction
More informationContents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.
page v Preface xiii I Basics 1 1 Optimization Models 3 1.1 Introduction... 3 1.2 Optimization: An Informal Introduction... 4 1.3 Linear Equations... 7 1.4 Linear Optimization... 10 Exercises... 12 1.5
More informationComputational Methods. Constrained Optimization
Computational Methods Constrained Optimization Manfred Huber 2010 1 Constrained Optimization Unconstrained Optimization finds a minimum of a function under the assumption that the parameters can take on
More informationProjection onto the probability simplex: An efficient algorithm with a simple proof, and an application
Proection onto the probability simplex: An efficient algorithm with a simple proof, and an application Weiran Wang Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science, University of
More informationA C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions
A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions Nira Dyn Michael S. Floater Kai Hormann Abstract. We present a new four-point subdivision scheme that generates C 2 curves.
More informationLower bounds on the barrier parameter of convex cones
of convex cones Université Grenoble 1 / CNRS June 20, 2012 / High Performance Optimization 2012, Delft Outline Logarithmically homogeneous barriers 1 Logarithmically homogeneous barriers Conic optimization
More informationMath 5593 Linear Programming Lecture Notes
Math 5593 Linear Programming Lecture Notes Unit II: Theory & Foundations (Convex Analysis) University of Colorado Denver, Fall 2013 Topics 1 Convex Sets 1 1.1 Basic Properties (Luenberger-Ye Appendix B.1).........................
More informationOn Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms
On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms Tianyi Lin, Nhat Ho, Michael I Jordan, Department of Electrical Engineering and Computer Sciences Department
More informationRevisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng
Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization Author: Martin Jaggi Presenter: Zhongxing Peng Outline 1. Theoretical Results 2. Applications Outline 1. Theoretical Results 2. Applications
More informationLecture 2 - Introduction to Polytopes
Lecture 2 - Introduction to Polytopes Optimization and Approximation - ENS M1 Nicolas Bousquet 1 Reminder of Linear Algebra definitions Let x 1,..., x m be points in R n and λ 1,..., λ m be real numbers.
More informationTwo-phase matrix splitting methods for asymmetric and symmetric LCP
Two-phase matrix splitting methods for asymmetric and symmetric LCP Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University Joint work with Feng, Nocedal, and Pang
More informationConvex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach
Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach Achintya Kundu Francis Bach Chiranjib Bhattacharyya Department of CSA INRIA - École
More informationConvex or non-convex: which is better?
Sparsity Amplified Ivan Selesnick Electrical and Computer Engineering Tandon School of Engineering New York University Brooklyn, New York March 7 / 4 Convex or non-convex: which is better? (for sparse-regularized
More informationConvexization in Markov Chain Monte Carlo
in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non
More informationApplied Lagrange Duality for Constrained Optimization
Applied Lagrange Duality for Constrained Optimization Robert M. Freund February 10, 2004 c 2004 Massachusetts Institute of Technology. 1 1 Overview The Practical Importance of Duality Review of Convexity
More informationIntroduction to Optimization Problems and Methods
Introduction to Optimization Problems and Methods wjch@umich.edu December 10, 2009 Outline 1 Linear Optimization Problem Simplex Method 2 3 Cutting Plane Method 4 Discrete Dynamic Programming Problem Simplex
More informationOptimization for Machine Learning
with a focus on proximal gradient descent algorithm Department of Computer Science and Engineering Outline 1 History & Trends 2 Proximal Gradient Descent 3 Three Applications A Brief History A. Convex
More informationSupplemental Material for Efficient MR Image Reconstruction for Compressed MR Imaging
Supplemental Material for Efficient MR Image Reconstruction for Compressed MR Imaging Paper ID: 1195 No Institute Given 1 More Experiment Results 1.1 Visual Comparisons We apply all methods on four 2D
More informationINTRODUCTION TO LINEAR AND NONLINEAR PROGRAMMING
INTRODUCTION TO LINEAR AND NONLINEAR PROGRAMMING DAVID G. LUENBERGER Stanford University TT ADDISON-WESLEY PUBLISHING COMPANY Reading, Massachusetts Menlo Park, California London Don Mills, Ontario CONTENTS
More informationSequential Coordinate-wise Algorithm for Non-negative Least Squares Problem
CENTER FOR MACHINE PERCEPTION CZECH TECHNICAL UNIVERSITY Sequential Coordinate-wise Algorithm for Non-negative Least Squares Problem Woring document of the EU project COSPAL IST-004176 Vojtěch Franc, Miro
More informationCMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro
CMU-Q 15-381 Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization Teacher: Gianni A. Di Caro GLOBAL FUNCTION OPTIMIZATION Find the global maximum of the function f x (and
More informationLecture 19: November 5
0-725/36-725: Convex Optimization Fall 205 Lecturer: Ryan Tibshirani Lecture 9: November 5 Scribes: Hyun Ah Song Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not
More informationLearning with infinitely many features
Learning with infinitely many features R. Flamary, Joint work with A. Rakotomamonjy F. Yger, M. Volpi, M. Dalla Mura, D. Tuia Laboratoire Lagrange, Université de Nice Sophia Antipolis December 2012 Example
More informationAlternating Direction Method of Multipliers
Alternating Direction Method of Multipliers CS 584: Big Data Analytics Material adapted from Stephen Boyd (https://web.stanford.edu/~boyd/papers/pdf/admm_slides.pdf) & Ryan Tibshirani (http://stat.cmu.edu/~ryantibs/convexopt/lectures/21-dual-meth.pdf)
More informationIntroduction to Modern Control Systems
Introduction to Modern Control Systems Convex Optimization, Duality and Linear Matrix Inequalities Kostas Margellos University of Oxford AIMS CDT 2016-17 Introduction to Modern Control Systems November
More informationAVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION. Jialin Liu, Yuantao Gu, and Mengdi Wang
AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION Jialin Liu, Yuantao Gu, and Mengdi Wang Tsinghua National Laboratory for Information Science and
More informationA fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]
A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009] Yongjia Song University of Wisconsin-Madison April 22, 2010 Yongjia Song
More informationNesterov's Smoothing Technique and Minimizing Differences of Convex Functions for Hierarchical Clustering
Portland State University PDXScholar Mathematics and Statistics Faculty Publications and Presentations Fariborz Maseeh Department of Mathematics and Statistics 3-017 Nesterov's Smoothing Technique and
More informationA Multilevel Proximal Gradient Algorithm for Large Scale Optimization
A Multilevel Proximal Gradient Algorithm for Large Scale Optimization Panos Parpas June 30, 2016 Abstract Composite optimization models consist of the minimization of the sum of a smooth not necessarily
More informationEXTREME POINTS AND AFFINE EQUIVALENCE
EXTREME POINTS AND AFFINE EQUIVALENCE The purpose of this note is to use the notions of extreme points and affine transformations which are studied in the file affine-convex.pdf to prove that certain standard
More informationLagrangian methods for the regularization of discrete ill-posed problems. G. Landi
Lagrangian methods for the regularization of discrete ill-posed problems G. Landi Abstract In many science and engineering applications, the discretization of linear illposed problems gives rise to large
More informationDistributed Delayed Proximal Gradient Methods
Distributed Delayed Proximal Gradient Methods Mu Li, David G. Andersen and Alexander Smola, Carnegie Mellon University Google Strategic Technologies Abstract We analyze distributed optimization algorithms
More informationConvexity I: Sets and Functions
Convexity I: Sets and Functions Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 See supplements for reviews of basic real analysis basic multivariate calculus basic
More information