Optimization for Machine Learning

Size: px
Start display at page:

Download "Optimization for Machine Learning"

Transcription

1 Optimization for Machine Learning (Problems; Algorithms - C) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017)

2 Course materials Some references: Introductory lectures on convex optimization Nesterov Convex optimization Boyd & Vandenberghe Nonlinear programming Bertsekas Convex Analysis Rockafellar Fundamentals of convex analysis Urruty, Lemaréchal Lectures on modern convex optimization Nemirovski Optimization for Machine Learning Sra, Nowozin, Wright Theory of Convex Optimization for Machine Learning Bubeck NIPS 2016 Optimization Tutorial Bach, Sra Some related courses: EE227A, Spring 2013, (Sra, UC Berkeley) , Spring 2014 (Sra, CMU) EE364a,b (Boyd, Stanford) EE236b,c (Vandenberghe, UCLA) Venues: NIPS, ICML, UAI, AISTATS, SIOPT, Math. Prog. Suvrit Sra Optimization for Machine Learning 2 / 29

3 Lecture Plan Introduction (3 lectures) Problems and algorithms (5 lectures) Non-convex optimization, perspectives (2 lectures) Suvrit Sra Optimization for Machine Learning 3 / 29

4 Nonsmooth convergence rates Theorem. (Nesterov.) Let B = { x x x 0 2 D }. Assume, x B. There exists a convex function f in C 0 L (B) (with L > 0), such that for 0 k n 1, the lower-bound f (x k ) f (x ) LD 2(1+ k+1), holds for any algorithm that generates x k by linearly combining the previous iterates and subgradients. Exercise: So design problems where we can do better! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 4 / 29

5 Composite problems Suvrit Sra Optimization for Machine Learning 5 / 29

6 Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29

7 Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) l + r Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29

8 Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29

9 Composite objectives Frequently ML problems take the regularized form minimize f (x) := l(x) + r(x) l + r Example: l(x) = 1 2 Ax b 2 and r(x) = λ x 1 Lasso, L1-LS, compressed sensing Example: l(x) : Logistic loss, and r(x) = λ x 1 L1-Logistic regression, sparse LR Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 6 / 29

10 Composite objective minimization minimize f (x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f (x k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 29

11 Composite objective minimization minimize f (x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f (x k ) subgradient: converges slowly at rate O(1/ k) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 29

12 Composite objective minimization minimize f (x) := l(x) + r(x) subgradient: x k+1 = x k α k g k, g k f (x k ) subgradient: converges slowly at rate O(1/ k) but: f is smooth plus nonsmooth we should exploit: smoothness of l for better method! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 7 / 29

13 Proximal Gradient Method min f (x) x X Projected (sub)gradient x P X (x α f (x)) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 29

14 Proximal Gradient Method min f (x) x X Projected (sub)gradient x P X (x α f (x)) min f (x) + h(x) Proximal gradient x prox αh (x α f (x)) prox αh denotes proximity operator for h Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 29

15 Proximal Gradient Method min f (x) x X Projected (sub)gradient x P X (x α f (x)) min f (x) + h(x) Proximal gradient x prox αh (x α f (x)) prox αh denotes proximity operator for h Why? If we can compute prox h (x) easily, prox-grad converges as fast gradient methods for smooth problems! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 8 / 29

16 Proximity operator Projection P X (y) := argmin x R n 1 2 x y X (x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 29

17 Proximity operator Projection P X (y) := argmin x R n 1 2 x y X (x) Proximity: Replace 1 X by a closed convex function prox r (y) := argmin x R n 1 2 x y r(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 9 / 29

18 Proximity operator l 1 -norm ball of radius ρ(λ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 10 / 29

19 Proximity operators Exercise: Let r(x) = x 1. Solve prox λr (y). 1 min x R n 2 x y λ x 1. Hint 1: The above problem decomposes into n independent subproblems of the form min x R 1 2 (x y)2 + λ x. Hint 2: Consider the two cases: either x = 0 or x 0 Exercise: Moreau decomposition y = prox h y + prox h y (notice analogy to V = S + S in linear algebra) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 11 / 29

20 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

21 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

22 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

23 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

24 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

25 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) x = (I + α h) 1 (x α f (x )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

26 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) x = (I + α h) 1 (x α f (x )) x = prox αh (x α f (x )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

27 How to cook-up prox-grad? Lemma x = prox αh (x α f (x )), α > 0 0 f (x ) + h(x ) 0 α f (x ) + α h(x ) x α f (x ) + (I + α h)(x ) x α f (x ) (I + α h)(x ) x = (I + α h) 1 (x α f (x )) x = prox αh (x α f (x )) Above fixed-point eqn suggests iteration x k+1 = prox αk h (x k α k f (x k )) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 12 / 29

28 Convergence Suvrit Sra Optimization for Machine Learning 13 / 29

29 Proximal-gradient works, why? x k+1 = prox αk h (x k α k f (x k )) x k+1 = x k α k G αk (x k ). Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 29

30 Proximal-gradient works, why? x k+1 = prox αk h (x k α k f (x k )) x k+1 = x k α k G αk (x k ). Gradient mapping: the gradient-like object G α (x) = 1 α (x P αh(x α f (x))) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 29

31 Proximal-gradient works, why? x k+1 = prox αk h (x k α k f (x k )) x k+1 = x k α k G αk (x k ). Gradient mapping: the gradient-like object G α (x) = 1 α (x P αh(x α f (x))) Our lemma shows: G α (x) = 0 if and only if x is optimal So G α analogous to f If x locally optimal, then G α (x) = 0 (nonconvex f ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 14 / 29

32 Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29

33 Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29

34 Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Lemma (Descent). Let f C 1 L. Then, f (y) f (x) + f (x), y x + L 2 y x 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29

35 Convergence analysis Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Lemma (Descent). Let f C 1 L. Then, f (y) f (x) + f (x), y x + L 2 y x 2 2 For convex f, compare with f (y) f (x) + f (x), y x. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 15 / 29

36 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

37 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

38 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

39 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x), y x dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

40 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x) 2 y x 2 dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

41 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x) 2 y x 2 dt L 1 0 t x y 2 2 dt Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

42 Descent lemma Proof. Since f C 1 L, by Taylor s theorem, for the vector z t = x + t(y x) we have f (y) = f (x) f (z t), y x dt. Add and subtract f (x), y x on rhs we have f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt f (y) f (x) f (x), y x = 1 0 f (z t) f (x), y x dt Bounds f (y) around x with quadratic functions 1 0 f (z t) f (x), y x dt 1 0 f (z t) f (x) 2 y x 2 dt L 1 0 t x y 2 2 dt = L 2 x y 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 16 / 29

43 Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29

44 Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 f (y) f (x) α f (x), G α (x) + α2 L 2 G α(x) 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29

45 Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 f (y) f (x) α f (x), G α (x) + α2 L 2 G α(x) 2 2. Corollary. So if 0 α 1/L, we have f (y) f (x) α f (x), G α (x) + α 2 G α(x) 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29

46 Descent lemma corollary Let y = x αg α (x), then f (y) f (x) + f (x), y x + L 2 y x 2 2 f (y) f (x) α f (x), G α (x) + α2 L 2 G α(x) 2 2. Corollary. So if 0 α 1/L, we have f (y) f (x) α f (x), G α (x) + α 2 G α(x) 2 2. Lemma Let y = x αg α (x). Then, for any z we have f (y) + h(y) f (z) + h(z) + G α (x), x z α 2 G α(x) 2 2. Exercise: Prove! (hint: f, h are convex, G α (x) f (x) h(y)) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 17 / 29

47 Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29

48 Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Use z = x in corollary to obtain progress in terms of iterates: φ(x ) φ G α (x), x x α 2 G α(x) 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29

49 Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Use z = x in corollary to obtain progress in terms of iterates: φ(x ) φ G α (x), x x α 2 G α(x) 2 2 = 1 [ 2 αgα (x), x x αg α (x) 2 2α 2] = 1 [ x x 2 2 2α x x αg α (x) 2 2] Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29

50 Convergence analysis We ve actually shown x = x αg α (x) is a descent method. Write φ = f + h; plug in z = x to obtain φ(x ) φ(x) α 2 G α(x) 2 2. Exercise: Why this inequality suffices to show convergence. Use z = x in corollary to obtain progress in terms of iterates: φ(x ) φ G α (x), x x α 2 G α(x) 2 2 = 1 [ 2 αgα (x), x x αg α (x) 2 2α 2] = 1 2α = 1 2α [ x x 2 2 x x αg α (x) 2 2] [ x x 2 2 x x 2 ] 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 18 / 29

51 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

52 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 i=1 (φ(x i) φ ) L 2 k+1 i=1 [ xi x 2 2 x i+1 x 2 2] Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

53 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 k+1 [ xi x 2 2 x i+1 x 2 2] (φ(x i) φ ) L i=1 2 i=1 [ x1 x 2 2 x k+1 x 2 2] = L 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

54 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 k+1 [ xi x 2 2 x i+1 x 2 2] (φ(x i) φ ) L i=1 2 i=1 [ x1 x 2 2 x k+1 x 2 2] = L 2 L 2 x 1 x 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

55 Convergence rate Set x x k, x x k+1, and α = 1/L. Then add k+1 k+1 [ xi x 2 2 x i+1 x 2 2] (φ(x i) φ ) L i=1 2 i=1 [ x1 x 2 2 x k+1 x 2 2] = L 2 L 2 x 1 x 2 2. Since φ(x k ) is a decreasing sequence, it follows that φ(x k+1 ) φ 1 k+1 k + 1 i=1 (φ(x i ) φ ) L 2(k + 1) x 1 x 2 2. This is the well-known O(1/k) rate. But for C 1 L convex functions, optimal rate is O(1/k2 )! Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 19 / 29

56 Accelerated Proximal Gradient min φ(x) = f (x) + h(x) Let x 0 = y 0 dom h. For k 1: x k = prox αk h (yk 1 α k f (y k 1 )) y k = x k + k 1 k + 2 (xk x k 1 ). Framework due to: Nesterov (1983, 2004); also Beck, Teboulle (2009). Simplified analysis: Tseng (2008). Uses extra memory for interpolation Same computational cost as ordinary prox-grad Convergence rate theoretically optimal φ(x k ) φ 2L (k + 1) 2 x0 x 2 2. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 20 / 29

57 Proximal splitting methods l(x) + f (x) + h(x) Direct use of prox-grad not easy Requires computation of: prox λ(f +h) (i.e., (I + λ( f + h)) 1 ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 29

58 Proximal splitting methods l(x) + f (x) + h(x) Direct use of prox-grad not easy Requires computation of: prox λ(f +h) (i.e., (I + λ( f + h)) 1 ) Example: 1 min 2 x y λ x 2 + µ n 1 }{{} x i+1 x i. i=1 }{{} f (x) h(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 29

59 Proximal splitting methods l(x) + f (x) + h(x) Direct use of prox-grad not easy Requires computation of: prox λ(f +h) (i.e., (I + λ( f + h)) 1 ) Example: 1 min 2 x y λ x 2 + µ n 1 }{{} x i+1 x i. i=1 }{{} f (x) h(x) But good feature: prox f and prox h separately easier Can we exploit that? Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 21 / 29

60 Proximal splitting operator notation If (I + f + h) 1 hard, but (I + f ) 1 and (I + h) 1 easy Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 29

61 Proximal splitting operator notation If (I + f + h) 1 hard, but (I + f ) 1 and (I + h) 1 easy Let us derive a fixed-point equation that splits the operators Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 29

62 Proximal splitting operator notation If (I + f + h) 1 hard, but (I + f ) 1 and (I + h) 1 easy Let us derive a fixed-point equation that splits the operators Assume we are solving min f (x) + h(x), where both f and h are convex but potentially nondifferentiable. Notice: We implicitly assumed: (f + h) = f + h. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 22 / 29

63 Proximal splitting 0 f (x) + h(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

64 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

65 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

66 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

67 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) = x (I + f ) 1 (2x z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

68 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) = x (I + f ) 1 (2x z) Not a fixed-point equation yet Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

69 Proximal splitting 0 f (x) + h(x) 2x (I + f )(x) + (I + h)(x) Key idea of splitting: new variable! z (I + h)(x) = x = prox h (z) 2x z (I + f )(x) = x (I + f ) 1 (2x z) Not a fixed-point equation yet We need one more idea Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 23 / 29

70 Douglas-Rachford splitting Reflection operator R h (z) := 2 prox h (z) z Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

71 Douglas-Rachford splitting Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method z (I + h)(x), x = prox h (z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

72 Douglas-Rachford splitting Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method z (I + h)(x), x = prox h (z) = R h (z) = 2x z Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

73 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

74 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

75 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) but R h (z) = 2x z = z = 2x R h (z) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

76 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) but R h (z) = 2x z = z = 2x R h (z) z = 2 prox f (R h (z)) R h (z) = Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

77 Douglas-Rachford splitting z (I + h)(x), Reflection operator R h (z) := 2 prox h (z) z Douglas-Rachford method x = prox h (z) = R h (z) = 2x z 0 f (x) + g(x) 2x (I + f )(x) + (I + g)(x) 2x z (I + f )(x) x = prox f (R h (z)) but R h (z) = 2x z = z = 2x R h (z) z = 2 prox f (R h (z)) R h (z) = R f (R h (z)) Finally, z is on both sides of the eqn Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 24 / 29

78 Douglas-Rachford method 0 f (x) + h(x) DR method: given z 0, iterate for k 0 x k = prox h (z k ) { x = prox h (z) z = R f (R h (z)) v k = prox f (2x k z k ) z k+1 = z k + γ k (v k x k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 25 / 29

79 Douglas-Rachford method 0 f (x) + h(x) DR method: given z 0, iterate for k 0 x k = prox h (z k ) { x = prox h (z) z = R f (R h (z)) v k = prox f (2x k z k ) z k+1 = z k + γ k (v k x k ) Theorem. If f + h admits minimizers, and (γ k ) satisfy γ k [0, 2], γ k(2 γ k ) =, k then the DR-iterates v k and x k converge to a minimizer. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 25 / 29

80 Douglas-Rachford method For γ k = 1, we have z k+1 = z k + v k x k z k+1 = z k + prox f (2 prox h (z k ) z k ) prox h (z k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 29

81 Douglas-Rachford method For γ k = 1, we have z k+1 = z k + v k x k z k+1 = z k + prox f (2 prox h (z k ) z k ) prox h (z k ) Dropping superscripts, writing P prox, we have z Tz T = I + P f (2P h I) P h Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 29

82 Douglas-Rachford method For γ k = 1, we have z k+1 = z k + v k x k z k+1 = z k + prox f (2 prox h (z k ) z k ) prox h (z k ) Dropping superscripts, writing P prox, we have z Tz T = I + P f (2P h I) P h Lemma DR can be written as: z 1 2 (R f R h + I)z, where R f denotes the reflection operator 2P f I (similarly R h ). Exercise: Prove this claim. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 26 / 29

83 Proximal methods cornucopia Douglas Rachford splitting ADMM (special case of DR on dual) Proximal-Dykstra Proximal methods for f 1 + f f n Peaceman-Rachford Proximal quasi-newton, Newton Many other variation... Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 27 / 29

84 Best approximation problem min δ A (x) + δ B (x) where A B =. Can we use DR? Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 28 / 29

85 Best approximation problem min δ A (x) + δ B (x) where A B =. Can we use DR? Using a clever analysis of Bauschke & Combettes (2004), DR can still be applied! However, it generates diverging iterates which can be projected back to obtain a solution to min a b 2 a A, b B. See: Jegelka, Bach, Sra (NIPS 2013) for an example. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 28 / 29

86 ADMM Let us see separable objective with constraints Suvrit Sra Optimization for Machine Learning 29 / 29

87 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

88 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

89 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

90 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

91 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL x k+1 = argmin x L ρ (x, z k, y k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

92 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

93 ADMM Let us see separable objective with constraints min f (x) + g(z) s.t. Ax + Bz = c. Objective function separated into x and z variables The constraint prevents a trivial decoupling Introduce augmented lagrangian (AL) L ρ (x, z, y) := f (x) + g(z) + y T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 Now, a Gauss-Seidel style update to the AL x k+1 = argmin x L ρ (x, z k, y k ) z k+1 = argmin z L ρ (x k+1, z, y k ) y k+1 = y k + ρ(ax k+1 + Bz k+1 c) Suvrit Sra (suvrit@mit.edu) Optimization for Machine Learning 29 / 29

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology Hausdorff Institute for Mathematics (HIM) Trimester: Mathematics of Signal Processing

More information

Convex optimization algorithms for sparse and low-rank representations

Convex optimization algorithms for sparse and low-rank representations Convex optimization algorithms for sparse and low-rank representations Lieven Vandenberghe, Hsiao-Han Chao (UCLA) ECC 2013 Tutorial Session Sparse and low-rank representation methods in control, estimation,

More information

Convex Optimization MLSS 2015

Convex Optimization MLSS 2015 Convex Optimization MLSS 2015 Constantine Caramanis The University of Texas at Austin The Optimization Problem minimize : f (x) subject to : x X. The Optimization Problem minimize : f (x) subject to :

More information

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know learn the proximal

More information

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics 1. Introduction EE 546, Univ of Washington, Spring 2016 performance of numerical methods complexity bounds structural convex optimization course goals and topics 1 1 Some course info Welcome to EE 546!

More information

The Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein October 8, 2015 1 / 30 Introduction Presentation Outline 1 Convex

More information

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007 : Convex Non-Smooth Optimization April 2, 2007 Outline Lecture 19 Convex non-smooth problems Examples Subgradients and subdifferentials Subgradient properties Operations with subgradients and subdifferentials

More information

Proximal operator and methods

Proximal operator and methods Proximal operator and methods Master 2 Data Science, Univ. Paris Saclay Robert M. Gower Optimization Sum of Terms A Datum Function Finite Sum Training Problem The Training Problem Convergence GD I Theorem

More information

Lecture 18: March 23

Lecture 18: March 23 0-725/36-725: Convex Optimization Spring 205 Lecturer: Ryan Tibshirani Lecture 8: March 23 Scribes: James Duyck Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not

More information

The Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers The Alternating Direction Method of Multipliers Customizable software solver package Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein April 27, 2016 1 / 28 Background The Dual Problem Consider

More information

Conditional gradient algorithms for machine learning

Conditional gradient algorithms for machine learning Conditional gradient algorithms for machine learning Zaid Harchaoui LEAR-LJK, INRIA Grenoble Anatoli Juditsky LJK, Université de Grenoble Arkadi Nemirovski Georgia Tech Abstract We consider penalized formulations

More information

Convergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM

Convergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM Convergence of Nonconvex Douglas-Rachford Splitting and Nonconvex ADMM Shuvomoy Das Gupta Thales Canada 56th Allerton Conference on Communication, Control and Computing, 2018 1 What is this talk about?

More information

Composite Self-concordant Minimization

Composite Self-concordant Minimization Composite Self-concordant Minimization Volkan Cevher Laboratory for Information and Inference Systems-LIONS Ecole Polytechnique Federale de Lausanne (EPFL) volkan.cevher@epfl.ch Paris 6 Dec 11, 2013 joint

More information

A proximal center-based decomposition method for multi-agent convex optimization

A proximal center-based decomposition method for multi-agent convex optimization Proceedings of the 47th IEEE Conference on Decision and Control Cancun, Mexico, Dec. 9-11, 2008 A proximal center-based decomposition method for multi-agent convex optimization Ion Necoara and Johan A.K.

More information

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers Stephen Boyd MIIS Xi An, 1/7/12 source: Distributed Optimization and Statistical Learning via the Alternating

More information

Parallel and Distributed Sparse Optimization Algorithms

Parallel and Distributed Sparse Optimization Algorithms Parallel and Distributed Sparse Optimization Algorithms Part I Ruoyu Li 1 1 Department of Computer Science and Engineering University of Texas at Arlington March 19, 2015 Ruoyu Li (UTA) Parallel and Distributed

More information

Lecture 4 Duality and Decomposition Techniques

Lecture 4 Duality and Decomposition Techniques Lecture 4 Duality and Decomposition Techniques Jie Lu (jielu@kth.se) Richard Combes Alexandre Proutiere Automatic Control, KTH September 19, 2013 Consider the primal problem Lagrange Duality Lagrangian

More information

Conditional gradient algorithms for machine learning

Conditional gradient algorithms for machine learning 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 1 A. d Aspremont. Convex Optimization M2. 1/49 Today Convex optimization: introduction Course organization and other gory details... Convex sets, basic definitions. A. d

More information

Recent Developments in Model-based Derivative-free Optimization

Recent Developments in Model-based Derivative-free Optimization Recent Developments in Model-based Derivative-free Optimization Seppo Pulkkinen April 23, 2010 Introduction Problem definition The problem we are considering is a nonlinear optimization problem with constraints:

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Second Order Optimization Methods Marc Toussaint U Stuttgart Planned Outline Gradient-based optimization (1st order methods) plain grad., steepest descent, conjugate grad.,

More information

Distributed Alternating Direction Method of Multipliers

Distributed Alternating Direction Method of Multipliers Distributed Alternating Direction Method of Multipliers Ermin Wei and Asuman Ozdaglar Abstract We consider a network of agents that are cooperatively solving a global unconstrained optimization problem,

More information

Leaves Machine Learning and Optimization Library

Leaves Machine Learning and Optimization Library Leaves Machine Learning and Optimization Library October 7, 07 This document describes how to use LEaves Machine Learning and Optimization Library (LEMO) for modeling. LEMO is an open source library for

More information

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2

Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2 Surrogate Gradient Algorithm for Lagrangian Relaxation 1,2 X. Zhao 3, P. B. Luh 4, and J. Wang 5 Communicated by W.B. Gong and D. D. Yao 1 This paper is dedicated to Professor Yu-Chi Ho for his 65th birthday.

More information

Convex Optimization. Lijun Zhang Modification of

Convex Optimization. Lijun Zhang   Modification of Convex Optimization Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Modification of http://stanford.edu/~boyd/cvxbook/bv_cvxslides.pdf Outline Introduction Convex Sets & Functions Convex Optimization

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming SECOND EDITION Dimitri P. Bertsekas Massachusetts Institute of Technology WWW site for book Information and Orders http://world.std.com/~athenasc/index.html Athena Scientific, Belmont,

More information

Convex Optimization / Homework 2, due Oct 3

Convex Optimization / Homework 2, due Oct 3 Convex Optimization 0-725/36-725 Homework 2, due Oct 3 Instructions: You must complete Problems 3 and either Problem 4 or Problem 5 (your choice between the two) When you submit the homework, upload a

More information

IDENTIFYING ACTIVE MANIFOLDS

IDENTIFYING ACTIVE MANIFOLDS Algorithmic Operations Research Vol.2 (2007) 75 82 IDENTIFYING ACTIVE MANIFOLDS W.L. Hare a a Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada. A.S. Lewis b b School of ORIE,

More information

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods. CS/ECE/ISyE 524 Introduction to Optimization Spring 2017 18 25. NLP algorithms ˆ Overview ˆ Local methods ˆ Constrained optimization ˆ Global methods ˆ Black-box methods ˆ Course wrap-up Laurent Lessard

More information

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems Panos Parpas May 9, 2017 Abstract Composite optimization models consist of the minimization of the sum of a smooth

More information

Efficient Euclidean Projections in Linear Time

Efficient Euclidean Projections in Linear Time Jun Liu J.LIU@ASU.EDU Jieping Ye JIEPING.YE@ASU.EDU Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA Abstract We consider the problem of computing the Euclidean

More information

Characterizing Improving Directions Unconstrained Optimization

Characterizing Improving Directions Unconstrained Optimization Final Review IE417 In the Beginning... In the beginning, Weierstrass's theorem said that a continuous function achieves a minimum on a compact set. Using this, we showed that for a convex set S and y not

More information

Convex Optimization: from Real-Time Embedded to Large-Scale Distributed

Convex Optimization: from Real-Time Embedded to Large-Scale Distributed Convex Optimization: from Real-Time Embedded to Large-Scale Distributed Stephen Boyd Neal Parikh, Eric Chu, Yang Wang, Jacob Mattingley Electrical Engineering Department, Stanford University Springer Lectures,

More information

Convexity: an introduction

Convexity: an introduction Convexity: an introduction Geir Dahl CMA, Dept. of Mathematics and Dept. of Informatics University of Oslo 1 / 74 1. Introduction 1. Introduction what is convexity where does it arise main concepts and

More information

Augmented Lagrangian Methods

Augmented Lagrangian Methods Augmented Lagrangian Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. IMA, August 2016 Stephen Wright (UW-Madison) Augmented Lagrangian IMA, August 2016 1 /

More information

Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization

Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization PhD defense of Federico Pierucci Thesis supervised by Prof. Zaid Harchaoui Prof. Anatoli Juditsky Dr. Jérôme Malick

More information

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A.

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A. ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS Donghwan Kim and Jeffrey A. Fessler University of Michigan Dept. of Electrical Engineering and Computer Science

More information

Iteratively Re-weighted Least Squares for Sums of Convex Functions

Iteratively Re-weighted Least Squares for Sums of Convex Functions Iteratively Re-weighted Least Squares for Sums of Convex Functions James Burke University of Washington Jiashan Wang LinkedIn Frank Curtis Lehigh University Hao Wang Shanghai Tech University Daiwei He

More information

Introduction to Convex Optimization

Introduction to Convex Optimization Introduction to Convex Optimization Lieven Vandenberghe Electrical Engineering Department, UCLA IPAM Graduate Summer School: Computer Vision August 5, 2013 Convex optimization problem minimize f 0 (x)

More information

Augmented Lagrangian Methods

Augmented Lagrangian Methods Augmented Lagrangian Methods Mário A. T. Figueiredo 1 and Stephen J. Wright 2 1 Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal 2 Computer Sciences Department, University of

More information

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1 Section 5 Convex Optimisation 1 W. Dai (IC) EE4.66 Data Proc. Convex Optimisation 1 2018 page 5-1 Convex Combination Denition 5.1 A convex combination is a linear combination of points where all coecients

More information

Convex Optimization CMU-10725

Convex Optimization CMU-10725 Convex Optimization CMU-10725 Conjugate Direction Methods Barnabás Póczos & Ryan Tibshirani Conjugate Direction Methods 2 Books to Read David G. Luenberger, Yinyu Ye: Linear and Nonlinear Programming Nesterov:

More information

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding

Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding Conic Optimization via Operator Splitting and Homogeneous Self-Dual Embedding B. O Donoghue E. Chu N. Parikh S. Boyd Convex Optimization and Beyond, Edinburgh, 11/6/2104 1 Outline Cone programming Homogeneous

More information

A primal-dual framework for mixtures of regularizers

A primal-dual framework for mixtures of regularizers A primal-dual framework for mixtures of regularizers Baran Gözcü baran.goezcue@epfl.ch Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale de Lausanne (EPFL) Switzerland

More information

EE8231 Optimization Theory Course Project

EE8231 Optimization Theory Course Project EE8231 Optimization Theory Course Project Qian Zhao (zhaox331@umn.edu) May 15, 2014 Contents 1 Introduction 2 2 Sparse 2-class Logistic Regression 2 2.1 Problem formulation....................................

More information

The Benefit of Tree Sparsity in Accelerated MRI

The Benefit of Tree Sparsity in Accelerated MRI The Benefit of Tree Sparsity in Accelerated MRI Chen Chen and Junzhou Huang Department of Computer Science and Engineering, The University of Texas at Arlington, TX, USA 76019 Abstract. The wavelet coefficients

More information

Mining Sparse Representations: Theory, Algorithms, and Applications

Mining Sparse Representations: Theory, Algorithms, and Applications Mining Sparse Representations: Theory, Algorithms, and Applications Jun Liu, Shuiwang Ji, and Jieping Ye Computer Science and Engineering The Biodesign Institute Arizona State University What is Sparsity?

More information

Programming, numerics and optimization

Programming, numerics and optimization Programming, numerics and optimization Lecture C-4: Constrained optimization Łukasz Jankowski ljank@ippt.pan.pl Institute of Fundamental Technological Research Room 4.32, Phone +22.8261281 ext. 428 June

More information

Mathematical Programming and Research Methods (Part II)

Mathematical Programming and Research Methods (Part II) Mathematical Programming and Research Methods (Part II) 4. Convexity and Optimization Massimiliano Pontil (based on previous lecture by Andreas Argyriou) 1 Today s Plan Convex sets and functions Types

More information

IE521 Convex Optimization Introduction

IE521 Convex Optimization Introduction IE521 Convex Optimization Introduction Instructor: Niao He Jan 18, 2017 1 About Me Assistant Professor, UIUC, 2016 Ph.D. in Operations Research, M.S. in Computational Sci. & Eng. Georgia Tech, 2010 2015

More information

Fast Newton methods for the group fused lasso

Fast Newton methods for the group fused lasso Fast Newton methods for the group fused lasso Matt Wytock Machine Learning Dept. Carnegie Mellon University Pittsburgh, PA Suvrit Sra Machine Learning Dept. Carnegie Mellon University Pittsburgh, PA J.

More information

15. Cutting plane and ellipsoid methods

15. Cutting plane and ellipsoid methods EE 546, Univ of Washington, Spring 2012 15. Cutting plane and ellipsoid methods localization methods cutting-plane oracle examples of cutting plane methods ellipsoid method convergence proof inequality

More information

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer David G. Luenberger Yinyu Ye Linear and Nonlinear Programming Fourth Edition ö Springer Contents 1 Introduction 1 1.1 Optimization 1 1.2 Types of Problems 2 1.3 Size of Problems 5 1.4 Iterative Algorithms

More information

A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions

A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions Nira Dyn School of Mathematical Sciences Tel Aviv University Michael S. Floater Department of Informatics University of

More information

ME 555: Distributed Optimization

ME 555: Distributed Optimization ME 555: Distributed Optimization Duke University Spring 2015 1 Administrative Course: ME 555: Distributed Optimization (Spring 2015) Instructor: Time: Location: Office hours: Website: Soomin Lee (email:

More information

Solution Methods Numerical Algorithms

Solution Methods Numerical Algorithms Solution Methods Numerical Algorithms Evelien van der Hurk DTU Managment Engineering Class Exercises From Last Time 2 DTU Management Engineering 42111: Static and Dynamic Optimization (6) 09/10/2017 Class

More information

A More Efficient Approach to Large Scale Matrix Completion Problems

A More Efficient Approach to Large Scale Matrix Completion Problems A More Efficient Approach to Large Scale Matrix Completion Problems Matthew Olson August 25, 2014 Abstract This paper investigates a scalable optimization procedure to the low-rank matrix completion problem

More information

Convex Optimization and Machine Learning

Convex Optimization and Machine Learning Convex Optimization and Machine Learning Mengliu Zhao Machine Learning Reading Group School of Computing Science Simon Fraser University March 12, 2014 Mengliu Zhao SFU-MLRG March 12, 2014 1 / 25 Introduction

More information

SYSTEMS OF NONLINEAR EQUATIONS

SYSTEMS OF NONLINEAR EQUATIONS SYSTEMS OF NONLINEAR EQUATIONS Widely used in the mathematical modeling of real world phenomena. We introduce some numerical methods for their solution. For better intuition, we examine systems of two

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. Chapter 4 : Optimization for Machine Learning

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey. Chapter 4 : Optimization for Machine Learning Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey Chapter 4 : Optimization for Machine Learning Summary of Chapter 2 Chapter 2: Convex Optimization with Sparsity

More information

Efficient MR Image Reconstruction for Compressed MR Imaging

Efficient MR Image Reconstruction for Compressed MR Imaging Efficient MR Image Reconstruction for Compressed MR Imaging Junzhou Huang, Shaoting Zhang, and Dimitris Metaxas Division of Computer and Information Sciences, Rutgers University, NJ, USA 08854 Abstract.

More information

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs 15.082J and 6.855J Lagrangian Relaxation 2 Algorithms Application to LPs 1 The Constrained Shortest Path Problem (1,10) 2 (1,1) 4 (2,3) (1,7) 1 (10,3) (1,2) (10,1) (5,7) 3 (12,3) 5 (2,2) 6 Find the shortest

More information

California Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas

California Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas California Institute of Technology HSS Division Crash-Course on Convex Optimization Fall 2011-12 Ec 133 Guilherme Freitas In this text, we will study the following basic problem: maximize x C f(x) subject

More information

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ)

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ) Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ) Yi Xu, Yan Yan, Qihang Lin, Tianbao Yang Department of Computer Science, University of Iowa, Iowa City, IA 52242 QCIS, University

More information

IE598 Big Data Optimization Summary Nonconvex Optimization

IE598 Big Data Optimization Summary Nonconvex Optimization IE598 Big Data Optimization Summary Nonconvex Optimization Instructor: Niao He April 16, 2018 1 This Course Big Data Optimization Explore modern optimization theories, algorithms, and big data applications

More information

Lecture 15: Log Barrier Method

Lecture 15: Log Barrier Method 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 15: Log Barrier Method Scribes: Pradeep Dasigi, Mohammad Gowayyed Note: LaTeX template courtesy of UC Berkeley EECS dept.

More information

SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian

SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian SpicyMKL Efficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School of Information Science and Technology Department of

More information

Lagrangian Relaxation: An overview

Lagrangian Relaxation: An overview Discrete Math for Bioinformatics WS 11/12:, by A. Bockmayr/K. Reinert, 22. Januar 2013, 13:27 4001 Lagrangian Relaxation: An overview Sources for this lecture: D. Bertsimas and J. Tsitsiklis: Introduction

More information

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited. page v Preface xiii I Basics 1 1 Optimization Models 3 1.1 Introduction... 3 1.2 Optimization: An Informal Introduction... 4 1.3 Linear Equations... 7 1.4 Linear Optimization... 10 Exercises... 12 1.5

More information

Computational Methods. Constrained Optimization

Computational Methods. Constrained Optimization Computational Methods Constrained Optimization Manfred Huber 2010 1 Constrained Optimization Unconstrained Optimization finds a minimum of a function under the assumption that the parameters can take on

More information

Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application

Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application Proection onto the probability simplex: An efficient algorithm with a simple proof, and an application Weiran Wang Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science, University of

More information

A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions

A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions A C 2 Four-Point Subdivision Scheme with Fourth Order Accuracy and its Extensions Nira Dyn Michael S. Floater Kai Hormann Abstract. We present a new four-point subdivision scheme that generates C 2 curves.

More information

Lower bounds on the barrier parameter of convex cones

Lower bounds on the barrier parameter of convex cones of convex cones Université Grenoble 1 / CNRS June 20, 2012 / High Performance Optimization 2012, Delft Outline Logarithmically homogeneous barriers 1 Logarithmically homogeneous barriers Conic optimization

More information

Math 5593 Linear Programming Lecture Notes

Math 5593 Linear Programming Lecture Notes Math 5593 Linear Programming Lecture Notes Unit II: Theory & Foundations (Convex Analysis) University of Colorado Denver, Fall 2013 Topics 1 Convex Sets 1 1.1 Basic Properties (Luenberger-Ye Appendix B.1).........................

More information

On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms

On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms On Efficient Optimal Transport: An Analysis of Greedy and Accelerated Mirror Descent Algorithms Tianyi Lin, Nhat Ho, Michael I Jordan, Department of Electrical Engineering and Computer Sciences Department

More information

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization Author: Martin Jaggi Presenter: Zhongxing Peng Outline 1. Theoretical Results 2. Applications Outline 1. Theoretical Results 2. Applications

More information

Lecture 2 - Introduction to Polytopes

Lecture 2 - Introduction to Polytopes Lecture 2 - Introduction to Polytopes Optimization and Approximation - ENS M1 Nicolas Bousquet 1 Reminder of Linear Algebra definitions Let x 1,..., x m be points in R n and λ 1,..., λ m be real numbers.

More information

Two-phase matrix splitting methods for asymmetric and symmetric LCP

Two-phase matrix splitting methods for asymmetric and symmetric LCP Two-phase matrix splitting methods for asymmetric and symmetric LCP Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University Joint work with Feng, Nocedal, and Pang

More information

Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach

Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach Convex Optimization over Intersection of Simple Sets: improved Convergence Rate Guarantees via an Exact Penalty Approach Achintya Kundu Francis Bach Chiranjib Bhattacharyya Department of CSA INRIA - École

More information

Convex or non-convex: which is better?

Convex or non-convex: which is better? Sparsity Amplified Ivan Selesnick Electrical and Computer Engineering Tandon School of Engineering New York University Brooklyn, New York March 7 / 4 Convex or non-convex: which is better? (for sparse-regularized

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Applied Lagrange Duality for Constrained Optimization

Applied Lagrange Duality for Constrained Optimization Applied Lagrange Duality for Constrained Optimization Robert M. Freund February 10, 2004 c 2004 Massachusetts Institute of Technology. 1 1 Overview The Practical Importance of Duality Review of Convexity

More information

Introduction to Optimization Problems and Methods

Introduction to Optimization Problems and Methods Introduction to Optimization Problems and Methods wjch@umich.edu December 10, 2009 Outline 1 Linear Optimization Problem Simplex Method 2 3 Cutting Plane Method 4 Discrete Dynamic Programming Problem Simplex

More information

Optimization for Machine Learning

Optimization for Machine Learning with a focus on proximal gradient descent algorithm Department of Computer Science and Engineering Outline 1 History & Trends 2 Proximal Gradient Descent 3 Three Applications A Brief History A. Convex

More information

Supplemental Material for Efficient MR Image Reconstruction for Compressed MR Imaging

Supplemental Material for Efficient MR Image Reconstruction for Compressed MR Imaging Supplemental Material for Efficient MR Image Reconstruction for Compressed MR Imaging Paper ID: 1195 No Institute Given 1 More Experiment Results 1.1 Visual Comparisons We apply all methods on four 2D

More information

INTRODUCTION TO LINEAR AND NONLINEAR PROGRAMMING

INTRODUCTION TO LINEAR AND NONLINEAR PROGRAMMING INTRODUCTION TO LINEAR AND NONLINEAR PROGRAMMING DAVID G. LUENBERGER Stanford University TT ADDISON-WESLEY PUBLISHING COMPANY Reading, Massachusetts Menlo Park, California London Don Mills, Ontario CONTENTS

More information

Sequential Coordinate-wise Algorithm for Non-negative Least Squares Problem

Sequential Coordinate-wise Algorithm for Non-negative Least Squares Problem CENTER FOR MACHINE PERCEPTION CZECH TECHNICAL UNIVERSITY Sequential Coordinate-wise Algorithm for Non-negative Least Squares Problem Woring document of the EU project COSPAL IST-004176 Vojtěch Franc, Miro

More information

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro CMU-Q 15-381 Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization Teacher: Gianni A. Di Caro GLOBAL FUNCTION OPTIMIZATION Find the global maximum of the function f x (and

More information

Lecture 19: November 5

Lecture 19: November 5 0-725/36-725: Convex Optimization Fall 205 Lecturer: Ryan Tibshirani Lecture 9: November 5 Scribes: Hyun Ah Song Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not

More information

Learning with infinitely many features

Learning with infinitely many features Learning with infinitely many features R. Flamary, Joint work with A. Rakotomamonjy F. Yger, M. Volpi, M. Dalla Mura, D. Tuia Laboratoire Lagrange, Université de Nice Sophia Antipolis December 2012 Example

More information

Alternating Direction Method of Multipliers

Alternating Direction Method of Multipliers Alternating Direction Method of Multipliers CS 584: Big Data Analytics Material adapted from Stephen Boyd (https://web.stanford.edu/~boyd/papers/pdf/admm_slides.pdf) & Ryan Tibshirani (http://stat.cmu.edu/~ryantibs/convexopt/lectures/21-dual-meth.pdf)

More information

Introduction to Modern Control Systems

Introduction to Modern Control Systems Introduction to Modern Control Systems Convex Optimization, Duality and Linear Matrix Inequalities Kostas Margellos University of Oxford AIMS CDT 2016-17 Introduction to Modern Control Systems November

More information

AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION. Jialin Liu, Yuantao Gu, and Mengdi Wang

AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION. Jialin Liu, Yuantao Gu, and Mengdi Wang AVERAGING RANDOM PROJECTION: A FAST ONLINE SOLUTION FOR LARGE-SCALE CONSTRAINED STOCHASTIC OPTIMIZATION Jialin Liu, Yuantao Gu, and Mengdi Wang Tsinghua National Laboratory for Information Science and

More information

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009]

A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009] A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization and continuation [Wen,Yin,Goldfarb,Zhang 2009] Yongjia Song University of Wisconsin-Madison April 22, 2010 Yongjia Song

More information

Nesterov's Smoothing Technique and Minimizing Differences of Convex Functions for Hierarchical Clustering

Nesterov's Smoothing Technique and Minimizing Differences of Convex Functions for Hierarchical Clustering Portland State University PDXScholar Mathematics and Statistics Faculty Publications and Presentations Fariborz Maseeh Department of Mathematics and Statistics 3-017 Nesterov's Smoothing Technique and

More information

A Multilevel Proximal Gradient Algorithm for Large Scale Optimization

A Multilevel Proximal Gradient Algorithm for Large Scale Optimization A Multilevel Proximal Gradient Algorithm for Large Scale Optimization Panos Parpas June 30, 2016 Abstract Composite optimization models consist of the minimization of the sum of a smooth not necessarily

More information

EXTREME POINTS AND AFFINE EQUIVALENCE

EXTREME POINTS AND AFFINE EQUIVALENCE EXTREME POINTS AND AFFINE EQUIVALENCE The purpose of this note is to use the notions of extreme points and affine transformations which are studied in the file affine-convex.pdf to prove that certain standard

More information

Lagrangian methods for the regularization of discrete ill-posed problems. G. Landi

Lagrangian methods for the regularization of discrete ill-posed problems. G. Landi Lagrangian methods for the regularization of discrete ill-posed problems G. Landi Abstract In many science and engineering applications, the discretization of linear illposed problems gives rise to large

More information

Distributed Delayed Proximal Gradient Methods

Distributed Delayed Proximal Gradient Methods Distributed Delayed Proximal Gradient Methods Mu Li, David G. Andersen and Alexander Smola, Carnegie Mellon University Google Strategic Technologies Abstract We analyze distributed optimization algorithms

More information

Convexity I: Sets and Functions

Convexity I: Sets and Functions Convexity I: Sets and Functions Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 See supplements for reviews of basic real analysis basic multivariate calculus basic

More information