Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology

Size: px

Start display at page:

Download "Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology"

Conrad Dennis
5 years ago
Views:

Institute of Technology Hausdorff Institute for

1 Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology Hausdorff Institute for Mathematics (HIM) Trimester: Mathematics of Signal Processing January 2016

2 Outline Convex analysis, optimality First-order methods Proximal methods, operator splitting Stochastic optimization, incremental methods Nonconvex models, algorithms Geometric optimization Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 2 / 50

3 Convex analysis Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 3 / 50

4 Convex sets (vector space) Def. Set C R n called convex, if for any x, y C, the linesegment θx + (1 θ)y, where θ [0, 1], also lies in C. Observations Linear: if restrictions on θ 1, θ 2 are dropped Conic: if restriction θ 1 + θ 2 = 1 is dropped Convex: θ 1 x + θ 2 y C, where θ 1, θ 2 0 and θ 1 + θ 2 = 1. Theorem (Intersection). Let C 1, C 2 be convex sets. Then, C 1 C 2 is also convex. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 4 / 50

5 Convex sets (vector space) Let x 1, x 2,..., x m R n. Their convex hull is { co(x 1,..., x m ) := θ ix i θ i 0, } θ i = 1. i i Let A R m n, and b R m. The set {x Ax = b} is convex (it is an affine space over subspace of solutions of Ax = 0). halfspace { x a T x b }. polyhedron {x Ax b, Cx = d}. ellipsoid { x (x x 0 ) T A(x x 0 ) 1 }, (A: semidefinite) convex cone x K = αx K for α 0 (and K convex) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 5 / 50

6 Challenge 1 Let A, B R n n be symmetric. Prove that { } R(A, B) := (x T Ax, x T Bx) x T x = 1 is a compact convex set for n 3. At the heart of S-lemma in control / optimization. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 6 / 50

7 Convex functions Def. Function f : I R on interval I called midpoint convex if f ( x+y ) 2 f (x)+f (y) 2, whenever x, y I. Read: f of AM is less than or equal to AM of f. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 7 / 50

8 Convex functions Def. Function f : I R on interval I called midpoint convex if f ( x+y ) 2 f (x)+f (y) 2, whenever x, y I. Read: f of AM is less than or equal to AM of f. Think: What is we use other means, e.g., GM-AM, GM-GM? Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 7 / 50

9 Convex functions Def. Function f : I R on interval I called midpoint convex if f ( x+y ) 2 f (x)+f (y) 2, whenever x, y I. Read: f of AM is less than or equal to AM of f. Think: What is we use other means, e.g., GM-AM, GM-GM? Theorem (J.L.W.V. Jensen). Let f : I R be continuous. Then, f is convex if and only if it is midpoint convex. Extends to f : X R n R; useful for proving convexity. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 7 / 50

10 Recognizing convex functions If f is continuous and midpoint convex, then it is convex. If f is differentiable, then f is convex if and only if dom f is convex and f (x) f (y) + f (y), x y for all x, y dom f. If f is twice differentiable, then f is convex if and only if dom f is convex and 2 f (x) 0 at every x dom f. By showing f to be a pointwise max of convex functions By showing f : dom(f ) R is convex if and only if its restriction to any line that intersects dom(f ) is convex. That is, for any x dom(f ) and any v, the function g(t) = f (x + tv) is convex (on its domain {t x + tv dom(f )}). Exercises (Ch. 3) in Boyd & Vandenberghe Several more ways... Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 8 / 50

11 Convex functions Indicator Let 1 X be the indicator function for X defined as: { 0 if x X, 1 X (x) := otherwise. Note: 1 X (x) is convex if and only if X is convex. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 9 / 50

12 Convex functions distance Example Let X be a convex set. Let x R n be some point. The distance of x to the set X is defined as dist(x, X ) := inf y X x y. Note: because x y is jointly convex in (x, y), the function dist(x, Y) is a convex function of x. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 10 / 50

13 Convex functions norms Example (l p -norm): Let p 1. x p = ( i x i p) 1/p Example (Frobenius-norm): Let A R m n. A F := ij a ij 2 Example Let A be any matrix. Then, the operator norm of A is A := Ax 2 sup x 2 0 x 2 = σ max (A). Example Operator q p norm (typically NP-Hard to compute!) Ax p A q p := sup x 0 x q Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 11 / 50

14 Fenchel conjugate Def. The Fenchel conjugate of a function f is f (z) := sup x dom f x T z f (x). Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 12 / 50

15 Fenchel conjugate Def. The Fenchel conjugate of a function f is f (z) := sup x dom f x T z f (x). Note: f is pointwise (over x) sup of linear functions of z. Hence, it is always convex (even if f is not convex). Example + and conjugate to each other. Think: What about other notions of conjugates? Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 12 / 50

16 Fenchel conjugate Def. The Fenchel conjugate of a function f is f (z) := sup x dom f x T z f (x). Note: f is pointwise (over x) sup of linear functions of z. Hence, it is always convex (even if f is not convex). Example + and conjugate to each other. Think: What about other notions of conjugates? [Artstein-Avidan, Milman (2007).] Up to linear terms, Fenchel conjugate only possible transform under basic axioms. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 12 / 50

17 Challenge 2 Consider the following functions on strictly positive variables: h 1 (x) := 1 x h 2 (x, y) := 1 x + 1 y 1 x + y h 3 (x, y, z) := 1 x + 1 y + 1 z 1 x + y 1 y + z 1 x + z + 1 x + y + z Prove that h 1, h 2, h 3, and in general h n are convex! Prove that in fact each 1/h n is concave 2 h n (x) 0 is not recommended Arose in studying expected random broadcast time in an unreliable star network (I. Affleck, 1994): h n(x) = E n[t (x)] := ( ) ( ) n x n σ(i) 1 n i=1 σ S n j=i x n σ(j) i=1 j=i x σ(j) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 13 / 50

18 Challenge 3 Kummer function (D. Karp, 2009). Let c > a > 0, let x R. Prove that [ (a + µ) µ log 1 F 1 (a + µ, c + µ, x) = log k x k ], k 0 (c + µ) k k! is convex on (0, ). (Here (x) k := x(x 1) (x k + 1)). Open problem. (Arose while we were studying Directional Statistics using the Multivariate Watson Distribution on RP n 1 ) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 14 / 50

19 Challenge 4 DPP entropy Determinantal Point Process (DPP): Essentially, probability measure over subsets of a ground set Y, wlog Y = {1,..., n} Suppose random subset A is drawn from 2 Y as per DPP Probability that any S 2 Y is covered by A is P(S A) = det(k S ), S [n], 0 K I the (marginal) DPP kernel; K S = K [S, S] Sometimes, we may write P(S) det(l S ) for L = K (I K ) 1 Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 15 / 50

20 Challenge 4 DPP entropy Determinantal Point Process (DPP): Essentially, probability measure over subsets of a ground set Y, wlog Y = {1,..., n} Suppose random subset A is drawn from 2 Y as per DPP Probability that any S 2 Y is covered by A is P(S A) = det(k S ), S [n], 0 K I the (marginal) DPP kernel; K S = K [S, S] Sometimes, we may write P(S) det(l S ) for L = K (I K ) 1 (R. Lyons, 2003). On {0 K I}, the negentropy H(K ) := P(S) log P(S) S [n] is convex, where P(S) = det(i K ) det([k (I K ) 1 ] S ). Open problem. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 15 / 50

21 Convex Optimization Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 16 / 50

22 Subgradients: global underestimators f(y) f(x) f(y) + f(y), x y y x f (x) f (y) + f (y), x y Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 17 / 50

23 Subgradients: global underestimators f(x) g 1 f(y) f(y) + g 1, x y g 2 g 3 y f (x) f (y) + g, x y Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 17 / 50

24 g 2 g 3 Subgradients: global underestimators f(x) g 1 f(y) f(y) + g 1, x y y subgradients at y g 1, g 2, g 3 are Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 18 / 50

25 Subgradients basic facts f is convex, differentiable: f (y) the unique subgradient at y A vector g is a subgradient at a point y if and only if f (y) + g, x y is globally smaller than f (x). Usually, one subgradient costs approx. as much as f (x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 19 / 50

26 Subgradients basic facts f is convex, differentiable: f (y) the unique subgradient at y A vector g is a subgradient at a point y if and only if f (y) + g, x y is globally smaller than f (x). Usually, one subgradient costs approx. as much as f (x) Determining all subgradients at a given point difficult. Subgradient calculus major achievement in convex analysis Fenchel-Young inequality: f (x) + f (s) s, x Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 19 / 50

27 Subgradients example f (x) := max(f 1 (x), f 2 (x)); both f 1, f 2 convex, differentiable Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 20 / 50

28 Subgradients example f (x) := max(f 1 (x), f 2 (x)); both f 1, f 2 convex, differentiable f 1 (x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 20 / 50

29 Subgradients example f (x) := max(f 1 (x), f 2 (x)); both f 1, f 2 convex, differentiable f 1 (x) f 2 (x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 20 / 50

30 Subgradients example f (x) := max(f 1 (x), f 2 (x)); both f 1, f 2 convex, differentiable f(x) f 1 (x) f 2 (x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 20 / 50

31 Subgradients example f (x) := max(f 1 (x), f 2 (x)); both f 1, f 2 convex, differentiable f(x) f 1 (x) f 2 (x) f 1 (x) > f 2 (x): unique subgradient of f is f 1 (x) f 1 (x) < f 2 (x): unique subgradient of f is f 2 (x) f 1 (y) = f 2 (y): subgradients, the segment [f 1 (y), f 2 (y)] (imagine all supporting lines turning about point y) y Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 20 / 50

32 Subdifferential Def. The set of all subgradients at y denoted by f (y). This set is called subdifferential of f at y Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 21 / 50

33 Subdifferential Def. The set of all subgradients at y denoted by f (y). This set is called subdifferential of f at y If f is convex, f (x) is nice: If x relative interior of dom f, then f (x) nonempty Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 21 / 50

34 Subdifferential Def. The set of all subgradients at y denoted by f (y). This set is called subdifferential of f at y If f is convex, f (x) is nice: If x relative interior of dom f, then f (x) nonempty If f differentiable at x, then f (x) = { f (x)} Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 21 / 50

35 Subdifferential Def. The set of all subgradients at y denoted by f (y). This set is called subdifferential of f at y If f is convex, f (x) is nice: If x relative interior of dom f, then f (x) nonempty If f differentiable at x, then f (x) = { f (x)} If f (x) = {g}, then f is differentiable and g = f (x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 21 / 50

36 Subdifferential example f (x) = x Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 22 / 50

37 Subdifferential example f (x) = x +1 f(x) 1 x Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 22 / 50

38 Subdifferential example f (x) = x +1 f(x) 1 x 1 x < 0, x = +1 x > 0, [ 1, 1] x = 0. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 22 / 50

39 More examples Example f (x) = x 2. Then, { x/ x 2 x 0, f (x) := {z z 2 1} x = 0. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 23 / 50

40 More examples Example f (x) = x 2. Then, { x/ x 2 x 0, f (x) := {z z 2 1} x = 0. Proof. z 2 x 2 + g, z x Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 23 / 50

41 More examples Example f (x) = x 2. Then, { x/ x 2 x 0, f (x) := {z z 2 1} x = 0. Proof. z 2 x 2 + g, z x z 2 g, z Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 23 / 50

42 More examples Example f (x) = x 2. Then, { x/ x 2 x 0, f (x) := {z z 2 1} x = 0. Proof. z 2 x 2 + g, z x z 2 g, z = g 2 1. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 23 / 50

43 Example Example A convex function need not be subdifferentiable everywhere. Let { (1 x 2 2 f (x) := )1/2 if x 2 1, + otherwise. f diff. for all x with x 2 < 1, but f (x) = whenever x 2 1. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 24 / 50

44 Subdifferential calculus If f is differentiable, f (x) = { f (x)} Scaling α > 0, (αf )(x) = α f (x) = {αg g f (x)} Addition : (f + k)(x) = f (x) + k(x) (set addition) Chain rule : Let A R m n, b R m, f : R m R, and h : R n R be given by h(x) = f (Ax + b). Then, h(x) = A T f (Ax + b). Chain rule : h(x) = f k, where k : X Y is diff. h(x) = f (k(x)) Dk(x) = [Dk(x)] T f (k(x)) Max function : If f (x) := max 1 i m f i (x), then f (x) = conv { f i (x) f i (x) = f (x)}, convex hull over subdifferentials of active functions at x Conjugation: z f (x) if and only if x f (z) * can fail to hold without precise assumptions. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 25 / 50

45 Example It can happen that (f 1 + f 2 ) f 1 + f 2 Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 26 / 50

46 Example It can happen that (f 1 + f 2 ) f 1 + f 2 Example Define f 1 and f 2 by { 2 x if x 0, f 1 (x) := + if x < 0, and f 2 (x) := { + if x > 0, 2 x if x 0. Then, f = max {f 1, f 2 } = 1 {0}, whereby f (0) = R But f 1 (0) = f 2 (0) =. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 26 / 50

47 Example It can happen that (f 1 + f 2 ) f 1 + f 2 Example Define f 1 and f 2 by { 2 x if x 0, f 1 (x) := + if x < 0, and f 2 (x) := { + if x > 0, 2 x if x 0. Then, f = max {f 1, f 2 } = 1 {0}, whereby f (0) = R But f 1 (0) = f 2 (0) =. However, f 1 (x) + f 2 (x) (f 1 + f 2 )(x) always holds. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 26 / 50

48 Optimality constrained For every x, y dom f, we have f (y) f (x) + f (x), y x. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 27 / 50

49 Optimality constrained For every x, y dom f, we have f (y) f (x) + f (x), y x. Thus, x is optimal if and only if f (x ), y x 0, for all y X. If X = R n, this reduces to f (x ) = 0 f(x) X x f(x ) x If f (x ) 0, it defines supporting hyperplane to X at x Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 27 / 50

50 Optimality nonsmooth Theorem (Fermat s rule): Let f : R n (, + ]. Then, argmin f = zer( f ) := { x R n 0 f (x) }. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 28 / 50

51 Optimality nonsmooth Theorem (Fermat s rule): Let f : R n (, + ]. Then, argmin f = zer( f ) := { x R n 0 f (x) }. Proof: x argmin f implies that f (x) f (y) for all y R n. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 28 / 50

52 Optimality nonsmooth Theorem (Fermat s rule): Let f : R n (, + ]. Then, argmin f = zer( f ) := { x R n 0 f (x) }. Proof: x argmin f implies that f (x) f (y) for all y R n. Equivalently, f (y) f (x) + 0, y x y, Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 28 / 50

53 Optimality nonsmooth Theorem (Fermat s rule): Let f : R n (, + ]. Then, argmin f = zer( f ) := { x R n 0 f (x) }. Proof: x argmin f implies that f (x) f (y) for all y R n. Equivalently, f (y) f (x) + 0, y x y, 0 f (x). Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 28 / 50

54 Optimality nonsmooth Theorem (Fermat s rule): Let f : R n (, + ]. Then, argmin f = zer( f ) := { x R n 0 f (x) }. Proof: x argmin f implies that f (x) f (y) for all y R n. Equivalently, f (y) f (x) + 0, y x y, 0 f (x). Nonsmooth optimality min f (x) s.t. x X min f (x) + 1 X (x). Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 28 / 50

55 Optimality nonsmooth Minimizing x must satisfy: 0 (f + 1 X )(x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 29 / 50

56 Optimality nonsmooth Minimizing x must satisfy: 0 (f + 1 X )(x) (CQ) Assuming ri(dom f ) ri(x ), 0 f (x) + 1 X (x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 29 / 50

57 Optimality nonsmooth Minimizing x must satisfy: 0 (f + 1 X )(x) (CQ) Assuming ri(dom f ) ri(x ), 0 f (x) + 1 X (x) Recall, g 1 X (x) iff 1 X (y) 1 X (x) + g, y x for all y. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 29 / 50

58 Optimality nonsmooth Minimizing x must satisfy: 0 (f + 1 X )(x) (CQ) Assuming ri(dom f ) ri(x ), 0 f (x) + 1 X (x) Recall, g 1 X (x) iff 1 X (y) 1 X (x) + g, y x for all y. So g 1 X (x) means x X and 0 g, y x y X. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 29 / 50

59 Optimality nonsmooth Minimizing x must satisfy: 0 (f + 1 X )(x) (CQ) Assuming ri(dom f ) ri(x ), 0 f (x) + 1 X (x) Recall, g 1 X (x) iff 1 X (y) 1 X (x) + g, y x for all y. So g 1 X (x) means x X and 0 g, y x y X. Normal cone: N X (x) := { g R n 0 g, y x y X } Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 29 / 50

60 Optimality nonsmooth Minimizing x must satisfy: 0 (f + 1 X )(x) (CQ) Assuming ri(dom f ) ri(x ), 0 f (x) + 1 X (x) Recall, g 1 X (x) iff 1 X (y) 1 X (x) + g, y x for all y. So g 1 X (x) means x X and 0 g, y x y X. Normal cone: N X (x) := { g R n 0 g, y x y X } Application. min f (x) s.t. x X : If f is diff., we get 0 f (x ) + N X (x ) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 29 / 50

61 Optimality nonsmooth Minimizing x must satisfy: 0 (f + 1 X )(x) (CQ) Assuming ri(dom f ) ri(x ), 0 f (x) + 1 X (x) Recall, g 1 X (x) iff 1 X (y) 1 X (x) + g, y x for all y. So g 1 X (x) means x X and 0 g, y x y X. Normal cone: N X (x) := { g R n 0 g, y x y X } Application. min f (x) s.t. x X : If f is diff., we get 0 f (x ) + N X (x ) f (x ) N X (x ) f (x ), y x 0 for all y X. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 29 / 50

62 Problems Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 30 / 50

63 Regularized optimization inf x X f (x) + r(ax) s.t. Ax Y. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 31 / 50

64 Regularized optimization inf x X f (x) + r(ax) s.t. Ax Y. inf u Y Dual problem f ( A T u) + r (u). Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 31 / 50

65 Regularized optimization inf x X f (x) + r(ax) s.t. Ax Y. inf u Y Dual problem Introduce new variable z = Ax inf x X,z Y f ( A T u) + r (u). f (x) + r(z), s.t. z = Ax. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 31 / 50

66 Regularized optimization inf x X f (x) + r(ax) s.t. Ax Y. inf u Y Dual problem Introduce new variable z = Ax inf x X,z Y The (partial)-lagrangian is f ( A T u) + r (u). f (x) + r(z), s.t. z = Ax. L(x, z; u) := f (x) + r(z) + u T (Ax z), x X, z Y; Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 31 / 50

67 Regularized optimization inf x X f (x) + r(ax) s.t. Ax Y. inf u Y Dual problem Introduce new variable z = Ax inf x X,z Y The (partial)-lagrangian is f ( A T u) + r (u). f (x) + r(z), s.t. z = Ax. L(x, z; u) := f (x) + r(z) + u T (Ax z), x X, z Y; Associated dual function g(u) := inf L(x, z; u). x X,z Y Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 31 / 50

68 Regularized optimization inf x X inf y Y f (x) + r(ax) s.t. Ax Y. Dual problem f ( A T y) + r (y). The infimum above can be rearranged as follows g(y) = inf x X f (x) + y T Ax + inf z Y r(z) y T z Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 32 / 50

69 Regularized optimization inf x X inf y Y f (x) + r(ax) s.t. Ax Y. Dual problem f ( A T y) + r (y). The infimum above can be rearranged as follows g(y) = inf x X = sup x X f (x) + y T Ax + inf r(z) y T z z Y { } x T A T y f (x) sup z Y { z T y r(z) } Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 32 / 50

70 Regularized optimization inf x X inf y Y f (x) + r(ax) s.t. Ax Y. Dual problem f ( A T y) + r (y). The infimum above can be rearranged as follows g(y) = inf x X = sup x X f (x) + y T Ax + inf r(z) y T z z Y { } x T A T y f (x) sup z Y = f ( A T y) r (y) s.t. y Y. { z T y r(z) } Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 32 / 50

71 Regularized optimization inf x X inf y Y f (x) + r(ax) s.t. Ax Y. Dual problem f ( A T y) + r (y). The infimum above can be rearranged as follows g(y) = inf x X = sup x X f (x) + y T Ax + inf r(z) y T z z Y { } x T A T y f (x) sup z Y = f ( A T y) r (y) s.t. y Y. { z T y r(z) Dual problem computes sup u Y g(u); so equivalently, } inf y Y f ( A T y) + r (y). Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 32 / 50

72 Regularized optimization inf x Strong duality {f (x) + r(ax)} = sup y if either of the following conditions holds: { f ( A T y) + r (y) } Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 33 / 50

73 Regularized optimization inf x Strong duality {f (x) + r(ax)} = sup y if either of the following conditions holds: { f ( A T y) + r (y) 1 x ri(dom f ) such that Ax ri(dom r) 2 y ri(dom r ) such that A T y ri(dom f ) } Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 33 / 50

74 Regularized optimization inf x Strong duality {f (x) + r(ax)} = sup y if either of the following conditions holds: { f ( A T y) + r (y) 1 x ri(dom f ) such that Ax ri(dom r) 2 y ri(dom r ) such that A T y ri(dom f ) Condition 1 ensures inf attained at some x Condition 2 ensures sup attained at some y } Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 33 / 50

75 Example: norm regularized problems min f (x) + Ax Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 34 / 50

76 Example: norm regularized problems min f (x) + Ax Dual problem min y f ( A T y) s.t. y 1. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 34 / 50

77 Example: norm regularized problems min f (x) + Ax Dual problem min y f ( A T y) s.t. y 1. Say ȳ < 1, such that A T ȳ ri(dom f ), then we have strong duality (e.g., for instance 0 ri(dom f )) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 34 / 50

78 Example: Lasso-like problem p := min x Ax b 2 + λ x 1. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 35 / 50

79 Example: Lasso-like problem p := min x Ax b 2 + λ x 1. { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 35 / 50

80 Example: Lasso-like problem p = min x p := min x Ax b 2 + λ x 1. max u,v { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. Saddle-point formulation { u T (b Ax) + v T x u 2 1, v λ } Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 35 / 50

81 Example: Lasso-like problem p = min x = max u,v p := min x Ax b 2 + λ x 1. max u,v min x { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. Saddle-point formulation { } u T (b Ax) + v T x u 2 1, v λ { } u T (b Ax) + x T v u 2 1, v λ Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 35 / 50

82 Example: Lasso-like problem p = min x = max u,v p := min x Ax b 2 + λ x 1. max u,v min x { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. Saddle-point formulation { } u T (b Ax) + v T x u 2 1, v λ { } u T (b Ax) + x T v u 2 1, v λ = max u,v ut b A T u = v, u 2 1, v λ Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 35 / 50

83 Example: Lasso-like problem p = min x = max u,v p := min x Ax b 2 + λ x 1. max u,v min x { } x 1 = max x T v v 1 { } x 2 = max x T u u 2 1. Saddle-point formulation { } u T (b Ax) + v T x u 2 1, v λ { } u T (b Ax) + x T v u 2 1, v λ = max u,v ut b A T u = v, u 2 1, v λ = max u T b u 2 1, A T v λ. u Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 35 / 50

84 Algorithms Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 36 / 50

85 Descent methods min x f (x) x k x k+1... x f(x ) = 0 Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 37 / 50

86 Descent methods x Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 38 / 50

87 Descent methods f(x) x f(x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 38 / 50

88 Descent methods f(x) x α f(x) x x δ f(x) f(x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 38 / 50

89 Descent methods f(x) x α f(x) x x + α 2 d d x δ f(x) f(x) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 38 / 50

90 Algorithm 1 Start with some guess x 0 ; 2 For each k = 0, 1,... x k+1 x k + α k d k Check when to stop (e.g., if f (x k+1 ) = 0) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 39 / 50

91 Gradient methods x k+1 = x k + α k d k, k = 0, 1,... stepsize α k 0, usually ensures f (x k+1 ) < f (x k ) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 40 / 50

92 Gradient methods x k+1 = x k + α k d k, k = 0, 1,... stepsize α k 0, usually ensures f (x k+1 ) < f (x k ) Descent direction d k satisfies f (x k ), d k < 0 Numerous ways to select α k and d k Usually methods seek monotonic descent f (x k+1 ) < f (x k ) Giving up on this monotonicity led to breakthroughs! Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 40 / 50

93 Gradient methods direction x k+1 = x k + α k d k, k = 0, 1,... Different choices of direction d k Scaled gradient: d k = D k f (x k ), D k 0 Newton s method: (D k = [ 2 f (x k )] 1 ) Quasi-Newton: D k [ 2 f (x k )] 1 Steepest descent: D k = I Diagonally scaled: D k diagonal with D k ii ( ) 2 f (x k 1 ) ( x i ) 2 Discretized Newton: D k = [H(x k )] 1, H via finite-diff.... Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 41 / 50

94 Gradient methods stepsize Exact: α k := argmin f (x k + αd k ) α 0 Limited min: α k = argmin f (x k + αd k ) 0 α s Armijo-rule. Given fixed scalars, s, β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s where we try β m s for m = 0, 1,... until sufficient descent f (x k ) f (x + β m sd k ) σβ m s f (x k ), d k If f (x k ), d k < 0, stepsize guaranteed to exist Usually, σ small [10 5, 0.1], while β from 1/2 to 1/10 depending on how confident we are about initial stepsize s. Constant: α k = 1/L (for suitable value of L) Diminishing: α k 0 but k α k =. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 42 / 50

95 Gradient-descent Assumption: Lipschitz continuous gradient; denoted f C 1 L f (x) f (y) 2 L x y 2 Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 43 / 50

96 Gradient-descent Assumption: Lipschitz continuous gradient; denoted f CL 1 f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 43 / 50

97 Gradient-descent Assumption: Lipschitz continuous gradient; denoted f CL 1 f (x) f (y) 2 L x y 2 Gradient vectors of closeby points are close to each other Objective function has bounded curvature Speed at which gradient varies is bounded Lemma (Descent). Let f C 1 L. Then, f (y) f (x) + f (x), y x + L 2 y x 2 2 Theorem Let f C 1 L and { x k} be sequence generated as above, with α k = 1/L. Then, f (x k+1 ) f (x ) = O(1/k). Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 43 / 50

98 Descent lemma Proof. Since f CL 1, by Taylor s theorem, for the vector z t = y + t(x y) we have f (x) = f (y) f (z t), x y dt. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 44 / 50

99 Descent lemma Proof. Since f CL 1, by Taylor s theorem, for the vector z t = y + t(x y) we have f (x) = f (y) f (z t), x y dt. Add and subtract f (y), x y on rhs we have f (x) f (y) f (y), x y = 1 0 f (z t) f (y), x y dt f (x) f (y) f (y), x y 1 0 f (z t) f (y), x y dt 1 0 f (z t) f (y), x y dt 1 0 f (z t) f (y) x y dt L 1 0 t x y 2 dt = L 2 x y 2. Bounds f (x) above and below with quadratic functions Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 44 / 50

100 Descent lemma corollaries Cor. 1 If f C 1 L, and 0 < α k < 2/L, then f (x k+1 ) < f (x k ) f (x k+1 ) f (x k ) + f (x k ), x k+1 x k + L 2 x k+1 x k 2 = f (x k ) α k f (x k ) α2 k L 2 f (x k ) 2 2 = f (x k ) α k (1 α k 2 L) f (x k ) 2 2 Thus, if α k < 2/L we have descent. Minimize over α k to get best bound: this yields α k = 1/L Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 45 / 50

101 Descent lemma corollaries Cor. 1 If f C 1 L, and 0 < α k < 2/L, then f (x k+1 ) < f (x k ) f (x k+1 ) f (x k ) + f (x k ), x k+1 x k + L 2 x k+1 x k 2 = f (x k ) α k f (x k ) α2 k L 2 f (x k ) 2 2 = f (x k ) α k (1 α k 2 L) f (x k ) 2 2 Thus, if α k < 2/L we have descent. Minimize over α k to get best bound: this yields α k = 1/L Cor. 2 If f C 1 L, then f (x) f (y), x y 1 L f (x) f (y) 2 Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 45 / 50

102 Linear convergence Assumption: Strong convexity; denote f S 1 L,µ f (x) f (y) + f (y), x y + µ 2 x y 2 2 Setting α k = 2/(µ + L) yields linear rate (µ > 0) Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 46 / 50

103 Strongly convex linear rate Theorem. If f SL,µ 1, 0 < α < 2/(L + µ), then the gradient method generates a sequence { x k} that satisfies ( x k x αµL ) k x 0 x 2. µ + L Moreover, if α = 2/(L + µ) then f (x k ) f L 2 ( ) κ 1 2k x 0 x 2 2 κ + 1, where κ = L/µ is the condition number. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 47 / 50

104 Convergence proof sketch Thm 2. Suppose f SL,µ 1. Then, for any x, y Rn f (x) f (y), x y µl µ + L x y µ + L f (x) f (y) 2 2 Proof. Recall descent lemma implies f (x) f (y), x y 1 L f (x) f (y) 2 Apply this result to φ(x) = f (x) µ 2 x 2 with L µ. φ(x) = f (x) µx φ(x) φ(y), x y 1 L µ φ(x) φ(y) 2 Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 48 / 50

105 Convergence proof sketch Let r k = x k x 2, and consider Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 49 / 50

106 Convergence proof sketch Let r k = x k x 2, and consider r 2 k+1 = x k x α f (x k ) 2 2 Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 49 / 50

107 Convergence proof sketch Let r k = x k x 2, and consider r 2 k+1 = x k x α f (x k ) 2 2 = r 2 k 2α f (x k ), x k x + α 2 f (x k ) 2 2 Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 49 / 50

108 Convergence proof sketch Let r k = x k x 2, and consider r 2 k+1 = x k x α f (x k ) 2 2 = rk 2 2α f (x k ), x k x + α 2 f (x k ) 2 2 ( 1 2αµL ) ( rk 2 µ + L + α α 2 ) f (x k ) 2 2 µ + L where we used Thm. 2 with f (x ) = 0 for last inequality. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 49 / 50

109 Convergence proof sketch Let r k = x k x 2, and consider r 2 k+1 = x k x α f (x k ) 2 2 = rk 2 2α f (x k ), x k x + α 2 f (x k ) 2 2 ( 1 2αµL ) ( rk 2 µ + L + α α 2 ) f (x k ) 2 2 µ + L where we used Thm. 2 with f (x ) = 0 for last inequality. To finish: Use Lipschitz gradient and f (x ) = 0 on last term to ultimately obtain r 2 k+1 γr 2 k, where γ 1 2αµL µ+l. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 49 / 50

110 Gradient methods lower bounds x k+1 = x k α k f (x k ) Theorem Lower bound I (Nesterov) For any x 0 R n, and 1 k 1 2 (n 1), there is a smooth f, s.t. f (x k ) f (x ) 3L x 0 x (k + 1) 2 Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 50 / 50

111 Gradient methods lower bounds x k+1 = x k α k f (x k ) Theorem Lower bound I (Nesterov) For any x 0 R n, and 1 k 1 2 (n 1), there is a smooth f, s.t. f (x k ) f (x ) 3L x 0 x (k + 1) 2 Theorem Lower bound II (Nesterov). strongly convex, i.e., SL,µ (µ > 0, κ > 1) For class of smooth, f (x k ) f (x ) µ 2 ( κ 1 κ + 1 ) 2k x 0 x 2 2. Suvrit Sra (MIT) Convex, nonconvex, and geometric optim. 50 / 50

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 2 Convex Optimization Shiqian Ma, MAT-258A: Numerical Optimization 2 2.1. Convex Optimization General optimization problem: min f 0 (x) s.t., f i