Convex Optimization MLSS PDF Free Download

Convex Optimization MLSS 2015 Constantine Caramanis The University of Texas at Austin

The Optimization Problem minimize : f (x) subject to : x X.

The Optimization Problem minimize : f (x) subject to : x X. What can this model? When can we solve it?

What Can We Model? Optimization: a frame of mind...

What Can We Model? Max Margin Classification Figure: Given labeled examples, find a classifier with the biggest margin of separation. Is this an optimization problem?

What Can We Model? Image Denoising Figure: Given the noisy camera man, can the denoising problem be cast as an optimization?

What Can We Model? Matrix Completion Users 1 2 4 F i l m s 3 2 4 5 3 3 2 3 5 2 Figure: Recover a low-rank matrix from a few of its entries. This is a starting point for many recommendation engines. Is this an optimization problem?

What Can We Model? Optimal Inequalities in Probability. X an integer-valued random variable. Given some moment constraints: µ i = E[X i ], i = 1,2,3,4,5, find the best upper and lower bounds for P{X [5,15]}

What Can We Model?...and what can we solve?

Convex Optimization minimize : f (x) subject to : x X. f (x) a convex function X a convex set

Convex Sets Definition A set, X, is called a convex set if and only if the convex combination of any two points in the set belongs to the set, i.e. X R n is convex if x 1,x 2 X and λ [0,1], λx 1 + (1 λ)x 2 X. Definition A convex combination of points x 1,...,x k is described by k i=0 θ ix i, where θ 1 + + θ k = 1 and θ i 0.

Convex Sets Figure: A convex set can be easily determined by examining whether the line segment between any two points in the set are in the set. Thus the figure on the left (circle) is in the set wheras the figure on the right (star) is not.

Convex Functions Definition The domain of a function f : R n R is denoted dom(f ), and is defined as the set of points where a function f is finite: dom(f ) = {x R n : f (x) < }.

Definition Convex Functions: Definition 1 A function f : R n R is convex if for any x 1,x 2 dom(f ) R n, λ [0,1], we have: λf (x 1 ) + (1 λ)f (x 2 ) f (λx 1 + (1 λ)x 2 ). Figure: Convex functions

Convex Functions: Definition 2 Definition Suppose a function f : R n R is differentiable. Then it is convex if and only if f (y) f (x) + f (x) (y x). f ( y) ( x, f ( x) ) f ( x) + f ( x) T ( y x)

Convex Functions: Definition 3 Definition Suppose that a function f : R n R is twice differentiable. Then f is convex iff its Hessian is positive semidefinite: 2 f (x) 0, x dom(f ) Note that the equivalences of the three definitions are proved in the lecture notes(proposition 1 and 2). We would just leave it here without the proofs.

Examples of Convex Functions Exponential f (x) = e ax, a R Powers f (x) = x a is convex on R ++ when a 1 or a 0, concave otherwise. Negative Logarithm f (x) = logx is convex on R ++ Norms The L p norms on R n are convex: x p = ( x i p ) 1/p, (1 p ) Max Function f (x) = max{x 1,x 2,,x n } is convex on R n Some Matrix Functions The sum of the k largest singular values.

Intuition: Convex Optimization Easy Figure: Gradient Descent on convex functions: Rolling down hill will lead to convergence to the global optimum

Intuition: Non-Convex Optimization Hard Figure: Gradient Descent on non-convex functions: Rolling down hill may lead us to a local minimum, which can be far from the global minimum. Many problems have massive numbers of highly suboptimal local optima.

Outline From Here Modeling so how do we model some of the problems mentioned above? Algorithms how do we solve them? Theory what can we prove?

Optimal Inequalities in Probability X an integer-valued random variable. Given some moment constraints: µ i = E[X i ], i = 1,2,3,4,5, find the best upper and lower bounds for P{X [5,15]}

Convex Modeling of Optimal Inequalities in Probability Let P j = P{X = j}. Then we formulate the optimization problem for finding upper/lower bounds as, max/min : s.t. : 15 P j j=5 P j 0, for any j P j = 1 j P j j i = µ i, i = 1,2,3,4,5. j f ( ) =? X =?

Image Denoising Figure: Given the noisy camera man, can the denoising problem be cast as an optimization?

Convex Modeling of Image Denoising Domain-specific insight: Natural images have structure: sharp edges with areas of near-constant intensity. Denote image by pixel intensity map: X : [0,1] [0,1] R. So X (a,b) is intensity of pixel (a,b). Let X clean and X noisy denote the clean and noisy images. We denoise by finding an image close to the noisy image but with smooth areas and sharp edges: min : (X (a,b) X noisy (a,b)) 2 + λ X (a,b) 2 2. X [0,1] 2 [0,1] 2 f ( ) =? X =?

Matrix Completion Users 1 2 4 F i l m s 3 2 4 5 3 3 2 3 5 2 Figure: Recover a low-rank matrix from a few of its entries. This is a starting point for many recommendation engines. Is this an optimization problem?

Convex Modeling of Matrix Completion Direct Minimization for the Rank min rank(x ) s.t. X ij = M ij, for observed (i,j) However, rank-minimization is a non-convex problem.

Convex Relaxations A simple idea with far-reaching consequences: if a problem is non-convex, solve the closest convex problem.

Convex Modeling of Matrix Completion Nuclear Norm Convex Relaxation of the Rank min X s.t. X ij = M ij, for observed (i,j) here, X is called the nuclear norm. It is the sum of singular values of X. Exercise. Show that f (X ) = X is convex.

Exercises and Software Try out these examples! Optimal Probability inequalities: try using linprog in Matlab. More general convex solver: CVX free download: http://cvxr.com/cvx/

Outline From Here Modeling so how do we model some of the problems mentioned above? Algorithms how do we solve them? Theory what can we prove?

Outline From Here Modeling so how do we model some of the problems mentioned above? Algorithms how do we solve them? Modern problems in machine learning are increasingly characterized by their massive size. We need iterative algorithms that have good convergence guarantees. Theory what can we prove? Many interesting problems sparse regression, matrix completion, etc. are inherently non-convex, but ideas of convex relaxation as above, can be used. When can we prove that the solution of the convex problem is useful?

Algorithms for Convex Optimization min : f (x) s.t. : x X R n. Want: ˆx such that ˆx is close to x, or f (ˆx) is close to f (x ). What can we expect? How hard must we work? Answer depends on f ( ), X, n, and error tolerance, ε.

Algorithms for Convex Optimization min : Question: which is better? f (x) s.t. : x X R n. Algorithm (A1) produces an ε-accuracy solution in time O(n 2 log(1/ε)); Algorithm (A2) produces an ε-accuracy solution in time O(n/ε 2 ).

Second Order and Interior Point Methods

First Order Methods Oracle Model: given x, oracle produces (f (x), f (x)), and Π X (x). How many calls to the oracle do we need to produce an ε-accurate solution? Discussion.

First Order Methods Oracle Model: given x, oracle produces (f (x), f (x)), and Π X (x). How many calls to the oracle do we need to produce an ε-accurate solution? Basic iterative algorithm: smooth convex optimization x + = x η f (x) unconstrained optimization x + = Π X (x η f (x)) constrained optimization Assumptions on f ( )? Comp. per iteration? No. of iterations?

Convergence of Gradient Descent min : f (x) s.t. : x X R n. Assumption: f ( ) has L-Lipschitz gradients. f (x) f (y) L x y. Upper bound on curvature of f.

Convergence of Gradient Descent Recall the definition of f convex: Definition Suppose a function f : R n R is differentiable. Then it is convex if and only if f (y) f (x) + f (x) (y x). f ( y) ( x, f ( x) ) f ( x) + f ( x) T ( y x)

Convergence of Gradient Descent Now, in addition to: f (y) f (x) + f (x) (y x). we have Lemma If f L-Lipschitz, then f (y) f (x) + f (x) (y x) + L 2 y x 2.

Convergence of Gradient Descent Proof of Lemma: First note that the function is convex. g(x) = L 2 x 2 f (x), Exercise. If f ( ) has second derivatives, 2 g(x) = L I 2 f (x) 0. Prove g(x) is convex without that assumption.

Convergence of Gradient Descent From the lemma, g(x) convex means, by definition, g(y) g(x) + g(x) (y x). Rearranging, we get the statement of the lemma.

Convergence of Gradient Descent From the Lemma: f (y) f (x) + f (x) (y x) + L 2 y x 2 we have: f (x η f (x)) f (x) + f (x) ( η f (x)) + L η f (x) 2 }{{} 2 y ( ) L = f (x) + 2 η2 η f (x) 2. Corollary Choosing η < 1/L, f (x η f (x)) f (x) η 2 f (x) 2.

Convergence of Gradient Descent Now for x (i+1) = x (i) η f (x (i) ), we have f (x (i+1) ) f (x (i) ) η 2 f (x (i) ) 2 f (x ) + f (x (i) ) (x (i) x ) η 2 f (x (i) ) 2 = f + 1 2η ( x (i) x 2 x (i) x η f (x (i) ) 2 ) = f + 1 2η ( x (i) x 2 x (i+1) x 2 ).

Convergence of Gradient Descent Summing over k iterations of the algorithm: k i=1 f (x (i) ) f 1 2η f (x (k) ) f 1 k ( k Theorem k i=1 ( x (i) x 2 x (i+1) x 2 ) = 1 2η ( x (0) x 2 x (k) x 2 ) 1 2η x (0) x 2 i=1 f (x (i) ) f ) 1 k ( ) 1 2η x (0) x 2. Under the above assumptions, gradient descent converges at a rate of O(1/k); i.e., it has error ε in O(1/ε) iterations.

Convergence of Gradient Descent Discussion: This analysis is for the unconstrained setting. The result is the same for the constrained setting. Key to proof: using convexity and upper bound on curvature. Result: dimension independent!

What about Different Assumptions Given a function f (x) and error target ε such that f (ˆx) f ε, and using gradient descent: Under the assumption that f (x) is smooth (upper bound on curvature) and convex: O( 1 ε ) iterations are needed. What if f (x) is not smooth? For example, f (x) = Ax b 2 + x 1.

Subgradients and Subdifferential f ( ) convex but not differentiable. Still have basic definition of convexity: f (y) f (x) + g x (y x). Now (possibly) many under-estimates. Define subdifferential: f (x) = {g : f (y) f (x) + g x (y x), y}

Subgradient Algorithm Starting at x, given any g x f (x), and step size η, x + = x ηg x. Convergence guarantees: We still have convexity inequality. Do we have something like??? f (x η f (x)) f (x) η f (x) 2 2

Unfortunately Not: Example The subgradient method is not a descent method. f (x 1,x 2 ) = x 1 + 10 x 2. current position is given by (x 1,x 2 ) = (10,0) the two extreme subgradients at (10,0) are (1,10) and (1, 10) then the subgradient set is obtained as f (10,0) = {(1,10v) for 1 v 1}

x 2 Extreme Subgradient New point (10,0) x 1 Extreme Subgradient Consider g x = (1, 1) f (10,0) Then x + = x tg x. The resulting point increases the function value.

Convergence of Subgradient Method Assume f is convex, and g G for any g f (x), x. For x (i+1) = x (i) ηg x (i) we have: x (k+1) x 2 = x (i) ηg x (k) x 2 = x (k) x 2 2ηg x (k) (x (k) x ) + η 2 g x (k) 2 x (k) x 2 2η(f (ˆx) f ) + η 2 G 2 x (0) x 2 2kη(f (ˆx) f ) + η 2 G 2 k.

Convergence of Subgradient Method Rearranging, gives f (ˆx) f x (0) x 2 + G 2 η 2 k. 2kη Minimizing over η we find best step size η = 1/ k, and we get f (ˆx) f 1 k. Theorem Under the above assumptions, the subgradient method converges at a rate of O(1/ k); i.e., it has error ε in O(1/ε 2 ) iterations.

What about Different Assumptions Given a function f (x) and error target ε such that f (ˆx) f ε, and using (sub)gradient descent: Under the assumption that f (x) is convex: O( 1 ε 2 ) iterations are needed. Under the assumption that f (x) is smooth (upper bound on curvature) and convex: O( 1 ε ) iterations are needed. Under the assumption that f (x) is smooth and strongly convex (also lower bound on curvature): O(ln( 1 ε )) iterations are needed. Is this the best we can do?

First Order Convergence Guarantees For (sub)gradient algorithm, the above rates are the best i.e., analysis cannot be improved. There are, however, other first-order algorithms.

Proximal Algorithm Suppose we want to minimize: f (x) = g(x) + h(x), where g(x) is smooth, and h(x) is simple. Example: l 1 -regularized regression min : Ax b 2 2 + λ x 1. Ax b 2 2 is smooth, and x 1 is simple. Can we do better?

Proximal Algorithm Briefly, the answer is yes. If we can easily evaluate the Prox function: Prox ηh (y) = arg min x : h(x) + 1 2η x y 2. Proximal algorithm: Convergence rate: O(1/k) x + = Prox ηh (x η g(x)).

Accelerated Algorithms (Lower Bounds) Accelerated algorithms: x (k+1) = x (k) α f (x (k) ) + β(x (k) x (k 1) ) Still fits in our oracle model. Convergence: If f is smooth, then for error ε, we need O(1/ ε) iterations, hence O(1/k 2 ) convergence. Proximal analogs for the case: f (x) = g(x) + h(x).

Outline From Here Modeling so how do we model some of the problems mentioned above? Algorithms how do we solve them? Theory what can we prove?

Duality Theory Figure: A convex set can be represented in two ways: a convex hull of extreme points, or an intersection of half-spaces that contain it.

Duality Theory Figure: Consider optimizing in a given direction over a convex set.

Duality Theory Figure: The optimal point is the point with the best value.

Duality Theory Figure:...but there is also a half-space that says you cannot go further. This is called a certificate of optimality. Non-convex optimization problems do not always have such easy certificates of optimality.

Duality Theory We can search for the best point (search over points). Or (and) we can search for the certificate of optimality (search over half-spaces).

Provably Close to Optimality Figure: If we solve a relaxation, sometimes we can characterize via duality theory, how close the optimal point on the non-convex set will be to the optimal point on the convex relaxation. Sparse regression and compressed sensing. Low-rank matrix completion. Many other examples.

Summary and Directions Modeling with Convex Optimization: Intuition and creativity are absolutely essential. Successful problem modeling comes from understanding the problem and hence what is important (e.g., boundary/smoothness of natural images, or approximate low-rank of rankings and preferences), and also the theory of convex optimization which problems can be solved quickly/efficiently, and what can we say about convex approximations to non-convex problems? Algorithms: Today we discussed first order methods. These are suited for very large-scale problems, as we often see in large-scale ML/data mining. Not all methods are applicable or best for all problems. Understanding the demands of your application, and the performance of convex optimization algorithms in different settings, is very important make or break in large-scale applications.

Summary and Directions Theory and Duality: We discussed this least today, but it is important not only for analysis, but also for algorithmic development. We discussed only algorithms that search for optimal points. Dual algorithms search over half-spaces. Different problems may yield to better solution or approximation in one domain or another.

Some Useful References Convex Optimization, by Stephen Boyd and Lieven Vandenberghe (see also the slides from their courses). Convex Optimization Algorithms, by Dimitri Bertsekas. Optimization Models and Applications, by Guiseppe Calafiore and Laurent El Ghaoui. Introductory Lectures on Convex Optimization, by Yurii Nesterov. Lectures on Modern Convex Optimization, by Aharon Ben-Tal and Arkadi Nemirovski

The End Thanks, and feel free to contact me with questions: constantine@utexas.edu