Convex Optimization MLSS 2015

Similar documents
Introduction to Modern Control Systems

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

Optimization for Machine Learning

Convex Optimization / Homework 2, due Oct 3

Convex Optimization Lecture 2

Lecture 2 September 3

Convexity Theory and Gradient Methods

Section 5 Convex Optimisation 1. W. Dai (IC) EE4.66 Data Proc. Convex Optimisation page 5-1

Proximal operator and methods

Mathematical Programming and Research Methods (Part II)

Convex Optimization M2

Aspects of Convex, Nonconvex, and Geometric Optimization (Lecture 1) Suvrit Sra Massachusetts Institute of Technology

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

Tutorial on Convex Optimization for Engineers

Lagrangian Relaxation: An overview

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Lecture 1: Introduction

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

IE 521 Convex Optimization

Convexity: an introduction

15.082J and 6.855J. Lagrangian Relaxation 2 Algorithms Application to LPs

Lecture 19 Subgradient Methods. November 5, 2008

Lecture 2 - Introduction to Polytopes

AM 221: Advanced Optimization Spring 2016

CS 435, 2018 Lecture 2, Date: 1 March 2018 Instructor: Nisheeth Vishnoi. Convex Programming and Efficiency

Convexity I: Sets and Functions

Projection-Based Methods in Optimization

Characterizing Improving Directions Unconstrained Optimization

Lecture 5: Properties of convex sets

EE/AA 578: Convex Optimization

Math 5593 Linear Programming Lecture Notes

Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. Author: Martin Jaggi Presenter: Zhongxing Peng

A primal-dual framework for mixtures of regularizers

Convex Optimization. Lijun Zhang Modification of

Introduction to Convex Optimization. Prof. Daniel P. Palomar

Introduction to Constrained Optimization

Outlier Pursuit: Robust PCA and Collaborative Filtering

Numerical Optimization

Lecture 4: Convexity

MTAEA Convexity and Quasiconvexity

Programming, numerics and optimization

Convexization in Markov Chain Monte Carlo

COMS 4771 Support Vector Machines. Nakul Verma

Lecture 4 Duality and Decomposition Techniques

Introduction to optimization

Convexity and Optimization

Solution Methods Numerical Algorithms

Homework 1 (a and b) Convex Sets and Convex Functions

Linear Programming. Larry Blume. Cornell University & The Santa Fe Institute & IHS

Lecture 2: August 31

Lecture 2: August 29, 2018

15. Cutting plane and ellipsoid methods

Applied Lagrange Duality for Constrained Optimization

Advanced Operations Research Techniques IE316. Quiz 2 Review. Dr. Ted Ralphs

ORIE 6300 Mathematical Programming I September 2, Lecture 3

Convex optimization algorithms for sparse and low-rank representations

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 2: Convex Sets

16.410/413 Principles of Autonomy and Decision Making

Convex Optimization - Chapter 1-2. Xiangru Lian August 28, 2015

Conic Duality. yyye

25. NLP algorithms. ˆ Overview. ˆ Local methods. ˆ Constrained optimization. ˆ Global methods. ˆ Black-box methods.

ACCELERATED DUAL GRADIENT-BASED METHODS FOR TOTAL VARIATION IMAGE DENOISING/DEBLURRING PROBLEMS. Donghwan Kim and Jeffrey A.

Lecture 5: Duality Theory

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Discrete Optimization 2010 Lecture 5 Min-Cost Flows & Total Unimodularity

California Institute of Technology Crash-Course on Convex Optimization Fall Ec 133 Guilherme Freitas

Convex Optimization. Erick Delage, and Ashutosh Saxena. October 20, (a) (b) (c)

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

A Brief Look at Optimization

Algorithms for convex optimization

Lecture 18: March 23

Convexity and Optimization

Lecture 2: August 29, 2018

Nonlinear Programming

Lecture 2 Optimization with equality constraints

Lecture 2. Topology of Sets in R n. August 27, 2008

Convex Optimization and Machine Learning

Affine function. suppose f : R n R m is affine (f(x) =Ax + b with A R m n, b R m ) the image of a convex set under f is convex

IDENTIFYING ACTIVE MANIFOLDS

A Multilevel Proximal Gradient Algorithm for a Class of Composite Optimization Problems

2. Optimization problems 6

14.5 Directional Derivatives and the Gradient Vector

Lecture 12: Feasible direction methods

Convex Sets (cont.) Convex Functions

Lecture 25 Nonlinear Programming. November 9, 2009

In this chapter we introduce some of the basic concepts that will be useful for the study of integer programming problems.

A More Efficient Approach to Large Scale Matrix Completion Problems

The Alternating Direction Method of Multipliers

Lecture 3: Convex sets

Convex Optimization. Stephen Boyd

Alternating Projections

PROJECTION ONTO A POLYHEDRON THAT EXPLOITS SPARSITY

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Bilinear Programming

Probabilistic Graphical Models

4 Integer Linear Programming (ILP)

Optimization. Industrial AI Lab.

Transcription:

Convex Optimization MLSS 2015 Constantine Caramanis The University of Texas at Austin

The Optimization Problem minimize : f (x) subject to : x X.

The Optimization Problem minimize : f (x) subject to : x X. What can this model? When can we solve it?

What Can We Model? Optimization: a frame of mind...

What Can We Model? Max Margin Classification Figure: Given labeled examples, find a classifier with the biggest margin of separation. Is this an optimization problem?

What Can We Model? Image Denoising Figure: Given the noisy camera man, can the denoising problem be cast as an optimization?

What Can We Model? Matrix Completion Users 1 2 4 F i l m s 3 2 4 5 3 3 2 3 5 2 Figure: Recover a low-rank matrix from a few of its entries. This is a starting point for many recommendation engines. Is this an optimization problem?

What Can We Model? Optimal Inequalities in Probability. X an integer-valued random variable. Given some moment constraints: µ i = E[X i ], i = 1,2,3,4,5, find the best upper and lower bounds for P{X [5,15]}

What Can We Model?...and what can we solve?

Convex Optimization minimize : f (x) subject to : x X. f (x) a convex function X a convex set

Convex Sets Definition A set, X, is called a convex set if and only if the convex combination of any two points in the set belongs to the set, i.e. X R n is convex if x 1,x 2 X and λ [0,1], λx 1 + (1 λ)x 2 X. Definition A convex combination of points x 1,...,x k is described by k i=0 θ ix i, where θ 1 + + θ k = 1 and θ i 0.

Convex Sets Figure: A convex set can be easily determined by examining whether the line segment between any two points in the set are in the set. Thus the figure on the left (circle) is in the set wheras the figure on the right (star) is not.

Convex Functions Definition The domain of a function f : R n R is denoted dom(f ), and is defined as the set of points where a function f is finite: dom(f ) = {x R n : f (x) < }.

Definition Convex Functions: Definition 1 A function f : R n R is convex if for any x 1,x 2 dom(f ) R n, λ [0,1], we have: λf (x 1 ) + (1 λ)f (x 2 ) f (λx 1 + (1 λ)x 2 ). Figure: Convex functions

Convex Functions: Definition 2 Definition Suppose a function f : R n R is differentiable. Then it is convex if and only if f (y) f (x) + f (x) (y x). f ( y) ( x, f ( x) ) f ( x) + f ( x) T ( y x)

Convex Functions: Definition 3 Definition Suppose that a function f : R n R is twice differentiable. Then f is convex iff its Hessian is positive semidefinite: 2 f (x) 0, x dom(f ) Note that the equivalences of the three definitions are proved in the lecture notes(proposition 1 and 2). We would just leave it here without the proofs.

Examples of Convex Functions Exponential f (x) = e ax, a R Powers f (x) = x a is convex on R ++ when a 1 or a 0, concave otherwise. Negative Logarithm f (x) = logx is convex on R ++ Norms The L p norms on R n are convex: x p = ( x i p ) 1/p, (1 p ) Max Function f (x) = max{x 1,x 2,,x n } is convex on R n Some Matrix Functions The sum of the k largest singular values.

Intuition: Convex Optimization Easy Figure: Gradient Descent on convex functions: Rolling down hill will lead to convergence to the global optimum

Intuition: Non-Convex Optimization Hard Figure: Gradient Descent on non-convex functions: Rolling down hill may lead us to a local minimum, which can be far from the global minimum. Many problems have massive numbers of highly suboptimal local optima.

Outline From Here Modeling so how do we model some of the problems mentioned above? Algorithms how do we solve them? Theory what can we prove?

Optimal Inequalities in Probability X an integer-valued random variable. Given some moment constraints: µ i = E[X i ], i = 1,2,3,4,5, find the best upper and lower bounds for P{X [5,15]}

Convex Modeling of Optimal Inequalities in Probability Let P j = P{X = j}. Then we formulate the optimization problem for finding upper/lower bounds as, max/min : s.t. : 15 P j j=5 P j 0, for any j P j = 1 j P j j i = µ i, i = 1,2,3,4,5. j f ( ) =? X =?

Image Denoising Figure: Given the noisy camera man, can the denoising problem be cast as an optimization?

Convex Modeling of Image Denoising Domain-specific insight: Natural images have structure: sharp edges with areas of near-constant intensity. Denote image by pixel intensity map: X : [0,1] [0,1] R. So X (a,b) is intensity of pixel (a,b). Let X clean and X noisy denote the clean and noisy images. We denoise by finding an image close to the noisy image but with smooth areas and sharp edges: min : (X (a,b) X noisy (a,b)) 2 + λ X (a,b) 2 2. X [0,1] 2 [0,1] 2 f ( ) =? X =?

Matrix Completion Users 1 2 4 F i l m s 3 2 4 5 3 3 2 3 5 2 Figure: Recover a low-rank matrix from a few of its entries. This is a starting point for many recommendation engines. Is this an optimization problem?

Convex Modeling of Matrix Completion Direct Minimization for the Rank min rank(x ) s.t. X ij = M ij, for observed (i,j) However, rank-minimization is a non-convex problem.

Convex Relaxations A simple idea with far-reaching consequences: if a problem is non-convex, solve the closest convex problem.

Convex Modeling of Matrix Completion Nuclear Norm Convex Relaxation of the Rank min X s.t. X ij = M ij, for observed (i,j) here, X is called the nuclear norm. It is the sum of singular values of X. Exercise. Show that f (X ) = X is convex.

Exercises and Software Try out these examples! Optimal Probability inequalities: try using linprog in Matlab. More general convex solver: CVX free download: http://cvxr.com/cvx/

Outline From Here Modeling so how do we model some of the problems mentioned above? Algorithms how do we solve them? Theory what can we prove?

Outline From Here Modeling so how do we model some of the problems mentioned above? Algorithms how do we solve them? Modern problems in machine learning are increasingly characterized by their massive size. We need iterative algorithms that have good convergence guarantees. Theory what can we prove? Many interesting problems sparse regression, matrix completion, etc. are inherently non-convex, but ideas of convex relaxation as above, can be used. When can we prove that the solution of the convex problem is useful?

Algorithms for Convex Optimization min : f (x) s.t. : x X R n. Want: ˆx such that ˆx is close to x, or f (ˆx) is close to f (x ). What can we expect? How hard must we work? Answer depends on f ( ), X, n, and error tolerance, ε.

Algorithms for Convex Optimization min : Question: which is better? f (x) s.t. : x X R n. Algorithm (A1) produces an ε-accuracy solution in time O(n 2 log(1/ε)); Algorithm (A2) produces an ε-accuracy solution in time O(n/ε 2 ).

Second Order and Interior Point Methods

First Order Methods Oracle Model: given x, oracle produces (f (x), f (x)), and Π X (x). How many calls to the oracle do we need to produce an ε-accurate solution? Discussion.

First Order Methods Oracle Model: given x, oracle produces (f (x), f (x)), and Π X (x). How many calls to the oracle do we need to produce an ε-accurate solution? Basic iterative algorithm: smooth convex optimization x + = x η f (x) unconstrained optimization x + = Π X (x η f (x)) constrained optimization Assumptions on f ( )? Comp. per iteration? No. of iterations?

Convergence of Gradient Descent min : f (x) s.t. : x X R n. Assumption: f ( ) has L-Lipschitz gradients. f (x) f (y) L x y. Upper bound on curvature of f.

Convergence of Gradient Descent Recall the definition of f convex: Definition Suppose a function f : R n R is differentiable. Then it is convex if and only if f (y) f (x) + f (x) (y x). f ( y) ( x, f ( x) ) f ( x) + f ( x) T ( y x)

Convergence of Gradient Descent Now, in addition to: f (y) f (x) + f (x) (y x). we have Lemma If f L-Lipschitz, then f (y) f (x) + f (x) (y x) + L 2 y x 2.

Convergence of Gradient Descent Proof of Lemma: First note that the function is convex. g(x) = L 2 x 2 f (x), Exercise. If f ( ) has second derivatives, 2 g(x) = L I 2 f (x) 0. Prove g(x) is convex without that assumption.

Convergence of Gradient Descent From the lemma, g(x) convex means, by definition, g(y) g(x) + g(x) (y x). Rearranging, we get the statement of the lemma.

Convergence of Gradient Descent From the Lemma: f (y) f (x) + f (x) (y x) + L 2 y x 2 we have: f (x η f (x)) f (x) + f (x) ( η f (x)) + L η f (x) 2 }{{} 2 y ( ) L = f (x) + 2 η2 η f (x) 2. Corollary Choosing η < 1/L, f (x η f (x)) f (x) η 2 f (x) 2.

Convergence of Gradient Descent Now for x (i+1) = x (i) η f (x (i) ), we have f (x (i+1) ) f (x (i) ) η 2 f (x (i) ) 2 f (x ) + f (x (i) ) (x (i) x ) η 2 f (x (i) ) 2 = f + 1 2η ( x (i) x 2 x (i) x η f (x (i) ) 2 ) = f + 1 2η ( x (i) x 2 x (i+1) x 2 ).

Convergence of Gradient Descent Summing over k iterations of the algorithm: k i=1 f (x (i) ) f 1 2η f (x (k) ) f 1 k ( k Theorem k i=1 ( x (i) x 2 x (i+1) x 2 ) = 1 2η ( x (0) x 2 x (k) x 2 ) 1 2η x (0) x 2 i=1 f (x (i) ) f ) 1 k ( ) 1 2η x (0) x 2. Under the above assumptions, gradient descent converges at a rate of O(1/k); i.e., it has error ε in O(1/ε) iterations.

Convergence of Gradient Descent Discussion: This analysis is for the unconstrained setting. The result is the same for the constrained setting. Key to proof: using convexity and upper bound on curvature. Result: dimension independent!

What about Different Assumptions Given a function f (x) and error target ε such that f (ˆx) f ε, and using gradient descent: Under the assumption that f (x) is smooth (upper bound on curvature) and convex: O( 1 ε ) iterations are needed. What if f (x) is not smooth? For example, f (x) = Ax b 2 + x 1.

Subgradients and Subdifferential f ( ) convex but not differentiable. Still have basic definition of convexity: f (y) f (x) + g x (y x). Now (possibly) many under-estimates. Define subdifferential: f (x) = {g : f (y) f (x) + g x (y x), y}

Subgradient Algorithm Starting at x, given any g x f (x), and step size η, x + = x ηg x. Convergence guarantees: We still have convexity inequality. Do we have something like??? f (x η f (x)) f (x) η f (x) 2 2

Unfortunately Not: Example The subgradient method is not a descent method. f (x 1,x 2 ) = x 1 + 10 x 2. current position is given by (x 1,x 2 ) = (10,0) the two extreme subgradients at (10,0) are (1,10) and (1, 10) then the subgradient set is obtained as f (10,0) = {(1,10v) for 1 v 1}

x 2 Extreme Subgradient New point (10,0) x 1 Extreme Subgradient Consider g x = (1, 1) f (10,0) Then x + = x tg x. The resulting point increases the function value.

Convergence of Subgradient Method Assume f is convex, and g G for any g f (x), x. For x (i+1) = x (i) ηg x (i) we have: x (k+1) x 2 = x (i) ηg x (k) x 2 = x (k) x 2 2ηg x (k) (x (k) x ) + η 2 g x (k) 2 x (k) x 2 2η(f (ˆx) f ) + η 2 G 2 x (0) x 2 2kη(f (ˆx) f ) + η 2 G 2 k.

Convergence of Subgradient Method Rearranging, gives f (ˆx) f x (0) x 2 + G 2 η 2 k. 2kη Minimizing over η we find best step size η = 1/ k, and we get f (ˆx) f 1 k. Theorem Under the above assumptions, the subgradient method converges at a rate of O(1/ k); i.e., it has error ε in O(1/ε 2 ) iterations.

What about Different Assumptions Given a function f (x) and error target ε such that f (ˆx) f ε, and using (sub)gradient descent: Under the assumption that f (x) is convex: O( 1 ε 2 ) iterations are needed. Under the assumption that f (x) is smooth (upper bound on curvature) and convex: O( 1 ε ) iterations are needed. Under the assumption that f (x) is smooth and strongly convex (also lower bound on curvature): O(ln( 1 ε )) iterations are needed. Is this the best we can do?

First Order Convergence Guarantees For (sub)gradient algorithm, the above rates are the best i.e., analysis cannot be improved. There are, however, other first-order algorithms.

Proximal Algorithm Suppose we want to minimize: f (x) = g(x) + h(x), where g(x) is smooth, and h(x) is simple. Example: l 1 -regularized regression min : Ax b 2 2 + λ x 1. Ax b 2 2 is smooth, and x 1 is simple. Can we do better?

Proximal Algorithm Briefly, the answer is yes. If we can easily evaluate the Prox function: Prox ηh (y) = arg min x : h(x) + 1 2η x y 2. Proximal algorithm: Convergence rate: O(1/k) x + = Prox ηh (x η g(x)).

Accelerated Algorithms (Lower Bounds) Accelerated algorithms: x (k+1) = x (k) α f (x (k) ) + β(x (k) x (k 1) ) Still fits in our oracle model. Convergence: If f is smooth, then for error ε, we need O(1/ ε) iterations, hence O(1/k 2 ) convergence. Proximal analogs for the case: f (x) = g(x) + h(x).

Outline From Here Modeling so how do we model some of the problems mentioned above? Algorithms how do we solve them? Theory what can we prove?

Duality Theory Figure: A convex set can be represented in two ways: a convex hull of extreme points, or an intersection of half-spaces that contain it.

Duality Theory Figure: Consider optimizing in a given direction over a convex set.

Duality Theory Figure: The optimal point is the point with the best value.

Duality Theory Figure:...but there is also a half-space that says you cannot go further. This is called a certificate of optimality. Non-convex optimization problems do not always have such easy certificates of optimality.

Duality Theory We can search for the best point (search over points). Or (and) we can search for the certificate of optimality (search over half-spaces).

Provably Close to Optimality Figure: If we solve a relaxation, sometimes we can characterize via duality theory, how close the optimal point on the non-convex set will be to the optimal point on the convex relaxation. Sparse regression and compressed sensing. Low-rank matrix completion. Many other examples.

Summary and Directions Modeling with Convex Optimization: Intuition and creativity are absolutely essential. Successful problem modeling comes from understanding the problem and hence what is important (e.g., boundary/smoothness of natural images, or approximate low-rank of rankings and preferences), and also the theory of convex optimization which problems can be solved quickly/efficiently, and what can we say about convex approximations to non-convex problems? Algorithms: Today we discussed first order methods. These are suited for very large-scale problems, as we often see in large-scale ML/data mining. Not all methods are applicable or best for all problems. Understanding the demands of your application, and the performance of convex optimization algorithms in different settings, is very important make or break in large-scale applications.

Summary and Directions Theory and Duality: We discussed this least today, but it is important not only for analysis, but also for algorithmic development. We discussed only algorithms that search for optimal points. Dual algorithms search over half-spaces. Different problems may yield to better solution or approximation in one domain or another.

Some Useful References Convex Optimization, by Stephen Boyd and Lieven Vandenberghe (see also the slides from their courses). Convex Optimization Algorithms, by Dimitri Bertsekas. Optimization Models and Applications, by Guiseppe Calafiore and Laurent El Ghaoui. Introductory Lectures on Convex Optimization, by Yurii Nesterov. Lectures on Modern Convex Optimization, by Aharon Ben-Tal and Arkadi Nemirovski

The End Thanks, and feel free to contact me with questions: constantine@utexas.edu