Machine Learning for Signal Processing Lecture 4: Optimization

Similar documents
Lecture 2 Optimization with equality constraints

Convexity Theory and Gradient Methods

Programming, numerics and optimization

1. Show that the rectangle of maximum area that has a given perimeter p is a square.

In other words, we want to find the domain points that yield the maximum or minimum values (extrema) of the function.

A Short SVM (Support Vector Machine) Tutorial

Chapter 3 Numerical Methods

Lecture 2 September 3

Sparse Optimization Lecture: Proximal Operator/Algorithm and Lagrange Dual

EC5555 Economics Masters Refresher Course in Mathematics September Lecture 6 Optimization with equality constraints Francesco Feri

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

Second Midterm Exam Math 212 Fall 2010

Local and Global Minimum

Computational Methods. Constrained Optimization

REVIEW I MATH 254 Calculus IV. Exam I (Friday, April 29) will cover sections

Unconstrained Optimization Principles of Unconstrained Optimization Search Methods

Lecture 15: Log Barrier Method

Lagrange multipliers October 2013

Characterizing Improving Directions Unconstrained Optimization

3.3 Optimizing Functions of Several Variables 3.4 Lagrange Multipliers

Lagrange multipliers 14.8

Lecture 18: March 23

COMS 4771 Support Vector Machines. Nakul Verma

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 2. Convex Optimization

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Contents. I Basics 1. Copyright by SIAM. Unauthorized reproduction of this article is prohibited.

Applied Lagrange Duality for Constrained Optimization

Lecture 12: convergence. Derivative (one variable)

Lecture 7: Support Vector Machine

Introduction to Modern Control Systems

PRIMAL-DUAL INTERIOR POINT METHOD FOR LINEAR PROGRAMMING. 1. Introduction

Solution Methods Numerical Algorithms

MATH 19520/51 Class 10

Inverse and Implicit functions

Convexization in Markov Chain Monte Carlo

CMU-Q Lecture 9: Optimization II: Constrained,Unconstrained Optimization Convex optimization. Teacher: Gianni A. Di Caro

Ellipsoid Algorithm :Algorithms in the Real World. Ellipsoid Algorithm. Reduction from general case

Chapter II. Linear Programming

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

EXTRA-CREDIT PROBLEMS ON SURFACES, MULTIVARIABLE FUNCTIONS AND PARTIAL DERIVATIVES

Constrained Optimization

14.5 Directional Derivatives and the Gradient Vector

Lecture 25 Nonlinear Programming. November 9, 2009

Lecture 19: Convex Non-Smooth Optimization. April 2, 2007

Today. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient

Introduction to Machine Learning

Modern Methods of Data Analysis - WS 07/08

Winter 2012 Math 255 Section 006. Problem Set 7

Convex Optimization and Machine Learning

5 Day 5: Maxima and minima for n variables.

Mathematical Programming and Research Methods (Part II)

Section 4: Extreme Values & Lagrange Multipliers.

Convex Optimization MLSS 2015

Lagrange multipliers. Contents. Introduction. From Wikipedia, the free encyclopedia

Support Vector Machines. James McInerney Adapted from slides by Nakul Verma

Convexity and Optimization

16.410/413 Principles of Autonomy and Decision Making

ISM206 Lecture, April 26, 2005 Optimization of Nonlinear Objectives, with Non-Linear Constraints

MEI Desmos Tasks for AS Pure

Kernel Methods & Support Vector Machines

30. Constrained Optimization

R f da (where da denotes the differential of area dxdy (or dydx)

Physics 235 Chapter 6. Chapter 6 Some Methods in the Calculus of Variations

NAME: Section # SSN: X X X X

MATH2111 Higher Several Variable Calculus Lagrange Multipliers

Constrained extrema of two variables functions

INTRODUCTION TO LINEAR AND NONLINEAR PROGRAMMING

Math 113 Calculus III Final Exam Practice Problems Spring 2003

(1) Given the following system of linear equations, which depends on a parameter a R, 3x y + 5z = 2 4x + y + (a 2 14)z = a + 2

Constrained and Unconstrained Optimization

Introduction to Optimization

UNIT 2 LINEAR PROGRAMMING PROBLEMS

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Convex Optimization. Erick Delage, and Ashutosh Saxena. October 20, (a) (b) (c)

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Convexity and Optimization

Lecture 2: August 31

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018

Constrained Optimization and Lagrange Multipliers

Support Vector Machines.

Practice problems from old exams for math 233 William H. Meeks III December 21, 2009

Support Vector Machines.

David G. Luenberger Yinyu Ye. Linear and Nonlinear. Programming. Fourth Edition. ö Springer

The Alternating Direction Method of Multipliers

. Tutorial Class V 3-10/10/2012 First Order Partial Derivatives;...

Lecture 2: August 29, 2018

Lecture 19: November 5

The big picture and a math review

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Multivariate Calculus Review Problems for Examination Two

Lagrange Multipliers. Lagrange Multipliers. Lagrange Multipliers. Lagrange Multipliers. Lagrange Multipliers. Lagrange Multipliers

Computational Optimization. Constrained Optimization Algorithms

MATH Lagrange multipliers in 3 variables Fall 2016

Graphs of Exponential

CSE 554 Lecture 6: Fairing and Simplification

MEI GeoGebra Tasks for AS Pure

Nonlinear Programming

Convex Programs. COMPSCI 371D Machine Learning. COMPSCI 371D Machine Learning Convex Programs 1 / 21

Multivariate Numerical Optimization

SYSTEMS OF NONLINEAR EQUATIONS

Transcription:

Machine Learning for Signal Processing Lecture 4: Optimization 13 Sep 2015 Instructor: Bhiksha Raj (slides largely by Najim Dehak, JHU) 11-755/18-797 1

Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 2

Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 3

A problem we recently saw M = N = S The projection matrix P is the matrix that minimizes the total error between the projected matrix S and the original matrix M 11-755/18-797 4

The projection problem S = PM For individual vectors in the spectrogram S i = PM i Total projection error is E = M i PM i 2 i The projection matrix projects onto the space of notes in N P = NC The problem of finding P : Minimize E = M i PM i 2 i such that P = NC This is a problem of constrained optimization 11-755/18-797 5

Optimization Optimization is finding the best value of a function (which can be the best minimum) The number of turning points of a function depend on the order of the function. Not all turning points are minima. They can be minima, maxima or inflection points. f (x) f(x) global maximum min x f (x) inflection point global minimum local minimum x

Examples of Optimization : Multivariate functions Find the optimal point in these functions 11-755/18-797 7

Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 8

Simple Approach: Turning Point f(x) X The minimum of the function is always a turning point The function increases on either side How to identify these turning points? 11-755/18-797 9

The derivative of a curve y y x x The derivative α of a curve is a multiplicative factor explaining how much y changes in response to a very small change in x y = x For scalar functions of scalar variables, often expressed as dy y = dy dx x y = f (x) x dx or as f (x) We have all learned how to compute derivatives in basic calculus 11-755/18-797 10

Positive derivative dy dx > 0 + + + + + The derivative of a Curve 0 - - - - - - - Negative derivative dy dx < 0 0 + + + + + + + + Zero derivative dy dx = 0 0 - - - - - - - - - In upward-rising regions of the curve, the derivative is positive Small increase in X cause Y to increase In downward-falling regions, the derivative is negative At turning points, the derivative is 0 Assumption: the function is differentiable at the turning point 11-755/18-797 11

Finding the minimum of a function f(x) dy dx = 0 x Find the value x at which f (x) = 0 Solve df(x) dx = 0 The solution is a turning point But is it a minimum? 11-755/18-797 13

Turning Points + + + 0 - - - - - - + + + + + + 0 - - - - - - - - 0 Both maxima and minima have zero derivative 11-755/18-797 14

Derivatives of a curve f(x) x Both maxima and minima are turning points 11-755/18-797 15

Derivatives of a curve f (x) f(x) x Both maxima and minima are turning points Both maxima and minima have zero derivative 11-755/18-797 16

Derivative of the derivative of the curve f (x) f (x) f(x) x Both maxima and minima are turning points Both maxima and minima have zero derivative The second derivative f (x) is ve at maxima and +ve at minima! At maxima the derivative goes from +ve to ve, so the derivative decreases as x increases At minima the derivative goes from ve to +ve and increases as x increases 11-755/18-797 17

Soln: Finding the minimum or f(x) maximum of a function dy dx = 0 Find the value x at which f (x) = 0: Solve df(x) dx = 0 The solution x soln is a turning point Check the double derivative at x soln : compute f x soln = df (x soln) dx If f x soln is positive x soln is a minimum, otherwise it is a maximum x 11-755/18-797 18

What about functions of multiple variables? The optimum point is still turning point Shifting in any direction will increase the value For smooth functions, miniscule shifts will not result in any change at all We must find a point where shifting in any direction by a microscopic amount will not change the value of the function 11-755/18-797 19

A brief note on derivatives of multivariate functions 11-755/18-797 20

The Gradient of a scalar function f(x) The Gradient f(x) of a scalar function f(x) of a multi-variate input X is a multiplicative factor that gives us the change in f(x) for tiny variations in X X f(x) = f(x) T X 11-755/18-797 21

Gradients of scalar functions with multi-variate inputs Consider f X = f x 1, x 2,, x n f(x) f(x) = x 1 f(x) x 2 f(x) Check: f X = f X T X = f(x) x x 1 + f(x) 1 x 2 x n x 2 + + f(x) x n x n

A well-known vector property u. v = u v cosθ The inner product between two vectors of fixed lengths is maximum when the two vectors are aligned 11-755/18-797 23

f X Properties of Gradient = f X T X The inner product between f X and X Fixing the length of X E.g. X = 1 f X is max if X = c f X f X, X = 0 The function f(x) increases most rapidly if the input increment X is perfectly aligned to f X The gradient is the direction of fastest increase in f(x) 11-755/18-797 24

Gradient Gradient vector f(x) 11-755/18-797 25

Gradient Gradient vector f(x) Moving in this direction increases f X fastest 11-755/18-797 26

Gradient Gradient vector f(x) f(x) Moving in this direction decreases f X fastest Moving in this direction increases f(x) fastest 11-755/18-797 27

Gradient Gradient here is 0 Gradient here is 0 11-755/18-797 28

Properties of Gradient: 2 The gradient vector f(x) is perpendicular to the level curve 29

Derivatives of vector function of vector input Y = f(x) X = x 1 x 2 Y = y 1 y 2 X f(x) The Gradient f(x) of a scalar function f(x) of a multi-variate input X is a multiplicative factor that gives us the change in f(x) for tiny variations in X f(x) = f(x) T X 11-755/18-797 30

Gradient of vector function of vector input 11-755/18-797 31 x n x x X.. 2 1 n m n n n n x y x y x y x y x y x y x y x y x y X f................ ) ( 2 1 2 2 2 1 2 1 2 1 1 1 y m y y X f.. ) ( 2 1 Properties and interpretations are similar to the case of scalar functions of vector inputs

The Hessian The Hessian of a function f(x 1, x 2,, x n ) is given by the second derivative é ê ê ê ê ê ê Ñ 2 f (x 1,..., x n ) := ê ê ê ê ê ê ê ë 2 f x 1 2 2 f x 1 x 2.. 2 f 2 f.. 2 x 2 x 1 x 2 2 f x 1 x n 2 f x 2 x n.......... 2 f 2 f.. x n x 1 x n x 2 2 f x n 2 11-755/18-797 45 ù ú ú ú ú ú ú ú ú ú ú ú ú ú û

Returning to direct optimization 11-755/18-797 47

Finding the minimum of a scalar function of a multi-variate input The optimum point is a turning point the gradient will be 0 11-755/18-797 48

Unconstrained Minimization of function (Multivariate) 1. Solve for the X where the gradient equation equals to zero f ( X ) 2. Compute the Hessian Matrix 2 f(x) at the candidate solution and verify that Hessian is positive definite (eigenvalues positive) -> to identify local minima Hessian is negative definite (eigenvalues negative) -> to identify local maxima 0 11-755/18-797 49

Unconstrained Minimization of Minimize function (Example) f (x 1, x 2, x 3 ) = (x 1 ) 2 + x 1 (1- x 2 )-(x 2 ) 2 - x 2 x 3 +(x 3 ) 2 + x 3 Gradient é ê Ñf = ê ê ëê 2x 1 +1- x 2 -x 1 + 2x 2 - x 3 -x 2 + 2x 3 +1 ù ú ú ú ûú 11-755/18-797 50

Unconstrained Minimization of Set the gradient to null é ê Ñf = 0Þ ê ê ëê function (Example) 2x 1 +1- x 2 -x 1 + 2x 2 - x 3 -x 2 + 2x 3 +1 ù é ú ê ú = ê ú ê ûú ë Solving the 3 equations system with 3 unknowns é ê x = ê ê ëê x 1 x 2 x 3 ù é ú ê ú = ê ú ê ûú ë -1-1 -1 ù ú ú ú û 0 0 0 ù ú ú ú û 11-755/18-797 51

Unconstrained Minimization of function (Example) Compute the Hessian matrix Evaluate the eigenvalues of the Hessian matrix l 1 = 3.414, l 2 = 0.586, l 3 = 2 All the eigenvalues are positives => the Hessian matrix is positive definite The point é ê x = ê ê ëê x 1 x 2 x 3 ù é ú ê ú = ê ú ê ûú ë -1-1 -1 ù ú ú ú û é ê Ñ 2 f = ê ê ë is a minimum 2-1 0-1 2-1 0-1 2 11-755/18-797 52 ù ú ú ú û

Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 53

Closed Form Solutions are not always available f(x) Often it is not possible to simply solve f X = 0 The function to minimize/maximize may have an intractable form In these situations, iterative solutions are used Begin with a guess for the optimal X and refine it iteratively until the correct value is obtained X 11-755/18-797 54

Iterative solutions f(x) X X x 0 x 1 x 2 x 5 x 3 X 0 2 x X 4 1 Iterative solutions Start from an initial guess X 0 for the optimal X Update the guess towards a (hopefully) better value of f(x) Stop when f(x) no longer decreases Problems: Which direction to step in How big must the steps be 11-755/18-797 55

Descent methods Iterative solutions that attempt to descend the function in steps to arrive at the minimum Based on the first order derivatives (gradient) and in some cases the second order derivatives (Hessian). Newton s method is based on both first and second derivatives Gradient descent is based only on the first derivative 11-755/18-797 56

Descent methods Iterative solutions that attempt to descend the function in steps to arrive at the minimum Based on the first order derivatives (gradient) and in some cases the second order derivatives (Hessian). Newton s method is based on both first and second derivatives Gradient descent is based only on the first derivative 11-755/18-797 57

Newton s Method The Newton Method is based on an iterative step to minimize the derivative f (x) x k+1 = x k - f (xk ) f (x k ) k is the current iteration The iterations continue until we achieve the stopping criterion x k+1 x k < ε 11-755/18-797 58

Newton s method to find the zero of a function Newton s method to find the zero of a function Initialize estimate Approximate function by the tangent at initial value Update estimate to location where tangent becomes 0 Iterate 11-755/18-797 59

Newton s Method to optimize a function y f(x) x f (x) Apply Newton s method to the derivative of the function! The derivative goes to 0 at the optimum Algorithm: Initialize x 0 K th iteration: Approximate f (x) by the tangent at x k Find the location x intersect where the tangent goes to 0. Set x k+1 = x intersect Iterate 11-755/18-797 60

Newton s method for univariate functions Apply Newton s algorithm to find the zero of the derivative f (x) x k+1 = x k - f (xk ) f (x k ) k is the current iteration The iterations continue until we achieve the stopping criterion x k+1 x k < ε 11-755/18-797 61

Newton s method for multivariate functions 1. Select an initial starting point X 0 2. Evaluate the gradient f X k and Hessian 2 f X k at X k 3. Calculate the new X k+1 using the following X X 2 k 1 f ( X ). f ( X ) k 1 k k 4. Repeat Steps 2 and 3 until convergence 11-755/18-797 62

Newton s Method Newton s approach is based on the computation of both gradient and Hessian Fast to converge (few iterations) Slow to compute Newton s method (arrives at optimum in a single step) x 0 This method is very sensitive to the initial point If the initial point is very far from the optimal point, the optimization process may not converge 11-755/18-797 68

Descent methods Iterative solutions that attempt to descend the function in steps to arrive at the minimum Based on the first order derivatives (gradient) and in some cases the second order derivatives (Hessian). Newton s method is based on both first and second derivatives Gradient descent is based only on the first derivative 11-755/18-797 69

The Approach of Gradient Descent Iterative solution: Start at some point Find direction in which to shift this point to decrease error This can be found from the derivative of the function A positive derivative moving left decreases error A negative derivative moving right decreases error Shift point in this direction

The Approach of Gradient Descent Iterative solution: Trivial algorithm Initialize x 0 While f x k 0 If sign f x k is positive: Else x k+1 = x k step x k+1 = x k + step But what must step be to ensure we actually get to the optimum?

The Approach of Gradient Descent Iterative solution: Trivial algorithm Initialize x 0 While f x k 0 x k+1 = x k sign f x k. step Identical to previous algorithm

The Approach of Gradient Descent Iterative solution: Trivial algorithm Initialize x 0 Whilef x k 0 x k+1 = x k η k f x k η k is the step size What must the step size be?

Gradient descent/ascent (multivariate) The gradient descent/ascent method to find the minimum or maximum of a function f iteratively To find a maximum move in the direction of the gradient x k+1 = x k + η k f x k To find a minimum move exactly opposite the direction of the gradient x k+1 = x k η k f x k What is the step size η k 11-755/18-797 74

1. Fixed step size Fixed step size Use fixed value for η k 11-755/18-797 75

f Influence of step size example (constant step size) é 2 2 ( x1, x2) ( x1 ) x1x2 4( x2) x initial = ê 3 ë 3 0.1 0.2 ù ú û x 0 x 0 11-755/18-797 76

2. Backtracking line search for step size Two parameters α (typically 0.5) and β (typically 0.8) At each iteration, estimate step size as follows: Set η k = 1 Update η k = βη k until f x k η k f x k f x k αη k f x k 2 Update x k+1 = x k η k f x k Intuitively: At each iteration Take a unit step size and keep shrinking it until we arrive at a place where the function f x k η k f x k actually decreases sufficiently w.r.t f x k 77

2. Backtracking line search for step size Keep shrinking step size till we find a good one 78

2. Backtracking line search for step size X k+1 X k Keep shrinking step size till we find a good one Update estimate to the position at the converged step size 79

2. Backtracking line search for step size At each iteration, estimate step size as follows: Set η k = 1 Update η k = βη k until f x k η k f x k f x k αη k f x k 2 Update x k+1 = x k η k f x k Figure shows actual evolution of x k 11-755/18-797 80

3. Full line search for step size At each iteration scan for η k that minimizes f x k η k f x k Update x k = x k η k f x k 11-755/18-797 81

3. Full line search for step size At each iteration scan for η k that minimizes f x k η k f x k Can be computed by solving df x k η k f x k dη k = 0 Update x k = x k η k f x k 11-755/18-797 82

Gradient descent convergence criteria The gradient descent algorithm converges when one of the following criteria is satisfied f (x k+1 )- f (x k ) <e 1 Or Ñf (x k ) < e 2 11-755/18-797 83

Gradient descent vs. Newton s Gradient descent is typically much slower to converge than Newton s But much faster to compute x 0 Newton s method Gradient descent Newton s method is exponentially faster for convex problems Although derivatives and Hessians may be hard to derive May not converge for non-convex problems 97

Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 98

Online Optimization Often our objective function is an error The error is the cumulative error from many signals E.g. E(W) = y f(x, W) 2 x Optimization will find the W that minimizes total error across all x What if wanted to update our parameters after each input x instead of waiting for all of them to arrive? 11-755/18-797 99

A problem we saw M = S = N =? Given the music M and the score S of only four of the notes, but not the notes themselves, find the notes M = NS N = MPinv(S) 11-755/18-797 100

The Actual Problem M = S = N =? Given the music M and the score S find a matrix N such the error of reconstruction E = M i NS i 2 i is minimized This is a standard optimization problem The solution gives us N = MPinv(S) 11-755/18-797 101

The Actual Problem M = S = N =? Given the music M and the score S find a matrix N such the error of reconstruction E = M i NS i 2 i is minimized This is a standard optimization problem The solution gives us N = MPinv(S) This requires seeing all of M and S to estimate N 11-755/18-797 102

Online Updates M = S = N =? N 0 N 1 N 2 N 3 N 4 What if we want to update our estimate of the notes after every input After observing each vector of music and its score A situation that arises in many similar problems 11-755/18-797 103

Incremental Updates Easy solution: To obtain the k th estimate N k, minimize the error on the k th input The error on the k th input is: E k = M K NS K 2 Differentiating it gives us N = 2 M K NS K M K T Update the parameter to move in the direction of this update N k+1 = N k T + η M K NS K M K η must typically be very small to prevent the updates from being influenced entirely by the latest observation 11-755/18-797 104

Online update: Non-quadratic functions The earlier problem has a linear predictor as the underlying model M k = NS k The squared error E k = M k NS k 2 is quadratic and differentiable We often have non-linear predictors Y k = g WX k E k = Y k g WX k 2 The derivative of the error w.r.t W is often ugly or intractable For such problems we will use the following generalization of the online update rule for linear predictors This is the Widrow-Hoff rule W k+1 = W k + η Y k g W k X k Y k T 11-755/18-797 105

Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 106

A problem we recently saw M = N = S The projection matrix P is the matrix that minimizes the total error between the projected matrix S and the original matrix M 11-755/18-797 107

CONSTRAINED optimization Recall the projection problem: Find P such that we minimize E = M i PM i 2 i AND such that the projection is composed of the notes in N P = NC This is a problem of constrained optimization 11-755/18-797 108

Optimization problem with constraints Finding the minimum of a function f: R N R subject to constraints min f (x) x s.t. g i (x) 0 h j (x) = 0 i = 1,..., k { } j = 1,...,l { } Constraints define a feasible region, which is nonempty 11-755/18-797 109

Optimization without constraints No Constraints min x f (x, y, z)= x 2 + y 2 Best minimum point 11-755/18-797 110

Optimization with constraints With Constraints min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 11-755/18-797 111

Optimization with constraints Minima w/ and w/o constraints min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 Minimum Without constraints Minimum With constraints 11-755/18-797 112

Solving for constrained optimization: the method of Lagrangians Consider a function f(x, y) that must be maximized w.r.t (x, y) subject to g(x, y) = c Note, we re using a maximization example to go with the figures that have been obtained from Wikipedia 11-755/18-797 113

The Lagrange Method Purple surface is f(x, y) Must be maximized Red curve is constraint g(x, y) = c All solutions must line on this curve Problem: Find the position of the largest f(x, y) on the red curve! 11-755/18-797 114

The Lagrange Method maximize f x, y, s. t. g x, y = c Dotted lines are constant-value contours f(x, y) = C f x, y has the same value C at all points on a contour The constrained optimum will be at the point where the highest constant-value contour touches the red curve It will be tangential to the red curve 115

The Lagrange Method maximize f x, y, s. t. g x, y = c The constrained optimum is where the highest constant-value contour is tangential to the red curve The gradient of f(x, y) = C will be parallel to the gradient of g(x, y) = c 116

The Lagrange Method maximize f x, y, s. t. g x, y = c At the optimum f x, y = λ g x, y g x, y = c Find x, y that satisfies both above conditions 117

The Lagrange Method f x, y g x, y = λ g x, y = c Find x, y that satisfies both above conditions Combine the above two into one equation L x, y, λ = f x, y λ(g x, y c) Optimize it for x, y, λ Solving for (x, y), Solving for λ x,y L x, y, λ =0 f x, y = λ g x, y L x, y, λ λ = 0 g x, y = c 11-755/18-797 118

The Lagrange Method f x, y g x, y = λ g x, y = c Find x, y that satisfies both above conditions Combine the above two into one equation L x, y, λ = f x, y λ(g x, y c) Optimize it for x, y, λ Solving for (x, y), Formally: Solving for λ x,y L x, y, λ =0 f x, y = λ g x, y to maximize f(x, y): max L x, y, λ λ x,y to minimize f(x, y): min min λ L(x, y, λ) max L(x, y, λ) = 0 g x, y = c x,y λ 11-755/18-797 119

Generalizes to inequality constraints Optimization problem with constraints min s. t. g ( x) Lagrange multipliers l i ³ 0,n Î Â L(x, l,n) = f (x)+ å l i g i (x)+ n j h j (x) i=1 The necessary condition ÑL(x, l,n) = 0 Û L x x h j f ( x) i ( x) 0 i 0 k j 1,..., k 1,..., l l å j=1 L L = 0, = 0, l n = 0 11-755/18-797 120

Lagrange multiplier example min x, y s. t.2x f ( x, y) x y 4 2 y 2 Lagrange multiplier L x 2 y 2 (2x y 4) Evaluate ÑL(x, l,n) = 0 Û L x L L = 0, = 0, l n = 0 11-755/18-797 121

Optimization with constraints Lagrange Multiplier results min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 Minimum With constraints (-8/5,-4/5,16/5) 11-755/18-797 123

An Alternate Approach: Projected Gradients Feasible Set The constraints specify a feasible set The region of the space where the solution can lie 11-755/18-797 124

An Alternate Approach: Projected Gradients Feasible Set From the current estimate, take a step using the conventional gradient descent approach If the update is inside the feasible set, no further action is required 11-755/18-797 125

An Alternate Approach: Projected Gradients Feasible Set If the update falls outside the feasible set, 11-755/18-797 126

An Alternate Approach: Projected Gradients Feasible Set If the update falls outside the feasible set, find the closest point to the update on the boundary of the feasible set 11-755/18-797 127

An Alternate Approach: Projected Gradients Feasible Set If the update falls outside the feasible set, find the closest point to the update on the boundary of the feasible set And move the updated estimate to this new point 11-755/18-797 128

The method of projected gradients min x s. t. g f ( x) i ( x) 0 i 1,..., k The constraints specify a feasible set The region of the space where the solution can lie Update current estimate using the conventional gradient descent approach If the update is inside the feasible set, no further action is required If the update falls outside the feasible set, find the closest point to the update on the boundary of the feasible set And move the updated estimate to this new point The closest point projects the update onto the feasible set For many problems, however, finding this projection can be difficult or intractable 11-755/18-797 129

Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 130

Regularization Sometimes we have additional regularization on the parameters E.g. Note these are not hard constraints Minimize f(x) while requiring that the length X 2 is also minimum Minimize f(x) while requiring that X 1 is also minimal Minimize f(x) such that g(x) is maximum We will encounter problems where such requirements are logical 11-755/18-797 131

Contour Plot of a Quadratic Objective f(x) f(x) x 2 x 2 x 1 x 1 Left: Actual 3D plot x = [x 1, x 2 ] Right: constant-value contours Innermost contour has lowest value Unconstrained/unregularized solution: The center of the innermost contour 11-755/18-797 132

Examples of regularization Image Credit: Tibshirani Left: L 1 regularization, find x that minimizes f(x) o Also minimize x 1 o x 1 = const is a diamond o Find x that also minimizes diameter of diamond Right: L 2 or Tikhonov regularization o Also minimize x 2 o x 2 = const is a circle (sphere) o Find x that also minimizes diameter of circle 11755/18797 133

Regularization The problem: multiple simultaneous objectives Minimize f(x) Also minimize g 1 X, g 2 X, These are regularizers Solution: Define L X = f X + λ 1 g 1 X + λ 2 g 2 X + λ 1, λ 2 etc are regularization parameters. These are set and not estimated Unlike Lagrange multipliers Minimize L X 11-755/18-797 134

Contour Plot of a Quadratic Objective f(x) f(x) x 2 x 2 x 1 x 1 Left: Actual 3D plot x = [x 1, x 2 ] Right: equal-value contours of f(x) Innermost contour has lowest value 11-755/18-797 135

With L 1 regularization f(x) + 0.5 x 1 f(x) + x 1 x 2 x 2 x 1 L 1 regularized objective f x + λ x 1, for different values of regularization parameter Note: Minimum value occurs on x 1 axis for = 1 Sparse solution 11-755/18-797 136 x 1

L 2 and L 1 -L 2 regularization f(x) + x 2 2 f(x) + x 1 + x 2 2 x 2 x 2 x 1 L 2 regularized objective f x + λ x 2 results in shorter optimum L 1 -L 2 regularized objective results in sparse, short optimum = 1 for both regularizers in example x 1 11-755/18-797 137

Sparse signal reconstruction Minimum Square Error Signal x of length 100 10 non-zero component Regularization Reconstructing the original signal from noisy 50 measurements b = Ax + ε 11-755/18-797 138

Signal reconstruction Minimum Square Error Signal reconstruction Least square problem min Ax - b 2 2 11-755/18-797 139

L2-Regularization Signal reconstruction Least squares problem min Ax - b 2 2 +g x 2 2 11-755/18-797 140

L1-Regularization Signal reconstruction Least square problem min Ax - b 2 2 +g x 1 11-755/18-797 141

Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 142

Convex optimization Problems An convex optimization problem is defined by convex objective function Convex inequality constraints Affine equality constraints h j min f 0 (x) (convex function) x s.t. f i (x) 0 (convex sets) h j (x) = 0 (Affine) f i 11-755/18-797 143

Convex Sets a set is convex, if for each x, y Î C and a Î 0,1 C Î Â n [ ] then ax +(1-a)y Î C x Non Convex Convex y y x 11-755/18-797 144

Convex functions A function f: R N R is convex if for each x, y domain(f) and α [0,1] f (ax +(1-a)y) a f (x)+(1-a) f (y) Convex f (x) Non Convex f (y) a f (x)+(1-a) f (y) a f (x)+(1-a) f (y) f (y) f (x) 11-755/18-797 145

Concave functions A function f: R N R is convex if for each x, y domain(f) and α [0,1] f (ax +(1-a)y) ³a f (x)+(1-a) f (y) f (y) Concave a f (x)+(1-a) f (y) 11-755/18-797 146

First order convexity conditions A differentiable function f: R N R is convex if and only if for x, y domain(f) the following condition is satisfied f (y) ³ f (x)+ñf (x) t (y - x) Lower Bound (x, f (x)) f (x)+ Ñf (x) T (y - x) 11-755/18-797 147

Second order convexity conditions A twice-differentiable function f: R N R is convex if and only if for all x, y domain(f) the Hessian is superior or equal to zero Ñ 2 f (x) ³ 0 11-755/18-797 148

Properties of Convex Optimization For convex objectives over convex feasible sets, the optimum value is unique There are no local minima/maxima that are not also the global minima/maxima Any gradient-based solution will find this optimum eventually Primary problem: speed of convergence to this optimum 11-755/18-797 149

Lagrange multiplier duality Optimization problem with constraints min f (x) x s.t. g i (x) 0 h j (x) = 0 i = { 1,..., k} j = { 1,...,l } Lagrange multipliers l i ³ 0,n Î Â L(x, l,n) = f (x)+ å l i g i (x)+ n j h j (x) The Dual function inf x L(x, l,n) = inf x k i=1 l å j=1 ì k l ï üï í f (x)+ å l i g i (x)+ ån j h j (x) ý îï i=1 j=1 þï 11-755/18-797 150

Lagrange multiplier duality The Original optimization problem min{ sup L(x, l,n) } x The Dual optimization Property of the Dual for convex function sup l³0,n l³0,n { inf L(x, l,n) } = f (x * ) x l³0,n max{ inf L(x, l,n) } x 11-755/18-797 151

Lagrange multiplier duality Previous Example f x, y is convex Constraint function is convex min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 11-755/18-797 152

Lagrange multiplier duality Primal system min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 Lagrange Multiplier L = x 2 + y 2 + l(2x + y - 4) L = 2x + 2l = 0 Þ x = -l x L y = 2y + l = 0 Þ y = - l 2 Dual system max l w(l) = 5 4 l 2 + 4l s.t. l ³ 0 Property w(l*) = f (x*, y*) 11-755/18-797 153

Lagrange multiplier duality Dual system max l w(l) = 5 4 l 2 + 4l s.t. l ³ 0 Concave function Convex function become concave function in dual problem w x = - 5 2 l + 4 = 0 Þ l* = 8 5 11-755/18-797 154

Lagrange multiplier duality Primal system min x,y Evaluating f (x, y)= x 2 + y 2 s.t. 2x + y -4 w(l*) = f (x*, y*) Dual system max l w(l) = 5 4 l 2 + 4l s.t. l ³ 0 x* = - 8 5, y* = - 4 5 l* = 8 5 æ f (x*, y*) = - 8 ö ç è 5ø 2 æ + - 4 ö ç è 5ø 2 w(l*) = - 5 æ8 ç ö 2 4 è 5ø + 32 5 f (x*, y*) = 16 5 w(l*) = 16 5 11-755/18-797 155