Machine Learning for Signal Processing Lecture 4: Optimization 13 Sep 2015 Instructor: Bhiksha Raj (slides largely by Najim Dehak, JHU) 11-755/18-797 1
Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 2
Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 3
A problem we recently saw M = N = S The projection matrix P is the matrix that minimizes the total error between the projected matrix S and the original matrix M 11-755/18-797 4
The projection problem S = PM For individual vectors in the spectrogram S i = PM i Total projection error is E = M i PM i 2 i The projection matrix projects onto the space of notes in N P = NC The problem of finding P : Minimize E = M i PM i 2 i such that P = NC This is a problem of constrained optimization 11-755/18-797 5
Optimization Optimization is finding the best value of a function (which can be the best minimum) The number of turning points of a function depend on the order of the function. Not all turning points are minima. They can be minima, maxima or inflection points. f (x) f(x) global maximum min x f (x) inflection point global minimum local minimum x
Examples of Optimization : Multivariate functions Find the optimal point in these functions 11-755/18-797 7
Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 8
Simple Approach: Turning Point f(x) X The minimum of the function is always a turning point The function increases on either side How to identify these turning points? 11-755/18-797 9
The derivative of a curve y y x x The derivative α of a curve is a multiplicative factor explaining how much y changes in response to a very small change in x y = x For scalar functions of scalar variables, often expressed as dy y = dy dx x y = f (x) x dx or as f (x) We have all learned how to compute derivatives in basic calculus 11-755/18-797 10
Positive derivative dy dx > 0 + + + + + The derivative of a Curve 0 - - - - - - - Negative derivative dy dx < 0 0 + + + + + + + + Zero derivative dy dx = 0 0 - - - - - - - - - In upward-rising regions of the curve, the derivative is positive Small increase in X cause Y to increase In downward-falling regions, the derivative is negative At turning points, the derivative is 0 Assumption: the function is differentiable at the turning point 11-755/18-797 11
Finding the minimum of a function f(x) dy dx = 0 x Find the value x at which f (x) = 0 Solve df(x) dx = 0 The solution is a turning point But is it a minimum? 11-755/18-797 13
Turning Points + + + 0 - - - - - - + + + + + + 0 - - - - - - - - 0 Both maxima and minima have zero derivative 11-755/18-797 14
Derivatives of a curve f(x) x Both maxima and minima are turning points 11-755/18-797 15
Derivatives of a curve f (x) f(x) x Both maxima and minima are turning points Both maxima and minima have zero derivative 11-755/18-797 16
Derivative of the derivative of the curve f (x) f (x) f(x) x Both maxima and minima are turning points Both maxima and minima have zero derivative The second derivative f (x) is ve at maxima and +ve at minima! At maxima the derivative goes from +ve to ve, so the derivative decreases as x increases At minima the derivative goes from ve to +ve and increases as x increases 11-755/18-797 17
Soln: Finding the minimum or f(x) maximum of a function dy dx = 0 Find the value x at which f (x) = 0: Solve df(x) dx = 0 The solution x soln is a turning point Check the double derivative at x soln : compute f x soln = df (x soln) dx If f x soln is positive x soln is a minimum, otherwise it is a maximum x 11-755/18-797 18
What about functions of multiple variables? The optimum point is still turning point Shifting in any direction will increase the value For smooth functions, miniscule shifts will not result in any change at all We must find a point where shifting in any direction by a microscopic amount will not change the value of the function 11-755/18-797 19
A brief note on derivatives of multivariate functions 11-755/18-797 20
The Gradient of a scalar function f(x) The Gradient f(x) of a scalar function f(x) of a multi-variate input X is a multiplicative factor that gives us the change in f(x) for tiny variations in X X f(x) = f(x) T X 11-755/18-797 21
Gradients of scalar functions with multi-variate inputs Consider f X = f x 1, x 2,, x n f(x) f(x) = x 1 f(x) x 2 f(x) Check: f X = f X T X = f(x) x x 1 + f(x) 1 x 2 x n x 2 + + f(x) x n x n
A well-known vector property u. v = u v cosθ The inner product between two vectors of fixed lengths is maximum when the two vectors are aligned 11-755/18-797 23
f X Properties of Gradient = f X T X The inner product between f X and X Fixing the length of X E.g. X = 1 f X is max if X = c f X f X, X = 0 The function f(x) increases most rapidly if the input increment X is perfectly aligned to f X The gradient is the direction of fastest increase in f(x) 11-755/18-797 24
Gradient Gradient vector f(x) 11-755/18-797 25
Gradient Gradient vector f(x) Moving in this direction increases f X fastest 11-755/18-797 26
Gradient Gradient vector f(x) f(x) Moving in this direction decreases f X fastest Moving in this direction increases f(x) fastest 11-755/18-797 27
Gradient Gradient here is 0 Gradient here is 0 11-755/18-797 28
Properties of Gradient: 2 The gradient vector f(x) is perpendicular to the level curve 29
Derivatives of vector function of vector input Y = f(x) X = x 1 x 2 Y = y 1 y 2 X f(x) The Gradient f(x) of a scalar function f(x) of a multi-variate input X is a multiplicative factor that gives us the change in f(x) for tiny variations in X f(x) = f(x) T X 11-755/18-797 30
Gradient of vector function of vector input 11-755/18-797 31 x n x x X.. 2 1 n m n n n n x y x y x y x y x y x y x y x y x y X f................ ) ( 2 1 2 2 2 1 2 1 2 1 1 1 y m y y X f.. ) ( 2 1 Properties and interpretations are similar to the case of scalar functions of vector inputs
The Hessian The Hessian of a function f(x 1, x 2,, x n ) is given by the second derivative é ê ê ê ê ê ê Ñ 2 f (x 1,..., x n ) := ê ê ê ê ê ê ê ë 2 f x 1 2 2 f x 1 x 2.. 2 f 2 f.. 2 x 2 x 1 x 2 2 f x 1 x n 2 f x 2 x n.......... 2 f 2 f.. x n x 1 x n x 2 2 f x n 2 11-755/18-797 45 ù ú ú ú ú ú ú ú ú ú ú ú ú ú û
Returning to direct optimization 11-755/18-797 47
Finding the minimum of a scalar function of a multi-variate input The optimum point is a turning point the gradient will be 0 11-755/18-797 48
Unconstrained Minimization of function (Multivariate) 1. Solve for the X where the gradient equation equals to zero f ( X ) 2. Compute the Hessian Matrix 2 f(x) at the candidate solution and verify that Hessian is positive definite (eigenvalues positive) -> to identify local minima Hessian is negative definite (eigenvalues negative) -> to identify local maxima 0 11-755/18-797 49
Unconstrained Minimization of Minimize function (Example) f (x 1, x 2, x 3 ) = (x 1 ) 2 + x 1 (1- x 2 )-(x 2 ) 2 - x 2 x 3 +(x 3 ) 2 + x 3 Gradient é ê Ñf = ê ê ëê 2x 1 +1- x 2 -x 1 + 2x 2 - x 3 -x 2 + 2x 3 +1 ù ú ú ú ûú 11-755/18-797 50
Unconstrained Minimization of Set the gradient to null é ê Ñf = 0Þ ê ê ëê function (Example) 2x 1 +1- x 2 -x 1 + 2x 2 - x 3 -x 2 + 2x 3 +1 ù é ú ê ú = ê ú ê ûú ë Solving the 3 equations system with 3 unknowns é ê x = ê ê ëê x 1 x 2 x 3 ù é ú ê ú = ê ú ê ûú ë -1-1 -1 ù ú ú ú û 0 0 0 ù ú ú ú û 11-755/18-797 51
Unconstrained Minimization of function (Example) Compute the Hessian matrix Evaluate the eigenvalues of the Hessian matrix l 1 = 3.414, l 2 = 0.586, l 3 = 2 All the eigenvalues are positives => the Hessian matrix is positive definite The point é ê x = ê ê ëê x 1 x 2 x 3 ù é ú ê ú = ê ú ê ûú ë -1-1 -1 ù ú ú ú û é ê Ñ 2 f = ê ê ë is a minimum 2-1 0-1 2-1 0-1 2 11-755/18-797 52 ù ú ú ú û
Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 53
Closed Form Solutions are not always available f(x) Often it is not possible to simply solve f X = 0 The function to minimize/maximize may have an intractable form In these situations, iterative solutions are used Begin with a guess for the optimal X and refine it iteratively until the correct value is obtained X 11-755/18-797 54
Iterative solutions f(x) X X x 0 x 1 x 2 x 5 x 3 X 0 2 x X 4 1 Iterative solutions Start from an initial guess X 0 for the optimal X Update the guess towards a (hopefully) better value of f(x) Stop when f(x) no longer decreases Problems: Which direction to step in How big must the steps be 11-755/18-797 55
Descent methods Iterative solutions that attempt to descend the function in steps to arrive at the minimum Based on the first order derivatives (gradient) and in some cases the second order derivatives (Hessian). Newton s method is based on both first and second derivatives Gradient descent is based only on the first derivative 11-755/18-797 56
Descent methods Iterative solutions that attempt to descend the function in steps to arrive at the minimum Based on the first order derivatives (gradient) and in some cases the second order derivatives (Hessian). Newton s method is based on both first and second derivatives Gradient descent is based only on the first derivative 11-755/18-797 57
Newton s Method The Newton Method is based on an iterative step to minimize the derivative f (x) x k+1 = x k - f (xk ) f (x k ) k is the current iteration The iterations continue until we achieve the stopping criterion x k+1 x k < ε 11-755/18-797 58
Newton s method to find the zero of a function Newton s method to find the zero of a function Initialize estimate Approximate function by the tangent at initial value Update estimate to location where tangent becomes 0 Iterate 11-755/18-797 59
Newton s Method to optimize a function y f(x) x f (x) Apply Newton s method to the derivative of the function! The derivative goes to 0 at the optimum Algorithm: Initialize x 0 K th iteration: Approximate f (x) by the tangent at x k Find the location x intersect where the tangent goes to 0. Set x k+1 = x intersect Iterate 11-755/18-797 60
Newton s method for univariate functions Apply Newton s algorithm to find the zero of the derivative f (x) x k+1 = x k - f (xk ) f (x k ) k is the current iteration The iterations continue until we achieve the stopping criterion x k+1 x k < ε 11-755/18-797 61
Newton s method for multivariate functions 1. Select an initial starting point X 0 2. Evaluate the gradient f X k and Hessian 2 f X k at X k 3. Calculate the new X k+1 using the following X X 2 k 1 f ( X ). f ( X ) k 1 k k 4. Repeat Steps 2 and 3 until convergence 11-755/18-797 62
Newton s Method Newton s approach is based on the computation of both gradient and Hessian Fast to converge (few iterations) Slow to compute Newton s method (arrives at optimum in a single step) x 0 This method is very sensitive to the initial point If the initial point is very far from the optimal point, the optimization process may not converge 11-755/18-797 68
Descent methods Iterative solutions that attempt to descend the function in steps to arrive at the minimum Based on the first order derivatives (gradient) and in some cases the second order derivatives (Hessian). Newton s method is based on both first and second derivatives Gradient descent is based only on the first derivative 11-755/18-797 69
The Approach of Gradient Descent Iterative solution: Start at some point Find direction in which to shift this point to decrease error This can be found from the derivative of the function A positive derivative moving left decreases error A negative derivative moving right decreases error Shift point in this direction
The Approach of Gradient Descent Iterative solution: Trivial algorithm Initialize x 0 While f x k 0 If sign f x k is positive: Else x k+1 = x k step x k+1 = x k + step But what must step be to ensure we actually get to the optimum?
The Approach of Gradient Descent Iterative solution: Trivial algorithm Initialize x 0 While f x k 0 x k+1 = x k sign f x k. step Identical to previous algorithm
The Approach of Gradient Descent Iterative solution: Trivial algorithm Initialize x 0 Whilef x k 0 x k+1 = x k η k f x k η k is the step size What must the step size be?
Gradient descent/ascent (multivariate) The gradient descent/ascent method to find the minimum or maximum of a function f iteratively To find a maximum move in the direction of the gradient x k+1 = x k + η k f x k To find a minimum move exactly opposite the direction of the gradient x k+1 = x k η k f x k What is the step size η k 11-755/18-797 74
1. Fixed step size Fixed step size Use fixed value for η k 11-755/18-797 75
f Influence of step size example (constant step size) é 2 2 ( x1, x2) ( x1 ) x1x2 4( x2) x initial = ê 3 ë 3 0.1 0.2 ù ú û x 0 x 0 11-755/18-797 76
2. Backtracking line search for step size Two parameters α (typically 0.5) and β (typically 0.8) At each iteration, estimate step size as follows: Set η k = 1 Update η k = βη k until f x k η k f x k f x k αη k f x k 2 Update x k+1 = x k η k f x k Intuitively: At each iteration Take a unit step size and keep shrinking it until we arrive at a place where the function f x k η k f x k actually decreases sufficiently w.r.t f x k 77
2. Backtracking line search for step size Keep shrinking step size till we find a good one 78
2. Backtracking line search for step size X k+1 X k Keep shrinking step size till we find a good one Update estimate to the position at the converged step size 79
2. Backtracking line search for step size At each iteration, estimate step size as follows: Set η k = 1 Update η k = βη k until f x k η k f x k f x k αη k f x k 2 Update x k+1 = x k η k f x k Figure shows actual evolution of x k 11-755/18-797 80
3. Full line search for step size At each iteration scan for η k that minimizes f x k η k f x k Update x k = x k η k f x k 11-755/18-797 81
3. Full line search for step size At each iteration scan for η k that minimizes f x k η k f x k Can be computed by solving df x k η k f x k dη k = 0 Update x k = x k η k f x k 11-755/18-797 82
Gradient descent convergence criteria The gradient descent algorithm converges when one of the following criteria is satisfied f (x k+1 )- f (x k ) <e 1 Or Ñf (x k ) < e 2 11-755/18-797 83
Gradient descent vs. Newton s Gradient descent is typically much slower to converge than Newton s But much faster to compute x 0 Newton s method Gradient descent Newton s method is exponentially faster for convex problems Although derivatives and Hessians may be hard to derive May not converge for non-convex problems 97
Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 98
Online Optimization Often our objective function is an error The error is the cumulative error from many signals E.g. E(W) = y f(x, W) 2 x Optimization will find the W that minimizes total error across all x What if wanted to update our parameters after each input x instead of waiting for all of them to arrive? 11-755/18-797 99
A problem we saw M = S = N =? Given the music M and the score S of only four of the notes, but not the notes themselves, find the notes M = NS N = MPinv(S) 11-755/18-797 100
The Actual Problem M = S = N =? Given the music M and the score S find a matrix N such the error of reconstruction E = M i NS i 2 i is minimized This is a standard optimization problem The solution gives us N = MPinv(S) 11-755/18-797 101
The Actual Problem M = S = N =? Given the music M and the score S find a matrix N such the error of reconstruction E = M i NS i 2 i is minimized This is a standard optimization problem The solution gives us N = MPinv(S) This requires seeing all of M and S to estimate N 11-755/18-797 102
Online Updates M = S = N =? N 0 N 1 N 2 N 3 N 4 What if we want to update our estimate of the notes after every input After observing each vector of music and its score A situation that arises in many similar problems 11-755/18-797 103
Incremental Updates Easy solution: To obtain the k th estimate N k, minimize the error on the k th input The error on the k th input is: E k = M K NS K 2 Differentiating it gives us N = 2 M K NS K M K T Update the parameter to move in the direction of this update N k+1 = N k T + η M K NS K M K η must typically be very small to prevent the updates from being influenced entirely by the latest observation 11-755/18-797 104
Online update: Non-quadratic functions The earlier problem has a linear predictor as the underlying model M k = NS k The squared error E k = M k NS k 2 is quadratic and differentiable We often have non-linear predictors Y k = g WX k E k = Y k g WX k 2 The derivative of the error w.r.t W is often ugly or intractable For such problems we will use the following generalization of the online update rule for linear predictors This is the Widrow-Hoff rule W k+1 = W k + η Y k g W k X k Y k T 11-755/18-797 105
Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 106
A problem we recently saw M = N = S The projection matrix P is the matrix that minimizes the total error between the projected matrix S and the original matrix M 11-755/18-797 107
CONSTRAINED optimization Recall the projection problem: Find P such that we minimize E = M i PM i 2 i AND such that the projection is composed of the notes in N P = NC This is a problem of constrained optimization 11-755/18-797 108
Optimization problem with constraints Finding the minimum of a function f: R N R subject to constraints min f (x) x s.t. g i (x) 0 h j (x) = 0 i = 1,..., k { } j = 1,...,l { } Constraints define a feasible region, which is nonempty 11-755/18-797 109
Optimization without constraints No Constraints min x f (x, y, z)= x 2 + y 2 Best minimum point 11-755/18-797 110
Optimization with constraints With Constraints min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 11-755/18-797 111
Optimization with constraints Minima w/ and w/o constraints min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 Minimum Without constraints Minimum With constraints 11-755/18-797 112
Solving for constrained optimization: the method of Lagrangians Consider a function f(x, y) that must be maximized w.r.t (x, y) subject to g(x, y) = c Note, we re using a maximization example to go with the figures that have been obtained from Wikipedia 11-755/18-797 113
The Lagrange Method Purple surface is f(x, y) Must be maximized Red curve is constraint g(x, y) = c All solutions must line on this curve Problem: Find the position of the largest f(x, y) on the red curve! 11-755/18-797 114
The Lagrange Method maximize f x, y, s. t. g x, y = c Dotted lines are constant-value contours f(x, y) = C f x, y has the same value C at all points on a contour The constrained optimum will be at the point where the highest constant-value contour touches the red curve It will be tangential to the red curve 115
The Lagrange Method maximize f x, y, s. t. g x, y = c The constrained optimum is where the highest constant-value contour is tangential to the red curve The gradient of f(x, y) = C will be parallel to the gradient of g(x, y) = c 116
The Lagrange Method maximize f x, y, s. t. g x, y = c At the optimum f x, y = λ g x, y g x, y = c Find x, y that satisfies both above conditions 117
The Lagrange Method f x, y g x, y = λ g x, y = c Find x, y that satisfies both above conditions Combine the above two into one equation L x, y, λ = f x, y λ(g x, y c) Optimize it for x, y, λ Solving for (x, y), Solving for λ x,y L x, y, λ =0 f x, y = λ g x, y L x, y, λ λ = 0 g x, y = c 11-755/18-797 118
The Lagrange Method f x, y g x, y = λ g x, y = c Find x, y that satisfies both above conditions Combine the above two into one equation L x, y, λ = f x, y λ(g x, y c) Optimize it for x, y, λ Solving for (x, y), Formally: Solving for λ x,y L x, y, λ =0 f x, y = λ g x, y to maximize f(x, y): max L x, y, λ λ x,y to minimize f(x, y): min min λ L(x, y, λ) max L(x, y, λ) = 0 g x, y = c x,y λ 11-755/18-797 119
Generalizes to inequality constraints Optimization problem with constraints min s. t. g ( x) Lagrange multipliers l i ³ 0,n Î Â L(x, l,n) = f (x)+ å l i g i (x)+ n j h j (x) i=1 The necessary condition ÑL(x, l,n) = 0 Û L x x h j f ( x) i ( x) 0 i 0 k j 1,..., k 1,..., l l å j=1 L L = 0, = 0, l n = 0 11-755/18-797 120
Lagrange multiplier example min x, y s. t.2x f ( x, y) x y 4 2 y 2 Lagrange multiplier L x 2 y 2 (2x y 4) Evaluate ÑL(x, l,n) = 0 Û L x L L = 0, = 0, l n = 0 11-755/18-797 121
Optimization with constraints Lagrange Multiplier results min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 Minimum With constraints (-8/5,-4/5,16/5) 11-755/18-797 123
An Alternate Approach: Projected Gradients Feasible Set The constraints specify a feasible set The region of the space where the solution can lie 11-755/18-797 124
An Alternate Approach: Projected Gradients Feasible Set From the current estimate, take a step using the conventional gradient descent approach If the update is inside the feasible set, no further action is required 11-755/18-797 125
An Alternate Approach: Projected Gradients Feasible Set If the update falls outside the feasible set, 11-755/18-797 126
An Alternate Approach: Projected Gradients Feasible Set If the update falls outside the feasible set, find the closest point to the update on the boundary of the feasible set 11-755/18-797 127
An Alternate Approach: Projected Gradients Feasible Set If the update falls outside the feasible set, find the closest point to the update on the boundary of the feasible set And move the updated estimate to this new point 11-755/18-797 128
The method of projected gradients min x s. t. g f ( x) i ( x) 0 i 1,..., k The constraints specify a feasible set The region of the space where the solution can lie Update current estimate using the conventional gradient descent approach If the update is inside the feasible set, no further action is required If the update falls outside the feasible set, find the closest point to the update on the boundary of the feasible set And move the updated estimate to this new point The closest point projects the update onto the feasible set For many problems, however, finding this projection can be difficult or intractable 11-755/18-797 129
Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 130
Regularization Sometimes we have additional regularization on the parameters E.g. Note these are not hard constraints Minimize f(x) while requiring that the length X 2 is also minimum Minimize f(x) while requiring that X 1 is also minimal Minimize f(x) such that g(x) is maximum We will encounter problems where such requirements are logical 11-755/18-797 131
Contour Plot of a Quadratic Objective f(x) f(x) x 2 x 2 x 1 x 1 Left: Actual 3D plot x = [x 1, x 2 ] Right: constant-value contours Innermost contour has lowest value Unconstrained/unregularized solution: The center of the innermost contour 11-755/18-797 132
Examples of regularization Image Credit: Tibshirani Left: L 1 regularization, find x that minimizes f(x) o Also minimize x 1 o x 1 = const is a diamond o Find x that also minimizes diameter of diamond Right: L 2 or Tikhonov regularization o Also minimize x 2 o x 2 = const is a circle (sphere) o Find x that also minimizes diameter of circle 11755/18797 133
Regularization The problem: multiple simultaneous objectives Minimize f(x) Also minimize g 1 X, g 2 X, These are regularizers Solution: Define L X = f X + λ 1 g 1 X + λ 2 g 2 X + λ 1, λ 2 etc are regularization parameters. These are set and not estimated Unlike Lagrange multipliers Minimize L X 11-755/18-797 134
Contour Plot of a Quadratic Objective f(x) f(x) x 2 x 2 x 1 x 1 Left: Actual 3D plot x = [x 1, x 2 ] Right: equal-value contours of f(x) Innermost contour has lowest value 11-755/18-797 135
With L 1 regularization f(x) + 0.5 x 1 f(x) + x 1 x 2 x 2 x 1 L 1 regularized objective f x + λ x 1, for different values of regularization parameter Note: Minimum value occurs on x 1 axis for = 1 Sparse solution 11-755/18-797 136 x 1
L 2 and L 1 -L 2 regularization f(x) + x 2 2 f(x) + x 1 + x 2 2 x 2 x 2 x 1 L 2 regularized objective f x + λ x 2 results in shorter optimum L 1 -L 2 regularized objective results in sparse, short optimum = 1 for both regularizers in example x 1 11-755/18-797 137
Sparse signal reconstruction Minimum Square Error Signal x of length 100 10 non-zero component Regularization Reconstructing the original signal from noisy 50 measurements b = Ax + ε 11-755/18-797 138
Signal reconstruction Minimum Square Error Signal reconstruction Least square problem min Ax - b 2 2 11-755/18-797 139
L2-Regularization Signal reconstruction Least squares problem min Ax - b 2 2 +g x 2 2 11-755/18-797 140
L1-Regularization Signal reconstruction Least square problem min Ax - b 2 2 +g x 1 11-755/18-797 141
Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals 11-755/18-797 142
Convex optimization Problems An convex optimization problem is defined by convex objective function Convex inequality constraints Affine equality constraints h j min f 0 (x) (convex function) x s.t. f i (x) 0 (convex sets) h j (x) = 0 (Affine) f i 11-755/18-797 143
Convex Sets a set is convex, if for each x, y Î C and a Î 0,1 C Î Â n [ ] then ax +(1-a)y Î C x Non Convex Convex y y x 11-755/18-797 144
Convex functions A function f: R N R is convex if for each x, y domain(f) and α [0,1] f (ax +(1-a)y) a f (x)+(1-a) f (y) Convex f (x) Non Convex f (y) a f (x)+(1-a) f (y) a f (x)+(1-a) f (y) f (y) f (x) 11-755/18-797 145
Concave functions A function f: R N R is convex if for each x, y domain(f) and α [0,1] f (ax +(1-a)y) ³a f (x)+(1-a) f (y) f (y) Concave a f (x)+(1-a) f (y) 11-755/18-797 146
First order convexity conditions A differentiable function f: R N R is convex if and only if for x, y domain(f) the following condition is satisfied f (y) ³ f (x)+ñf (x) t (y - x) Lower Bound (x, f (x)) f (x)+ Ñf (x) T (y - x) 11-755/18-797 147
Second order convexity conditions A twice-differentiable function f: R N R is convex if and only if for all x, y domain(f) the Hessian is superior or equal to zero Ñ 2 f (x) ³ 0 11-755/18-797 148
Properties of Convex Optimization For convex objectives over convex feasible sets, the optimum value is unique There are no local minima/maxima that are not also the global minima/maxima Any gradient-based solution will find this optimum eventually Primary problem: speed of convergence to this optimum 11-755/18-797 149
Lagrange multiplier duality Optimization problem with constraints min f (x) x s.t. g i (x) 0 h j (x) = 0 i = { 1,..., k} j = { 1,...,l } Lagrange multipliers l i ³ 0,n Î Â L(x, l,n) = f (x)+ å l i g i (x)+ n j h j (x) The Dual function inf x L(x, l,n) = inf x k i=1 l å j=1 ì k l ï üï í f (x)+ å l i g i (x)+ ån j h j (x) ý îï i=1 j=1 þï 11-755/18-797 150
Lagrange multiplier duality The Original optimization problem min{ sup L(x, l,n) } x The Dual optimization Property of the Dual for convex function sup l³0,n l³0,n { inf L(x, l,n) } = f (x * ) x l³0,n max{ inf L(x, l,n) } x 11-755/18-797 151
Lagrange multiplier duality Previous Example f x, y is convex Constraint function is convex min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 11-755/18-797 152
Lagrange multiplier duality Primal system min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 Lagrange Multiplier L = x 2 + y 2 + l(2x + y - 4) L = 2x + 2l = 0 Þ x = -l x L y = 2y + l = 0 Þ y = - l 2 Dual system max l w(l) = 5 4 l 2 + 4l s.t. l ³ 0 Property w(l*) = f (x*, y*) 11-755/18-797 153
Lagrange multiplier duality Dual system max l w(l) = 5 4 l 2 + 4l s.t. l ³ 0 Concave function Convex function become concave function in dual problem w x = - 5 2 l + 4 = 0 Þ l* = 8 5 11-755/18-797 154
Lagrange multiplier duality Primal system min x,y Evaluating f (x, y)= x 2 + y 2 s.t. 2x + y -4 w(l*) = f (x*, y*) Dual system max l w(l) = 5 4 l 2 + 4l s.t. l ³ 0 x* = - 8 5, y* = - 4 5 l* = 8 5 æ f (x*, y*) = - 8 ö ç è 5ø 2 æ + - 4 ö ç è 5ø 2 w(l*) = - 5 æ8 ç ö 2 4 è 5ø + 32 5 f (x*, y*) = 16 5 w(l*) = 16 5 11-755/18-797 155