Machine Learning for Signal Processing Lecture 4: Optimization

Size: px

Start display at page:

Download "Machine Learning for Signal Processing Lecture 4: Optimization"

Arleen Cain
6 years ago
Views:

1 Machine Learning for Signal Processing Lecture 4: Optimization 13 Sep 2015 Instructor: Bhiksha Raj (slides largely by Najim Dehak, JHU) /

2 Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals /

3 Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals /

4 A problem we recently saw M = N = S The projection matrix P is the matrix that minimizes the total error between the projected matrix S and the original matrix M /

5 The projection problem S = PM For individual vectors in the spectrogram S i = PM i Total projection error is E = M i PM i 2 i The projection matrix projects onto the space of notes in N P = NC The problem of finding P : Minimize E = M i PM i 2 i such that P = NC This is a problem of constrained optimization /

6 Optimization Optimization is finding the best value of a function (which can be the best minimum) The number of turning points of a function depend on the order of the function. Not all turning points are minima. They can be minima, maxima or inflection points. f (x) f(x) global maximum min x f (x) inflection point global minimum local minimum x

7 Examples of Optimization : Multivariate functions Find the optimal point in these functions /

8 Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals /

9 Simple Approach: Turning Point f(x) X The minimum of the function is always a turning point The function increases on either side How to identify these turning points? /

10 The derivative of a curve y y x x The derivative α of a curve is a multiplicative factor explaining how much y changes in response to a very small change in x y = x For scalar functions of scalar variables, often expressed as dy y = dy dx x y = f (x) x dx or as f (x) We have all learned how to compute derivatives in basic calculus /

11 Positive derivative dy dx > The derivative of a Curve Negative derivative dy dx < Zero derivative dy dx = In upward-rising regions of the curve, the derivative is positive Small increase in X cause Y to increase In downward-falling regions, the derivative is negative At turning points, the derivative is 0 Assumption: the function is differentiable at the turning point /

12 Finding the minimum of a function f(x) dy dx = 0 x Find the value x at which f (x) = 0 Solve df(x) dx = 0 The solution is a turning point But is it a minimum? /

13 Turning Points Both maxima and minima have zero derivative /

14 Derivatives of a curve f(x) x Both maxima and minima are turning points /

15 Derivatives of a curve f (x) f(x) x Both maxima and minima are turning points Both maxima and minima have zero derivative /

16 Derivative of the derivative of the curve f (x) f (x) f(x) x Both maxima and minima are turning points Both maxima and minima have zero derivative The second derivative f (x) is ve at maxima and +ve at minima! At maxima the derivative goes from +ve to ve, so the derivative decreases as x increases At minima the derivative goes from ve to +ve and increases as x increases /

17 Soln: Finding the minimum or f(x) maximum of a function dy dx = 0 Find the value x at which f (x) = 0: Solve df(x) dx = 0 The solution x soln is a turning point Check the double derivative at x soln : compute f x soln = df (x soln) dx If f x soln is positive x soln is a minimum, otherwise it is a maximum x /

18 What about functions of multiple variables? The optimum point is still turning point Shifting in any direction will increase the value For smooth functions, miniscule shifts will not result in any change at all We must find a point where shifting in any direction by a microscopic amount will not change the value of the function /

19 A brief note on derivatives of multivariate functions /

20 The Gradient of a scalar function f(x) The Gradient f(x) of a scalar function f(x) of a multi-variate input X is a multiplicative factor that gives us the change in f(x) for tiny variations in X X f(x) = f(x) T X /

21 Gradients of scalar functions with multi-variate inputs Consider f X = f x 1, x 2,, x n f(x) f(x) = x 1 f(x) x 2 f(x) Check: f X = f X T X = f(x) x x 1 + f(x) 1 x 2 x n x f(x) x n x n

22 A well-known vector property u. v = u v cosθ The inner product between two vectors of fixed lengths is maximum when the two vectors are aligned /

23 f X Properties of Gradient = f X T X The inner product between f X and X Fixing the length of X E.g. X = 1 f X is max if X = c f X f X, X = 0 The function f(x) increases most rapidly if the input increment X is perfectly aligned to f X The gradient is the direction of fastest increase in f(x) /

24 Gradient Gradient vector f(x) /

25 Gradient Gradient vector f(x) Moving in this direction increases f X fastest /

26 Gradient Gradient vector f(x) f(x) Moving in this direction decreases f X fastest Moving in this direction increases f(x) fastest /

27 Gradient Gradient here is 0 Gradient here is /

28 Properties of Gradient: 2 The gradient vector f(x) is perpendicular to the level curve 29

multi-variate input X is a multiplicative factor that gives us the

29 Derivatives of vector function of vector input Y = f(x) X = x 1 x 2 Y = y 1 y 2 X f(x) The Gradient f(x) of a scalar function f(x) of a multi-variate input X is a multiplicative factor that gives us the change in f(x) for tiny variations in X f(x) = f(x) T X /

30 Gradient of vector function of vector input / x n x x X n m n n n n x y x y x y x y x y x y x y x y x y X f ) ( y m y y X f.. ) ( 2 1 Properties and interpretations are similar to the case of scalar functions of vector inputs

31 The Hessian The Hessian of a function f(x 1, x 2,, x n ) is given by the second derivative é ê ê ê ê ê ê Ñ 2 f (x 1,..., x n ) := ê ê ê ê ê ê ê ë 2 f x f x 1 x f 2 f.. 2 x 2 x 1 x 2 2 f x 1 x n 2 f x 2 x n f 2 f.. x n x 1 x n x 2 2 f x n / ù ú ú ú ú ú ú ú ú ú ú ú ú ú û

32 Returning to direct optimization /

33 Finding the minimum of a scalar function of a multi-variate input The optimum point is a turning point the gradient will be /

34 Unconstrained Minimization of function (Multivariate) 1. Solve for the X where the gradient equation equals to zero f ( X ) 2. Compute the Hessian Matrix 2 f(x) at the candidate solution and verify that Hessian is positive definite (eigenvalues positive) -> to identify local minima Hessian is negative definite (eigenvalues negative) -> to identify local maxima /

35 Unconstrained Minimization of Minimize function (Example) f (x 1, x 2, x 3 ) = (x 1 ) 2 + x 1 (1- x 2 )-(x 2 ) 2 - x 2 x 3 +(x 3 ) 2 + x 3 Gradient é ê Ñf = ê ê ëê 2x x 2 -x 1 + 2x 2 - x 3 -x 2 + 2x 3 +1 ù ú ú ú ûú /

36 Unconstrained Minimization of Set the gradient to null é ê Ñf = 0Þ ê ê ëê function (Example) 2x x 2 -x 1 + 2x 2 - x 3 -x 2 + 2x 3 +1 ù é ú ê ú = ê ú ê ûú ë Solving the 3 equations system with 3 unknowns é ê x = ê ê ëê x 1 x 2 x 3 ù é ú ê ú = ê ú ê ûú ë ù ú ú ú û ù ú ú ú û /

37 Unconstrained Minimization of function (Example) Compute the Hessian matrix Evaluate the eigenvalues of the Hessian matrix l 1 = 3.414, l 2 = 0.586, l 3 = 2 All the eigenvalues are positives => the Hessian matrix is positive definite The point é ê x = ê ê ëê x 1 x 2 x 3 ù é ú ê ú = ê ú ê ûú ë ù ú ú ú û é ê Ñ 2 f = ê ê ë is a minimum / ù ú ú ú û

38 Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals /

39 Closed Form Solutions are not always available f(x) Often it is not possible to simply solve f X = 0 The function to minimize/maximize may have an intractable form In these situations, iterative solutions are used Begin with a guess for the optimal X and refine it iteratively until the correct value is obtained X /

40 Iterative solutions f(x) X X x 0 x 1 x 2 x 5 x 3 X 0 2 x X 4 1 Iterative solutions Start from an initial guess X 0 for the optimal X Update the guess towards a (hopefully) better value of f(x) Stop when f(x) no longer decreases Problems: Which direction to step in How big must the steps be /

41 Descent methods Iterative solutions that attempt to descend the function in steps to arrive at the minimum Based on the first order derivatives (gradient) and in some cases the second order derivatives (Hessian). Newton s method is based on both first and second derivatives Gradient descent is based only on the first derivative /

42 Descent methods Iterative solutions that attempt to descend the function in steps to arrive at the minimum Based on the first order derivatives (gradient) and in some cases the second order derivatives (Hessian). Newton s method is based on both first and second derivatives Gradient descent is based only on the first derivative /

43 Newton s Method The Newton Method is based on an iterative step to minimize the derivative f (x) x k+1 = x k - f (xk ) f (x k ) k is the current iteration The iterations continue until we achieve the stopping criterion x k+1 x k < ε /

44 Newton s method to find the zero of a function Newton s method to find the zero of a function Initialize estimate Approximate function by the tangent at initial value Update estimate to location where tangent becomes 0 Iterate /

45 Newton s Method to optimize a function y f(x) x f (x) Apply Newton s method to the derivative of the function! The derivative goes to 0 at the optimum Algorithm: Initialize x 0 K th iteration: Approximate f (x) by the tangent at x k Find the location x intersect where the tangent goes to 0. Set x k+1 = x intersect Iterate /

46 Newton s method for univariate functions Apply Newton s algorithm to find the zero of the derivative f (x) x k+1 = x k - f (xk ) f (x k ) k is the current iteration The iterations continue until we achieve the stopping criterion x k+1 x k < ε /

47 Newton s method for multivariate functions 1. Select an initial starting point X 0 2. Evaluate the gradient f X k and Hessian 2 f X k at X k 3. Calculate the new X k+1 using the following X X 2 k 1 f ( X ). f ( X ) k 1 k k 4. Repeat Steps 2 and 3 until convergence /

48 Newton s Method Newton s approach is based on the computation of both gradient and Hessian Fast to converge (few iterations) Slow to compute Newton s method (arrives at optimum in a single step) x 0 This method is very sensitive to the initial point If the initial point is very far from the optimal point, the optimization process may not converge /

49 Descent methods Iterative solutions that attempt to descend the function in steps to arrive at the minimum Based on the first order derivatives (gradient) and in some cases the second order derivatives (Hessian). Newton s method is based on both first and second derivatives Gradient descent is based only on the first derivative /

50 The Approach of Gradient Descent Iterative solution: Start at some point Find direction in which to shift this point to decrease error This can be found from the derivative of the function A positive derivative moving left decreases error A negative derivative moving right decreases error Shift point in this direction

51 The Approach of Gradient Descent Iterative solution: Trivial algorithm Initialize x 0 While f x k 0 If sign f x k is positive: Else x k+1 = x k step x k+1 = x k + step But what must step be to ensure we actually get to the optimum?

52 The Approach of Gradient Descent Iterative solution: Trivial algorithm Initialize x 0 While f x k 0 x k+1 = x k sign f x k. step Identical to previous algorithm

53 The Approach of Gradient Descent Iterative solution: Trivial algorithm Initialize x 0 Whilef x k 0 x k+1 = x k η k f x k η k is the step size What must the step size be?

54 Gradient descent/ascent (multivariate) The gradient descent/ascent method to find the minimum or maximum of a function f iteratively To find a maximum move in the direction of the gradient x k+1 = x k + η k f x k To find a minimum move exactly opposite the direction of the gradient x k+1 = x k η k f x k What is the step size η k /

55 1. Fixed step size Fixed step size Use fixed value for η k /

56 f Influence of step size example (constant step size) é 2 2 ( x1, x2) ( x1 ) x1x2 4( x2) x initial = ê 3 ë ù ú û x 0 x /

57 2. Backtracking line search for step size Two parameters α (typically 0.5) and β (typically 0.8) At each iteration, estimate step size as follows: Set η k = 1 Update η k = βη k until f x k η k f x k f x k αη k f x k 2 Update x k+1 = x k η k f x k Intuitively: At each iteration Take a unit step size and keep shrinking it until we arrive at a place where the function f x k η k f x k actually decreases sufficiently w.r.t f x k 77

58 2. Backtracking line search for step size Keep shrinking step size till we find a good one 78

59 2. Backtracking line search for step size X k+1 X k Keep shrinking step size till we find a good one Update estimate to the position at the converged step size 79

60 2. Backtracking line search for step size At each iteration, estimate step size as follows: Set η k = 1 Update η k = βη k until f x k η k f x k f x k αη k f x k 2 Update x k+1 = x k η k f x k Figure shows actual evolution of x k /

61 3. Full line search for step size At each iteration scan for η k that minimizes f x k η k f x k Update x k = x k η k f x k /

62 3. Full line search for step size At each iteration scan for η k that minimizes f x k η k f x k Can be computed by solving df x k η k f x k dη k = 0 Update x k = x k η k f x k /

63 Gradient descent convergence criteria The gradient descent algorithm converges when one of the following criteria is satisfied f (x k+1 )- f (x k ) <e 1 Or Ñf (x k ) < e /

64 Gradient descent vs. Newton s Gradient descent is typically much slower to converge than Newton s But much faster to compute x 0 Newton s method Gradient descent Newton s method is exponentially faster for convex problems Although derivatives and Hessians may be hard to derive May not converge for non-convex problems 97

65 Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals /

66 Online Optimization Often our objective function is an error The error is the cumulative error from many signals E.g. E(W) = y f(x, W) 2 x Optimization will find the W that minimizes total error across all x What if wanted to update our parameters after each input x instead of waiting for all of them to arrive? /

67 A problem we saw M = S = N =? Given the music M and the score S of only four of the notes, but not the notes themselves, find the notes M = NS N = MPinv(S) /

68 The Actual Problem M = S = N =? Given the music M and the score S find a matrix N such the error of reconstruction E = M i NS i 2 i is minimized This is a standard optimization problem The solution gives us N = MPinv(S) /

69 The Actual Problem M = S = N =? Given the music M and the score S find a matrix N such the error of reconstruction E = M i NS i 2 i is minimized This is a standard optimization problem The solution gives us N = MPinv(S) This requires seeing all of M and S to estimate N /

70 Online Updates M = S = N =? N 0 N 1 N 2 N 3 N 4 What if we want to update our estimate of the notes after every input After observing each vector of music and its score A situation that arises in many similar problems /

71 Incremental Updates Easy solution: To obtain the k th estimate N k, minimize the error on the k th input The error on the k th input is: E k = M K NS K 2 Differentiating it gives us N = 2 M K NS K M K T Update the parameter to move in the direction of this update N k+1 = N k T + η M K NS K M K η must typically be very small to prevent the updates from being influenced entirely by the latest observation /

72 Online update: Non-quadratic functions The earlier problem has a linear predictor as the underlying model M k = NS k The squared error E k = M k NS k 2 is quadratic and differentiable We often have non-linear predictors Y k = g WX k E k = Y k g WX k 2 The derivative of the error w.r.t W is often ugly or intractable For such problems we will use the following generalization of the online update rule for linear predictors This is the Widrow-Hoff rule W k+1 = W k + η Y k g W k X k Y k T /

73 Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals /

74 A problem we recently saw M = N = S The projection matrix P is the matrix that minimizes the total error between the projected matrix S and the original matrix M /

75 CONSTRAINED optimization Recall the projection problem: Find P such that we minimize E = M i PM i 2 i AND such that the projection is composed of the notes in N P = NC This is a problem of constrained optimization /

76 Optimization problem with constraints Finding the minimum of a function f: R N R subject to constraints min f (x) x s.t. g i (x) 0 h j (x) = 0 i = 1,..., k { } j = 1,...,l { } Constraints define a feasible region, which is nonempty /

77 Optimization without constraints No Constraints min x f (x, y, z)= x 2 + y 2 Best minimum point /

78 Optimization with constraints With Constraints min x,y f (x, y)= x 2 + y 2 s.t. 2x + y /

79 Optimization with constraints Minima w/ and w/o constraints min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 Minimum Without constraints Minimum With constraints /

80 Solving for constrained optimization: the method of Lagrangians Consider a function f(x, y) that must be maximized w.r.t (x, y) subject to g(x, y) = c Note, we re using a maximization example to go with the figures that have been obtained from Wikipedia /

81 The Lagrange Method Purple surface is f(x, y) Must be maximized Red curve is constraint g(x, y) = c All solutions must line on this curve Problem: Find the position of the largest f(x, y) on the red curve! /

82 The Lagrange Method maximize f x, y, s. t. g x, y = c Dotted lines are constant-value contours f(x, y) = C f x, y has the same value C at all points on a contour The constrained optimum will be at the point where the highest constant-value contour touches the red curve It will be tangential to the red curve 115

83 The Lagrange Method maximize f x, y, s. t. g x, y = c The constrained optimum is where the highest constant-value contour is tangential to the red curve The gradient of f(x, y) = C will be parallel to the gradient of g(x, y) = c 116

84 The Lagrange Method maximize f x, y, s. t. g x, y = c At the optimum f x, y = λ g x, y g x, y = c Find x, y that satisfies both above conditions 117

85 The Lagrange Method f x, y g x, y = λ g x, y = c Find x, y that satisfies both above conditions Combine the above two into one equation L x, y, λ = f x, y λ(g x, y c) Optimize it for x, y, λ Solving for (x, y), Solving for λ x,y L x, y, λ =0 f x, y = λ g x, y L x, y, λ λ = 0 g x, y = c /

86 The Lagrange Method f x, y g x, y = λ g x, y = c Find x, y that satisfies both above conditions Combine the above two into one equation L x, y, λ = f x, y λ(g x, y c) Optimize it for x, y, λ Solving for (x, y), Formally: Solving for λ x,y L x, y, λ =0 f x, y = λ g x, y to maximize f(x, y): max L x, y, λ λ x,y to minimize f(x, y): min min λ L(x, y, λ) max L(x, y, λ) = 0 g x, y = c x,y λ /

87 Generalizes to inequality constraints Optimization problem with constraints min s. t. g ( x) Lagrange multipliers l i ³ 0,n Î Â L(x, l,n) = f (x)+ å l i g i (x)+ n j h j (x) i=1 The necessary condition ÑL(x, l,n) = 0 Û L x x h j f ( x) i ( x) 0 i 0 k j 1,..., k 1,..., l l å j=1 L L = 0, = 0, l n = /

88 Lagrange multiplier example min x, y s. t.2x f ( x, y) x y 4 2 y 2 Lagrange multiplier L x 2 y 2 (2x y 4) Evaluate ÑL(x, l,n) = 0 Û L x L L = 0, = 0, l n = /

89 Optimization with constraints Lagrange Multiplier results min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 Minimum With constraints (-8/5,-4/5,16/5) /

90 An Alternate Approach: Projected Gradients Feasible Set The constraints specify a feasible set The region of the space where the solution can lie /

91 An Alternate Approach: Projected Gradients Feasible Set From the current estimate, take a step using the conventional gradient descent approach If the update is inside the feasible set, no further action is required /

92 An Alternate Approach: Projected Gradients Feasible Set If the update falls outside the feasible set, /

93 An Alternate Approach: Projected Gradients Feasible Set If the update falls outside the feasible set, find the closest point to the update on the boundary of the feasible set /

94 An Alternate Approach: Projected Gradients Feasible Set If the update falls outside the feasible set, find the closest point to the update on the boundary of the feasible set And move the updated estimate to this new point /

95 The method of projected gradients min x s. t. g f ( x) i ( x) 0 i 1,..., k The constraints specify a feasible set The region of the space where the solution can lie Update current estimate using the conventional gradient descent approach If the update is inside the feasible set, no further action is required If the update falls outside the feasible set, find the closest point to the update on the boundary of the feasible set And move the updated estimate to this new point The closest point projects the update onto the feasible set For many problems, however, finding this projection can be difficult or intractable /

96 Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals /

97 Regularization Sometimes we have additional regularization on the parameters E.g. Note these are not hard constraints Minimize f(x) while requiring that the length X 2 is also minimum Minimize f(x) while requiring that X 1 is also minimal Minimize f(x) such that g(x) is maximum We will encounter problems where such requirements are logical /

98 Contour Plot of a Quadratic Objective f(x) f(x) x 2 x 2 x 1 x 1 Left: Actual 3D plot x = [x 1, x 2 ] Right: constant-value contours Innermost contour has lowest value Unconstrained/unregularized solution: The center of the innermost contour /

99 Examples of regularization Image Credit: Tibshirani Left: L 1 regularization, find x that minimizes f(x) o Also minimize x 1 o x 1 = const is a diamond o Find x that also minimizes diameter of diamond Right: L 2 or Tikhonov regularization o Also minimize x 2 o x 2 = const is a circle (sphere) o Find x that also minimizes diameter of circle 11755/

100 Regularization The problem: multiple simultaneous objectives Minimize f(x) Also minimize g 1 X, g 2 X, These are regularizers Solution: Define L X = f X + λ 1 g 1 X + λ 2 g 2 X + λ 1, λ 2 etc are regularization parameters. These are set and not estimated Unlike Lagrange multipliers Minimize L X /

101 Contour Plot of a Quadratic Objective f(x) f(x) x 2 x 2 x 1 x 1 Left: Actual 3D plot x = [x 1, x 2 ] Right: equal-value contours of f(x) Innermost contour has lowest value /

102 With L 1 regularization f(x) x 1 f(x) + x 1 x 2 x 2 x 1 L 1 regularized objective f x + λ x 1, for different values of regularization parameter Note: Minimum value occurs on x 1 axis for = 1 Sparse solution / x 1

103 L 2 and L 1 -L 2 regularization f(x) + x 2 2 f(x) + x 1 + x 2 2 x 2 x 2 x 1 L 2 regularized objective f x + λ x 2 results in shorter optimum L 1 -L 2 regularized objective results in sparse, short optimum = 1 for both regularizers in example x /

104 Sparse signal reconstruction Minimum Square Error Signal x of length non-zero component Regularization Reconstructing the original signal from noisy 50 measurements b = Ax + ε /

105 Signal reconstruction Minimum Square Error Signal reconstruction Least square problem min Ax - b /

106 L2-Regularization Signal reconstruction Least squares problem min Ax - b 2 2 +g x /

107 L1-Regularization Signal reconstruction Least square problem min Ax - b 2 2 +g x /

108 Index 1. The problem of optimization 2. Direct optimization 3. Descent methods Newton s method Gradient methods 4. Online optimization 5. Constrained optimization Lagrange s method Projected gradients 6. Regularization 7. Convex optimization and Lagrangian duals /

109 Convex optimization Problems An convex optimization problem is defined by convex objective function Convex inequality constraints Affine equality constraints h j min f 0 (x) (convex function) x s.t. f i (x) 0 (convex sets) h j (x) = 0 (Affine) f i /

110 Convex Sets a set is convex, if for each x, y Î C and a Î 0,1 C Î Â n [ ] then ax +(1-a)y Î C x Non Convex Convex y y x /

111 Convex functions A function f: R N R is convex if for each x, y domain(f) and α [0,1] f (ax +(1-a)y) a f (x)+(1-a) f (y) Convex f (x) Non Convex f (y) a f (x)+(1-a) f (y) a f (x)+(1-a) f (y) f (y) f (x) /

112 Concave functions A function f: R N R is convex if for each x, y domain(f) and α [0,1] f (ax +(1-a)y) ³a f (x)+(1-a) f (y) f (y) Concave a f (x)+(1-a) f (y) /

113 First order convexity conditions A differentiable function f: R N R is convex if and only if for x, y domain(f) the following condition is satisfied f (y) ³ f (x)+ñf (x) t (y - x) Lower Bound (x, f (x)) f (x)+ Ñf (x) T (y - x) /

114 Second order convexity conditions A twice-differentiable function f: R N R is convex if and only if for all x, y domain(f) the Hessian is superior or equal to zero Ñ 2 f (x) ³ /

115 Properties of Convex Optimization For convex objectives over convex feasible sets, the optimum value is unique There are no local minima/maxima that are not also the global minima/maxima Any gradient-based solution will find this optimum eventually Primary problem: speed of convergence to this optimum /

116 Lagrange multiplier duality Optimization problem with constraints min f (x) x s.t. g i (x) 0 h j (x) = 0 i = { 1,..., k} j = { 1,...,l } Lagrange multipliers l i ³ 0,n Î Â L(x, l,n) = f (x)+ å l i g i (x)+ n j h j (x) The Dual function inf x L(x, l,n) = inf x k i=1 l å j=1 ì k l ï üï í f (x)+ å l i g i (x)+ ån j h j (x) ý îï i=1 j=1 þï /

117 Lagrange multiplier duality The Original optimization problem min{ sup L(x, l,n) } x The Dual optimization Property of the Dual for convex function sup l³0,n l³0,n { inf L(x, l,n) } = f (x * ) x l³0,n max{ inf L(x, l,n) } x /

118 Lagrange multiplier duality Previous Example f x, y is convex Constraint function is convex min x,y f (x, y)= x 2 + y 2 s.t. 2x + y /

119 Lagrange multiplier duality Primal system min x,y f (x, y)= x 2 + y 2 s.t. 2x + y -4 Lagrange Multiplier L = x 2 + y 2 + l(2x + y - 4) L = 2x + 2l = 0 Þ x = -l x L y = 2y + l = 0 Þ y = - l 2 Dual system max l w(l) = 5 4 l 2 + 4l s.t. l ³ 0 Property w(l*) = f (x*, y*) /

120 Lagrange multiplier duality Dual system max l w(l) = 5 4 l 2 + 4l s.t. l ³ 0 Concave function Convex function become concave function in dual problem w x = l + 4 = 0 Þ l* = /

121 Lagrange multiplier duality Primal system min x,y Evaluating f (x, y)= x 2 + y 2 s.t. 2x + y -4 w(l*) = f (x*, y*) Dual system max l w(l) = 5 4 l 2 + 4l s.t. l ³ 0 x* = - 8 5, y* = l* = 8 5 æ f (x*, y*) = - 8 ö ç è 5ø 2 æ ö ç è 5ø 2 w(l*) = - 5 æ8 ç ö 2 4 è 5ø f (x*, y*) = 16 5 w(l*) = /

Lecture 2 Optimization with equality constraints

Lecture 2 Optimization with equality constraints Constrained optimization The idea of constrained optimisation is that the choice of one variable often affects the amount of another variable that can be