Lecture Notes: Constraint Optimization

Size: px

Start display at page:

Download "Lecture Notes: Constraint Optimization"

Brice Fleming
6 years ago
Views:

1 Lecture Notes: Constraint Optimization Gerhard Neumann January 6, Constraint Optimization Problems In constraint optimization we want to maximize a function f(x) under the constraints that g i (x) = 0. For simplicity, we will only consider equality constraints. We can formalize this problem as argmax x f(x) s.t.:g i (x i ) = 0, i. (1) We look for a point x that lies on the constraint surface and maximizes f. Lets first consider the simpler case with only one constraint g(x) = 0- The constraint can be approximated by a Taylor expansion around a point c, i.e., g(x + ɛ) g(x) + ɛ T g(x). We want the point x + ɛ to be also on the surface, it follows that ɛ T g(x) = 0. Stated differently, g(x) is normal to the constraint surface. From Figure 1, we can see that the gradient f(x ) needs to be orthogonal to the constraint surface, i.e., f(x ) and g(x ) are (anti-) parallel. Stated differently, there needs to be a factor λ such that f(x ) + λ g(x ) = 0 (2) 1.1 The Lagrangian Function Now we can introduce the Lagrangian function L(x, λ))f(x) + λg(x), where λ is a Lagrangian multiplier. function, i.e., We want to find a sattel point of this L(x, λ) = 0 x (3) L(x, λ) = 0 λ (4) Note that these two conditions reconstruct the condition from Equation 2 and the constraint g(x) = 0. 1

2 Figure 1: Illustration of the constraint surface, the gradient directions g(x) and f(x) of the constraint and objective respectively. Taken from [Bishop, 2006]. 1.2 The Lagrangian Dual Function Under certain conditions, it is easier to optimize the lagrangian dual function of an optimization problem. The dual function is given by h(λ) = max L(x, λ). x The dual function is only a function of the lagrangian multipliers λ and it is, per definition, h(λ) f(x) + λg(x). For an optimal point x, the condition simplifies to h(λ) f(x ), as g(x ) = 0. Since h(λ) f(x ), the optimal point for λ is obtained by minimizing the dual function λ = argmin λ h(λ). (5) Under certain conditions, which are called Karesh-Kuhn-Tucker KKT conditions [Wikipedia, a], optimizing the dual is equivalent to solving the primal optimization problem, i.e., λ can be used to obtain x by x = argmax x L(x, λ ). (6) The question is why should we optimize the dual instead of the primal? Often, the dual is easier to solve. The number of variables is equivalent to the number of constraints, which is typically less then the dimensionality of x. We can also show that h(λ) is convex, even if f(x) is not. Some more notes on the dual: If the primal is maximized, the dual is minimized and vice versa. 2

3 Inequality constraints can be handled almost equivalent to equality constraints, with the exception that every inequality constraint in the primal problem also adds a inequality constraint to the dual, where the constraints are put on the lagrangian multipliers that are used for the inequality constraints. There are many differnt regularity (KKT) conditions, such that finding an optimal solution of the dual can be used for an optimal solution of the primal. The simplest one which we can use for most cases is the Slater condition [Wikipedia, b]. 1.3 Cookbook for constraint optimization We consider the following problem with equality and inequality constraints argmax x f(x) s.t.: g i (x i ) = 0, i, c j (x) 0, j (7) We need to consider the following steps in order to solve this constraint optimization problem: 1. Check for KKT conditions 2. Write down the Lagrangian L(x, λ, η) = f(x) + i λ i g(x) + j η j c j (x). The parametes η j are again lagrangian multipliers for the inequality constraints. 3. Maximize Lagrangian for x, i.e. x = l(λ, η) = argmax x L(x, λ, η). Note that x is a function of the lagrangian multipliers λ and η. Hence, this step is typically only feasible if we can obtain x analytically. 4. Set x back into the Lagrangian to obtain the dualfunction h(λ, η) = L(l(λ, η), λ, η). 5. Solve the dual optimization problem 6. Obtain optimal solution by 2 Examples [λ, η ] = argmin h(λ, η), s.t.: η i 0, i x = l(λ, η ). In this section, we will quickly review several examples that are important in the context of robot learning. 3

4 2.1 Resolved Velocity Control aka. Jacobian Pseudo Inverse We want to solve the following problem argmin q 1 2 qt q, s.t.: ẋ = J q. This is a convex problem, so the KKT condition is satisfied. The Lagrangian of this optimization problem is given by L( q, λ) = 1 2 qt q + λ T (ẋ J q). We obtain the optimal solution for q as a function of λ by setting the derivative of the Lagrangian to 0 and solving for q which yields L( q, λ) q = q T λ T J = 0, q T = l(λ) = λ T J. Setting the previous equation back into the Lagrangian results in the dual function h(λ) = 1 2 λt JJ T λ + λ T (ẋ JJ T λ) = λ T ẋ 1 2 λt JJ T λ. To solve the dual optimization problem, we have to find λ = argmax λ h(λ). We do this again by setting the derivative of h(λ) to zero, i.e., which yields h(λ) λ = ẋt λ T JJ T = 0, λ T = ẋ(jj T ) 1. Setting λ T back into the optimal solution l(λ), we obtain the solution q T = ẋ T (JJ T ) 1 J q = J T (JJ T ) 1 ẋ, which corresponds to the solution we know from the lecture. 2.2 Gradient Descent with Pre-Defined Metric Gradient Descent with a pre-defined metric we want to find the update direction θ for a parameter vector θ which is most similar to the standard gradient g = f(θ) T θ, where f(θ) is the function we want to optimize, which has a limited metric L M ( θ) = θm θ ɛ, where M is a positive definite, symmetric matrix. Hence, the optimization problem is defined by θ = argmax θ θ T g s.t.: θ T M θ ɛ 4

5 The objective as well as the constraint are convex, hence, the KKT conditions are satisfied. The Lagrangian of the optimization problem is given by L( θ, λ) = θ T g + λ(ɛ θ T M θ). We obtain the otpimal solution for θ by setting the derivative of the Lagrangian to 0, L( θ, λ) = g T 2λ θ T M = 0 T, θ which yields θ = M 1 g M 1 g. 2λ Note that we can always invert M as we assumed that M is positive definite. The previous equation already reveals a basic result for the gradient update. The update is always propertional to the standard gradient, which is transformed by multiplying the inverse of the metric matrix M. We still have to compute the learning rate λ which is a lagrangian multiplier. The multiplier is again obtained by optimizing the dual function. The dual function is given by h(λ) = gm 1 g 2λ + λɛ gt M 1 MM 1 g 4λ = λɛ + gt M 1 g. 4λ We optimize the dual function again by setting its derivative to 0. Rearranging terms yields h(λ) λ = ɛ gt M 1 g 4λ 2 = 0. λ 2 = gt M 1 g. 4ɛ As the lagrangian multiplier λ is due to the inequality constraint restricted to be positive, we have a unique solution at g λ = T M 1 g. 4ɛ Hence, we can also obtain the learning rate for the specified metric in closed form. Note that the learning rate λ depends on the bound ɛ, which also intuitively has of course to be the case. This derivation is the basis for all natural gradient algorithms, where M is typically the Fisher information matrix. 2.3 Relative Entropy Policy Search TODO 5

6 References [Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. [Wikipedia, a] Wikipedia. Karush kuhn tucker conditions. [Wikipedia, b] Wikipedia. Slater condition. 6

Demo 1: KKT conditions with inequality constraints

Demo 1: KKT conditions with inequality constraints MS-C5 Introduction to Optimization Solutions 9 Ehtamo Demo : KKT conditions with inequality constraints Using the Karush-Kuhn-Tucker conditions, see if the points x (x, x ) (, 4) or x (x, x ) (6, ) are