Recent Developments in Model-based Derivative-free Optimization

Recent Developments in Model-based Derivative-free Optimization Seppo Pulkkinen April 23, 2010

Introduction Problem definition The problem we are considering is a nonlinear optimization problem with constraints: min x R n f : Rn R l i x i u i, i = 1,..., n, Ax b. We also assume that the objective function is nonconvex and not necessarily differentiable. is expensive to evaluate.

A Motivating Example: Image Matching Problem Given two consequent images and a region in the first image, find a matching region from the second image: Practical considerations: Difficult to find invariant measures between the images. The transformations between the images can be large. The images may be contaminated with noise.

Image Matching Problem: A Simple Mathematical Formulation Problem definition The aim is to find transformation parameters giving the best fit between the matched regions by solving the problem min p x Ω x I 1 (x ) I 2 (x + T (x, p)) 2. This is a nonconvex nonlinear optimization problem having a large number of local minima. nondifferentiable objective function. possibly constraints enforcing the smoothness of the solution. Implication: Local gradient-based methods are not really usable for such problems.

Problems With Noisy Data Smooth Noisy Consider a difference approximation of the form f f (x + he i) f (x). x i h Clearly, any local gradient approximations become unusable in the presence of noise.

The Traditional Approach - Taylor Series Approximations Most gradient-based algorithms employ the quadratic Taylor series approximation m(x + s) = f (x) + f (x) T s + 1 2 st H(x)s. This can expressed in a more generic form, that is m(x + s) = c + b T s + 1 2 st As, where c R, b R n and A R n n. Problem: Can the model parameters c, b and A be estimated without evaluating derivatives?

Interpolation-based Methods An alternative approach: Determine the model parameters c, b and A from interpolation equations m(x + y i ) = f (x + y i ), i = 1,..., Y, where Y = {y 1,..., y m } is the set of interpolation points. A model defined by the above equations only requires that f can be evaluated at the given points: no derivatives are needed. is not restricted to the small neighbourhood of x.

Limitations of Quadratic Models Quadratic interpolation models have several limitations: The need to solve O(n 2 ) parameters from interpolation equations. Updating the model parameters has complexity of O(n 4 ). The interpolation set must be well-poised, which leads to complex geometric conditions. Quadratic models are essentially local: they cannot model multimodal behaviour of a nonconvex function. Problem: Is there a model function that requires only O(n) parameters. requires only mild conditions for well-poisedness. can approximate functions with multiple local minima.

Improvement: Underdetermined Quadratic Models An alternative approach is to determine only the diagonal elements of the matrix A from interpolation equations. The off-diagonal elements of the matrix A are approximated by using the minimum Frobenius norm method (Powell, 2004, Wild, 2008). Requires only 2n + 1 model parameters. The amount of work per iteration is O(n 3 ) (Powell, 2004). Analogous to quasi-newton methods. However... this approach still has the limitations of quadratic models: it gives only local approximations.

A Novel Approach - Radial Basis Function Models RBF Model Function A typical radial basis function model is of the form Y m(x + s) = λ i φ( s y i ) + p(s), i=1 where λ i are weighting coefficients and p is a low-order polynomial. Such a model function addresses the questions posed above: The minimum number of interpolation points is n + 2. Can use an arbitrary number of interpolation points. Ideal for approximating functions with multiple minima.

Radial Basis Functions: Overview The choice of the radial basis function φ is crucial for the accuracy and numerical stability of the approximation. Commonly used radial basis functions: φ(r) = r linear φ(r) = r 3 cubic φ(r) = r 2 log r thin plate φ(r) = (γr 2 + 1) 3 2 multiquadric φ(r) = exp γr 2 gaussian r 0. Other important applications of radial basis functions are solving partial differential equations. neural networks.

An Illustrative Example RBF Interpolation with 30 randomly chosen interpolation points, Rastrigin function: 1.0 Function 40.5 1.0 RBF model 42 36.0 36 0.5 31.5 27.0 0.5 30 24 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 22.5 18.0 13.5 9.0 4.5 0.0 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 18 12 6 0 6

An Illustrative Example Function RBF model Model 40 35 30 25 20 15 10 5 40 30 20 10 0 0.5 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5 0.5 0.0 Radial basis function models yield global approximations of the objective function. increasingly accurate approximations as the number of interpolation points increases.

Limiting Functions of Flat RBF Models (1) Examples of RBF models with adjustable shape parameter: φ(r) = (γr 2 + 1) 3 2 φ(r) = e γr 2 multiquadric gaussian The limit γ 0 (Fornberg et al., 2004): When Y = (n+1)(n+2) 2, the limit γ 0 yields under certain conditions a quadratic polynomial, i.e. Y lim γ 0 i=1 λ i φ( s y i, γ) + p(s) = 1 2 st As + b T s + c. Implication: RBF models yield accurate local approximations by letting γ 0 near a minimum.

Limiting Functions of flat RBF Models (2) 0.6 0.4 0.2 0.0 0.2 0.4 Function 40.5 36.0 31.5 27.0 22.5 18.0 13.5 9.0 0.6 0.4 0.2 0.0 0.2 0.4 Multiquadric RBF model (γ=5) 81 72 63 54 45 36 27 18 0.6 4.5 0.6 9 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.0 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0 Multiquadric RBF model (γ=0.05) 180 Quadratic model 180 0.6 135 0.6 135 0.4 90 0.4 90 0.2 45 0.2 45 0.0 0 0.0 0 0.2 45 0.2 45 0.4 0.6 0.6 0.4 0.2 0.0 0.2 0.4 0.6 90 135 180 0.4 0.6 0.6 0.4 0.2 0.0 0.2 0.4 0.6 90 135 180

Geomeric Conditions for RBF Interpolation (1) We are particularly interested in multiquadric RBF models Y m(x + s) = λ i (γ s y i 2 + 1) 3 2 + g T s + c, i=1 where the linear polynomial tail guarantees an unique interpolant (Powell, 1992). provides an estimate for the function gradient. The interpolation equations uniquely determining the model parameters are: m(x + y i ) = f (x + y i ), i = 1,..., Y Y λ i p j (y i ) = 0, j = 1,..., n + 1 i=1

Geomeric Conditions for RBF Interpolation (2) We denote the linear tail p(s) by n+1 i=1 c ip i (s), where {p 1,..., p n+1 } span the linear polynomial space. The interpolation equations in matrix form are [ ] [ ] Φ Π λ Π T = F, 0 c where λ = λ 1. λ Y, c = c 1. c n+1, F = f (x + y 1 ). f (x + y Y ). and Φ ij = φ( y i y j ), Π ij = p i (y j ).

Geomeric Conditions for RBF Interpolation (3) Necessary conditions A necessary condition for an unique solution of the interpolation equations is that rank(π) n + 1, or equivalently, at a least subset consisting of n + 1 points is linearly independent. Two approaches for ensuring this poisedness condition can be found in the literature: 1 Apply correction steps for improving the quality of the model (Powell, 2004, Scheinberg et al., 2009). 2 Avoid inserting any bad interpolation points (Marazzi and Nocedal, 2002).

The Trust Region Framework (1) Idea: Define a region in which the model can be considered reliable. Contour lines of f An iterative algorithm: Each new iterate x k+1 is defined by x k+1 = x k + s k, where s k minimizes the model within the current trust region.

The Trust Region Framework (2) Mathematical formulation: Solve the minimization problem s = arg min s {m k (x k + s) x k + s B k }, where the spherical trust region B k is defined as B k = {x F x x k < k }, and F is the set of feasible points. Also adjust the trust region radius k, if necessary: If the step s leads to sufficiently smaller function value, increase the radius, set k+1 > k. Otherwise, shrink the trust region, set k+1 < k.

Updating the Model Under Geometric Constraints The Constraint Condition: S = span({y 1,..., y n+1 } \ {y }). Compute vector ˆn that is orthonormal to S. The feasible region containing sufficiently linearly independent points is defined by infeasible region F = {x B k x T ˆn > γ x }. The Idea of The Algorithm: Replace some interpolation point y with a better point y +, for example, y + = s k.

The Special Structure of RBF Models Motivation RBF models are linear combinations of convex and concave functions. Hence, it seems natural to express the model function in the decomposed form m(x) = g(x) h(x), where g and h are convex. Implications This special structure allows developing efficient d.c. (diff-convex) algorithms for minimizing the RBF model function.

Diff-convex Decompositions of RBF Models The following decompositions of RBF models have been proposed in the literature (Hoai An, Vaz and Vicente, 2009): Separation of convex and concave terms: g(x) = λ i 0 λ i φ( x y i ) + p(x), h(x) = λ i <0( λ i )φ( x y i ) Regularization approach: g(x) = ρ 2 x 2 + p(x), Y h(x) = ρ 2 x 2 λ i φ( x y i ) i=1

The d.c. Algorithm: Preliminaries g(x) g(x)-h(x) -h(x) x The Idea of the d.c. Algorithm: Replace the concave term h(x) of f (x) = g(x) h(x) with its linear approximation, let f (x) g(x) (h(x 0 ) + h(x 0 )(x x 0 )).

The d.c. Algorithm: Mathematical Formulation Statement of the Algorithm: Iteratively solve the problem x k+1 = arg min x F {g(x) (h(x k) + (x x k ) T y k )}, where y k = h(x k ). Using this formulation can be beneficial, if the new problem is easier to solve can be solved more efficiently than the original problem.

Convexification: An Illustrative Example Idea: Convexify the function by adding a convex term ρ 2 x 2 with a large enough parameter ρ to it.

The d.c. Algorithm: Regularization Approach With the regularized d.c. decomposition, solving the linearized minimization problem x k+1 = arg min x F {g(x) (h(x k) + (x x k ) T y k )}, is equivalent to solving x k+1 = arg min x (x O + y k g ), x F ρ which is the projection of the term x O + y k g ρ to the set F. We have a gradient descent method requiring no line search. a convenient way to handle constraints.

How to determine the Regularization Parameter ρ? The sufficient condition for convexity of h The convexity of h within the trust region B is guaranteed, if ρ max x B 2 h (x), where Y h (x) = λ i φ( x y i ). i=1 It is possible to derive an upper bound for the minimum ρ that ensures convexity. When ρ gives an accurate estimate, the algorithm converges rapidly.

Thank you! Questions?