Simplicial Global Optimization

Simplicial Global Optimization Julius Žilinskas Vilnius University, Lithuania September, 7 http://web.vu.lt/mii/j.zilinskas

Global optimization Find f = min x A f (x) and x A, f (x ) = f, where A R n. Example: n = ; A = [, ]; Objective function 5 f (x) = j sin((j+)x+j); 5 j= f (x) 5 5 f =.; x = 5.798. 5 4 8 x

Local optimization A point x is a local minimum point if f (x ) f (x) for x N, where N is a neighborhood of x. A local minimum point can be found stepping in the direction of steepest descent. Without additional information one cannot say if the local minimum is global. How do we know if we are in the deepest hole? f (x) 5 5 5 5 4 8 x

Algorithms for global optimization With guaranteed accuracy: Lipschitz optimization, Branch and bound algorithms. Randomized algorithms and heuristics: Random search, Multi-start, Clustering methods, Generalized descent methods, Simulated annealing, Evolutionary algorithms (genetic algorithms, evolution strategies), Swarm-based optimization (particle swarm, ant colony).

Criteria of performance of global optimization algorithms Speed: Time of optimization, Number of objective function (and sometimes gradient, bounding, and other functions) evaluations, Both criteria are equivalent when objective function is expensive its evaluation takes much more time than auxiliary computations of an algorithm. Best function value found. Reliability how often problems are solved with prescribed accuracy.

Parallel global optimization Global optimization problems are classified difficult in the sense of the algorithmic complexity theory. Global optimization algorithms are computationally intensive. When computing power of usual computers is not sufficient to solve a practical global optimization problem, high performance parallel computers and computational grids may be helpful. An algorithm is more applicable in case its parallel implementation is available, because larger practical problems may be solved by means of parallel computations.

Criteria of performance of parallel algorithms Efficiency of the parallelization can be evaluated using several criteria taken into account the optimization time and the number of processors. A commonly used criterion of parallel algorithms is the speedup: s p = t t p, where t p is the time used by the algorithm implemented on p processors. The speedup divided by the number of processors is called the efficiency: e p = s p p.

Covering methods Detect and discard non-promising sub-regions using { n } interval arithmetic f (X ) X X, X [R, R] f ( X ), Lipschitz condition f (x) f (y) L x y, convex envelopes, statistical estimates, heuristic estimates, ad hoc algorithms. May be implemented using branch and bound algorithms. May guarantee accuracy for some classes of problems: f min x D f (x) + ɛ.

Branch and bound An iteration of a classical branch and bound processes a node in the search tree representing a not yet explored subspace. Iteration has three main components: selection of the node to process, branching and bound calculation. The bound for the objective function over a subset of feasible solutions should be evaluated and compared with the best objective function value found. If the evaluated bound is worse than the known function value, the subset cannot contain optimal solutions and the branch describing the subset can be pruned. The fundamental aspect of branch and bound is that the earlier the branch is pruned, the smaller the number of complete solutions that will need to be explicitly evaluated.

General algorithm for partition and branch-and-bound Cover D by L = {L j D L j, j =, m} using covering rule. while L do Choose I L using selection rule, L L\{I }. if bounding rule is satisfied for I then Partition I into p subsets I j using subdivision rule. for j =,..., p do if bounding rule is satisfied for I j then L L {I j }. end if end for end if end while

Rules of branch and bound algorithms for global optimization The rules of covering and subdivision depend on type of partitions used: hyper-rectangular, simplicial, etc. Selection rules: Best first, statistical the best candidate: priority queue. Depth first the youngest candidate: stack or without storing. Breadth first the oldest candidate: queue. Bounding rule describes how the bounds for minimum are found. For the upper bound the best currently found function value may be accepted. The lower bound may be estimated using Convex envelopes of the objective function. Lipschitz condition { f (x) f (y) L x y. n } Interval arithmetic f (X ) X X, X [R, R] f ( X ).

Rules of covering rectangular feasible region and branching

Lipschitz optimization Lipschitz optimization is one of the most deeply investigated subjects of global optimization. It is based on the assumption that the slope of an objective function is bounded. The multivariate function f : D R, D R n is said to be Lipschitz if it satisfies the condition f (x) f (y) L x y, x, y D, where L > is a constant called Lipschitz constant, the domain D is compact and denotes a norm. Branch and bound algorithm with Lipschitz bounds may be built: if the evaluated bound is worse than the known function value, the sub-region cannot contain optimal solutions and the branch describing it can be pruned.

Lipschitz bounds The efficiency of the branch and bound technique depends on the bound calculation. The upper bound for the minimum is the smallest value of the function at the vertices x v : LB(D) = min x v f (x v ). The lower bound for the minimum is evaluated exploiting Lipschitz condition: f (x) f (y) + L x y. The lower bound can be derived as { µ = max x v V (I ) f (x v ) L max x x v x I }.

Comparison of selection strategies in Lipschitz optimization Paulavičius, Žilinskas, Grothey, Optimization Letters, 4(), 7-8,, doi:.7/s59-9-5- Average number of function evaluations Dimension Best First Statistical Depth First Breadth First n = 94 9 97 98 n = 77 7 759 n = 4 8795 8797 88478 87985 n = 5, 9978 955 95984 99 Average total amount of simplices Dimension Best First Statistical Depth First Breadth First n = 8 899 94 8 n = 448 444 444 45 n = 4 7599 7585 794 7598 n = 5, 787899 784744 7859 79898

Average maximal number of simplices in the list Dimension Best First Statistical Depth First Breadth First n = 79 4 5 n = 749 74 455 n = 4 897 584 495 n = 5, 748 589 74 9555 Average execution time Dimension Best First Statistical Depth First Breadth First n =.5.5.5.5 n = 7.99..4.4 n = 4.5.8.8 7.5 n = 5, 59.9 58. 577.8 555.55 Average ratio t BestFound /t AllTime Dimension Best First Statistical Depth First Breadth First n =.4..5.48 n =.4.8.9. n = 4.4.5.4.47 n = 5,.74..5.77

Simplex An n-simplex is the convex hull of a set of (n + ) affinely independent points in n-dimensional Euclidean space. An one-simplex is a segment of line, a two-simplex is a triangle, a three-simplex is a tetrahedron.

Simplicial vs rectangular partitions A simplex is a polyhedron in n-dimensional space, which has the minimal number of vertices. Simplicial partitions are preferable when values of the objective function at vertices of partitions are used to evaluate sub-regions. Numbers of function evaluations in Lipschitz global optimization with rectangular and simplicial partitions: test function 4 5 7 rectangular 4 7 5 45 7 99 799 simplicial 85 7 8 88 7 test function 8 9 rectangular 95 77 4 595 simplicial 44 77 848 5 4 484

Subdivision of simplices Into similar simplices. Into two through the middle of the longest edge.

Covering of feasible region Very often a feasible region in global optimization is a hyper-rectangle defined by intervals of variables. A feasible region is face-to-face vertex triangulated: it is partitioned into n-simplices, where the vertices of simplices are also the vertices of the feasible region. A feasible region defined by linear inequality constraints may be vertex triangulated. In this way constraints are managed by initial covering.

Combinatorial approach for vertex triangulation of a hyper-rectangle General for any n. Deterministic, the number of simplices is known in advance n!. All simplices are of equal hyper-volume /n! of the hyper-volume of the hyper-rectangle. By adding just one point at the middle of diagonal of the hyper-rectangle each simplex may be subdivided into two. May be easily parallelized.

Examples of combinatorial vertex triangulation of two- and three-dimensional hyper-rectangles

Algorithm for combinatorial vertex triangulation of hyper-rectangle for τ = equals one of all permutations of,..., n do for j =,..., n do v j D j end for for i =,..., n do for j =,..., n do v (i+)j v ij end for v (i+)τi D τi end for end for

Combinatorial vertex triangulation of a unit cube τ = {,, } τ = {,, } τ = {,, } v = v = v = τ = {,, } τ = {,, } τ = {,, } v = v = v =

Parallel algorithm for combinatorial vertex triangulation for k = n!rank/size to n!(rank + )/size do for j =,..., n do τ j j end for c for j =,..., n do c c(j ) swap τ j k/c %j with τ j end for for j =,..., n do v j D j end for for i =,..., n do for j =,..., n do v (i+)j v ij end for v (i+)τi D τi end for end for

Vertex triangulation of feasible region defined by linear inequality constraints I min f (x), s.t. x i, x + x, x x. x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x =

Vertex triangulation of feasible region defined by linear inequality constraints II min f (x), s.t. x i, x + x, x + x. x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x = x =, x =, x =

DISIMPL (DIviding SIMPLices) algorithm 7.4 5.84 5. 5.7 4 4 5.5 5.9.. 9.5 4 4.8 4.5.7 7..94 x 4.5. 5.5.8.4..4 5.8 8. 4.89 The feasible region defined by linear constraints may be covered by simplices. Symmetries of objective functions may be exploited using linear inequality constraints. Paulavičius, Žilinskas. Optimization Letters, (), 7-4,, doi:.7/s59-4-77-4 x

DISIMPL-V and DISIMPL-C algorithms DISIMPL - stands for DIviding SIMPLices. Initial step: Scale D D = {x R n : x i, i =,..., n}

DISIMPL-V and DISIMPL-C algorithms DISIMPL - stands for DIviding SIMPLices. Initial step: Scale D D = {x R n : x i, i =,..., n} Covering step: Cover D by n! simplices of equal hyper-volume

DISIMPL-V and DISIMPL-C algorithms DISIMPL - stands for DIviding SIMPLices. Initial step: Scale D D = {x R n : x i, i =,..., n} Covering step: Cover D by n! simplices and evaluate at all the vertices [DISIMPL-v] (barycenters [DISIMPL-c]). For symmetric problems there may be one initial simplex. Step : Identify and select potentially optimal simplices Potentially optimal simplices (POS)[DISIMPL-v] Let S be a set of all simplices created by DISIMPL-v after k iterations, ε > and f min current best function value. A j S is POS if L > such that min f (v) Lδ j min f (v) Lδ i, v V(j) v V(i) min f (v) Lδ j f min ε f min. v V(j) i S Here δ j is the length of its longest edge and V(j) is the vertex

DISIMPL-V and DISIMPL-C algorithms DISIMPL - stands for DIviding SIMPLices. Initial step: Scale D D = {x R n : x i, i =,..., n} Covering step: Cover D by n! simplices and evaluate at all the vertices [DISIMPL-v] (barycenters [DISIMPL-c]). For symmetric problems there may be one initial simplex. Step : Identify and select potentially optimal simplices Potentially optimal simplices (POS) [DISIMPL-c] Let S be a set of all simplices created by DISIMPL-c after k iterations, ε > and f min current best function value. A j S is POS if L > such that f (c j ) Lδ j f (c i ) Lδ i, i S f (c j ) Lδ j f min ε f min. Here c j is the geometric center (centroid) of simplex j and a measure δ j is the maximal distance from c j to its vertices.

DISIMPL-V and DISIMPL-C algorithms DISIMPL - stands for DIviding SIMPLices. Initial step: Scale D D = {x R n : x i, i =,..., n} Covering step: Cover D by n! simplices and evaluate at all the vertices [DISIMPL-v] (barycenters [DISIMPL-c]). For symmetric problems there may be one initial simplex. Step : Identify and select potentially optimal simplices Step : Sample the function and divide into two by a hyper-plane passing through the middle point of the longest edge and the vertices which do not belong to the longest edge (DISIMPL-c) or into three simplices dividing the longest edge too. Repeat Step and Step until satisfied some stopping criteria.

Visualization of DISIMPL-V and DISIMPL-C DISIMPL-V Result: fmin =.7, iter =, f.eval = DISIMPL-C Result: fmin =.9, iter =, f.eval = 5 5.7.8.8 DISIMPL-V 5 5 -.9 DISIMPL-C

Visualization of DISIMPL-V and DISIMPL-C DISIMPL-V Result: fmin =.7, iter =, f.eval = 4 DISIMPL-C Result: fmin = 9., iter =, f.eval = 5.7.8 DISIMPL-V 9.88 5.8 5 DISIMPL-C -9. -.9.4 5

Visualization of DISIMPL-V and DISIMPL-C DISIMPL-V Result: fmin =.7, iter =, f.eval = 5 DISIMPL-C Result: fmin = 4.48, iter =, f.eval = 5 5.7.5.8 DISIMPL-V 9.88 5.8 5 DISIMPL-C -.9 7.87-9.-4.48.4 5

Visualization of DISIMPL-V and DISIMPL-C DISIMPL-V Result: fmin =.7, iter =, f.eval = 7 DISIMPL-C Result: fmin =., iter =, f.eval = 9 DISIMPL-V DISIMPL-C 5.7.5.8 8.8 9.88 5.8 4.9 5 -.9-.45.8 7.87-9.-4.48 5.98 7.5.4 5

Coping with symmetries of objective function If interchange of the variables x i and x j does not change the value of the objective function, it is symmetric over the hyper-plane x i = x j. It is possible to avoid such symmetries by setting linear constraints on such variables: x i x j. The resulting constrained search space may be vertex triangulated. The search space and the numbers of local and global minimizers may be reduced by avoiding symmetries.

Example of coping with symmetries of objective function f (x) = n x n i 4 i= i= cos(x i ) +, D = [ 5, 7] n The objective function is symmetric over hyper-planes x i = x j. Constraints for avoiding symmetries: x x... x n. The resulting simplicial search space: 5 5... 5 5 5 5... 5 7 D =...... 5 7... 7 7 7 7... 7 7

Example of coping with symmetries: optimization of gillage-type foundations Grillage foundation consists of separate beams, which are supported by piles or reside on other beams. The piles should be positioned minimizing the largest difference between reactive forces and limit magnitudes of reactions.

Optimization of gillage-type foundations: formulation Black-box problem. The values of objective function are evaluated by an independent package which models reactive forces in the grillage using finite element method. Gradient may be estimated using sensitivity analysis implemented in the modelling package. The number of piles is n. The position of a pile is given by a real number, which is mapped to a two-dimensional position by the modelling package. Possible values are from zero to the sum of length of all beams l. Feasible region is [, l] n. Characteristic of all piles are equal, their interchange does not change the value of objective function.

Optimization of gillage-type foundations: simplicial search space I Let us constrain the problem avoiding symmetries of the objective function: x x... x n. Simplicial search space:...... l D =...... l... l l l l... l l Search space and the numbers of local and global minimizers are reduced n! times with respect to the original feasible region.

Optimization of gillage-type foundations: simplicial search space II The distance between adjacent piles cannot be too small due to the specific capacities of a pile driver. Let us constrain the problem avoiding symmetries and distances between two piles smaller than δ: x x δ,..., x n x n δ. Simplicial search space: δ... (n )δ (n )δ δ... (n )δ l D =..... l (n )δ... l δ l l (n )δ l (n )δ... l δ l.

Least squares nonlinear regression The optimization problem in nonlinear LSR: m min (y i ϕ(x, z i )) = min f (x), x D x D k= where the measurements y i at points z i = (z i, z i,..., z pi ) should be tuned by the nonlinear function ϕ(x, ), e.g. ϕ(x, z) = x + x exp( x 4 z) + x exp( x 5 z)..5.5.5.5

Experimental investigation # DISIMPL-V DISIMPL-C SymDIRECT pe <. pe <. pe <. pe <. pe <. pe <.. 5 () () 7 () 554 () 5 () 49 (). 4 () 9 () 5 () 4 () 99 () 89 (). 95 () 94 () 45 () 54 () 75 () 955 () 4. 99 () 894 () 8 () 78 () 577 () 7 () 5. 88 () 7 () () () 45 () 5 (). () 977 () 74 () () () 85 () 7. () 49 () 58 () 54 () 57 () 7 () 8. 4 () 879 () 8 () 45 () 47 () 597 () 9. () 49 () 58 () 4974 () 57 () 7 (). 4 () 879 () 8 () 45 () 47 () 5897 (). () 49 () 58 () 54 () 57 () 89 (). 4 () 44 () 8 () 4 () 47 () 5 () TP () (8) () (7) () ()

Motivation of globally-biased Direct-type algorithms It is well known that the Direct algorithm quickly gets close to the optimum but can take much time to achieve a high degree of accuracy. Disimpl-v Disimpl-c min f (v), v V (S) f (ci).5..5..5.5..5..5 δ k i non-potentially optimal simplices δ k i potentially optimal simplices

The globally-biased Disimpl (Gb-Disimpl) Gb-Disimpl consists of the following two-phases: a usual phase (as in the original Disimpl method) and a global one. The usual phase is performed until a sufficient number of subdivisions of simplices near the current best point is performed. Once many subdivisions around the current best point have been executed, its neighborhood contains only small simplices and all larger ones are located far from it. Thus, the two-phase approach forces the Gb-Disimpl algorithm to explore larger simplices and to return to the usual phase only when an improved minimal function value is obtained. Each of these phases can consist of several iterations. Paulavičius, Sergeev, Kvasov, Žilinskas. JOGO, 59(-), 545-57, 4, doi:.7/s898-4-8-4 Journal of Global Optimization Best Paper Award 4.

Visualization of Gb-Disimpl-v min f (v), v V (Si).5.5 S k min non-potentially optimal simplices potentially optimal simplices using Disimpl-v potentially optimal simplices using Gb-Disimpl-v f k min f min ε f min 98 7 5 4 groups indices: l k L k = {,..., 9}

Visualization of Gb-Disimpl Gb-Disimpl-v Gb-Disimpl-c min f (v), v V (S) f (ci).5..5..5.5..5..5 δ k i non-potentially optimal simplices δ k i potentially optimal simplices Figure: Geometric interpretation of potentially optimal simplices in k = iteration of the Gb-Disimpl-v and Gb-Disimpl-c algorithms on a n = GKLS test problem

Description of GKLS classes of test problems The standard test problems are too simple for global optimization methods, therefore we use GKLS generator Class (#) Difficulty n m f d r. Simple 4..9.. Hard 4..9.. Simple... 4. Hard..9. 5. Simple 4.... Hard 4..9. 7. Simple 7 5... 8. Hard 7 5... Here: n dimension; m no. of local minima; f global minima value; d distance from the global minimizer to the vertex; r radius of the attraction region.

Comparison criteria. The number (maximal and average over each class) of function evaluations (f.e.) executed by the methods until the satisfaction of the stopping rule: Stopping Rule The global minimizer x D = [l, u] was considered to be found when an algorithm generated a trial point x j S j such that x j (i) x (i) n (u(i) l(i)), i n, () where < is an accuracy coefficient. The algorithms also stopped if the maximal number of function evaluations T max = was reached.. The number of subregions generated until condition () is satisfied. This number reflects indirectly degree of qualitative examination of D during the search for a global minimizer.

Comparison on 8 classes of GKLS problems Direct Directl Disimpl-c Disimpl-v # Original Gb Original Gb Average number of function evaluations 98.89 9.79.8 87.7 9.9 7.9.78 7.7 89. 7.8.5 55.95 7.7 785.7 4. 7.8.8 745.7 4 >4.5 4858.9 758.7 445.8 598.9 49. 5 >478.89 898.55 >45.9 959.5 7.98 55.8 >9578.5 8754. >78794.7 >57.7 985. 79.8 7 >57.4 758.44 >448.74 >5..44 8579.89 8 >75.58 >94.5 >55.84 >4754. 475.99 479.

Comparison on 8 classes of GKLS problems Direct Directl Disimpl-c Disimpl-v # Original Gb Original Gb Number of function evaluations (5%) 5 5 5 8 75 55 5 8 59 4 48 787 7 4 749 97 5 9 594 5 485 794 477 74 4 4 47 77 4 987 5 7 94 89 5 75 74 8 559 4 59 574 48 5 Number of function evaluations (%) 59 8 9 98 77 48 44 9 4 8 745 57 9 5 58 474 4 FAIL (4) 9 5 4 754 48 5 FAIL (4) 8744 FAIL (4) 894 5874 9 FAIL (7) 87857 FAIL () FAIL () 848 44 7 FAIL () 787 FAIL () 887 4859 9 8 FAIL () FAIL (4) FAIL () FAIL (9) 859 5

Comparison on 8 classes of GKLS problems Direct Directl Disimpl-c Disimpl-v # Original Gb Original Gb Number of subregions (5%) 5 4 4 8 75 55 8 9 8 59 4 48 785 5 4 749 97 5 9 84 588 5 485 794 477 99 444 4 47 77 4 997 85 7 94 89 5 4 75 8 559 4 59 574 877 788 Number of subregions (%) 59 8 9 98 49 75 44 9 4 585 8 57 9 5 58 97 548 4 > 9 5 4 87 575 5 > 8744 > 894 4995 854 > 87857 > > 5 4449 7 > 787 > 887 459 88 8 > > > > 4 7957

Operating characteristics for the algorithms Class no. (simple) Class no. (hard) Number of solved functions 8 4,,, 4, Numb. of func. eval.,,, 4, Numb. of func. eval. Direct Directl Disimpl-c Gb-Disimpl-c Disimpl-v Gb-Disimpl-v Figure: Operating characteristics for the n = test problems

Operating characteristics for the algorithms Class no. (simple) Class no. 4 (hard) Number of solved functions 8 4 Numb. of func. eval. 4 Numb. of func. eval. 4 Direct Directl Disimpl-c Gb-Disimpl-c Disimpl-v Gb-Disimpl-v Figure: Operating characteristics for the n = test problems

Operating characteristics for the algorithms Class no. 5 (simple) Class no. (hard) Number of solved functions 8 4 4 8 4 8 Numb. of func. eval. 4 Numb. of func. eval. 4 Direct Directl Disimpl-c Gb-Disimpl-c Disimpl-v Gb-Disimpl-v Figure: Operating characteristics for the n = 4 test problems

Operating characteristics for the algorithms Class no. 7 (simple) Class no. 8 (hard) Number of solved functions 8 4.5.5.5.5 Numb. of func. eval. 5 Numb. of func. eval. 5 Direct Directl Disimpl-c Gb-Disimpl-c Disimpl-v Gb-Disimpl-v Figure: Operating characteristics for the n = 5 test problems

Thank you for your attention