Local search. Heuristic algorithms. Giovanni Righini. University of Milan Department of Computer Science (Crema)

Local search Heuristic algorithms Giovanni Righini University of Milan Department of Computer Science (Crema)

Exchange algorithms In Combinatorial Optimization every solution x is a subset of E An exchange heuristic iteratively updates a subset x (t) 1. it starts from a feasible solution x (0) obtained in some way (often with a constructive heuristic) 2. it exchanges elements in the current solution with elements not in it yielding other feasible solutions x A,D = x A\D with different A E \ x and D x 3. at each step t, it selects which elements must be added and deleted according to a suitable criterion (A, D ) = arg minφ(x, A, D) A,D 4. it generates the new current solution x (t+1) := x (t) A \ D 5. when a suitable end test is satisfied, it terminates; otherwise, it goes back to step 2.

Neighborhood An exchange heuristic is defined by: the set of subsets A and D that can be used, i.e. the subset of solutions that can be generated with an exchange; the selection criterion φ(x, A, D). The neighborhood N : X 2 X is a function associating a subset of neighbor solutions N(x) X with each feasible solution x X. One can define a search graph in which nodes represent feasible solutions; arcs link each solution x with those in its neighborhood N(x). Given a search graph a run of the algorithm corresponds to a path the traversal of an arc is called move, because it transforms a solution in another one by moving some elements

Distance-based neighborhoods Every solution x X can be represented with its incidence vector: { 1 if i x x i = 0 if i E \ x The Hamming distance between two incidence vectors x and x is the numbero of components in which they are different: d H (x, x ) = i E x i x i With reference to the subsets this means x \ x + x \ x. The set of solutions with Hamming distance from x within a given threshold is a possible definition of a neighborhood (parameterized on the threshold, k): N Hk (x) = {x X : d H (x, x ) k}

An example: the KP The instance of the KP with E = {1, 2, 3, 4}, w = [ 5 4 3 2 ] and W = 10, has the following solutions where the subsets {1, 2, 3, 4}, {1, 2, 3} and {1, 2, 4} are not feasible. The solution x = {1, 3, 4} (in blue) has a neighborhood N H2 (x) of 7 elements (in pink). The subsets in black do not belong to the neighborhood, because their Hamming distance from x is larger than 2.

Operations-defined neighborhoods Another common definition of neighborhood is oeprational. It is obtained by defining a set O of operations that can be done on the solutions of the problem; a set of solutions generated by doing the operations of O n x. N O (x) = {x X : o O : o(x) = x } For the KP, one can define O as insertion of an element of E \ x in x; deletion of an element of x from x; exchange of an element in x with an element in E \ x. The resulting neighborhood N O is related to the distance-based neighborhoods but it does not coincide with any of them. N 1 N O N 2 These neighborhoods can be parameterized by executing sequences of k operations of O instead of one, as with distance-based neighborhoods.

Differences between neighborhoods In general, operations-based neighborhoods produce solutions at different Hamming distances. For the TSP one can define a neighborhood N S1 with the solutions that can be obtained by exchanging two vertices in the sequence of visits. The solution x = (3, 1, 4, 5, 2) has neighborhood: N S1 (x) = {(1, 3, 4, 5, 2),(4, 1, 3, 5, 2),(5, 1, 4, 3, 2),(2, 1, 4, 5, 3),(3, 4, 1, 5, 2), (3, 5, 4, 1, 2),(3, 2, 4, 5, 1),(3, 1, 5, 4, 2),(3, 1, 2, 5, 4),(3, 1, 4, 2, 5)} With respect to x three arcs change if the two vertices are adjacent, 4 arcs change otherwise.

Relations between distance-based and operations-based neighborhoods Sometimes the neighborhoods defined in the two ways coincide: for the MDP: N H2 with solutions at distance 2; N S1 defined by the exchange of an element; for the BPP: N H2 with solutions at distance 2; N T1 defined by moving an item to a different bin; and many other examples are possible... This is typical with solutions where the cardinality is fixed: one runs a sequence of k exchanges; k elements enter and k elements leave the solution; the Hamming distance between the first and the last solution is 2k.

Different neighbors for a same problem: the CMST A same problem may allow for different operations-based neighborhoods. In the CMST one can exchange edges: (i, j) leaves, (i, n) enters; exchange vertices: n is moved from subtree 2 to subtree 1 (recomputing the edges to reconnect all subtrees at minimum cost):

Different neighbors for a same problem: the PMSP For the PMSP one can define a transfer neighborhood N T1, based on the set T 1 of job moves from a machine to another one: an exchange neighborhood N S1, based on the set S 1 of job exchanges between two different machines (one job for each machine)

Connectivity of the search space An exchange heuristic can always find an optimal solution only if at least one optimal solution is reachable from any initial solution. One says the the search graph is weakly connected to the optimum when for each solution x X a path from x to x exists. Since x is unknown, a stronger condition is often used: the search graph is strongly connected when for each pair of solutions x, y X a path from x to y exists. An exchange heuristic should guarantee one of such conditions. This is not always possible: in the MDP, the neighborhood N S1 allows to connect any pair of solutions is at most k steps in the KP and the SCP, no neighborhood N Sk guarantees this The feasible solutions may have any cardinality if we allow for deletions (in the KP) and insertions (in the SCP), then the search graph is connected.

Connectivity of the search space If feasibility is defined in a sophisticated way, owing to the many constraints of the problem, then deletions, insertions and exchanges od elements may be insufficient: infeasible subsets may interrupt the paths between pairs of feasible solutions. w = 1 w = 1 w = 1 e f g w = 1 w = 1 e f g w = 1 w = 1 e f g w = 1 w = 1 a b c d a b c d a b c d w = 1 w = 2 w = 1 w = 1 w = 1 w = 2 w = 1 w = 1 w = 1 w = 2 w = 1 w = 1 r r r Given W = 8, tehre are three feasible solutions, all with two subtrees of weight 4: x = {(r, a),(a, b),(b, e),(r, d),(c, d),(d, g),(f, g)} x = {(r, a),(a, e),(e, f),(f, g),(r, d),(c, d),(b, c)} x = {(r, a),(a, b),(b, c),(r, d),(d, g),(f, g),(e, f)} The three solutions can be reached from one another only by exchanging two edges at a time; exchanging one edge, only infeasible solutions are reached.

Steepest descent heuristic The selection criterion φ(x) of the new solution in the neighborhood of the current solution is typically the objective function: at each step, the heuristic moves from the current solution to the best one in in its neighborhood. To avoid cycling, one accepts only strictly improving moves. Algorithm SteepestDescent ( I, x (0)) x := x (0) ; Fine := false; While Fine = false do x := arg min x N(x) f (x); If f (x ) f (x) then Fine := true; else x := x ; EndWhile; Return (x, f (x));

Local and global optimality A steepest descent heuristic terminates when it finds a locally optimal solution, that is a solution x X such that (assuming minimization) z( x) z(x) for each x X A globally optimal solution is also locally optimal, but the viceversa in not true in general: X X N X f(x) f(x) f(x) f(x*) N (x) X x* x X x x

Exact neighborhood An exact neighborhood is a neighborhood function N : X 2 X such that every local optimum is also a global optimum. X N = X A trivial case occurs when the neighborhood of every solution coincides with the whole feasible region (N(x) = X for each x X). Exact neighborhoods are extremely rare This is useless: too large to explore. the only relevant case is the exchange of basic and non-basic variables used by the simplex algorithm for linear programming problems. In general, a steepest descent heuristic finds a local optimum, not a global optimum. Its effectiveness depends on the properties of the search graph and the objective.

Some relevant properties are: Properties of the search graph the size of the search space X the connectivity of the search space (or the search graph) the diameter of the search graph, i.e. the number of arcs of the longest shortest path in it. For instance, for the symmetric TSP on complete graphs: the search space contains X = (n 1)! solutions the vertex exchange neighborhood contains ( ) n 2 = n(n 1) 2 solutions the diameter of the search graph is n 2, because any solution can be transformed in any other by at most n 2 exchanges. For instance, x = (1, 5, 4, 2, 3) becomes x = (1, 2, 3, 4, 5) in 3 steps x = (1, 5, 4, 2, 3) (1, 2, 4, 5, 3) (1, 2, 3, 5, 4) (1, 2, 3, 4, 5) = x

Other relevant properties: Properties of the search graph the density of globally optimal solutions ( X X ) and locally optimal solutions ( X X ): if local optima are many, it is difficult to find global optima. the quality of local optima compared with global optima (δ( x) = z( x) z(x ) z(x ) ), possibly described by an SQD diagram: if local optima are good, it may be less important to find global optima. the distribution of local optima in the search space: if local optima are close by, it is not necessary to explore the whole space. The exact evaluation of these indicators would require an exhaustive exploration of the search space. In practice, we limit ourselves to probe it: this analysis may require a lot of time; may provide misleading results.

Example: the TSP Typical results with the TSP on complete graphs with Euclidean costs: the average Hamming distance between two local optima is n: local optima are concentrated in a small sub-region of X; the average Hamming distance between to local optima is larger than between local and global optima: global optima are likely to in between local optima; the FDC (Fitness-Distance Correlation) diagram links the quality δ( x) with the distance from the global optima d H ( x, X ): better local optima are closer to global optima;

Fitness-Distance Correlation If the correlation between quality and closeness to global optima is strong, it is more convenient to search for good initial solutions, because they guide the local search to good local optima it is more better to intensify than to diversify. On the contrary, if the correlation is weak a good initialization is less important; it is better to diversify than to intensify. This happens, for instance, with the Quadratic Assignment Problem (QAP)

Landscape A landscape is a triple (X, N, z), where X is the search space, or the feasible region; N : X 2 X is the neighborhood function; z : X N is the objective function. One can see the search graph as a graph weighted on the vertices with the objective. The effectiveness of exchange heuristics depends on the landscape. Rugged landscapes imply many local optima and hence less effective heuristics.

Different types of landscapes There is a wide variety of landscapes.

Autocorrelation coefficient The complexity of a landscape can be estimated empirically 1. doing a random walk on the search graph 2. determining the sequence of objective values z (1),...,z (tmax) 3. computing their average value z = tmax z (t) t=1 4. computing the empirical autocorrelation coefficient r(i) = tmax i t=1 (z (t) z)(z (t+i) z) t max i tmax t=1 (z(t) z) 2 t max It is a function of i that starts from r(0) = 1 and usually decreases. If r(i) remains 1, the landscape is smooth: the neighbor solutions have values similar to the current one there are few local optima the steepest descent heuristic is effective. If r rapidly changes, the landscape is rugged: the neighbor solutions have values quite different from the current one there are many local optima the steepest descent heuristic is not so effective.

Plateaux One can analyse the search graph dividing it into objective levels: a plateau of value z is a subset of solutions of value z that are adjacent in the search graph. Large plateaux hamper the selection of the move, because they make it dependent on the order in which the neighbor solutions are visited. Hence a too smooth landscape is not an advantage! Example (PMSP): all transfers and exchanges between machines 1 and 3 leave the objective function value unchanged (the other moves worsen it.)

Attraction basins An alternative subdivision of the search graph is based on the concept of attraction basin of a local optimum x. It is the set of solutions x (0) X such that the steepest descent heuristic starting from x (0) terminates in x. The steepest descent heuristic is effective if attraction basins are few and large (especially when global optima have larger attraction basins); ineffective if attraction basins are many and small (especially if global optima have smaller attraction basins).

Complexity Algorithm SteepestDescent ( I, x (0)) x := x (0) ; Stop := false; While Fine = false do x := arg min x N(x) z(x); If z(x ) z(x) then Stop := true; else x := x ; EndWhile; Return (x, z(x)); The complexity of the steepest descent heuristic depends on 1. the number of steps: this depends on the structure of the search graph (width of the attraction basins), which is difficult to estimate a priori; 2. the selection of a best solution in the neighborhood: this depends on ho w the search is done.

Two main strategies are used: Exploring the neighborhood 1. exhaustive search: all neighbor solutions are evaluated; the complexity of each iteration is the product of the number of neighbor solutions ( N(x) ) the cost for evaluating each of them (γ N ( E, x)) Sometimes it is not easy to evaluate only neighbor solutions: one visits a superset of the neighborhood for each element the feasibility is checked for the feasible elements the cost is evaluated 2. efficient exploration of the neighborhood: instead of visiting the whole neighborhood, one finds the optimal neighbor solution by solving an auxiliary problem. Only some special neighborhoods allow for this.

Exhaustive exploration of the neighborhood Algorithm SteepestDescent ( I, x (0)) x := x (0) ; Stop := false; While Stop = false do x := x; { x := arg min x N(x) z(x) } For each x N (x) do If z ( x) < z (x ) then x := x; EndFor; If z (x ) z (x) then Stop := true; else x := x ; EndWhile; Return (x, z (x)); The complexity is the product of three terms: 1. the number of iterations t max to reach the local optimum 2. the number of solutions N ( x (t) ) visited at each iteration 3. the time to evaluate the objective γ N ( x (t), E ) In general N ( x (t) ) and γn ( x (t), E ) have a maximum which is independent of x (t).

Evaluating the objective: the additive case The first expedient to accelerate an exchange algorithm is minimizing the time needed to evaluate the objective. If an exchange inserts or deletes a small number of elements, updating z(x) instead of recomputing it costs γ N ( E ) O(1): it is enough to add φ j for each element j inserted in x; to subtract φ j for each element j deleted from x. In the KP and the CMSTP one can define the neighborhood N S1 generated by the exchange of an element i x with an element j E \ x. Moving from x to x = x \{i} {j}, the objective varies by δ(x, i, j) = z(x \{i} {j}) z(x) = φ(j) φ(i) Note that δ(x, i, j) does not depend on x.

Example: the symmetric TSP The neighborhood N R2 for the TSP deletes two non-consecutive edges (s i, s i+1 ) and ( ) s j, s j+1 inserts the two edges ( ) ( ) s i, s j and si+1, s j+1 reverses the direction of ( s i+1,...,s j ) (modifying O(n) edges.) If the cost function is symmetric, the variation of z(x) is δ(x, i, j) = c si,s j + c si+1,s j+1 c si,s i+1 c sj,s j+1 In many other cases, however, the function is not additive.

Quadratic functions In the MDP the objective function is quadratic: if one uses the neighborhood N S1, moving from x to x = x \{i} {j}, the objective varies by δ(x, i, j) = z(x \{i} {j}) z(x) = 1 2 d hk 1 2 h,k x There are O(n) different terms in the two sums. h,k x\{i} {j} There is a general expedient that works with symmetric quadratic objective functions: δ(x, i, j) = 1 d hk 1 d hk 2 2 h x k x h x\{i} {j} k x\{i} {j} δ(x, i, j) = k x d jk k x d ik d ij = D j (x) D i (x) d ij d hk If one knows D l (x) = k x d lk for each l E, the computation requires O(1).

Example: the MDP We want to evaluate the exchange x x = x \{i} {j} with i x and j E \ x. x E \ x z = z D i + D j d ij We loose the pairs including i We get the pairs including j But the pair (i, j) is in excess. i j

Example: the MDP We want to evaluate the exchange x x = x \{i} {j} with i x and j E \ x. x E \ x z = z D i + D j d ij i D i j We loose the pairs including i We get the pairs including j But the pair (i, j) is in excess.

Example: the MDP We want to evaluate the exchange x x = x \{i} {j} with i x and j E \ x. x E \ x z = z D i + D j d ij i +D j j We loose the pairs including i We get the pairs including j But the pair (i, j) is in excess.

Example: the MDP We want to evaluate the exchange x x = x \{i} {j} with i x and j E \ x. x E \ x z = z D i + D j d ij i d ij j We loose the pairs including i We get the pairs including j But the pair (i, j) is in excess.

Example: the MDP x E \ x i j d li +d lj l Update of the data-structures: D l = D l d li + d lj, l E Each element l sees d li disappearing d lj appearing

Use of auxiliary information Also other non-linear functions can be updated keeping aggregate information on the current solution; using this information to compute z efficiently; updating such information when moving to the next solution. For the PMSP with the transfer neighborhood N T1 and the exchange neighborhood N S1, one can evaluate the objective in constant time by keeping and updating the completion time of each machine; the index of the machines with the two largest completion time values.

Example: the PMSP Consider the exchange o = (i, j) of jobs i and j (i on machine M i, j on machine M j ) the new completion times can be computed in constant time: one of them increases, the other one decreases (o they remain unchanged); one can check in constant time whether one of them exceeds the maximum completion time; if the maximum completion time decreases, one can check in constant time whether the other one or the second one becomes the maximum. Once visited the whole neighborhood and once selected the move, it is necessary to update the completion times (in constant time: only two of them change);

Use of auxiliary information The auxiliary information can be about the current solution x; the previous solution in the neighborhood, according to a suitable ordering. Consider the neighborhood N R2 for the symmetric TSP: the neighbor solution differ from x by O(n) edges; the solutions in the neighborhood differ from one another by O(n) edges; if the edge pairs (s i, s i+1 ) and (s j, s j+1 ) follow the lexicographical order, the reversed path changes by only one edge. π 0 π i π i+1 π j π j+1 π 0 π 0 π i π i+1 π j π j+1 π j+2 π 0

Example: the asymmetric TSP π 0 π i π i+1 π j π j+1 π 0 π 0 π i π i+1 π j π j+1 π j+2 π 0 In general, the variation of z(x) is δ(x, i, j) = c si,s j + c si+1,s j+1 c si,s i+1 c sj,s j+1 + c sj...s i+1 c si+1...s j When we have considered exchange (i, j) and we consider exchange (i, j ) with j = j + 1 the first four terms change, but they are data; the last two terms can be updated in constant time: { c sj...s i+1 = c sj...s i+1 + c sj+1,s j c si+1...s j = c si+1...s j + c sj,s j+1

Feasibility Some operations done to explore the neighborhood may involve infeasible solutions: { Ñ O (x) = x 2 E : o O : o(x) = x } N O (x) = ÑO(x) X In this case, for each element of ÑO(x) one needs to check the feasibility; if feasible, to evaluate the cost. To check feasibility one can use the same techniques used for the objective.

Example: the CMSTP Consider the neighborhood N S1 that inserts an edge and deletes another: if the two edges are in the same branch, the solution remains feasible; if they belong to different branches, one loses weight, the other one gets it: the variation is equal to the weight of the transferred sub-tree. If we keep the weight of the sub-tree rooted at each vertex, it is enough to compare such weight with the residual capacity of the branch that receives it. This piece of information must be updated once the move has been done: it takes O(n) time.

Refined heuristic The use of additional information implies 1. the inizialization of suitable local data-structures, relates to the exploration of each neighborhood; global data-structures, related to the whole search process; 2. their update from a solution to another or from an iteration to another. Algorithm SteepestDescent ( I, x (0)) x := x (0) ; GD := InitializeGD(); Stop := false; While Stop = false do x := 0; δ := 0; LD := InitializeLD() For each x N (x) do If δ( x) < δ(x ) then x := x; LD := UpdateLD(LD) EndFor; If z (x ) z (x) then Stop := true; else x := x ; GD := UpdateGD(GD) EndIf

Partial conservation of the neighborhood When an operation o O is executed on a solution x, it often happens that the variation δ(x, o) of the objective function does not depend on x or it depends only on a part of x. Many operations o O executed on x = o(x) produce δ(x, o) = δ(x, o) In this case, it is convenient 1. to store all values of δ(x, o) as they are computed; 2. to do the best move, generating x ; 3. to delete the values δ(x, o) δ(x, o); 4. to recompute only the deleted values; 5. to go back to step 2.

Example: the CMST Consider the neighborhood N S1 for the CMST: insert an edge j E \ x delete an edge i x The exchanges of the branches not affected by the move produce the same effect. δ(x, i, j) = δ(x, i, j) Therefore it is possible to keep the set of the feasible exchanges; to delete from the list the exchanges involving one or both the branches associate with the move; to recompute only the effect of those exchanges.

The efficiency-efficacy trade-off The complexity depends on three factors: 1. the number of local search iterations; 2. the size of the neighborhood to be explored; 3. the complexity of evaluating each solution. The former two are conflicting: large neighborhood allows for few steps (or better solutions); small neighborhood implies many steps. The optimal trade-off is somewhere in between: we need a neighborhood large enough to allow to reach good solutions; small enough to allow for a quick selection of the move. In general it is difficult to understand a priori what is the best trade-off.

Fine tuning the neighborhoods It is also possible to fine tune the size of a given neighborhood N: one explores only a promising subset N N For instance, one can insert only elements j E \ x with cost φ(j) low enough delete only elements i x with cost φ(i) high enough if the best known solution is promising, the search terminates For instance, one apply the first-improve strategy: the exploration of the neighborhood is stopped as soon as a solution is found which is better than the current one If z( x) < z(x) then Stop := true;

Fine tuning the neighborhoods The effectiveness depends on the objective: if the cost of some elements heavily affects the objective, it may be worth fixing or forbidding them. It also depends on the neighborhood: if the landscape is smooth, the first improving neighbor solution is not likely to be much worse than the best improving; if the landscape is rugged, the best solution in the neighborhood can be much better than the first improving solution.