Primal Dual Schema Approach to the Labeling Problem with Applications to TSP

1 Primal Dual Schema Approach to the Labeling Problem with Applications to TSP Colin Brown, Simon Fraser University Instructor: Ramesh Krishnamurti The Metric Labeling Problem has many applications, especially in computer vision and image analysis, but in its general form is NP Hard. Thus, the formulation of approximation algorithms which solve this problem is an important avenue of research. N. Komodakis and G. Tziritas constructed a framework, based on the Primal Dual Schema, to find approximate solutions to the Labeling problem as an integer program (IP) via three distinct, but related, algorithms. In the case of two such algorithms, the condition of a metric distance function over labels can be relaxed to allow more general problems to be solved. In this paper, we summarize and review this framework and then show how under the relaxation to non metric label distances, TSP can be reduced to the problem of labeling. Introduction The Labeling problem is the problem of assigning a label to each node on a graph such that we minimize the total cost of the assignment. In its general form, below, this problem is NP hard. Many problems in image analysis and computer vision, such as image segmentation, stereo matching, and image restoration, can be solved as a labeling problem. For a graph G(V,E), the problem can be formulated as a minimization problem as follows: min c p (a)x p (a) + d(a, b)x pq (a, b ) (1) pεv aεl w pq (p,q)εε a,bεl s.t. (a) = 1 pεv a (2) a x pq (a, b ) = x q (b) bεl, (p, q )εe, x pq (a, b ) = x q (a) aεl, (p, q )εe (3) x p (.), x pq (.,.) ε {0, 1 } b where c is the cost of a label on a given node, w the weight function over E, d is the distance between any two labels and x p (a) = 1 iff node p has label a, 0 otherwise. Similarly, x pq (a, b ) = 1 iff node p has label a and node q has label b, 0 otherwise.

2 It turns out that if d is a linear function, then a global optimum can be easily computed. If d is nonlinear but is metric, i.e the following hold, d(a, b ) = 0 a = b, d(a, b ) = d(b, a) 0, (4) d(a, c) d(a, b ) + d(b, c ) (5) then, the problem becomes hard, but approximation algorithms exist. In the framework by N. Komodakis and G. Tziritas we can relax the conditions even further by not requiring (5) and still find a good approximation. Specifically, the solution can be approximated within a known factor of the optimal solution, so we can tell how good it really is. To do this, they use the Primal Dual Schema which gives us a sub optimality bound at every iteration. Primal Dual Schema and Relaxed Complementary Slackness The idea of the Primal Dual Schema is to compute iteratively better feasible solutions in the primal and the dual until they are within some ratio, f, of one another. It hinges on the fact that the optimal solution to an integer program sits between any pair of feasible primal and dual solutions (figure 1). Figure 1: Feasible primal and dual solutions with the optimal integral primal solution and optimal linear program solution between them. In particular, the Primal Dual Principle states that if we have, P: min c T x D: max b T y s.t. A x=b, x 0, xεz s.t. y c and a pair of feasible primal and dual solutions ( x,y ) which satisfies A T c T T x f b y then x is an f approximation to the optimal IP solution x, i.e. c T x f c T x < fc T x. In order to actually generate new feasible solutions it is convenient to use the Relaxed Complementary Slackness (RCS) Theorem. It states that if: x j > 0 m a ij y i c j /f j then for f = max j f j i=1 ( x,y ) is an f approximation to the optimal IP solution. So given some feasible solution ( x,y ), we can easily check if it is an f approximation, and we have some sense of which variables to modify to get us there.

3 Dual of Labeling Problem In order to find relaxed complementary slackness conditions for the Labeling problem, we need to first formulate its dual: D: max p z p s.t. z p min aεl [c p (a) + (a)] y pq q:q~p y pq (a) + y qp (b) w pq d(a, b) a, b εl, (p, q )εe For convenience, an additional variable, h, is introduced which denotes the height of each label at each node. Also, the inequality in the first constraint can be made an equality since we are maximizing a minimum. So we have, D: max z p s.t. z p = min aεl h p (a) p h p (a) = c p (a) + y pq (a) (6) q:q,pεe y pq (a) + y qp (b) w pq d(a, b) a, b εl, (p, q )εe (7) where, z p are the dual variables to maximize, h p are introduced height variables, y pq are 'balance variables', one for each edge, for each label, w pq are the edge weights of the graph and d are the distances between each pair of labels. Instead of enforcing (7) directly, a constraint is placed on each balance variable individually, for convenience: y pq (a) w pq d min /2 (8) where d min is the smallest entry in d. Given our Primal and Dual for the labeling problem, our relaxed complementary slackness conditions become: x p (a) = / 0 : z c p (x p )/f 1 + y pq (x p ) (9) q:q~p x pq (a, b ) = / 0 : x p = / x q y pq (x p ) + y qp (x q ) w pq d(x p, x q )/f2 (10) We can enforce (11) as we go by defining: f approximation, a = x p = x q y pq (a) + y qp (a) = 0 (11) y pq (.) = y qp (.) (p, q) E Define the f j s for our f 1 = 1, f 2 = f app 2d max /dmin and then (9) and (10) respectively become, h p (x p ) = min a h p (a) (12) x p = / x q y pq (x p ) + y qp (x q ) w pq d(x p, x q )/f app (13) The PD1 algorithm will generate new feasible solutions which satisfies (13) until we have one that satisfies (12) also. PD1 Algorithm The PD1 algorithm runs as follows. Initialize the primals, via a random labeling, and the duals by setting each balance variable to the feasible upper bound for that edge using (8). PD1 then loops until re labeling has ceased. For each iteration of this outer loop, the algorithm selects each label, c, in our set of labels one at a time, and executes the main step called a c iteration.

4 In one c iteration, the balance variables, y, are first updated via a max flow calculation detailed below. From this calculation we will get a flow, f pq, between every two nodes. The updated y pq is calculated as: y pq = y pq + f pq f qp. This can modify the height of label c at each node, by the definition of h. With the new heights calculated, the primal variables, x, are re labeled accordingly. The re label rule of x p is: if h (c) < h ( x p ) then set x p =c. This ensures that our label at p is always the lower of x p and c. The max flow calculation for modifying our balance variables requires, two additional nodes, a source, s, and sink, t, on our graph flow capacities defined on each edge. Internal edges will be those from the original graph, and external edges will be those either connecting s to p or p to t, where p is any node in the original graph. We want to define edge capacities such that a resulting flow across each edge, ( p,q ) will determine a good incremental change to y pq (c), i.e. one which results in a new y pq (c) that both satisfies (13) and remains feasible by satisfying (8). Then, edge capacities on internal edges will be defined as follows, Note that this allows us to raise the height of a c label as far as it does not violate (8) and as long as it is not the label of x p or x q ( since we want our selected labels to be lowest). Our external edge capacities are then defined as, These allow flow to raise a label, c, if it below the selected label and to lower it in order to reduce slack, if it is above the selected label. If x p =c then there will be no flow at p and so we set the capacity to 1 by convention. Note that any node p will only have non zero capacity either from s or to t but not both. After running max flow and relabeling primal variables, x p, it is possible that some y pq (c) will be negative. In this case, set both y pq (c)= y qp (c)=0. As mentioned above, this c iteration repeats for each label and then we check whether any x p was actually relabelled. If no relabeling occurred, then the algorithm stops, having satisfied (12) and having achieved an f approximation, otherwise it continues iterating. PD2 and PD3 algorithms The PD2 is very similar to PD1 but with minor variations. It was presented by Komodakis et. al. because it turns out to be the general form of a state of the art labeling algorithm called alpha expansion. Unlike PD1, PD2 only works for metric distance functions. Furthermore, in each iteration of PD2, the solution is allowed to become infeasible, but stays close to feasible. At the end of iterating, the solution is scaled by a factor to become feasible again in a process called Dual Fitting. Also different from PD1 are the definitions of edge capacities. Instead of simply restricting flow across all edges by a factor of d min, the edge capacities are based directly on the distance matrix. Consequently, we now need to

5 'pre edit' duals before performing max flow to ensure we are satisfying (13). There are no other important differences between PD1 and PD2. PD3 is an extension of PD2 to non metric distance functions (thereby improving on the alpha expansion technique). Specifically, PD3 needs to deal with the case where (5) is violated, i.e. d(a,c) > d(a,b) + d(b,c). Komodakis et. al. present a few different variants of PD3 which address this issue in different ways. One presented possibility is to set the capacity of the internal edge at (a,c) to 0 in these cases. It turns out that if this flow assigns p to b and q to c then we will have y pq (c) + y qp (b) = w pq (d(a, c) d(a, b)), which is an 'overestimation' of separation cost between two labels. In this case, the problem can be rectified with an extra fix up step after label reassignment. Other approaches to solving the problem of a violation of the triangle inequality are similar. Reduction from TSP to Labeling Problem Given a graph G(N,E) with a set of finite, positive weights over E, the Traveling Salesperson Problem (TSP) is one of finding the shortest tour of all nodes, or Hamiltonian cycle, such that the summed weight of all edges traversed is minimized. Here we will assume that TSP is over a complete graph. Given an instance of TSP, P1, we can generate an instance of the labeling problem P2 such that we have an optimal solution for P2 if and only if the corresponding solution is optimal for P1. Then, PD1 and PD3 algorithms can be used to find approximate solutions to TSP. To do this, the general idea will be to construct our labeling problem instance such that we encourage labels to form a chain along the best edges in a tour and penalize non adjacent labels from sitting next to one another. We cannot directly prevent labels from forming a non tour, but we can construct our instance in a way that it is always non optimal to not form a tour. We will first generate a new graph G from G as follows: Add all nodes from G to G. For each edge e i in G, add a path of two edges,, e, between the corresponding nodes e i,1 i,2 (figure 1). G has n + n(n 1)/2 nodes. We will call a node in G corresponding to a node in G an original node, and call the new nodes in G, its new nodes. Given the largest weight in G, w max, set new weights on G, w (e i,1 ) = w (e i,2 ) = [ (w(e i) w max) 1 ]/2, where w max is the largest weight in G.The idea is to shift the range of weights down below zero and divide between the two edges between original nodes. We now have only negative weights in G and the negative weights of greatest magnitude are best to traverse. We have done this because d ( a,a )=0 and d 0, so with positive weights the optimal labeling would be every vertex labeled the same, which is not useful.

6 Figure 1: An example of a graph G and its counterpart G. Next, we must construct a matrix of distances, d, between labels which incentivizes chains of labels. 2n+1 labels are required to achieve this. The distance between any label and itself will be 0. (This is a necessary constraint for distance functions, even non metric ones, under the labeling problem framework given by N. Komodakis et. al). The first 2n labels, hereafter referred to as chain labels, will have distance defined as follows, d(p, p + 1 mod2n) = d chain, d(p, q) = d miss, q = 1,..., 2 n, q = / p + 1 mod2n where d chain is the maximum distance. Note that two adjacent chain labels on adjacent nodes will have maximum distance, and thus minimize the value of the objective function over the edge between them due to our negative weights (figure 2a). Our remaining label, called the filler label, has distance: d miss < d (2n + 1, l) = d filler < d chain l = 1...2n, d(2n + 1, 2n + 1) = 0 The actual value of d chain, d filler, and d miss will be discussed shortly. a) b) Figure 2: a) Example of the distance matrix d(a,b) for G where c = d chain, f = d filler and m = d miss. b) Example of a graph labeled with 2n different chain labels such that they form as many chain edges as possible (c tour). (No numbers on filler labels). Given a labeling of the nodes of G, we will call an edge between two adjacent chain

7 labels a chain edge and an edge between a filler node and any other node a filler edge. An edge between two non adjacent chain nodes will be a missed edge. Intuitively, we want to maximize the number of chain edges over the best weights, use filler edges where necessary, and avoid missed edges in order to minimize our objective function. It will be shown that by choosing the distances carefully, we can ensure that P1 is optimal iff P2 is optimal. Proof: We will first define a c tour as a path, only along chain edges, which visits all original nodes exactly once. We can see that this corresponds to a tour in G. So, we need only show that an optimal labeling in G must contain a c tour which corresponds to the optimal tour in G. To ensure this, our distances need to be set to ensure two constraints: 1.) Any labeling containing a missed edge is less optimal than any labeling with no missed edges. 2.) The best labeling with no c tour is less optimal than the worst labeling with a c tour. If 1.) and 2.) hold then it follows that the best labeling contains a c tour. And then the best labeling with a c tour is optimal and corresponds to the optimal tour in P1. For now, let d chain = Rd filler, d filler = Q d miss, R, Q > 1. We will first find constraints on R and Q such that 1.) holds. Because G is complete, it is impossible to have chain edges on every edge in G. Thus, for any pair of edges between two non adjacent nodes, we can either have two filler edges or one chain edge and one missed edge (see figure 2b). (We could have two missed edges but clearly that is non optimal.) So, for 1.) to hold, we need that two filler edges is always better than one chain edge and one missed edge. That is, for any two edges between two original nodes, e i,1, e i,2, need: w (e i,1 ) dfiller + w (e i,2 ) dfiller < w (e i,1 ) dchain + w (e ) d i,2 miss w (e i,1 )2 d filler < w (e i,1 ) (d chain + d miss ) 2Qdmiss > RQd miss + d miss Q(2 R ) > 1 (14) Now we must find values of R and Q such that 2.) is true and (14) holds. Since there are exactly 2n chain labels, and since we aren t allowed to generate any missed edges, the only cycle of chain edges possible visits every original node once, and so is a c tour. This c tour has 2n chain edges. Any tree configuration of chain edges on G, i.e one without any cycles, can generate strictly less than 2n 1 chain edges. (This is a property of any tree on a graph with 2n nodes.) Thus, the only way to maximize the number of chain edges over G is to have a c tour labeling. We will now choose R and Q such that by maximizing the number of chain edges, we also minimize the objective function. Thus for any edge, we only need, w (e i,j ) dfiller > w (e i,j ) dchain 1 < R (15) Which we already had by definition, so we haven t added any new constraints. We can now

8 satisfy 1.) and 2.) by satisfying (14) and (15). For example, R = 3 /2, Q = 5 /2, then set d miss = 1, d filler = 5 /2, dchain = 1 5/2 Given these distances, the optimal labeling of G must contain a c tour. This c tour must be the optimal c tour and corresponds to the optimal tour in G. we have shown that an optimal tour in P1, corresponds to an optimal c tour in P2 and implies the optimality of a labeling in P2. By constructing appropriate instances of P2, we can now use the PD1 and PD3 algorithms for approximating solutions to TSP. We can see that we have essentially dualized the degree and subtour elimination constraints on our TSP. Thus, our approximate solutions may not be feasible tours in P1. However, as we have shown, any labeling corresponding to a tour has a lower associated objective function value than any other labeling not corresponding to a tour. So, we should expect that the better our approximation gets, the more likely it will correspond to a tour. Results As proof of concept, PD1 has been implemented in Matlab and applied to the problem of image segmentation. Image segmentation is the problem of finding similar regions and/or regions of interest in an image. As is common in many image segmentation algorithms, each pixel in the image is represented by a node in the labeling graph and weights between nodes are defined by the intensity gradient between pixels. Thus, neighbouring pixels of very different intensity values define strong boundaries. The distance matrix was defined in a non metric way, such that different labels were desirable across strong image boundaries. Results are displayed in figure 3.

9 Figure 3: a) Two grayscale images, top and bottom, to be segmented. b) The primal variables are given an initial labeling over a set of 3 labels each, shown in red, green and blue. c) The final output of the PD1 labeling algorithm. The PD1 algorithm was able to find a reasonable segmentation in either case, although not the optimal segmentation. This is clear in the second example of figure 3, since we would expect to see the center of the square segmented with the third label. Small examples were used because the naive implementation of the PD1 algorithm presented by Komodakis, et. al. is slow. To achieve the run times listed in their paper, optimizations must have been made. Furthermore, Komodakis, et. al. list calculated f values in their results much lower than the stated f approximation bounds. It is possible that this is due to the specific applications and data used for their tests, or perhaps that better bounds are possible. References [1] N. Komodakis and G. Tziritas, Approximate Labeling via Graph Cuts Based on Linear Programming, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007.