Similar Elemets ad Metric Labelig o Complete Graphs arxiv:1803.08037v [cs.ds] 4 Mar 018 Pedro F. Felzeszwalb Brow Uiversity Providece, RI, USA pff@brow.edu March 8, 018 We cosider a problem that ivolves fidig similar elemets i a collectio of sets. The problem is motivated by applicatios i machie learig ad patter recogitio (see, e.g. [3]). Ituitively we would like to discover somethig i commo amog a collectio of sets, eve whe the sets have empty itersectio. A solutio ivolves selectig a elemet from each set such that the selected elemets are close to each other uder a appropriate metric. We formulate a optimizatio problem that captures this otio ad give a efficiet approximatio algorithm that fids a solutio withi a factor of of the optimal solutio. The similar elemets problem is a special case of the metric labelig problem defied i [] ad we also give a efficiet -approximatio algorithm for the metric labelig problem o complete graphs. Metric labelig o complete graphs geeralizes the similar elemets problem to iclude costs for selectig elemets i each set. The algorithms described here are similar to the ceter star method for multiple sequece aligmet described i [1]. Beyod producig solutios with good theoretical guaratees, the algorithms described here are also practical. A versio of the algorithm for the similar elemets problem has bee implemeted ad used to fid objects i a collectio of photographs [4]. 1 Similar Elemets Let X be a (possibly ifiite) set ad d be a metric o X. Let S 1,...,S be fiite subsets of X. The goal of the similar elemets problem is to select a elemet from each set S i such that the selected elemets are close to each other uder the metric d. Oe motivatio is for discoverig somethig i commo amog the sets S 1,...,S eve whe they have empty itersectio. We formalize the problem as the imizatio of the sum of pairwise distaces amog selected elemets. Let x = (x 1,...,x ) with x i S i. Defie the similar elemets objective as, c(x) = d(x i,x j ). (1) Let x = arg x c(x) be a optimal solutio for the similar elemets problem. Optimizig c(x) appears to be difficult, but we ca defie easier problems if we igore some of the pairwise distaces i the objective. I particular we defie differet star-graph objective 1
fuctios as follows. For each 1 r defie the objective c r (x) to accout oly for the terms i c(x) ivolvig x r, c r (x) = j rd(x r,x j ). () Let x r = arg x c r (x) be a optimal solutio for the optimizatio problem defied by c r (x). We ca compute x r efficietly usig a simple form of dyamic programg, by first computig x r r ad the computig x r j for j r. x r r = arg x r S r j r d(x r,x j ), (3) x j S j x r j = arg x j S j d(x r r,x j ). (4) Each of the star-graph objective fuctios leads to a possible solutio. We the select from amog the solutios x 1,...,x as follows, ˆr = argc r (x r ), (5) 1 r ˆx = x r. (6) Theorem 1. The algorithm described above fids a -approximate solutio for the similar elemets problem. That is, c(ˆx) c(x ). Proof. First ote that, c(x) = c r (x). Sice the imum of a set of values is at most the average, ad x r imizes c r (x), 1 r cr (x r ) 1 By the triagle iequality we have c(x) = d(x i,x j ) c r (x r ) 1 c r (x ) = 1 c(x ). (d(x i,x r )+d(x r,x j )) = d(x r,x l ) = c r (x). l=1 Therefore c(ˆx) cˆr (ˆx) = 1 r cr (x r ) c(x ). To aalyze the ruig time of the algorithm we assume the distaces d(p,q) betwee pairs of elemets i S = S 1 S are either pre-computed ad give as part of the iput, or they ca each be computed i O(1) time. Let k = max 1 i S i. The first stage of the algorithm ivolves optimizatio problems that ca be solved i O(k ) time each. The secod stage of the algorithm ivolves selectig oe of the solutios, ad takes O( ) time.
Remark. If each of the sets S 1,...,S has size at most k the ruig time of the approximatio algorithm for the similar elemets problem is O( k ). The bottleeck of the algorithm is the evaluatio of the imizatios over x j S j i (3) ad (4). This computatio is equivalet to a earest-eighbor computatio, where we wat to fid a poit from a set S X that is closest to a query poit q X. Whe the earest-eighbor computatio ca be doe efficietly (with a appropriate data structure) the ruig time of the similar elemets approximatio algorithm ca be reduced. Metric Labelig o Complete Graphs Let G = (V,E) be a udirected simple graph o odes V = {1,...,}. Let L be a fiite set of labels with L = k ad d be a metric o L. For i V let m i be a o-egative fuctio mappig labels to real values. The uweighted metric labelig problem o G is to fid a labelig x = (x 1,...,x ) L imizig c(x) = m i (x i )+ d(x i,x j ). (7) i V {i,j} E Let x = arg x c(x). This optimizatio problem ca be solved i polyomial time usig dyamic programg if G is a tree. Here we cosider the case whe G is the complete graph ad give a efficiet -approximatio algorithm based o the solutio of several metric labelig problems o star graphs. For each r V defie a differet objective fuctio, c r (x), correspodig to a metric labelig problem o a star graph with vertex set V rooted at r, c r (x) = i V m i (x i ) + j V\{r} d(x r,x j ). (8) Let x r = arg x c r (x). We ca solve this optimizatio problem i O(k ) time usig a simple form of dyamic programg. First compute a optimal label for the root vertex usig oe step of dyamic programg, x r r = arg x r L m r(x r ) + x j L j V\{r} ( mj (x j ) + d(x ) r,x j ). (9) The compute x r j for j V \{r}, ( ) x r mj (x j ) j = arg + d(xr r,x j ). (10) x j L Optimizig each c r (x) separately leads to possible solutios x 1,...,x, ad we select oe of them as follows, ˆr = argc r (x r ), (11) r V ˆx = x r. (1) 3
Theorem 3. The algorithm described above fids a -approximate solutio for the metric labelig problem o a complete graph. That is, Proof. First ote that, c(ˆx) c(x ). c(x) = c r (x). Sice the imum of a set of values is at most the average, ad x r imizes c r (x), 1 r cr (x r ) 1 c r (x r ) 1 Sice d is a metric ad m i is o-egative, c(x) = i V m i (x i )+ {i,j} E c r (x ) = 1 c(x ). d(x i,x j ) = m i (x i )+ d(x i,x j ) i V (i,j) V m i (x i )+ ( d(xi,x r ) + d(x ) r,x j ) i V (i,j) V = m i (x i )+ d(x r,x l ) i V l V\{r} m i (x i ) + d(x r,x l ) i V = c r (x). l V\{r} Therefore c(ˆx) cˆr (ˆx) = 1 r cr (x r ) c(x ). The first stage of the algorithm ivolves optimizatio problems that ca be solved i O(k ) time each. The secod stage ivolves selectig oe of the solutios, ad takes O( ) time. Remark 4. The ruig time of the approximatio algorithm for the metric labelig problem o complete graphs is O( k ). Ackowledgmets We thak Carolie Klivas, Sarah Sachs, Aa Grim, Robert Kleiberg ad Yag Yua for helpful discussios about the cotets of this report. This material is based upo work supported by the Natioal Sciece Foudatio uder Grat No. 1447413. 4
Refereces [1] Da Gusfield. Efficiet methods for multiple sequece aligmet with guarateed error bouds. Bulleti of Mathematical Biology, 55(1):141 154, 1993. [] Jo Kleiberg ad Eva Tardos. Approximatio algorithms for classificatio problems with pairwise relatioships: Metric labelig ad markov radom fields. Joural of the ACM, 49(5):616 639, 00. [3] Oded Maro ad Apara Lakshmi Rata. Multiple-istace learig for atural scee classificatio. I Iteratioal Coferece o Machie Learig, volume 98, pages 341 349, 1998. [4] Sarah Sachs. Similar-part approximatio usig ivariat feature descriptors. Udergraduate Hoors Thesis, Brow Uiversity, 016. 5