On the Approximation of Largest Common. Subtrees and Largest Common Point Sets. Technology, Tatsunokuchi, Ishikawa , Japan

On the Approximation of Largest Common Subtrees and Largest Common Point Sets Tatsuya Akutsu 1 and Magnus M. Halldorsson 2 1 Department of Computer Science, Gunma University, Kiryu, Gunma 376, Japan 2 School of Information Science, Japan Advanced Institute of Science and Technology, Tatsunokuchi, Ishikawa 923-12, Japan Abstract. This paper considers the approximability of the largest common subtree and the largest common subgraph problems, which have applications in molecular biology. It is shown that approximating the problems within a factor of n is NP-complete, while a general search algorithm which approximates both problems within a factor of O(n= log n) is presented. Moreover, several variants of the largest common subtree problem are studied. 1 Introduction In computational biology and chemistry, there is a frequent need to extract a common pattern from multiple data. A good example is the extraction of a common pattern from multiple amino acid sequences automatically, which has been studied extensively from both theoretical and practical viewpoints. We consider two such applications. One involves nding a common pattern from multiple molecules. Several methods have been developed for nding the largest common connected subgraph of multiple graphs, and have been applied to study the relationship between structures and chemical activities [11]. The other application is nding common substructures of a collection of 3-dimensional protein structures. Many methods have been given for nding common substructures of two protein structures [10, 12], and Russel and Barton have developed a method for multiple protein structures, and have applied it to multiple sequence alignment [10]. However, neither of these two applications have seen many studies from a theoretical perspective. This paper gives a theoretical analysis of the complexity of nding nearly optimal solutions for the abovementioned problems. The problems are formalized under the names largest common subtree problem (LCST) and the largest common point set problem (LCP). Given a collection of trees, possibly with labels on the vertices, the LCST problem is to nd a tree of maximum size that is isomorphic to a subtree in each of the input trees. It is a special case of a problem considered in [11]. We are particularly interested in the bounded-degree case, since the maximum vertex degree is bounded by constant in all molecules. Given a collection of geometric point sets, the LCP problem is to nd a set of points that is congruent to a subset of each input point set. It is closely related to the

problem of nding common substructures of multiple protein structures, since 3-dimensional protein structures are frequently treated as point sequences. Our main result is that both problems are very hard to approximate, when the number of input sets (trees/point sets) is unbounded. In particular, it is N P -complete to approximate the problems within a factor of n, where n is the size of the smallest input set. This holds for both labeled and unlabeled trees of bounded degree (LCST), and for one dimensional point sets (LCP). From the other direction, we present a general search algorithm that yields a non-trivial performance ratio for both problems, or O(n= log n). We further consider several variants of the LCST problem, both in terms of the format of the input trees (e.g. unbounded/bounded number of input trees, labels, and degree; ordered trees), and in terms of the denition used for a subtree (e.g. induced subgraph; required to include the root; descendants inheriting membership) and prove hardness and approximability results. Let us briey review related results. LCST for two trees can be solved in polynomial time [6] and LCST for three trees is NP-hard [1]. LCST can be solved in polynomial time if the maximum vertex degree is bounded and the number of input trees is xed [1]. Zhang and Jiang have shown that computing the largest tree that can be obtained from two unordered trees via few editing operations is MAX SNP-hard [13]. Approximate and exact matching problems between two point sets have been studied extensively in computational geometry, but we are not aware of results on larger collections of point sets. 2 The Hardness of Approximating Common Subtrees In this, and the following two sections, we consider the largest common subtree (LCST) problem. LCST is dened formally as follows: given a collection of trees T S = ft 1 ; 1 1 1 ; T K g, nd a common tree with the maximum number of edges, where a tree T is called a common tree of T S if T is isomorphic to a subtree (i.e. a connected induced subgraph) of each T i. Let n denote the size of the smallest tree of the input, K the number of trees, and 1 the maximum degree of a vertex. An approximation algorithm A for a maximization problem approximates the optimal cost opt(x) within a factor of f(n) if, for all instances X of the problem of size n, opt(x)=a(x) f(n), where A(X) is the size of the solution produced by the algorithm. An independent set of graph G(V; E) is a set I of mutually non-adjacent vertices, i.e. satisfying (8v i ; v j 2 I)(fv i ; v j g =2 E). Arora et al. proved that for some > 0, the size of the maximum independent set of a graph cannot be approximated within a factor of n in polynomial time unless P=NP [3]. We show here that LCST cannot be approximated within a factor of n unless P=NP, by giving an approximation preserving reduction from the Independent Set problem. We shall consider directed (i.e. rooted) trees, while the same argument holds also for undirected trees. First, consider the case when the number of labels is unbounded.

Theorem 1. For some > 0, LCST for labeled trees cannot be approximated within a factor of n unless P=NP. Proof. We show that if LCST can be approximated within a factor of n, then Maximum Independent Set can be approximated within a factor of n 2. The theorem then follows from [3]. The reduction is similar to that of [8], with an additional technique introduced. Given a graph G, where V (G) = fv 1 ; : : : ; v N g and without loss of generality N = jv (G)j = 2 h for some cardinal h, we construct a collection T S of complete binary trees ft 1 ; T 2 ; : : : ; T g. The last tree T is of height h + H, where H will be specied later. Assuming some order among the children of each node, let (w 1 ; w 2 ; : : : ; wn ) be the sequence of nodes at depth h (with the root at depth 0), with node wj labeled by j. Other nodes in the tree are labeled by 0. The other trees T i, 1 i N, are of height h + H + 1 and consist of two subtrees R i and S i of height h + H, connected by an additional vertex. The vertices at height h in each subtree, (r i 1 ; ri 2 ; : : : ; ri n) and (s i 1 ; si 2 ; : : : ; si n), are labeled by: label(r i j) = j; if j = i _ fvi ; v j g =2 E, 02; otherwise, label(s i j) = 02; if j = i, j; otherwise, with all other nodes of the trees labeled by 0. Intuitively, the labels of vertices r i j encode the adjacency list of graph vertex v i. Observe that a common subtree T will be contained in either the R i or the S i portion of each input tree. A maximal common subtree will contain all the L = 2 H+1 0 1 descendents of the non-zero labeled vertices it contains, as well as all the N 0 1 ancestors of non-zero labeled vertices in T =R i =S i. The vertices in T with positive labels describe exactly an independent set in the graph. Namely, we can construct an independent set I by: I = fv i j (9w 2 V (T))(label(w) = i)g: Thus, given a common subtree of size at least kl + N 0 2, we obtain an independent set of size k. Furthermore, this can be mapped back, thus, the independence number of the graph is m i the LCST size is ml + N 0 2. Now let H = h, thus L = 2N 0 1. Then the approximation ratio of the Independent Set instance is bounded by m k 2( ml + N 0 2 kl + N 0 2 ) 2n O(N 2 ); using that n = je(t )j = NL + N 0 2 2N 2. We now modify the construction for unlabeled trees. Theorem 2. For some > 0, LCST for unlabeled trees cannot be approximated within a factor of n unless P=NP.

T T i Nc logn w 1 w 2 w N c logn w 1 2c logn Nc logn w 2 complete binary tree of size L w N similar to T Fig. 1. Forms of trees constructed in Thm. 2 Proof. The proof is similar to that of Thm. 1, while the form of the constructed trees are modied as shown in Fig. 1. Vertices rj i (and similarly si j, w j ) are each replaced by a path of length ic log N, starting with new node wj i and ending with node ^w j i. Here, c is some xed integer satisfying c log N > 2 log N +dlog Le+ 1. The descendants of ^w j i are deleted if label(wi j ) = 02 in the proof of Thm. 1. Let q denote the size of the tree excluding the descendants of the ^w-nodes, i.e. q = c N(N01) log N +2N 02. It is easy to see that the size of LCST is ml+q 2 if and only if the size of the maximum independent set is m. Moreover, this can be obtained constructively. Given a common subtree T with at least kl + q edges, we extract an independent I with at least k vertices as follows: v i 2 I i there are at least two leaves w 1 and w 2 of T whose depth modulo c log N is i and whose lowest common ancestor is of depth at least ic log N. Then for L = (N 2 log N), the approximation ratio of the independent set instance is bounded by m=k < 2(mL + q)=(kl + q), which is at most n O(). 3 Algorithms for Approximating Common Subtrees We now consider some polynomial time algorithms that nd common subtrees of guaranteed size. Let us initially consider unlabeled trees. One approach is to look for simple types of subtrees, for instance paths. Compute the longest path of each tree, and use the shortest of those as the

common subtree solution. If the least maximum degree of the trees is bounded by 1, then there exists a path in each tree with at least log 1 n = log n= log 1 vertices. A O(n= log 1 n) performance ratio then follows. A similar approach can be used on high degree trees. Find the largest star in each tree, and let the smallest, of size 1 + 1, be the common subtree solution. Combine this with the above path and use the larger solution of size max 1 (1 + 1; log 1 n) log n= log log n. A performance ratio of O(n log log n= log n) is then automatic. Observe that in both cases did we solve optimally a restricted version of the problem, where the solution is to belong to these special subclasses of trees. Paths and stars are linearly ordered in terms of containment and thus easily solvable. More generally, we can solve this restricted problem for any family of trees that is polynomially bounded, or whose collection of maximal solutions is polynomially bounded. We now turn our attention to a dierent approach that holds also for labeled trees. We consider a simple but general search algorithm, previously used for graph coloring approximation [4, 7]. It is applicable to any hereditary subgraph/subset problem, i.e. for nding a subgraph satisfying property X which is also satised by any subsolution. The algorithm nds a subgraph of size m = dlog k jv je, whenever the graph contains an X-subgraph of size djv j=ke. The Common Subtree problem calls for the solution to be a tree, which is not a hereditary property. Nevertheless, because the input is a tree, there is a unique way to extend a subset of the vertices or edges of a tree into a proper subtree. CommonSubtreeSearch(T S = (T 1 ; T 2 ; : : : T K ),k) f T S contains a common subtree on d n=k e nodes g Partition V (T 1 ) into subsets B 1 ; B 2 ; : : : B dn=kme of size at most km each. for each B i do for each S B of m elements do S 0 Unique extension of S to a subtree in T 1 od od if S is a common subtree of T S then return S; halt return \T S did not contain a common subtree on d n=k e nodes" end To verify the correctness of the algorithm, suppose there is a common subtree of size dn=ke. By the pigeonhole principle, some B i must contain a subset of T of size dn=(km)e=dn=ke m: Since we search each B exhaustively, the search must succeed. As for the complexity, the repetitions of the inner loop is km (ek) m = e log n = n O(1) : m

Using that we can eciently check if a given tree is a common subtree, it follows that the algorithm runs in polynomial time. The randomized version of this method is particularly appealing: Repeatedly select a random subset of size m and verify if it satises the property. By similar analysis it succeed within n iterations, with high probability. In the absence of knowing k or the size of the optimal solution, we run CommonSubtreeSearch for k = 1; 2; 3; : : : until rst success is encountered. The size of the solution will then be log n=jop T j n, where OP T is the size of the optimal solution, and the performance ratio O(n= log n). Corollary 3. LCST can be approximated with a O(n= log n) factor. 4 Variants of LCST We now consider several variations on the theme of nding a largest common subtree of a collection of trees. The variations involve on one hand the structure of the input trees, and on the other hand the denition of the concept of a 'subtree'. The input can be restricted in three main ways by bounding crucial parameters: A) Number of labels, B) Maximum degree, and C) Number of input trees. The previous hardness result already holds for the rst two parameters optimally restricted: one (or zero) label, and degree at most two (at most two children). Incidentally, if maximum degree is two, the problem reduces to the well-known Largest Common Substring problem. In the case of constant number of trees, the problem can be solved in polynomial time if the degree is also bounded, even for labeled trees [1]. The reduction of [1] that proves the NP-hardness for 3 trees of unbounded degree can be seen to prove MAX SNP-hardness as well, showing that the problem does not have a polynomial time approximation schema. Another variation of the input format concerns whether there is an order among the children that must be preserved. This is largely a concern for unlabeled graphs, since an ordering can always be simulated with labels. Again, the previous hardness result holds. Corollary 4. LCST is as hard to approximate as the Independent Set problem, independent of the number of labels, maximum degree, and ordering constraints. When the number of trees is bounded, it is polynomial solvable on bounded degree graphs, but MAX SNP-hard for unbounded degree. Now consider variations of the denition of the term subtree. The canonical denition that we have used is that of a tree induced by a set of vertices/edges. Because the input graphs are also trees, this coincides both with the denition of an induced connected subgraph as well as with subtrees uniquely obtained by specifying only the leaves. Another natural version asks for an induced { not necessarily connected { subgraph, i.e. a forest. This comes in two avors: induced by a set of edges, or by a set of vertices.

The edge (vertex) subset variant, respectively, can be approximated within a factor of 1 + 1 (1 2 + 1) on unlabeled graphs with maximum degree 1 by the following simple approach: Pick an edge in each tree and add to solution; remove the edge and all incident edges (remove the edges, incident edges and edges incident on those, i.e. all edges within distance 2); iterate until some tree is empty. Because at each step we add one edge to the solution but remove at most 1 + 1 (1 2 + 1), this ratio is maintained. This can be generalized to trees with L distinct labels, by applying the above method on each of the collection of forests induced by the edges with that label, and using the largest result. This can be improved and generalized to graphs of unbounded degree by choosing disjoint stars instead of edges. We focus here on the edge subset version. We process each tree by repeating the following operation: Pick an internal vertex whose children are all leaves; keep the star rooted by that vertex; delete the vertex, its children and all incident edges (including the edge to the vertex' parent) from the tree. The result for each tree, is a collection of stars, which can be compactly represented as a multiset of integers. We now nd a maximum common collection of substars, from the collections of all the trees. The main argument is that for each star we select, we remove only that star and one additional edge which for the current purpose can be thought of as a degenerate star. The star we select matches both of these; thus we match at least half of the optimal solution, for a performance ratio of 2. In the case of trees with unbounded labels and unbounded degree, we can prove a strong hardness result by modifying the construction of the proof of Thm. 1. Eliminate all the unlabeled nodes (nodes with label 0), and make all w i j nodes to be children of the nearest r node. A common subtree will now contain a set of w nodes corresponding to an independent set and a single additional r node. Hence, any approximation for these instances will lead to equally good approximations for the independent set problem. We have the following result. Theorem 5. When a subtree is dened as a subgraph induced by a set of edges, the Common Subtree problem can be approximated within a factor of 2L on inputs with unbounded degree, unbounded number of input trees, and L labels. If the labels are also unbounded, the problem is as hard to approximate as the Independent Set problem. Finally, we consider two additional subtree variations. In one, the subtree must contain all the vertices headed by a given node w. That is, the subtree must consist of one of the two parts that appear when some edge of the tree is removed. This case turns out to be easy to solve because there are only 2n distinct such subtrees of a tree on n vertices. This holds for all the input variations above. This can be generalized to leveled subtrees, where the common subtree must contain precisely the vertices of level at most t below the node w. This includes the case when we restrict the solution to be a complete (1 0 1)-ary tree. The other case is a slight modication of the canonical case, where the input graphs are rooted and the subtrees are required to contain the root. The hardness results from the previous section hold equally for this case by a slight

modication of the construction. We modify the tree T by adding a new root adjacent only to the old root. The new root will match the roots of the other trees, while the old root will match either of R i or S i 's roots. As before, this holds for unlabeled trees of unbounded degree. However, in the case of ordered trees or when the labels of children must be distinct, the problem can be solved by simple recursive checking. Corollary 6. The LCST problem is hard to approximate even if the common subtree must match the roots of the trees, but can be solved in polynomial time if the subtrees are required to contain all the descendants of the highest vertex. 5 Largest Common Point Set Problem We now consider the largest common point set problem (LCP), dened as follows: given a collection of D-dimensional point sets SS = fs 1 ; 1 1 1 ; S K g, nd a common point set with the maximum number of elements, where a point set C is a common point set of SS if C is congruent to a subset of each S i. We assume D is xed, and let n denote the size of the smallest input point set. First, we show that LCP is hard to approximate, even on the real line. Theorem 7. For some > 0, the LCP problem cannot be approximated within a factor of n unless P=NP. Proof. The reduction is similar to that of Thm. 1, but considerably simpler. Given a graph G on N vertices, we construct a collection S of point sets fs 1 ; S 2 ; : : : ; S g. Let L 1 and L 2 be suciently large numbers such that L 1 N 2 and L 2 N L 1, e.g. L 1 = 100N 2 and L 2 = 100NL 1. Corresponding to each vertex v j dene two points P j, R j on the real line by: P 1 = 0, P j = P j01 + L 1 + j 0 2 for 1 < j N, and R j = L 2 + P j. Let S = fp 1 ; 1 1 1 ; P N g, and, for 1 i N, S i = f P j j j = i _ fv i ; v j g =2 E g [ f R j j j 6= i g: Note that if S i is not transformed by x 7! x or x 7! x 0 L 2, the number of elements of the intersection between S and the transformed point set can be at most 2. Thus, G has an independent set of size m if and only if there exists a common point set of size m, where we assume w.l.o.g. m > 2. Moreover, we can construct an independent set of size m in polynomial time given a common point set of size m. Finally, note that n, the size of the smallest point set, is N. While the LCP problem normally requires exact correspondence, the above result holds even if a small gap is allowed between corresponding points. In addition, we can easily obtain the same results for two variants. In one, the point sets are ordered and the order in sequences must be preserved. In the other, instead of point sets we are given graphs where each vertex is a point in D-dimensional space and each edge is a line segment connecting its endpoints.

In spite of this hardness result, the LCP problem can be solved in polynomial time if the number of point sets K is bounded. Consider all combinations of D + 1 elements from each set, which, if matching, uniquely denes an isometric transformation of the point sets, and count the points that match. This runs in time polynomial in the size of the input, where the degree depends on K and D. (See also [2].) Theorem 8. LCP can be solved in polynomial time for any xed value of K. Note that LCP can be approximated within a factor of O(n= log n) by the general algorithm described in Section 3 even if K is unbounded. References 1. T. Akutsu. An RNC algorithm for nding a largest common subtree of two trees. IEICE Transactions on Information and Systems, vol. E75-D, no. 1, pp. 95{101, Jan. 1992 2. T. Akutsu. On determining the congruity of point sets in higher dimensions. These proceedings. 3. S. Arora, C. Lund, R. Motwani, M. Sudan and M. Szegedy. Proof verication and hardness of approximation problems. Proc. 33rd IEEE Symp. on Foundations of Computer Science, pp. 14{23, Oct. 1992 4. B. Berger and J. Rompel. A better performance guarantee for approximate graph coloring. Algorithmica, vol. 5, no. 4, pp. 459{466, 1990. 5. C. Branden and J. Tooze. Introduction to Protein Structure, Garland Publishing Inc., New York, 1991 6. M. R. Garey and D. S. Johnson. Computers and Intractability, Freeman, New York, 1979 7. M. M. Halldorsson. A still better performance guarantee for approximate graph coloring. Inform. Process. Lett., vol. 45, pp. 19{23, 25 January 1993. 8. R. Maier. The complexity of some problems on subsequences and supersequences. J. ACM, vol. 25, pp. 322{336, 1978 9. C. Papadimitriou and M. Yannakakis. Optimization, approximation, and complexity classes. J. Computer and System Sciences, vol. 43, no. 3, pp. 425{440, Dec. 1991 10. R. B. Russel and G. J. Barton. Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue condence levels. PRO- TEINS: Structure, Function, and Genetics, vol. 14, pp. 309{323, 1992 11. Y. Takahashi, Y. Satoh, H. Suzuki and S. Sasaki. Recognition of largest common structural fragment among a variety of chemical structures. Analytical Sciences, vol. 3, pp.23{28, 1987. 12. G. Vriend and C. Sander. Detection of common three-dimensional substructures in proteins. PROTEINS: Structure, Function, and Genetics, vol. 11, pp. 52{58, 1991 13. K. Zhang and T. Jiang. Some MAX SNP-hard results concerning unordered labeled trees. To appear in Information Processing Letters.