Statistically based postprocessing of phylogenetic analysis by clustering

Size: px

Start display at page:

Download "Statistically based postprocessing of phylogenetic analysis by clustering"

Rosamund Sullivan
5 years ago
Views:

1 BIOINFORMATICS Vol. 18 Suppl Pages S285 S293 Statistically based postprocessing of phylogenetic analysis by clustering Cara Stockham 1, Li-San Wang 2, and Tandy Warnow 2 1 Texas Institute for Computational and Applied Mathematics, University of Texas, ACES 6.412, Austin, TX 78712, USA and 2 Department of Computer Sciences, University of Texas, Austin, TX 78712, USA Received on January 24, 02; revised and accepted on March 29, 02 ABSTRACT Motivation: Phylogenetic analyses often produce thousands of candidate trees. Biologists resolve the conflict by computing the consensus of these trees. Single-tree consensus as postprocessing methods can be unsatisfactory due to their inherent limitations. Results: In this paper we present an alternative approach by using clustering algorithms on the set of candidate trees. We propose bicriterion problems, in particular using the concept of information loss, and new consensus trees called characteristic trees that minimize the information loss. Our empirical study using four biological datasets shows that our approach provides a significant improvement in the information content, while adding only a small amount of complexity. Furthermore, the consensus trees we obtain for each of our large clusters are more resolved than the single-tree consensus trees. We also provide some initial progress on theoretical questions that arise in this context. Availability: Software available upon request from the authors. The agglomerative clustering is implemented using Matlab (MathWorks, 00) with the Statistics Toolbox. The Robinson-Foulds distance matrices and the strict consensus trees are computed using PAUP (Swofford, 01) and the Daniel Huson s tree library on Intel Pentium workstations running Debian Linux. Contact: lisan@cs.utexas.edu Supplementary Information: Keywords: consensus methods; clustering; phylogenetics; information theory; maximum parsimony. INTRODUCTION Phylogenetic analysis can be divided into three stages. In the first stage, a researcher collects data (such as DNA sequences) for each of the different taxa (genes, species, etc.) under study. In the second phase, she applies a tree reconstruction method to the data. Many tree reconstruction To whom correspondence should be addressed. methods produce more than one candidate tree for the input dataset. For example, the maximum parsimony (Swofford et al., 1996) method returns those binary trees with the lowest parsimony score. (The parsimony score of a tree is the minimum tree length, i.e., the sum of distances between two endpoints across all edges, obtained by any way of labeling the internal nodes.) Very often the number of trees can be in the hundreds or thousands. In the last phase, a consensus tree of the candidate trees is computed so as to resolve the conflict, summarize the information, and reduce the overwhelming number of possible solutions to the evolutionary history. Many consensus tree methods are available, but a common feature to all of them is that they produce one tree. There are several shortcomings of this approach including loss of information and sensitivity to outliers. In this paper we present a different approach to postprocessing. The set of candidate trees is divided into several subsets using clustering methods. Each cluster is then characterized by its own consensus tree. We pose several theoretical optimization problems for these kinds of outputs, and present some initial progress on these problems; these are presented in the section on Clustering Criteria. The bulk of our paper is focused on an empirical study, which is presented in the Experiments section. We conclude our study and propose additional research problems in the Conclusions section. BACKGROUND Phylogenetic trees A leaf-labeled tree topology can be decomposed into a set of bipartitions in the following manner. Each edge, when deleted from the tree, induces a bipartition of the leaves; thus, we can identify each edge with its induced bipartition. Let t 1 and t 2 be two trees on the same leaf set, and let E(t 1 ) and E(t 2 ) denote their sets of internal edges. The quantity E(t 1 ) E(t 2 ) = (E(t 1 ) E(t 2 )) (E(t 2 ) E(t 1 )) is called the Robinson-Foulds (RF) distance (Robinson and Foulds, 1981) between the two c Oxford University Press 02 S285

2 C.Stockham et al. trees. A tree t refines another tree t if E(t ) E(t). If a tree t has n leaves, then the number of internal edges of t is between 0 and n 3. If t has n 3 edges, t is a binary tree; we also say it is fully refined. Consensus tree methods Let T be a set of trees on the same taxa. The strict consensus tree is the tree whose edges are in all trees in T ; thus, all trees in T must refine the strict consensus of T. The strict consensus is a conservative hypothesis about the true phylogeny suggested by the set T. By using the strict consensus to conclude the result of phylogenetic analysis, we lose a lot of information about the whole set of candidate trees including how the trees are distributed in the space of all binary trees and how the trees are similar to each other. There are other types of consensus tree methods that produce a consensus tree whose leaf set is the whole set of taxa (McMorris and Steel, 1993; Adams, 1986; Nelson, 1979; Phillips and Warnow, 1996; Kannan et al., 1998), or one whose leaf set is a subset of the taxa (Steel and Warnow, 1993; Ganapathysaravanabavan and Warnow, 01). Notation Let S = {1, 2,...,n} denote the n taxa being studied. Let T n denote the set of all (unrooted) binary trees with S as their leaf set. T n =(n 2)!! = (2n 3). Let T denote the set of input trees (e.g., the most parsimonious trees from a maximum parsimony analysis). C is a clustering of T T n if C is a partition of T. If T T, C is said to cover T, i.e., C is covering. Otherwise, C is noncovering. Each member C of C is a cluster. Let SC(C) denote the strict consensus of all trees in C. The bounding ball B(C) of a cluster C is defined by B(C) = {t T n : t refines SC(C)}. We let B(C) = C C B(C). We use d(t, t ) to denote the Robinson-Foulds distance between two trees t and t. CRITERIA FOR CLUSTERING IN THE TREE SPACE In this section we describe the criteria used for clustering phylogenetic trees. Biologically based criteria Parameters for clustering. Given a cluster C, wedefine the following parameters of C: 1. diam(c) = max t,t C d(t, t ) is the diameter of C. 2. λ(c) = E(SC(C)) n 3 is the specificity of C; itisthe normalized number of internal edges of the strict consensus of C. 3. ρ(c) = C B(C) is the density of C. Biologists are interested in the specificity; the higher it is, the more information is present. This value is related to the diameter since it is easy to show 1 [diam(c)/(2(n 3))] λ(c). The density is also important since it shows how many trees refining the strict consensus are optimal, i.e., in the input. Based on these parameters we can define the parameters of the whole clustering C. Let f (C) be a parameter value of cluster C. Wehave 1. M(C; f ) = max C C f (C) (the maximum value of f over all clusters). 2. m(c; f ) = min C C f (C) (the minimum value of f over all clusters). 3. W (C; f ) = C 1 C C C f (C) (the weighted value of f over all clusters). 4. k = C (the number of clusters). Bicriterion problems. We might want to cluster T in order to maximize the minimum specificity, but by putting each tree into its own cluster, we trivially solve that problem. So it is more interesting to try to optimize one parameter with respect to another, e.g., maximize the minimum specificity for a fixed number of clusters. As we will show when we talk about statistically based criteria, bicriterion problems involving k, the number of clusters, are most natural and interesting. Also note that when we refine a clustering by dividing some of the clusters, the diameter of each new cluster is smaller or equal to the diameter of the original cluster the minimum, maximum, and weighted sum of diameters decrease. Similarly the minimum, maximum, and weighted sum of specificities increase. OBSERVATION 1. The minimum, maximum, and weighted sum of diameters or specificity are monotone with respect to refinements of clusterings. Therefore by dividing clusters, we improve certain parameter values. As we will see in the Experiments section, the score for each clustering obtained by agglomerative clustering improves as the number of clusters increases. Statistically based criteria Biologists assume the true tree is among the trees obtained during a phylogenetic analysis; without any additional information, all trees are considered equally likely to be the true tree. Thus, the set of trees defines a probability distribution on tree space. Consider the original set T of m binary trees, each of them having the same probability of being the true tree. The corresponding distribution is { 1m f (t) = if t T 0 if t T S286

3 Statistically based postprocessing of phylogenetic analysis by clustering Because the number of trees can be overwhelming, biologists replace them with their strict consensus tree, and the original trees are then ignored. Knowing only that the true tree refines this consensus tree, then, we have another probability distribution, with every binary tree that refines the consensus tree considered equally likely to be the true tree. Let b = B(T ). { 1b g(t) = if t B 0 if t B If C is a clustering of T containing more than one cluster, let B = C C B(C) be the union of the bounding balls, and let b = B. Then define the probability distribution as above. Our objective, then, is to increase the number of consensus trees to a still tolerably small number so that the probability distribution defined by these trees is closer to that of the original output. We call this the uniform distribution, and use it to evaluate the information conveyed in a clustering C. Note distributions f and g agree if C is such that every tree in T is in its own cluster, meaning there is no information loss in C. Using the number of clusters to represent the complexity of a clustering, we can then define bicriterion problems called complexity versus information loss. Information loss. We define the information loss as the distance between the distributions of two clusterings. Let f and g be the distributions of the original set of trees and the clustering of the input, respectively. Some popular distances are 1. L distance: L ( f, g) = max f (t) g(t). 2. L 1 distance: L 1 ( f, g) = f (t) g(t). 3. L 2 distance: L 2 ( f, g) = ( f (t) g(t)) Kullback-Leibler (KL) distance (Kullback, 1987): H(g f ) = f (t) ln f (t) g(t). Note that if T = B(C) then the above distances are 0 for the uniform distribution. The KL distance is not symmetric. The technical difficulty of the KL distance approach is that there may be trees t such that f (t) = 0butg(t) = 0, so the ratio f (t) g(t) is not finite. We avoid this difficulty by assuming C is covering (T B(C)). On the other hand, for trees t B(C) T, f (t) = 0 and g(t) = 0, we set f (t) ln f (t) g(t) = 0 1 (this is also based on the observation lim n n ln n 1 = 0). THEOREM 1. Among all clusterings C satisfying T B(C), a clustering C that minimizes B(C) has minimal KL distance with respect to the uniform distribution. PROOF. We note H(g f ) = = t T B f (t) ln f (t) g(t) 1 1/m ln m 1/b + 0 t B T = m m ln b m = ln b m Since we assume b m, the distance is minimized (0) when b = m. COROLLARY 1. The Kullback-Leibler distance with uniform distribution is monotone with respect to refinements of clusterings. This is due to the fact that the number of trees in the union of the bounding balls stays the same or decreases when clusters are divided. Our definition of information loss is similar to the information bottleneck introduced in (Tishby et al., 1999). In Thorley et al. (1998) the information content of a single consensus tree is discussed. Characteristic tree In this section we look at the characteristic k-set problem; we want to find a set of k trees such that the induced distribution is closest to the original distribution. We define the problem and show it can be solved in polynomial time for the case where k = 1 using the uniform distribution and the distances we defined in the previous section. The single-characteristic tree problem is as follows. Assume we intend to use tree t to replace the whole set of trees T. Let B(t) be the set of binary trees that refine t. The uniform distribution introduced by t is such that all trees in B(t) have the same probability, and all trees outside B(t) have zero probability. The information loss is as defined in the previous section. Note, we allow the case when there exist tree(s) from T that do not refine t, i.e., T may not be covered. THEOREM 2. Let T be a set of binary trees with the same set of leaves {1, 2,..., n}. We use the uniform distribution in measuring the information loss. 1. The strict consensus of T is the characteristic tree of T with respect to the L 1,L 2, and KL distances. 2. The characteristic tree of T with respect to the L distance can be computed in O(n T ) time. PROOF. From the discussion on KL distance we see immediately the strict consensus minimizes the KL distance (for the KL distance the characteristic tree must cover T ). Similarly one can prove the strict consensus optimizes the S287

4 C.Stockham et al. L 1 and L 2 distances if every tree in T is in the cluster and we allow only one cluster. For the L distance, let C be the cluster, B the bounding ball of C, and let m = C and b = B. L ( f, g) = max f (t) g(t) = max{1(t B) 1 m, 1(B T ) 1 b, 1(C) 1 m 1 b } = max{ 1 m 1(T B), 1 b 1(B T ), 1(C) 1 m 1 b } Here the function 1(X) is defined as follows: 1(X) = 1if X =, and 1(X) = 0ifX =. If T B (T is covered) then b m and L ( f, g) = max{ b 1 1(B T ), 1(C) m 1 b 1 } m 1. Thus a noncovering clustering (T B = ) does not optimize the L distance with respect to the uniform distribution. Assume C is covering. If B = T, then L ( f, g) = 0; otherwise, the L distance is minimized when b = 2m.A simple algorithm that finds the optimal characteristic tree under the L distance is as follows. First compute the strict consensus, SC(T ), and the corresponding density. If the density is below 0.5 the problem is solved; otherwise, find an edge of SC(T ) such that when contracted the new tree has the smallest number of refinements possible. For an edge (u,v), this can be determined based upon the degrees of u and v. Let n be the number of leaves in each tree in T. Computing the strict consensus of T takes O(nm) time (Day, 1985) and finding the edge with minimal increase in the number of refinements takes O(n) time. EXPERIMENTS Clustering algorithms K-means clustering. We implement two variants of the K-means algorithm, which is a well-studied method of clustering in data mining. First we use binary vectors to represent trees (KmVec). Let x t be the vector corresponding to tree t. Every entry (x t ) i in x t corresponds to a bipartition i induced by some internal edge in at least one tree in T ; (x t ) i = 1ifi E(t); (x t ) i = 0 otherwise. The mean of a cluster is the average of the binary vectors of the trees in the cluster; it does not necessarily represent a tree. The distance between vectors is the Euclidean distance. In the other variant (Kmeans), we use the strict consensus of each cluster as its mean and the RF distance to assign trees to clusters. In both versions, the objective function is the sum of the squared distances between each tree and its closest mean. Agglomerative clustering. We make no changes to the agglomerative clustering algorithm, another popular algorithm in data mining. The pairwise distance is again the RF distance. The similarity measures between clusters are as follows: 1. Agg0: Minimum pairwise distance: merge two clusters C 1 and C 2 if they minimize min t1 C 1,t 2 C 2 d(t 1, t 2 ). 2. Agg1: Maximum pairwise distance: merge two clusters C 1 and C 2 if they minimize max t1 C 1,t 2 C 2 d(t 1, t 2 ). 3. Agg2: Average pairwise distance: merge two clusters C 1 and C 2 if they minimize t 1 C 1,t 2 C 2 d(t 1, t 2 ). 1 C 1 C 2 When using the first and second similarity measures, the algorithms are called single linkage and complete linkage, respectively. Phylogenetic islands. The only method currently used by biologists for clustering phylogenetic trees is phylogenetic islands (PhyIsl) (Maddison, 1991). A tree topological operation (or move) is called a Tree Bisection and Reconnection (TBR) (Swofford et al., 1996) move if it does the following. An edge (u,v) is removed from tree T to create two unrooted subtrees T u and T v. We then connect the two subtrees to form a new tree T by inserting a new edge e = (u,v ), where u is on an edge in T u and v is on an edge in T v (if either of the two subtrees are single-node trees, the endpoint is the node itself). A phylogenetic island of a set of trees T is defined as follows. We create a graph G where each vertex corresponds to a tree in T, and there is an edge (i, j) between two trees i and j if they are 1 TBR-move apart. Each connected component of G is a phylogenetic island (cluster). The significance of this particular clustering method lies in its relation with maximum parsimony and heuristic search. In PAUP* 4.0 (Swofford, 01) the heuristic search implements hill-climbing in tree space by using TBR moves to modify a given tree topology to obtain new candidate trees. Settings for the experiment. We use 2, 3,..., clusters for both agglomerative clustering and K-means clustering. We also use the strict consensus trees of the clusters produced by the complete linkage for agglomerative clustering as the initial means in the K-means algorithm (KmAgg). The motive is to avoid being trapped in some local optimum due to the random effect in choosing initial means. Datasets We obtained four datasets: Camp (Cosner et al., 00; Moret et al., 01a), Caesal (Weeks et al., 01), PEVCCA1, and PEVCCA2 for our empirical study. The Camp dataset is obtained using the GRAPPA (Moret et S288

5 Statistically based postprocessing of phylogenetic analysis by clustering KL distance Agg 0 Agg 1 Agg 2 Kmeans KmeansVec PhyIsland 1Clu L1 distance Linf distance Number of clusters vs. KL Number of clusters vs. L1 Number of clusters vs. Linf wtddiam _ logwtdden _ wtdspec Number of clusters vs. wtddiam Number of clusters vs. logwtdden Number of clusters vs. wtdspec Fig. 1. Results of the clustering experiment using the Caesal dataset. See the section on Experiment Settings for details. al., 01b) software to reconstruct the breakpoint phylogeny of the Campanulaceae family (see (Moret et al., 01b)) for an explanation of the breakpoint phylogeny). The dataset contains 216 trees on 13 leaves. The strict consensus tree for this dataset is 60% resolved. The Caesal dataset is obtained by maximum parsimony searches of the trnl-trnf intron and spacer regions of chloroplast genome from the Caesalpinia family. The dataset has 450 trees on 51 leaves. The strict consensus tree for this dataset is 77% resolved. The PEVCCA dataset, provided by Derrick Zwickl, is obtained by maximum parsimony searches of the small subunit ribosomal RNA sequences (Van de Peer et al., 00); the dataset consists of 5630 trees on 129 leaves divided into 78 phylogenetic islands. PEVCCA stands for Porifera (sea sponges), Echinodermata (sea urchins, sea cucumbers), Vertebrata (fish, reptiles, mammals), Cnidaria (jellyfish), Crustacea (crabs, lobsters, shrimp), and Annelida (roundworms). The PEVCCA1 dataset contains 168 most parsimonious trees of PEVCCA (1 island). The strict consensus tree for this dataset is 77% resolved. The PEVCCA2 dataset includes the next best trees as well, for a total of 654 trees (5 islands). The strict consensus tree is 72% resolved. Correlation of the parameters We collected the output generated by all clustering methods and computed the correlation coefficient between KL and other parameters. Since KL is highly correlated (0.9 or above) with specificity, diameter, and density, this suggests KL is an informative criterion about the quality of clustering. See our website for more details. Comparison of different algorithms We compute the parameters in Table 1 for each clustering produced by the algorithms being tested. The results are in Figures 1, 2, and 3. We add minus signs in front of those parameters for which we prefer larger values; therefore a lower value in the y-direction is favored. See our website for the full set of figures. Caesal dataset. Most of the time the Kmeans clustering has the worst performance of all the methods. With a The correlation coefficient indicates how two parameters relate to each other linearly in the sample. The value is between 1 and 1. A value close to 1 or 1 indicates the two parameters are close to being linearly correlated. See any standard statistics textbook for more details. S289

6 C.Stockham et al. KL distance Agg 0 Agg 1 Agg 2 Kmeans KmeansVec PhyIsland 1Clu L1 distance Linf distance Number of clusters vs. KL Number of clusters vs. L1 Number of clusters vs. Linf wtddiam logwtdden _ 15 5 _ wtdspec Number of clusters vs. wtddiam Number of clusters vs. logwtdden Number of clusters vs. wtdspec Fig. 2. Results of the clustering experiment using the PEVCCA1 dataset. See the section on Experiment Settings for details. Table 1. Clustering parameters for the experiments Linf L1 L2 KL wtddiam wtdspec wtdden logwtdden the L distance. the L 1 distance. the L 2 distance. the Kullback-Leibler distance. W (C; diam), weighted sum of diameter. W (C; λ), weighted sum of specificity. W (C; ρ), weighted sum of density. base- logarithm of wtden. large enough number of clusters (5 or above), the KmVec algorithm can have very good scores in parameters other than L1, L2, Linf and KL, but has suboptimal scores in these information-loss measures. The Agg0 algorithm (single linkage) is unsatisfactory in all parameters, and increasing the number of clusters provides little improvement. Agg1 (complete linkage) has better overall performance in all parameters than Agg2, which is better than KmVec in all information loss measures (except with clusters for the Linf distance). The PhyIsl clustering is the same as the Agg1 two-cluster clustering. It is interesting to note that by increasing the number of clusters, Kmeans and KmVec can become worse. Since increasing the number of clusters in agglomerative clustering means refinement of the clustering by dividing some clusters, agglomerative clustering improves with respect to the monotone parameters; however the K- means clustering generally does not have this refinement relationship as we increase the number of clusters. PEVCCAl dataset. In this dataset and the next dataset (PEVCCA2), L1, L2, and Linf distances are uninformative; they always return values close to or equal to the maximally allowed value. This is due to the relatively low density of each cluster, causing many trees to be in B(C) but not in T, and thus contribute to the distance. However KL is very informative. There is only one phylogenetic island. In this dataset, all clustering methods other than PhyIsl have similar performance in all parameters. When the number of clusters increase, all the parameters improve. PEVCCA2 dataset. Kmeans is inferior in performance to Agg1 and Agg2, but better than Agg0. Agg1 and Agg2 have similar performance. When the number of clusters is S290

7 Statistically based postprocessing of phylogenetic analysis by clustering 2.2 KL distance Agg 0 Agg 1 Agg 2 Kmeans KmeansVec PhyIsland 1Clu L1 distance Linf distance Number of clusters vs. KL Number of clusters vs. L1 Number of clusters vs. Linf wtddiam logwtdden _ 15 5 _ wtdspec Number of clusters vs. wtddiam Number of clusters vs. logwtdden Number of clusters vs. wtdspec Fig. 3. Results of the clustering experiment using the PEVCCA2 dataset. See the section on Experiment Settings for details. low, Agg2 has better scores than Agg1; when the number of clusters is high (5 or more), Agg1 and Agg2 have similar performance. KmVec can be as good as Agg1 and Agg2 until the number of clusters is 7 or more, where its performance becomes suboptimal. The performance of PhyIsl is very bad in all parameters considered when compared to all other methods. Camp dataset. We applied Agg1 to this dataset. The dataset contains 216 trees out of 315 refinements of the strict consensus, which means the density is high. When we try to cluster the dataset, the specificities of the consensus trees improve slightly, but the densities drop dramatically. This suggests that one cluster is sufficient for this dataset, and that agglomerative clustering, by illustrating this fact, is robust. Summary. Agg1 and Agg2 have the best overall performance. Both Kmeans and KmVec are unreliable, and Agg0 and PhyIsl tend to have worse performance. Comparing clustering outputs to single-tree consensus In this section we compare the outputs of clustering to the single-consensus approach. The comparison is done using Caesal, PEVVCA1, and PEVCCA2. In each dataset, we compare the output of Agg1 with the strict consensus tree of the whole dataset. The number of clusters is determined by finding the number where the improvement starts to diminish; we use 3 clusters for Caesal and PEVCCA1, and 5 clusters for PEVCCA2. The results are in Table 2. In each of the datasets, the strict consensus tree of each cluster is much more resolved than the strict consensus of the whole dataset. The Caesal dataset has one large cluster (cluster 2), one medium cluster (cluster 1), and one small cluster (cluster 3). The small cluster is sparse; it has more refinements than the medium cluster and has relatively few trees, suggesting it is a collection of outliers in the whole set of trees. Similarly, cluster 2 in the PEVCCA1 dataset and cluster 3 and 5 in the PEVCCA2 dataset are sparse clusters. We remove these sparse clusters from the datasets. The percentage of trees dropped from Caesal, PEVCCA1, S291

8 C.Stockham et al. Table 2. Comparison of the clustering approach and the single-consensus approach. We use Agg1 with 3 clusters for Caesal and PEVCCA1, and 5 for PEVCCA2. The numtrees and the numref fields are the number of trees in the cluster and the refinements of the strict consensus of the cluster, respectively. The 1clu row in each dataset corresponds to the strict consensus of the whole set of trees Caesal KL(Agg1, 5 clusters) KL(Agg1, 3 clusters) clu numtrees specificity numref % % % 945 1clu % PEVCCA1 KL(Agg1, 5 clusters) KL(Agg1, 3 clusters) clu numtrees specificity numref % % % clu % PEVCCA2 KL(Agg1, 5 clusters)= , KL(1 cluster)= clu numtrees specificity numref % % % % % 2.1 1clu % and PEVCCA2 are 4%, 21.4%, and 14.4%, respectively. The specificities of the strict consensus trees of Caesal, PEVCCA1, and PEVCCA2 have increased to 85.4%, 81.7%, and 75.4%, respectively. This suggests the Caesal dataset is dominated by two major clusters (clusters 1 and 2) that are closer to each other than cluster 3; the small amount of increase in the specificities in PEVCCA1 and PEVCCA2 suggests the larger clusters are far away from each other. CONCLUSIONS In this paper we studied the clustering approach as a replacement for single-consensus postprocessing methods in phylogenetic analysis. We set up the framework for clustering in the space of phylogenetic trees and proposed several optimization problems. Of particular merit is our new approach, which we call the complexity versus information content problem. We also proposed the single-characteristic tree problem using the concept of information loss, and showed that it can be solved in polynomial time for all four distribution distances discussed. We then applied the most popular clustering algorithms used by computer scientists, as well as the phylogenetic island method used by biologists, to real datasets. We showed that complete linkage agglomerative clustering outperforms the other methods we examined with respect to most of our optimization criteria. We also looked at the best clusterings for each dataset and compared the strict consensus trees of the clusters to the strict consensus trees of the datasets, demonstrating an improvement in the multi-tree consensus over the single strict consensus. Our new approach can be used to improve the degree of resolution of the output trees, and provide more details about how the candidate trees are distributed. From our experimental study we recommend complete linkage agglomerative clustering with different numbers of clusters, and the use the Kullback-Leibler distance as the quality measure. A future research direction is to use nonuniform distributions in the information loss for clusterings and characteristic trees; for example, we can give weights to trees according to their likelihood score, and weights to clusters according to their densities or the number of trees they contain from the input set. Other open problems include developing algorithms that solve tree clustering problems optimally. One example of this is finding the optimal k- clustering (k > 1) for the Kullback-Leibler distance, or as we have shown, finding the clustering that has the smallest set of refinements. ACKNOWLEDGMENTS This research is funded by the David and Lucile Packard Foundation and the National Science Foundation (EIA and DEB ). We thank Beryl Simpson for providing us with the Caesal dataset, Derrick Zwickl for providing us with the PEVCCA dataset, and Robert Jansen and Bernard Moret for providing us with the Camp dataset. Some of the implementations in this paper use Daniel Huson s Tree library software. We thank Nina Amenta, Inderjit Dhillon, Jeff Klingner, Usman Roshan, and the participants of the DIMACS Bioconsensus II Workshop (01) for their comments and suggestions. REFERENCES Adams,E. (1986) N-trees as nestings: complexity, similarity and consensus. J. Classification, 3, S292

9 Statistically based postprocessing of phylogenetic analysis by clustering Cosner,M., Jansen,R., Moret,B., Raubeson,L., Wang,L.-S., Warnow,T. and Wyman,S. (00) A new fast heuristic for computing the breakpoint phylogeny and a phylogenetic analysis of a group of highly rearranged chloroplast genomes. In Proceedings of the 8th International Conference on Inteligence System for Molecular Biology (ISMB 00). AAAI Press, pp Day,W. (1985) Optimal algorithms for comparing trees with labelled leaves. J. Classification, 2, Ganapathysaravanabavan,G. and Warnow,T. (01) Finding a maximum compatible tree for a bounded number of trees with bounded degree is solvable in polynomial time. In Proceedings of the 1st Workshop on Algorithms in BioInformatics (WABI 01), Lecture Notes in Computer Science, 2149, pp Kannan,S., Warnow,T. and Yooseph,S. (1998) Computing the local consensus of trees. SIAM J. Comput., 27, Kullback,S. (1987) The Kullback Leibler distance. Am. Stat., 41, 340. Maddison,D. (1991) The discovery and importance of multiple islands of most parsimonious trees. Syst. Zool., 40, MathWorks (00) Matlab 6.1. Natick, MA, USA. McMorris,F. and Steel,M. (1993) The complexity of the median procedure for binary trees. In Proceedings of the International Federation of Classification Societies. Springer. Moret,B., Wang,L.-S., Warnow,T. and Wyman,S. (01a) New approaches for reconstructing phylogenies from gene order data. In Proceeedings of the 9th International Conference on Inteligence System for Molecular Biology (ISMB 01). AAAI Press, pp Moret,B., Wyman,S., Bader,D., Warnow,T. and Yan,M. (01b) A new implementation and detailed study of breakpoint analysis. In Proceedings of the 6th Pacific Symposium on Biocomputing (PSB 01). pp Nelson,G. (1979) Cladistic analysis and synthesis: principles and definitions, with a historical note on Adanson s Familles des Plantes ( ). Syst. Zool., 28, Phillips,C. and Warnow,T. (1996) The asymmetric median tree: a new model for building consensus trees. Disc. App. Math., 71, Robinson,D. and Foulds,L. (1981) Comparison of phylogenetic trees. Math. Biosci., 53, Steel,M. and Warnow,T. (1993) Kaikoura tree theorems: the maximum agreement subtree problem. Inf. Process. Lett., 48, Swofford,D. (01) PAUP* 4.0. Sinauer Associates. Swofford,D., Olson,G., Waddell,P. and Hillis,D. (1996) Phylogenetic inference. In Hillis,D., Moritz,C. and Mable,B. (eds), Molecular Systematics, Chapter 11, 2nd edn, Sinauer Associates, pp Thorley,J., Wilkinson,M. and Charleston,M. (1998) The information content of consensus trees. Advances in Data Science and Classification. Springer, pp Tishby,N., Pereira,F. and Bialek,W. (1999) The information bottleneck method. The 37th annual Allerton Conference on Communication, Control, and Computing. pp Van de Peer,Y., De Rijk,P., Wuyts,J., Winkelmans,T. and De Wachter,R. (00) The European small subunit ribosomal RNA database. Nucleic Acids Res., 28, Weeks,A., Larkin,L. and Simpson,B. (01) A chloroplast DNA molecular study of the phylogenetic relationships of members of the Caesalpinia group. Botany 01 Abstracts. Botanical Society of America, pp S293

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University E-mail: {sulsj,tlw}@cs.tamu.edu Technical Report