Statistically based postprocessing of phylogenetic analysis by clustering

Size: px
Start display at page:

Download "Statistically based postprocessing of phylogenetic analysis by clustering"

Transcription

1 BIOINFORMATICS Vol. 18 Suppl Pages S285 S293 Statistically based postprocessing of phylogenetic analysis by clustering Cara Stockham 1, Li-San Wang 2, and Tandy Warnow 2 1 Texas Institute for Computational and Applied Mathematics, University of Texas, ACES 6.412, Austin, TX 78712, USA and 2 Department of Computer Sciences, University of Texas, Austin, TX 78712, USA Received on January 24, 02; revised and accepted on March 29, 02 ABSTRACT Motivation: Phylogenetic analyses often produce thousands of candidate trees. Biologists resolve the conflict by computing the consensus of these trees. Single-tree consensus as postprocessing methods can be unsatisfactory due to their inherent limitations. Results: In this paper we present an alternative approach by using clustering algorithms on the set of candidate trees. We propose bicriterion problems, in particular using the concept of information loss, and new consensus trees called characteristic trees that minimize the information loss. Our empirical study using four biological datasets shows that our approach provides a significant improvement in the information content, while adding only a small amount of complexity. Furthermore, the consensus trees we obtain for each of our large clusters are more resolved than the single-tree consensus trees. We also provide some initial progress on theoretical questions that arise in this context. Availability: Software available upon request from the authors. The agglomerative clustering is implemented using Matlab (MathWorks, 00) with the Statistics Toolbox. The Robinson-Foulds distance matrices and the strict consensus trees are computed using PAUP (Swofford, 01) and the Daniel Huson s tree library on Intel Pentium workstations running Debian Linux. Contact: lisan@cs.utexas.edu Supplementary Information: Keywords: consensus methods; clustering; phylogenetics; information theory; maximum parsimony. INTRODUCTION Phylogenetic analysis can be divided into three stages. In the first stage, a researcher collects data (such as DNA sequences) for each of the different taxa (genes, species, etc.) under study. In the second phase, she applies a tree reconstruction method to the data. Many tree reconstruction To whom correspondence should be addressed. methods produce more than one candidate tree for the input dataset. For example, the maximum parsimony (Swofford et al., 1996) method returns those binary trees with the lowest parsimony score. (The parsimony score of a tree is the minimum tree length, i.e., the sum of distances between two endpoints across all edges, obtained by any way of labeling the internal nodes.) Very often the number of trees can be in the hundreds or thousands. In the last phase, a consensus tree of the candidate trees is computed so as to resolve the conflict, summarize the information, and reduce the overwhelming number of possible solutions to the evolutionary history. Many consensus tree methods are available, but a common feature to all of them is that they produce one tree. There are several shortcomings of this approach including loss of information and sensitivity to outliers. In this paper we present a different approach to postprocessing. The set of candidate trees is divided into several subsets using clustering methods. Each cluster is then characterized by its own consensus tree. We pose several theoretical optimization problems for these kinds of outputs, and present some initial progress on these problems; these are presented in the section on Clustering Criteria. The bulk of our paper is focused on an empirical study, which is presented in the Experiments section. We conclude our study and propose additional research problems in the Conclusions section. BACKGROUND Phylogenetic trees A leaf-labeled tree topology can be decomposed into a set of bipartitions in the following manner. Each edge, when deleted from the tree, induces a bipartition of the leaves; thus, we can identify each edge with its induced bipartition. Let t 1 and t 2 be two trees on the same leaf set, and let E(t 1 ) and E(t 2 ) denote their sets of internal edges. The quantity E(t 1 ) E(t 2 ) = (E(t 1 ) E(t 2 )) (E(t 2 ) E(t 1 )) is called the Robinson-Foulds (RF) distance (Robinson and Foulds, 1981) between the two c Oxford University Press 02 S285

2 C.Stockham et al. trees. A tree t refines another tree t if E(t ) E(t). If a tree t has n leaves, then the number of internal edges of t is between 0 and n 3. If t has n 3 edges, t is a binary tree; we also say it is fully refined. Consensus tree methods Let T be a set of trees on the same taxa. The strict consensus tree is the tree whose edges are in all trees in T ; thus, all trees in T must refine the strict consensus of T. The strict consensus is a conservative hypothesis about the true phylogeny suggested by the set T. By using the strict consensus to conclude the result of phylogenetic analysis, we lose a lot of information about the whole set of candidate trees including how the trees are distributed in the space of all binary trees and how the trees are similar to each other. There are other types of consensus tree methods that produce a consensus tree whose leaf set is the whole set of taxa (McMorris and Steel, 1993; Adams, 1986; Nelson, 1979; Phillips and Warnow, 1996; Kannan et al., 1998), or one whose leaf set is a subset of the taxa (Steel and Warnow, 1993; Ganapathysaravanabavan and Warnow, 01). Notation Let S = {1, 2,...,n} denote the n taxa being studied. Let T n denote the set of all (unrooted) binary trees with S as their leaf set. T n =(n 2)!! = (2n 3). Let T denote the set of input trees (e.g., the most parsimonious trees from a maximum parsimony analysis). C is a clustering of T T n if C is a partition of T. If T T, C is said to cover T, i.e., C is covering. Otherwise, C is noncovering. Each member C of C is a cluster. Let SC(C) denote the strict consensus of all trees in C. The bounding ball B(C) of a cluster C is defined by B(C) = {t T n : t refines SC(C)}. We let B(C) = C C B(C). We use d(t, t ) to denote the Robinson-Foulds distance between two trees t and t. CRITERIA FOR CLUSTERING IN THE TREE SPACE In this section we describe the criteria used for clustering phylogenetic trees. Biologically based criteria Parameters for clustering. Given a cluster C, wedefine the following parameters of C: 1. diam(c) = max t,t C d(t, t ) is the diameter of C. 2. λ(c) = E(SC(C)) n 3 is the specificity of C; itisthe normalized number of internal edges of the strict consensus of C. 3. ρ(c) = C B(C) is the density of C. Biologists are interested in the specificity; the higher it is, the more information is present. This value is related to the diameter since it is easy to show 1 [diam(c)/(2(n 3))] λ(c). The density is also important since it shows how many trees refining the strict consensus are optimal, i.e., in the input. Based on these parameters we can define the parameters of the whole clustering C. Let f (C) be a parameter value of cluster C. Wehave 1. M(C; f ) = max C C f (C) (the maximum value of f over all clusters). 2. m(c; f ) = min C C f (C) (the minimum value of f over all clusters). 3. W (C; f ) = C 1 C C C f (C) (the weighted value of f over all clusters). 4. k = C (the number of clusters). Bicriterion problems. We might want to cluster T in order to maximize the minimum specificity, but by putting each tree into its own cluster, we trivially solve that problem. So it is more interesting to try to optimize one parameter with respect to another, e.g., maximize the minimum specificity for a fixed number of clusters. As we will show when we talk about statistically based criteria, bicriterion problems involving k, the number of clusters, are most natural and interesting. Also note that when we refine a clustering by dividing some of the clusters, the diameter of each new cluster is smaller or equal to the diameter of the original cluster the minimum, maximum, and weighted sum of diameters decrease. Similarly the minimum, maximum, and weighted sum of specificities increase. OBSERVATION 1. The minimum, maximum, and weighted sum of diameters or specificity are monotone with respect to refinements of clusterings. Therefore by dividing clusters, we improve certain parameter values. As we will see in the Experiments section, the score for each clustering obtained by agglomerative clustering improves as the number of clusters increases. Statistically based criteria Biologists assume the true tree is among the trees obtained during a phylogenetic analysis; without any additional information, all trees are considered equally likely to be the true tree. Thus, the set of trees defines a probability distribution on tree space. Consider the original set T of m binary trees, each of them having the same probability of being the true tree. The corresponding distribution is { 1m f (t) = if t T 0 if t T S286

3 Statistically based postprocessing of phylogenetic analysis by clustering Because the number of trees can be overwhelming, biologists replace them with their strict consensus tree, and the original trees are then ignored. Knowing only that the true tree refines this consensus tree, then, we have another probability distribution, with every binary tree that refines the consensus tree considered equally likely to be the true tree. Let b = B(T ). { 1b g(t) = if t B 0 if t B If C is a clustering of T containing more than one cluster, let B = C C B(C) be the union of the bounding balls, and let b = B. Then define the probability distribution as above. Our objective, then, is to increase the number of consensus trees to a still tolerably small number so that the probability distribution defined by these trees is closer to that of the original output. We call this the uniform distribution, and use it to evaluate the information conveyed in a clustering C. Note distributions f and g agree if C is such that every tree in T is in its own cluster, meaning there is no information loss in C. Using the number of clusters to represent the complexity of a clustering, we can then define bicriterion problems called complexity versus information loss. Information loss. We define the information loss as the distance between the distributions of two clusterings. Let f and g be the distributions of the original set of trees and the clustering of the input, respectively. Some popular distances are 1. L distance: L ( f, g) = max f (t) g(t). 2. L 1 distance: L 1 ( f, g) = f (t) g(t). 3. L 2 distance: L 2 ( f, g) = ( f (t) g(t)) Kullback-Leibler (KL) distance (Kullback, 1987): H(g f ) = f (t) ln f (t) g(t). Note that if T = B(C) then the above distances are 0 for the uniform distribution. The KL distance is not symmetric. The technical difficulty of the KL distance approach is that there may be trees t such that f (t) = 0butg(t) = 0, so the ratio f (t) g(t) is not finite. We avoid this difficulty by assuming C is covering (T B(C)). On the other hand, for trees t B(C) T, f (t) = 0 and g(t) = 0, we set f (t) ln f (t) g(t) = 0 1 (this is also based on the observation lim n n ln n 1 = 0). THEOREM 1. Among all clusterings C satisfying T B(C), a clustering C that minimizes B(C) has minimal KL distance with respect to the uniform distribution. PROOF. We note H(g f ) = = t T B f (t) ln f (t) g(t) 1 1/m ln m 1/b + 0 t B T = m m ln b m = ln b m Since we assume b m, the distance is minimized (0) when b = m. COROLLARY 1. The Kullback-Leibler distance with uniform distribution is monotone with respect to refinements of clusterings. This is due to the fact that the number of trees in the union of the bounding balls stays the same or decreases when clusters are divided. Our definition of information loss is similar to the information bottleneck introduced in (Tishby et al., 1999). In Thorley et al. (1998) the information content of a single consensus tree is discussed. Characteristic tree In this section we look at the characteristic k-set problem; we want to find a set of k trees such that the induced distribution is closest to the original distribution. We define the problem and show it can be solved in polynomial time for the case where k = 1 using the uniform distribution and the distances we defined in the previous section. The single-characteristic tree problem is as follows. Assume we intend to use tree t to replace the whole set of trees T. Let B(t) be the set of binary trees that refine t. The uniform distribution introduced by t is such that all trees in B(t) have the same probability, and all trees outside B(t) have zero probability. The information loss is as defined in the previous section. Note, we allow the case when there exist tree(s) from T that do not refine t, i.e., T may not be covered. THEOREM 2. Let T be a set of binary trees with the same set of leaves {1, 2,..., n}. We use the uniform distribution in measuring the information loss. 1. The strict consensus of T is the characteristic tree of T with respect to the L 1,L 2, and KL distances. 2. The characteristic tree of T with respect to the L distance can be computed in O(n T ) time. PROOF. From the discussion on KL distance we see immediately the strict consensus minimizes the KL distance (for the KL distance the characteristic tree must cover T ). Similarly one can prove the strict consensus optimizes the S287

4 C.Stockham et al. L 1 and L 2 distances if every tree in T is in the cluster and we allow only one cluster. For the L distance, let C be the cluster, B the bounding ball of C, and let m = C and b = B. L ( f, g) = max f (t) g(t) = max{1(t B) 1 m, 1(B T ) 1 b, 1(C) 1 m 1 b } = max{ 1 m 1(T B), 1 b 1(B T ), 1(C) 1 m 1 b } Here the function 1(X) is defined as follows: 1(X) = 1if X =, and 1(X) = 0ifX =. If T B (T is covered) then b m and L ( f, g) = max{ b 1 1(B T ), 1(C) m 1 b 1 } m 1. Thus a noncovering clustering (T B = ) does not optimize the L distance with respect to the uniform distribution. Assume C is covering. If B = T, then L ( f, g) = 0; otherwise, the L distance is minimized when b = 2m.A simple algorithm that finds the optimal characteristic tree under the L distance is as follows. First compute the strict consensus, SC(T ), and the corresponding density. If the density is below 0.5 the problem is solved; otherwise, find an edge of SC(T ) such that when contracted the new tree has the smallest number of refinements possible. For an edge (u,v), this can be determined based upon the degrees of u and v. Let n be the number of leaves in each tree in T. Computing the strict consensus of T takes O(nm) time (Day, 1985) and finding the edge with minimal increase in the number of refinements takes O(n) time. EXPERIMENTS Clustering algorithms K-means clustering. We implement two variants of the K-means algorithm, which is a well-studied method of clustering in data mining. First we use binary vectors to represent trees (KmVec). Let x t be the vector corresponding to tree t. Every entry (x t ) i in x t corresponds to a bipartition i induced by some internal edge in at least one tree in T ; (x t ) i = 1ifi E(t); (x t ) i = 0 otherwise. The mean of a cluster is the average of the binary vectors of the trees in the cluster; it does not necessarily represent a tree. The distance between vectors is the Euclidean distance. In the other variant (Kmeans), we use the strict consensus of each cluster as its mean and the RF distance to assign trees to clusters. In both versions, the objective function is the sum of the squared distances between each tree and its closest mean. Agglomerative clustering. We make no changes to the agglomerative clustering algorithm, another popular algorithm in data mining. The pairwise distance is again the RF distance. The similarity measures between clusters are as follows: 1. Agg0: Minimum pairwise distance: merge two clusters C 1 and C 2 if they minimize min t1 C 1,t 2 C 2 d(t 1, t 2 ). 2. Agg1: Maximum pairwise distance: merge two clusters C 1 and C 2 if they minimize max t1 C 1,t 2 C 2 d(t 1, t 2 ). 3. Agg2: Average pairwise distance: merge two clusters C 1 and C 2 if they minimize t 1 C 1,t 2 C 2 d(t 1, t 2 ). 1 C 1 C 2 When using the first and second similarity measures, the algorithms are called single linkage and complete linkage, respectively. Phylogenetic islands. The only method currently used by biologists for clustering phylogenetic trees is phylogenetic islands (PhyIsl) (Maddison, 1991). A tree topological operation (or move) is called a Tree Bisection and Reconnection (TBR) (Swofford et al., 1996) move if it does the following. An edge (u,v) is removed from tree T to create two unrooted subtrees T u and T v. We then connect the two subtrees to form a new tree T by inserting a new edge e = (u,v ), where u is on an edge in T u and v is on an edge in T v (if either of the two subtrees are single-node trees, the endpoint is the node itself). A phylogenetic island of a set of trees T is defined as follows. We create a graph G where each vertex corresponds to a tree in T, and there is an edge (i, j) between two trees i and j if they are 1 TBR-move apart. Each connected component of G is a phylogenetic island (cluster). The significance of this particular clustering method lies in its relation with maximum parsimony and heuristic search. In PAUP* 4.0 (Swofford, 01) the heuristic search implements hill-climbing in tree space by using TBR moves to modify a given tree topology to obtain new candidate trees. Settings for the experiment. We use 2, 3,..., clusters for both agglomerative clustering and K-means clustering. We also use the strict consensus trees of the clusters produced by the complete linkage for agglomerative clustering as the initial means in the K-means algorithm (KmAgg). The motive is to avoid being trapped in some local optimum due to the random effect in choosing initial means. Datasets We obtained four datasets: Camp (Cosner et al., 00; Moret et al., 01a), Caesal (Weeks et al., 01), PEVCCA1, and PEVCCA2 for our empirical study. The Camp dataset is obtained using the GRAPPA (Moret et S288

5 Statistically based postprocessing of phylogenetic analysis by clustering KL distance Agg 0 Agg 1 Agg 2 Kmeans KmeansVec PhyIsland 1Clu L1 distance Linf distance Number of clusters vs. KL Number of clusters vs. L1 Number of clusters vs. Linf wtddiam _ logwtdden _ wtdspec Number of clusters vs. wtddiam Number of clusters vs. logwtdden Number of clusters vs. wtdspec Fig. 1. Results of the clustering experiment using the Caesal dataset. See the section on Experiment Settings for details. al., 01b) software to reconstruct the breakpoint phylogeny of the Campanulaceae family (see (Moret et al., 01b)) for an explanation of the breakpoint phylogeny). The dataset contains 216 trees on 13 leaves. The strict consensus tree for this dataset is 60% resolved. The Caesal dataset is obtained by maximum parsimony searches of the trnl-trnf intron and spacer regions of chloroplast genome from the Caesalpinia family. The dataset has 450 trees on 51 leaves. The strict consensus tree for this dataset is 77% resolved. The PEVCCA dataset, provided by Derrick Zwickl, is obtained by maximum parsimony searches of the small subunit ribosomal RNA sequences (Van de Peer et al., 00); the dataset consists of 5630 trees on 129 leaves divided into 78 phylogenetic islands. PEVCCA stands for Porifera (sea sponges), Echinodermata (sea urchins, sea cucumbers), Vertebrata (fish, reptiles, mammals), Cnidaria (jellyfish), Crustacea (crabs, lobsters, shrimp), and Annelida (roundworms). The PEVCCA1 dataset contains 168 most parsimonious trees of PEVCCA (1 island). The strict consensus tree for this dataset is 77% resolved. The PEVCCA2 dataset includes the next best trees as well, for a total of 654 trees (5 islands). The strict consensus tree is 72% resolved. Correlation of the parameters We collected the output generated by all clustering methods and computed the correlation coefficient between KL and other parameters. Since KL is highly correlated (0.9 or above) with specificity, diameter, and density, this suggests KL is an informative criterion about the quality of clustering. See our website for more details. Comparison of different algorithms We compute the parameters in Table 1 for each clustering produced by the algorithms being tested. The results are in Figures 1, 2, and 3. We add minus signs in front of those parameters for which we prefer larger values; therefore a lower value in the y-direction is favored. See our website for the full set of figures. Caesal dataset. Most of the time the Kmeans clustering has the worst performance of all the methods. With a The correlation coefficient indicates how two parameters relate to each other linearly in the sample. The value is between 1 and 1. A value close to 1 or 1 indicates the two parameters are close to being linearly correlated. See any standard statistics textbook for more details. S289

6 C.Stockham et al. KL distance Agg 0 Agg 1 Agg 2 Kmeans KmeansVec PhyIsland 1Clu L1 distance Linf distance Number of clusters vs. KL Number of clusters vs. L1 Number of clusters vs. Linf wtddiam logwtdden _ 15 5 _ wtdspec Number of clusters vs. wtddiam Number of clusters vs. logwtdden Number of clusters vs. wtdspec Fig. 2. Results of the clustering experiment using the PEVCCA1 dataset. See the section on Experiment Settings for details. Table 1. Clustering parameters for the experiments Linf L1 L2 KL wtddiam wtdspec wtdden logwtdden the L distance. the L 1 distance. the L 2 distance. the Kullback-Leibler distance. W (C; diam), weighted sum of diameter. W (C; λ), weighted sum of specificity. W (C; ρ), weighted sum of density. base- logarithm of wtden. large enough number of clusters (5 or above), the KmVec algorithm can have very good scores in parameters other than L1, L2, Linf and KL, but has suboptimal scores in these information-loss measures. The Agg0 algorithm (single linkage) is unsatisfactory in all parameters, and increasing the number of clusters provides little improvement. Agg1 (complete linkage) has better overall performance in all parameters than Agg2, which is better than KmVec in all information loss measures (except with clusters for the Linf distance). The PhyIsl clustering is the same as the Agg1 two-cluster clustering. It is interesting to note that by increasing the number of clusters, Kmeans and KmVec can become worse. Since increasing the number of clusters in agglomerative clustering means refinement of the clustering by dividing some clusters, agglomerative clustering improves with respect to the monotone parameters; however the K- means clustering generally does not have this refinement relationship as we increase the number of clusters. PEVCCAl dataset. In this dataset and the next dataset (PEVCCA2), L1, L2, and Linf distances are uninformative; they always return values close to or equal to the maximally allowed value. This is due to the relatively low density of each cluster, causing many trees to be in B(C) but not in T, and thus contribute to the distance. However KL is very informative. There is only one phylogenetic island. In this dataset, all clustering methods other than PhyIsl have similar performance in all parameters. When the number of clusters increase, all the parameters improve. PEVCCA2 dataset. Kmeans is inferior in performance to Agg1 and Agg2, but better than Agg0. Agg1 and Agg2 have similar performance. When the number of clusters is S290

7 Statistically based postprocessing of phylogenetic analysis by clustering 2.2 KL distance Agg 0 Agg 1 Agg 2 Kmeans KmeansVec PhyIsland 1Clu L1 distance Linf distance Number of clusters vs. KL Number of clusters vs. L1 Number of clusters vs. Linf wtddiam logwtdden _ 15 5 _ wtdspec Number of clusters vs. wtddiam Number of clusters vs. logwtdden Number of clusters vs. wtdspec Fig. 3. Results of the clustering experiment using the PEVCCA2 dataset. See the section on Experiment Settings for details. low, Agg2 has better scores than Agg1; when the number of clusters is high (5 or more), Agg1 and Agg2 have similar performance. KmVec can be as good as Agg1 and Agg2 until the number of clusters is 7 or more, where its performance becomes suboptimal. The performance of PhyIsl is very bad in all parameters considered when compared to all other methods. Camp dataset. We applied Agg1 to this dataset. The dataset contains 216 trees out of 315 refinements of the strict consensus, which means the density is high. When we try to cluster the dataset, the specificities of the consensus trees improve slightly, but the densities drop dramatically. This suggests that one cluster is sufficient for this dataset, and that agglomerative clustering, by illustrating this fact, is robust. Summary. Agg1 and Agg2 have the best overall performance. Both Kmeans and KmVec are unreliable, and Agg0 and PhyIsl tend to have worse performance. Comparing clustering outputs to single-tree consensus In this section we compare the outputs of clustering to the single-consensus approach. The comparison is done using Caesal, PEVVCA1, and PEVCCA2. In each dataset, we compare the output of Agg1 with the strict consensus tree of the whole dataset. The number of clusters is determined by finding the number where the improvement starts to diminish; we use 3 clusters for Caesal and PEVCCA1, and 5 clusters for PEVCCA2. The results are in Table 2. In each of the datasets, the strict consensus tree of each cluster is much more resolved than the strict consensus of the whole dataset. The Caesal dataset has one large cluster (cluster 2), one medium cluster (cluster 1), and one small cluster (cluster 3). The small cluster is sparse; it has more refinements than the medium cluster and has relatively few trees, suggesting it is a collection of outliers in the whole set of trees. Similarly, cluster 2 in the PEVCCA1 dataset and cluster 3 and 5 in the PEVCCA2 dataset are sparse clusters. We remove these sparse clusters from the datasets. The percentage of trees dropped from Caesal, PEVCCA1, S291

8 C.Stockham et al. Table 2. Comparison of the clustering approach and the single-consensus approach. We use Agg1 with 3 clusters for Caesal and PEVCCA1, and 5 for PEVCCA2. The numtrees and the numref fields are the number of trees in the cluster and the refinements of the strict consensus of the cluster, respectively. The 1clu row in each dataset corresponds to the strict consensus of the whole set of trees Caesal KL(Agg1, 5 clusters) KL(Agg1, 3 clusters) clu numtrees specificity numref % % % 945 1clu % PEVCCA1 KL(Agg1, 5 clusters) KL(Agg1, 3 clusters) clu numtrees specificity numref % % % clu % PEVCCA2 KL(Agg1, 5 clusters)= , KL(1 cluster)= clu numtrees specificity numref % % % % % 2.1 1clu % and PEVCCA2 are 4%, 21.4%, and 14.4%, respectively. The specificities of the strict consensus trees of Caesal, PEVCCA1, and PEVCCA2 have increased to 85.4%, 81.7%, and 75.4%, respectively. This suggests the Caesal dataset is dominated by two major clusters (clusters 1 and 2) that are closer to each other than cluster 3; the small amount of increase in the specificities in PEVCCA1 and PEVCCA2 suggests the larger clusters are far away from each other. CONCLUSIONS In this paper we studied the clustering approach as a replacement for single-consensus postprocessing methods in phylogenetic analysis. We set up the framework for clustering in the space of phylogenetic trees and proposed several optimization problems. Of particular merit is our new approach, which we call the complexity versus information content problem. We also proposed the single-characteristic tree problem using the concept of information loss, and showed that it can be solved in polynomial time for all four distribution distances discussed. We then applied the most popular clustering algorithms used by computer scientists, as well as the phylogenetic island method used by biologists, to real datasets. We showed that complete linkage agglomerative clustering outperforms the other methods we examined with respect to most of our optimization criteria. We also looked at the best clusterings for each dataset and compared the strict consensus trees of the clusters to the strict consensus trees of the datasets, demonstrating an improvement in the multi-tree consensus over the single strict consensus. Our new approach can be used to improve the degree of resolution of the output trees, and provide more details about how the candidate trees are distributed. From our experimental study we recommend complete linkage agglomerative clustering with different numbers of clusters, and the use the Kullback-Leibler distance as the quality measure. A future research direction is to use nonuniform distributions in the information loss for clusterings and characteristic trees; for example, we can give weights to trees according to their likelihood score, and weights to clusters according to their densities or the number of trees they contain from the input set. Other open problems include developing algorithms that solve tree clustering problems optimally. One example of this is finding the optimal k- clustering (k > 1) for the Kullback-Leibler distance, or as we have shown, finding the clustering that has the smallest set of refinements. ACKNOWLEDGMENTS This research is funded by the David and Lucile Packard Foundation and the National Science Foundation (EIA and DEB ). We thank Beryl Simpson for providing us with the Caesal dataset, Derrick Zwickl for providing us with the PEVCCA dataset, and Robert Jansen and Bernard Moret for providing us with the Camp dataset. Some of the implementations in this paper use Daniel Huson s Tree library software. We thank Nina Amenta, Inderjit Dhillon, Jeff Klingner, Usman Roshan, and the participants of the DIMACS Bioconsensus II Workshop (01) for their comments and suggestions. REFERENCES Adams,E. (1986) N-trees as nestings: complexity, similarity and consensus. J. Classification, 3, S292

9 Statistically based postprocessing of phylogenetic analysis by clustering Cosner,M., Jansen,R., Moret,B., Raubeson,L., Wang,L.-S., Warnow,T. and Wyman,S. (00) A new fast heuristic for computing the breakpoint phylogeny and a phylogenetic analysis of a group of highly rearranged chloroplast genomes. In Proceedings of the 8th International Conference on Inteligence System for Molecular Biology (ISMB 00). AAAI Press, pp Day,W. (1985) Optimal algorithms for comparing trees with labelled leaves. J. Classification, 2, Ganapathysaravanabavan,G. and Warnow,T. (01) Finding a maximum compatible tree for a bounded number of trees with bounded degree is solvable in polynomial time. In Proceedings of the 1st Workshop on Algorithms in BioInformatics (WABI 01), Lecture Notes in Computer Science, 2149, pp Kannan,S., Warnow,T. and Yooseph,S. (1998) Computing the local consensus of trees. SIAM J. Comput., 27, Kullback,S. (1987) The Kullback Leibler distance. Am. Stat., 41, 340. Maddison,D. (1991) The discovery and importance of multiple islands of most parsimonious trees. Syst. Zool., 40, MathWorks (00) Matlab 6.1. Natick, MA, USA. McMorris,F. and Steel,M. (1993) The complexity of the median procedure for binary trees. In Proceedings of the International Federation of Classification Societies. Springer. Moret,B., Wang,L.-S., Warnow,T. and Wyman,S. (01a) New approaches for reconstructing phylogenies from gene order data. In Proceeedings of the 9th International Conference on Inteligence System for Molecular Biology (ISMB 01). AAAI Press, pp Moret,B., Wyman,S., Bader,D., Warnow,T. and Yan,M. (01b) A new implementation and detailed study of breakpoint analysis. In Proceedings of the 6th Pacific Symposium on Biocomputing (PSB 01). pp Nelson,G. (1979) Cladistic analysis and synthesis: principles and definitions, with a historical note on Adanson s Familles des Plantes ( ). Syst. Zool., 28, Phillips,C. and Warnow,T. (1996) The asymmetric median tree: a new model for building consensus trees. Disc. App. Math., 71, Robinson,D. and Foulds,L. (1981) Comparison of phylogenetic trees. Math. Biosci., 53, Steel,M. and Warnow,T. (1993) Kaikoura tree theorems: the maximum agreement subtree problem. Inf. Process. Lett., 48, Swofford,D. (01) PAUP* 4.0. Sinauer Associates. Swofford,D., Olson,G., Waddell,P. and Hillis,D. (1996) Phylogenetic inference. In Hillis,D., Moritz,C. and Mable,B. (eds), Molecular Systematics, Chapter 11, 2nd edn, Sinauer Associates, pp Thorley,J., Wilkinson,M. and Charleston,M. (1998) The information content of consensus trees. Advances in Data Science and Classification. Springer, pp Tishby,N., Pereira,F. and Bialek,W. (1999) The information bottleneck method. The 37th annual Allerton Conference on Communication, Control, and Computing. pp Van de Peer,Y., De Rijk,P., Wuyts,J., Winkelmans,T. and De Wachter,R. (00) The European small subunit ribosomal RNA database. Nucleic Acids Res., 28, Weeks,A., Larkin,L. and Simpson,B. (01) A chloroplast DNA molecular study of the phylogenetic relationships of members of the Caesalpinia group. Botany 01 Abstracts. Botanical Society of America, pp S293

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees A Randomized Algorithm for Comparing Sets of Phylogenetic Trees Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University E-mail: {sulsj,tlw}@cs.tamu.edu Technical Report

More information

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES SEUNG-JIN SUL AND TIFFANI L. WILLIAMS Department of Computer Science Texas A&M University College Station, TX 77843-3112 USA E-mail: {sulsj,tlw}@cs.tamu.edu

More information

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees Applied Mathematics Letters 24 (2011) 719 723 Contents lists available at ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml Graph triangulations and the compatibility

More information

MASTtreedist: Visualization of Tree Space based on Maximum Agreement Subtree

MASTtreedist: Visualization of Tree Space based on Maximum Agreement Subtree MASTtreedist: Visualization of Tree Space based on Maximum Agreement Subtree Hong Huang *1 and Yongji Li 2 1 School of Information, University of South Florida, Tampa, FL, 33620 2 Department of Computer

More information

Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees

Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Rec-I-: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan Bernard M.E. Moret Tiffani L. Williams Tandy Warnow Abstract Estimations of phylogenetic trees are most commonly

More information

Scaling species tree estimation methods to large datasets using NJMerge

Scaling species tree estimation methods to large datasets using NJMerge Scaling species tree estimation methods to large datasets using NJMerge Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana Champaign 2018 Phylogenomics Software

More information

Dynamic Programming for Phylogenetic Estimation

Dynamic Programming for Phylogenetic Estimation 1 / 45 Dynamic Programming for Phylogenetic Estimation CS598AGB Pranjal Vachaspati University of Illinois at Urbana-Champaign 2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for

More information

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1 DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens Katherine St. John City University of New York 1 Thanks to the DIMACS Staff Linda Casals Walter Morris Nicole Clark Katherine St. John

More information

Introduction to Triangulated Graphs. Tandy Warnow

Introduction to Triangulated Graphs. Tandy Warnow Introduction to Triangulated Graphs Tandy Warnow Topics for today Triangulated graphs: theorems and algorithms (Chapters 11.3 and 11.9) Examples of triangulated graphs in phylogeny estimation (Chapters

More information

CS 581. Tandy Warnow

CS 581. Tandy Warnow CS 581 Tandy Warnow This week Maximum parsimony: solving it on small datasets Maximum Likelihood optimization problem Felsenstein s pruning algorithm Bayesian MCMC methods Research opportunities Maximum

More information

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3 The XIII International Conference Applied Stochastic Models and Data Analysis (ASMDA-2009) June 30-July 3, 2009, Vilnius, LITHUANIA ISBN 978-9955-28-463-5 L. Sakalauskas, C. Skiadas and E. K. Zavadskas

More information

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University College Station, TX 77843-3 {sulsj,tlw}@cs.tamu.edu

More information

Introduction to Computational Phylogenetics

Introduction to Computational Phylogenetics Introduction to Computational Phylogenetics Tandy Warnow The University of Texas at Austin No Institute Given This textbook is a draft, and should not be distributed. Much of what is in this textbook appeared

More information

Recent Research Results. Evolutionary Trees Distance Methods

Recent Research Results. Evolutionary Trees Distance Methods Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how

More information

SEEING THE TREES AND THEIR BRANCHES IN THE NETWORK IS HARD

SEEING THE TREES AND THEIR BRANCHES IN THE NETWORK IS HARD 1 SEEING THE TREES AND THEIR BRANCHES IN THE NETWORK IS HARD I A KANJ School of Computer Science, Telecommunications, and Information Systems, DePaul University, Chicago, IL 60604-2301, USA E-mail: ikanj@csdepauledu

More information

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such) Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences joe@gs Phylogeny methods, part 1 (Parsimony and such) Methods of reconstructing phylogenies (evolutionary trees) Parsimony

More information

Introduction to Graph Theory

Introduction to Graph Theory Introduction to Graph Theory Tandy Warnow January 20, 2017 Graphs Tandy Warnow Graphs A graph G = (V, E) is an object that contains a vertex set V and an edge set E. We also write V (G) to denote the vertex

More information

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation

More information

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie. Arbres formels et Arbre de la Vie Olivier Gascuel Centre National de la Recherche Scientifique LIRMM, Montpellier, France www.lirmm.fr/gascuel 10 permanent researchers 2 technical staff 3 postdocs, 10

More information

High-Performance Algorithm Engineering for Computational Phylogenetics

High-Performance Algorithm Engineering for Computational Phylogenetics High-Performance Algorithm Engineering for Computational Phylogenetics Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 High-Performance

More information

Seeing the wood for the trees: Analysing multiple alternative phylogenies

Seeing the wood for the trees: Analysing multiple alternative phylogenies Seeing the wood for the trees: Analysing multiple alternative phylogenies Tom M. W. Nye, Newcastle University tom.nye@ncl.ac.uk Isaac Newton Institute, 17 December 2007 Multiple alternative phylogenies

More information

The Performance of Phylogenetic Methods on Trees of Bounded Diameter

The Performance of Phylogenetic Methods on Trees of Bounded Diameter The Performance of Phylogenetic Methods on Trees of Bounded Diameter Luay Nakhleh 1, Usman Roshan 1, Katherine St. John 1 2, Jerry Sun 1, and Tandy Warnow 1 3 1 Department of Computer Sciences, University

More information

Reconstruction of Large Phylogenetic Trees: A Parallel Approach

Reconstruction of Large Phylogenetic Trees: A Parallel Approach Reconstruction of Large Phylogenetic Trees: A Parallel Approach Zhihua Du and Feng Lin BioInformatics Research Centre, Nanyang Technological University, Nanyang Avenue, Singapore 639798 Usman W. Roshan

More information

PRec-I-DCM3: A Parallel Framework for Fast and Accurate Large Scale Phylogeny Reconstruction

PRec-I-DCM3: A Parallel Framework for Fast and Accurate Large Scale Phylogeny Reconstruction PRec-I-DCM3: A Parallel Framework for Fast and Accurate Large Scale Phylogeny Reconstruction Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Luay Nakhleh Usman Roshan Abstract Phylogenetic trees play

More information

MODERN phylogenetic analyses are rapidly increasing

MODERN phylogenetic analyses are rapidly increasing IEEE CONFERENCE ON BIOINFORMATICS & BIOMEDICINE 1 Accurate Simulation of Large Collections of Phylogenetic Trees Suzanne J. Matthews, Member, IEEE, ACM Abstract Phylogenetic analyses are growing at a rapid

More information

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin Sequence length requirements Tandy Warnow Department of Computer Science The University of Texas at Austin Part 1: Absolute Fast Convergence DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

Online algorithms for clustering problems

Online algorithms for clustering problems University of Szeged Department of Computer Algorithms and Artificial Intelligence Online algorithms for clustering problems Summary of the Ph.D. thesis by Gabriella Divéki Supervisor Dr. Csanád Imreh

More information

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:

More information

Reconsidering the Performance of Cooperative Rec-I-DCM3

Reconsidering the Performance of Cooperative Rec-I-DCM3 Reconsidering the Performance of Cooperative Rec-I-DCM3 Tiffani L. Williams Department of Computer Science Texas A&M University College Station, TX 77843 Email: tlw@cs.tamu.edu Marc L. Smith Computer Science

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information

Phylogenetic networks that display a tree twice

Phylogenetic networks that display a tree twice Bulletin of Mathematical Biology manuscript No. (will be inserted by the editor) Phylogenetic networks that display a tree twice Paul Cordue Simone Linz Charles Semple Received: date / Accepted: date Abstract

More information

Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis

Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis B. B. Zhou 1, D. Chu 1, M. Tarawneh 1, P. Wang 1, C. Wang 1, A. Y. Zomaya 1, and R. P. Brent 2 1 School of Information Technologies

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 ML phylogenetic inference and GARLI Derrick Zwickl University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 Outline Heuristics and tree searches ML phylogeny inference and

More information

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees. Chapter 7 SUPERTREE ALGORITHMS FOR NESTED TAXA Philip Daniel and Charles Semple Abstract: Keywords: Most supertree algorithms combine collections of rooted phylogenetic trees with overlapping leaf sets

More information

Phylospaces: Reconstructing Evolutionary Trees in Tuple Space

Phylospaces: Reconstructing Evolutionary Trees in Tuple Space Phylospaces: Reconstructing Evolutionary Trees in Tuple Space Marc L. Smith 1 and Tiffani L. Williams 2 1 Colby College 2 Texas A&M University Department of Computer Science Department of Computer Science

More information

Cooperative Rec-I-DCM3: A Population-Based Approach for Reconstructing Phylogenies

Cooperative Rec-I-DCM3: A Population-Based Approach for Reconstructing Phylogenies Cooperative Rec-I-DCM3: A Population-Based Approach for Reconstructing Phylogenies Tiffani L. Williams Department of Computer Science Texas A&M University tlw@cs.tamu.edu Marc L. Smith Department of Computer

More information

A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees

A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees Kedar Dhamdhere, Srinath Sridhar, Guy E. Blelloch, Eran Halperin R. Ravi and Russell Schwartz March 17, 2005 CMU-CS-05-119

More information

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS 1 SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS C. THAN and L. NAKHLEH Department of Computer Science Rice University 6100 Main Street, MS 132 Houston, TX 77005, USA Email: {cvthan,nakhleh}@cs.rice.edu

More information

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency?

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency? Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency? Fathiyeh Faghih and Daniel G. Brown David R. Cheriton School of Computer Science, University of

More information

Approximating Subtree Distances Between Phylogenies. MARIA LUISA BONET, 1 KATHERINE ST. JOHN, 2,3 RUCHI MAHINDRU, 2,4 and NINA AMENTA 5 ABSTRACT

Approximating Subtree Distances Between Phylogenies. MARIA LUISA BONET, 1 KATHERINE ST. JOHN, 2,3 RUCHI MAHINDRU, 2,4 and NINA AMENTA 5 ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 13, Number 8, 2006 Mary Ann Liebert, Inc. Pp. 1419 1434 Approximating Subtree Distances Between Phylogenies AU1 AU2 MARIA LUISA BONET, 1 KATHERINE ST. JOHN, 2,3

More information

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

of the Balanced Minimum Evolution Polytope Ruriko Yoshida Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida Figure 19.1 Genomes 3 ( Garland Science 2007) Origins of Species Tree (or web) of life eukarya

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Evolution Module. 6.1 Phylogenetic Trees. Bob Gardner and Lev Yampolski. Integrated Biology and Discrete Math (IBMS 1300)

Evolution Module. 6.1 Phylogenetic Trees. Bob Gardner and Lev Yampolski. Integrated Biology and Discrete Math (IBMS 1300) Evolution Module 6.1 Phylogenetic Trees Bob Gardner and Lev Yampolski Integrated Biology and Discrete Math (IBMS 1300) Fall 2008 1 INDUCTION Note. The natural numbers N is the familiar set N = {1, 2, 3,...}.

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

Introduction to Trees

Introduction to Trees Introduction to Trees Tandy Warnow December 28, 2016 Introduction to Trees Tandy Warnow Clades of a rooted tree Every node v in a leaf-labelled rooted tree defines a subset of the leafset that is below

More information

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand

More information

Study of a Simple Pruning Strategy with Days Algorithm

Study of a Simple Pruning Strategy with Days Algorithm Study of a Simple Pruning Strategy with ays Algorithm Thomas G. Kristensen Abstract We wish to calculate all pairwise Robinson Foulds distances in a set of trees. Traditional algorithms for doing this

More information

LARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR. A Thesis HYUN JUNG PARK

LARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR. A Thesis HYUN JUNG PARK LARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR A Thesis by HYUN JUNG PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree

More information

NSGA-II for Biological Graph Compression

NSGA-II for Biological Graph Compression Advanced Studies in Biology, Vol. 9, 2017, no. 1, 1-7 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/asb.2017.61143 NSGA-II for Biological Graph Compression A. N. Zakirov and J. A. Brown Innopolis

More information

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES Integrative Biology 200, Spring 2014 Principles of Phylogenetics: Systematics University of California, Berkeley Updated by Traci L. Grzymala Lab 07: Maximum Likelihood Model Selection and RAxML Using

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Chordal Graphs and Evolutionary Trees. Tandy Warnow

Chordal Graphs and Evolutionary Trees. Tandy Warnow Chordal Graphs and Evolutionary Trees Tandy Warnow Possible Indo-European tree (Ringe, Warnow and Taylor 2000) Anatolian Vedic Iranian Greek Italic Celtic Tocharian Armenian Germanic Baltic Slavic Albanian

More information

Improved parameterized complexity of the Maximum Agreement Subtree and Maximum Compatible Tree problems LIRMM, Tech.Rep. num 04026

Improved parameterized complexity of the Maximum Agreement Subtree and Maximum Compatible Tree problems LIRMM, Tech.Rep. num 04026 Improved parameterized complexity of the Maximum Agreement Subtree and Maximum Compatible Tree problems LIRMM, Tech.Rep. num 04026 Vincent Berry, François Nicolas Équipe Méthodes et Algorithmes pour la

More information

A more efficient algorithm for perfect sorting by reversals

A more efficient algorithm for perfect sorting by reversals A more efficient algorithm for perfect sorting by reversals Sèverine Bérard 1,2, Cedric Chauve 3,4, and Christophe Paul 5 1 Département de Mathématiques et d Informatique Appliquée, INRA, Toulouse, France.

More information

An Edge-Swap Heuristic for Finding Dense Spanning Trees

An Edge-Swap Heuristic for Finding Dense Spanning Trees Theory and Applications of Graphs Volume 3 Issue 1 Article 1 2016 An Edge-Swap Heuristic for Finding Dense Spanning Trees Mustafa Ozen Bogazici University, mustafa.ozen@boun.edu.tr Hua Wang Georgia Southern

More information

Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction

Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction William R. Pearson, Gabriel Robins,* and Tongtong Zhang* *Department of Computer Science and Department of Biochemistry, University

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Reconstructing Reticulate Evolution in Species Theory and Practice

Reconstructing Reticulate Evolution in Species Theory and Practice Reconstructing Reticulate Evolution in Species Theory and Practice Luay Nakhleh Department of Computer Science Rice University Houston, Texas 77005 nakhleh@cs.rice.edu Tandy Warnow Department of Computer

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Characterization of Super Strongly Perfect Graphs in Chordal and Strongly Chordal Graphs

Characterization of Super Strongly Perfect Graphs in Chordal and Strongly Chordal Graphs ISSN 0975-3303 Mapana J Sci, 11, 4(2012), 121-131 https://doi.org/10.12725/mjs.23.10 Characterization of Super Strongly Perfect Graphs in Chordal and Strongly Chordal Graphs R Mary Jeya Jothi * and A Amutha

More information

Fast Hashing Algorithms to Summarize Large. Collections of Evolutionary Trees

Fast Hashing Algorithms to Summarize Large. Collections of Evolutionary Trees Texas A&M CS Technical Report 2008-6- June 27, 2008 Fast Hashing Algorithms to Summarize Large Collections of Evolutionary Trees by Seung-Jin Sul and Tiffani L. Williams Department of Computer Science

More information

On Covering a Graph Optimally with Induced Subgraphs

On Covering a Graph Optimally with Induced Subgraphs On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number

More information

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)

TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) 1 S. ADAEKALAVAN, 2 DR. C. CHANDRASEKAR 1 Assistant Professor, Department of Information Technology, J.J. College of Arts and Science, Pudukkottai,

More information

Theoretical Foundations of Clustering. Margareta Ackerman

Theoretical Foundations of Clustering. Margareta Ackerman Theoretical Foundations of Clustering Margareta Ackerman The Theory-Practice Gap Clustering is one of the most widely used tools for exploratory data analysis. Identifying target markets Constructing phylogenetic

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

Multicasting in the Hypercube, Chord and Binomial Graphs

Multicasting in the Hypercube, Chord and Binomial Graphs Multicasting in the Hypercube, Chord and Binomial Graphs Christopher C. Cipriano and Teofilo F. Gonzalez Department of Computer Science University of California, Santa Barbara, CA, 93106 E-mail: {ccc,teo}@cs.ucsb.edu

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Clustering. (Part 2)

Clustering. (Part 2) Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Efficiently Computing the Robinson-Foulds Metric

Efficiently Computing the Robinson-Foulds Metric Efficiently Computing the Robinson-Foulds Metric Nicholas D. Pattengale 1, Eric J. Gottlieb 1, Bernard M.E. Moret 1,2 {nickp, ejgottl, moret}@cs.unm.edu 1 Department of Computer Science, University of

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

3. Cluster analysis Overview

3. Cluster analysis Overview Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as

More information

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst Phylogenetics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Phylogenetics phylum = tree phylogenetics: reconstruction of evolutionary

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Parameterized graph separation problems

Parameterized graph separation problems Parameterized graph separation problems Dániel Marx Department of Computer Science and Information Theory, Budapest University of Technology and Economics Budapest, H-1521, Hungary, dmarx@cs.bme.hu Abstract.

More information

A Memetic Heuristic for the Co-clustering Problem

A Memetic Heuristic for the Co-clustering Problem A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu

More information

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Missing Data Estimation in Microarrays Using Multi-Organism Approach Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008

More information

Computing the Quartet Distance Between Trees of Arbitrary Degrees

Computing the Quartet Distance Between Trees of Arbitrary Degrees January 22, 2006 University of Aarhus Department of Computer Science Computing the Quartet Distance Between Trees of Arbitrary Degrees Chris Christiansen & Martin Randers Thesis supervisor: Christian Nørgaard

More information

arxiv: v1 [math.co] 7 Dec 2018

arxiv: v1 [math.co] 7 Dec 2018 SEQUENTIALLY EMBEDDABLE GRAPHS JACKSON AUTRY AND CHRISTOPHER O NEILL arxiv:1812.02904v1 [math.co] 7 Dec 2018 Abstract. We call a (not necessarily planar) embedding of a graph G in the plane sequential

More information

Lecture 5: Multiple sequence alignment

Lecture 5: Multiple sequence alignment Lecture 5: Multiple sequence alignment Introduction to Computational Biology Teresa Przytycka, PhD (with some additions by Martin Vingron) Why do we need multiple sequence alignment Pairwise sequence alignment

More information

Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees

Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees Journal of Bioinformatics and Computational Biology c Imperial College Press Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees M. Stissing, T. Mailund, C. N. S. Pedersen and G. S.

More information

UC Davis Computer Science Technical Report CSE On the Full-Decomposition Optimality Conjecture for Phylogenetic Networks

UC Davis Computer Science Technical Report CSE On the Full-Decomposition Optimality Conjecture for Phylogenetic Networks UC Davis Computer Science Technical Report CSE-2005 On the Full-Decomposition Optimality Conjecture for Phylogenetic Networks Dan Gusfield January 25, 2005 1 On the Full-Decomposition Optimality Conjecture

More information

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018 15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018 In this lecture, we describe a very general problem called linear programming

More information

A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees

A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees Kedar Dhamdhere ½ ¾, Srinath Sridhar ½ ¾, Guy E. Blelloch ¾, Eran Halperin R. Ravi and Russell Schwartz March 17, 2005 CMU-CS-05-119

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

PLANAR GRAPH BIPARTIZATION IN LINEAR TIME

PLANAR GRAPH BIPARTIZATION IN LINEAR TIME PLANAR GRAPH BIPARTIZATION IN LINEAR TIME SAMUEL FIORINI, NADIA HARDY, BRUCE REED, AND ADRIAN VETTA Abstract. For each constant k, we present a linear time algorithm that, given a planar graph G, either

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Parallelizing SuperFine

Parallelizing SuperFine Parallelizing SuperFine Diogo Telmo Neves ESTGF - IPP and Universidade do Minho Portugal dtn@ices.utexas.edu Tandy Warnow Dept. of Computer Science The Univ. of Texas at Austin Austin, TX 78712 tandy@cs.utexas.edu

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet) Phylogeny Codon models Last lecture: poor man s way of calculating dn/ds (Ka/Ks) Tabulate synonymous/non- synonymous substitutions Normalize by the possibilities Transform to genetic distance K JC or K

More information

Lecture: Bioinformatics

Lecture: Bioinformatics Lecture: Bioinformatics ENS Sacley, 2018 Some slides graciously provided by Daniel Huson & Celine Scornavacca Phylogenetic Trees - Motivation 2 / 31 2 / 31 Phylogenetic Trees - Motivation Motivation -

More information

A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees

A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees Andreas Sand 1,2, Gerth Stølting Brodal 2,3, Rolf Fagerberg 4, Christian N. S. Pedersen 1,2 and Thomas Mailund

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Rotation Distance is Fixed-Parameter Tractable

Rotation Distance is Fixed-Parameter Tractable Rotation Distance is Fixed-Parameter Tractable Sean Cleary Katherine St. John September 25, 2018 arxiv:0903.0197v1 [cs.ds] 2 Mar 2009 Abstract Rotation distance between trees measures the number of simple

More information

Distance-based Phylogenetic Methods Near a Polytomy

Distance-based Phylogenetic Methods Near a Polytomy Distance-based Phylogenetic Methods Near a Polytomy Ruth Davidson and Seth Sullivant NCSU UIUC May 21, 2014 2 Phylogenetic trees model the common evolutionary history of a group of species Leaves = extant

More information

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo Genome 559: Introduction to Statistical and Computational Genomics Lecture15a Multiple Sequence Alignment Larry Ruzzo 1 Multiple Alignment: Motivations Common structure, function, or origin may be only

More information