Statistically based postprocessing of phylogenetic analysis by clustering
|
|
- Rosamund Sullivan
- 5 years ago
- Views:
Transcription
1 BIOINFORMATICS Vol. 18 Suppl Pages S285 S293 Statistically based postprocessing of phylogenetic analysis by clustering Cara Stockham 1, Li-San Wang 2, and Tandy Warnow 2 1 Texas Institute for Computational and Applied Mathematics, University of Texas, ACES 6.412, Austin, TX 78712, USA and 2 Department of Computer Sciences, University of Texas, Austin, TX 78712, USA Received on January 24, 02; revised and accepted on March 29, 02 ABSTRACT Motivation: Phylogenetic analyses often produce thousands of candidate trees. Biologists resolve the conflict by computing the consensus of these trees. Single-tree consensus as postprocessing methods can be unsatisfactory due to their inherent limitations. Results: In this paper we present an alternative approach by using clustering algorithms on the set of candidate trees. We propose bicriterion problems, in particular using the concept of information loss, and new consensus trees called characteristic trees that minimize the information loss. Our empirical study using four biological datasets shows that our approach provides a significant improvement in the information content, while adding only a small amount of complexity. Furthermore, the consensus trees we obtain for each of our large clusters are more resolved than the single-tree consensus trees. We also provide some initial progress on theoretical questions that arise in this context. Availability: Software available upon request from the authors. The agglomerative clustering is implemented using Matlab (MathWorks, 00) with the Statistics Toolbox. The Robinson-Foulds distance matrices and the strict consensus trees are computed using PAUP (Swofford, 01) and the Daniel Huson s tree library on Intel Pentium workstations running Debian Linux. Contact: lisan@cs.utexas.edu Supplementary Information: Keywords: consensus methods; clustering; phylogenetics; information theory; maximum parsimony. INTRODUCTION Phylogenetic analysis can be divided into three stages. In the first stage, a researcher collects data (such as DNA sequences) for each of the different taxa (genes, species, etc.) under study. In the second phase, she applies a tree reconstruction method to the data. Many tree reconstruction To whom correspondence should be addressed. methods produce more than one candidate tree for the input dataset. For example, the maximum parsimony (Swofford et al., 1996) method returns those binary trees with the lowest parsimony score. (The parsimony score of a tree is the minimum tree length, i.e., the sum of distances between two endpoints across all edges, obtained by any way of labeling the internal nodes.) Very often the number of trees can be in the hundreds or thousands. In the last phase, a consensus tree of the candidate trees is computed so as to resolve the conflict, summarize the information, and reduce the overwhelming number of possible solutions to the evolutionary history. Many consensus tree methods are available, but a common feature to all of them is that they produce one tree. There are several shortcomings of this approach including loss of information and sensitivity to outliers. In this paper we present a different approach to postprocessing. The set of candidate trees is divided into several subsets using clustering methods. Each cluster is then characterized by its own consensus tree. We pose several theoretical optimization problems for these kinds of outputs, and present some initial progress on these problems; these are presented in the section on Clustering Criteria. The bulk of our paper is focused on an empirical study, which is presented in the Experiments section. We conclude our study and propose additional research problems in the Conclusions section. BACKGROUND Phylogenetic trees A leaf-labeled tree topology can be decomposed into a set of bipartitions in the following manner. Each edge, when deleted from the tree, induces a bipartition of the leaves; thus, we can identify each edge with its induced bipartition. Let t 1 and t 2 be two trees on the same leaf set, and let E(t 1 ) and E(t 2 ) denote their sets of internal edges. The quantity E(t 1 ) E(t 2 ) = (E(t 1 ) E(t 2 )) (E(t 2 ) E(t 1 )) is called the Robinson-Foulds (RF) distance (Robinson and Foulds, 1981) between the two c Oxford University Press 02 S285
2 C.Stockham et al. trees. A tree t refines another tree t if E(t ) E(t). If a tree t has n leaves, then the number of internal edges of t is between 0 and n 3. If t has n 3 edges, t is a binary tree; we also say it is fully refined. Consensus tree methods Let T be a set of trees on the same taxa. The strict consensus tree is the tree whose edges are in all trees in T ; thus, all trees in T must refine the strict consensus of T. The strict consensus is a conservative hypothesis about the true phylogeny suggested by the set T. By using the strict consensus to conclude the result of phylogenetic analysis, we lose a lot of information about the whole set of candidate trees including how the trees are distributed in the space of all binary trees and how the trees are similar to each other. There are other types of consensus tree methods that produce a consensus tree whose leaf set is the whole set of taxa (McMorris and Steel, 1993; Adams, 1986; Nelson, 1979; Phillips and Warnow, 1996; Kannan et al., 1998), or one whose leaf set is a subset of the taxa (Steel and Warnow, 1993; Ganapathysaravanabavan and Warnow, 01). Notation Let S = {1, 2,...,n} denote the n taxa being studied. Let T n denote the set of all (unrooted) binary trees with S as their leaf set. T n =(n 2)!! = (2n 3). Let T denote the set of input trees (e.g., the most parsimonious trees from a maximum parsimony analysis). C is a clustering of T T n if C is a partition of T. If T T, C is said to cover T, i.e., C is covering. Otherwise, C is noncovering. Each member C of C is a cluster. Let SC(C) denote the strict consensus of all trees in C. The bounding ball B(C) of a cluster C is defined by B(C) = {t T n : t refines SC(C)}. We let B(C) = C C B(C). We use d(t, t ) to denote the Robinson-Foulds distance between two trees t and t. CRITERIA FOR CLUSTERING IN THE TREE SPACE In this section we describe the criteria used for clustering phylogenetic trees. Biologically based criteria Parameters for clustering. Given a cluster C, wedefine the following parameters of C: 1. diam(c) = max t,t C d(t, t ) is the diameter of C. 2. λ(c) = E(SC(C)) n 3 is the specificity of C; itisthe normalized number of internal edges of the strict consensus of C. 3. ρ(c) = C B(C) is the density of C. Biologists are interested in the specificity; the higher it is, the more information is present. This value is related to the diameter since it is easy to show 1 [diam(c)/(2(n 3))] λ(c). The density is also important since it shows how many trees refining the strict consensus are optimal, i.e., in the input. Based on these parameters we can define the parameters of the whole clustering C. Let f (C) be a parameter value of cluster C. Wehave 1. M(C; f ) = max C C f (C) (the maximum value of f over all clusters). 2. m(c; f ) = min C C f (C) (the minimum value of f over all clusters). 3. W (C; f ) = C 1 C C C f (C) (the weighted value of f over all clusters). 4. k = C (the number of clusters). Bicriterion problems. We might want to cluster T in order to maximize the minimum specificity, but by putting each tree into its own cluster, we trivially solve that problem. So it is more interesting to try to optimize one parameter with respect to another, e.g., maximize the minimum specificity for a fixed number of clusters. As we will show when we talk about statistically based criteria, bicriterion problems involving k, the number of clusters, are most natural and interesting. Also note that when we refine a clustering by dividing some of the clusters, the diameter of each new cluster is smaller or equal to the diameter of the original cluster the minimum, maximum, and weighted sum of diameters decrease. Similarly the minimum, maximum, and weighted sum of specificities increase. OBSERVATION 1. The minimum, maximum, and weighted sum of diameters or specificity are monotone with respect to refinements of clusterings. Therefore by dividing clusters, we improve certain parameter values. As we will see in the Experiments section, the score for each clustering obtained by agglomerative clustering improves as the number of clusters increases. Statistically based criteria Biologists assume the true tree is among the trees obtained during a phylogenetic analysis; without any additional information, all trees are considered equally likely to be the true tree. Thus, the set of trees defines a probability distribution on tree space. Consider the original set T of m binary trees, each of them having the same probability of being the true tree. The corresponding distribution is { 1m f (t) = if t T 0 if t T S286
3 Statistically based postprocessing of phylogenetic analysis by clustering Because the number of trees can be overwhelming, biologists replace them with their strict consensus tree, and the original trees are then ignored. Knowing only that the true tree refines this consensus tree, then, we have another probability distribution, with every binary tree that refines the consensus tree considered equally likely to be the true tree. Let b = B(T ). { 1b g(t) = if t B 0 if t B If C is a clustering of T containing more than one cluster, let B = C C B(C) be the union of the bounding balls, and let b = B. Then define the probability distribution as above. Our objective, then, is to increase the number of consensus trees to a still tolerably small number so that the probability distribution defined by these trees is closer to that of the original output. We call this the uniform distribution, and use it to evaluate the information conveyed in a clustering C. Note distributions f and g agree if C is such that every tree in T is in its own cluster, meaning there is no information loss in C. Using the number of clusters to represent the complexity of a clustering, we can then define bicriterion problems called complexity versus information loss. Information loss. We define the information loss as the distance between the distributions of two clusterings. Let f and g be the distributions of the original set of trees and the clustering of the input, respectively. Some popular distances are 1. L distance: L ( f, g) = max f (t) g(t). 2. L 1 distance: L 1 ( f, g) = f (t) g(t). 3. L 2 distance: L 2 ( f, g) = ( f (t) g(t)) Kullback-Leibler (KL) distance (Kullback, 1987): H(g f ) = f (t) ln f (t) g(t). Note that if T = B(C) then the above distances are 0 for the uniform distribution. The KL distance is not symmetric. The technical difficulty of the KL distance approach is that there may be trees t such that f (t) = 0butg(t) = 0, so the ratio f (t) g(t) is not finite. We avoid this difficulty by assuming C is covering (T B(C)). On the other hand, for trees t B(C) T, f (t) = 0 and g(t) = 0, we set f (t) ln f (t) g(t) = 0 1 (this is also based on the observation lim n n ln n 1 = 0). THEOREM 1. Among all clusterings C satisfying T B(C), a clustering C that minimizes B(C) has minimal KL distance with respect to the uniform distribution. PROOF. We note H(g f ) = = t T B f (t) ln f (t) g(t) 1 1/m ln m 1/b + 0 t B T = m m ln b m = ln b m Since we assume b m, the distance is minimized (0) when b = m. COROLLARY 1. The Kullback-Leibler distance with uniform distribution is monotone with respect to refinements of clusterings. This is due to the fact that the number of trees in the union of the bounding balls stays the same or decreases when clusters are divided. Our definition of information loss is similar to the information bottleneck introduced in (Tishby et al., 1999). In Thorley et al. (1998) the information content of a single consensus tree is discussed. Characteristic tree In this section we look at the characteristic k-set problem; we want to find a set of k trees such that the induced distribution is closest to the original distribution. We define the problem and show it can be solved in polynomial time for the case where k = 1 using the uniform distribution and the distances we defined in the previous section. The single-characteristic tree problem is as follows. Assume we intend to use tree t to replace the whole set of trees T. Let B(t) be the set of binary trees that refine t. The uniform distribution introduced by t is such that all trees in B(t) have the same probability, and all trees outside B(t) have zero probability. The information loss is as defined in the previous section. Note, we allow the case when there exist tree(s) from T that do not refine t, i.e., T may not be covered. THEOREM 2. Let T be a set of binary trees with the same set of leaves {1, 2,..., n}. We use the uniform distribution in measuring the information loss. 1. The strict consensus of T is the characteristic tree of T with respect to the L 1,L 2, and KL distances. 2. The characteristic tree of T with respect to the L distance can be computed in O(n T ) time. PROOF. From the discussion on KL distance we see immediately the strict consensus minimizes the KL distance (for the KL distance the characteristic tree must cover T ). Similarly one can prove the strict consensus optimizes the S287
4 C.Stockham et al. L 1 and L 2 distances if every tree in T is in the cluster and we allow only one cluster. For the L distance, let C be the cluster, B the bounding ball of C, and let m = C and b = B. L ( f, g) = max f (t) g(t) = max{1(t B) 1 m, 1(B T ) 1 b, 1(C) 1 m 1 b } = max{ 1 m 1(T B), 1 b 1(B T ), 1(C) 1 m 1 b } Here the function 1(X) is defined as follows: 1(X) = 1if X =, and 1(X) = 0ifX =. If T B (T is covered) then b m and L ( f, g) = max{ b 1 1(B T ), 1(C) m 1 b 1 } m 1. Thus a noncovering clustering (T B = ) does not optimize the L distance with respect to the uniform distribution. Assume C is covering. If B = T, then L ( f, g) = 0; otherwise, the L distance is minimized when b = 2m.A simple algorithm that finds the optimal characteristic tree under the L distance is as follows. First compute the strict consensus, SC(T ), and the corresponding density. If the density is below 0.5 the problem is solved; otherwise, find an edge of SC(T ) such that when contracted the new tree has the smallest number of refinements possible. For an edge (u,v), this can be determined based upon the degrees of u and v. Let n be the number of leaves in each tree in T. Computing the strict consensus of T takes O(nm) time (Day, 1985) and finding the edge with minimal increase in the number of refinements takes O(n) time. EXPERIMENTS Clustering algorithms K-means clustering. We implement two variants of the K-means algorithm, which is a well-studied method of clustering in data mining. First we use binary vectors to represent trees (KmVec). Let x t be the vector corresponding to tree t. Every entry (x t ) i in x t corresponds to a bipartition i induced by some internal edge in at least one tree in T ; (x t ) i = 1ifi E(t); (x t ) i = 0 otherwise. The mean of a cluster is the average of the binary vectors of the trees in the cluster; it does not necessarily represent a tree. The distance between vectors is the Euclidean distance. In the other variant (Kmeans), we use the strict consensus of each cluster as its mean and the RF distance to assign trees to clusters. In both versions, the objective function is the sum of the squared distances between each tree and its closest mean. Agglomerative clustering. We make no changes to the agglomerative clustering algorithm, another popular algorithm in data mining. The pairwise distance is again the RF distance. The similarity measures between clusters are as follows: 1. Agg0: Minimum pairwise distance: merge two clusters C 1 and C 2 if they minimize min t1 C 1,t 2 C 2 d(t 1, t 2 ). 2. Agg1: Maximum pairwise distance: merge two clusters C 1 and C 2 if they minimize max t1 C 1,t 2 C 2 d(t 1, t 2 ). 3. Agg2: Average pairwise distance: merge two clusters C 1 and C 2 if they minimize t 1 C 1,t 2 C 2 d(t 1, t 2 ). 1 C 1 C 2 When using the first and second similarity measures, the algorithms are called single linkage and complete linkage, respectively. Phylogenetic islands. The only method currently used by biologists for clustering phylogenetic trees is phylogenetic islands (PhyIsl) (Maddison, 1991). A tree topological operation (or move) is called a Tree Bisection and Reconnection (TBR) (Swofford et al., 1996) move if it does the following. An edge (u,v) is removed from tree T to create two unrooted subtrees T u and T v. We then connect the two subtrees to form a new tree T by inserting a new edge e = (u,v ), where u is on an edge in T u and v is on an edge in T v (if either of the two subtrees are single-node trees, the endpoint is the node itself). A phylogenetic island of a set of trees T is defined as follows. We create a graph G where each vertex corresponds to a tree in T, and there is an edge (i, j) between two trees i and j if they are 1 TBR-move apart. Each connected component of G is a phylogenetic island (cluster). The significance of this particular clustering method lies in its relation with maximum parsimony and heuristic search. In PAUP* 4.0 (Swofford, 01) the heuristic search implements hill-climbing in tree space by using TBR moves to modify a given tree topology to obtain new candidate trees. Settings for the experiment. We use 2, 3,..., clusters for both agglomerative clustering and K-means clustering. We also use the strict consensus trees of the clusters produced by the complete linkage for agglomerative clustering as the initial means in the K-means algorithm (KmAgg). The motive is to avoid being trapped in some local optimum due to the random effect in choosing initial means. Datasets We obtained four datasets: Camp (Cosner et al., 00; Moret et al., 01a), Caesal (Weeks et al., 01), PEVCCA1, and PEVCCA2 for our empirical study. The Camp dataset is obtained using the GRAPPA (Moret et S288
5 Statistically based postprocessing of phylogenetic analysis by clustering KL distance Agg 0 Agg 1 Agg 2 Kmeans KmeansVec PhyIsland 1Clu L1 distance Linf distance Number of clusters vs. KL Number of clusters vs. L1 Number of clusters vs. Linf wtddiam _ logwtdden _ wtdspec Number of clusters vs. wtddiam Number of clusters vs. logwtdden Number of clusters vs. wtdspec Fig. 1. Results of the clustering experiment using the Caesal dataset. See the section on Experiment Settings for details. al., 01b) software to reconstruct the breakpoint phylogeny of the Campanulaceae family (see (Moret et al., 01b)) for an explanation of the breakpoint phylogeny). The dataset contains 216 trees on 13 leaves. The strict consensus tree for this dataset is 60% resolved. The Caesal dataset is obtained by maximum parsimony searches of the trnl-trnf intron and spacer regions of chloroplast genome from the Caesalpinia family. The dataset has 450 trees on 51 leaves. The strict consensus tree for this dataset is 77% resolved. The PEVCCA dataset, provided by Derrick Zwickl, is obtained by maximum parsimony searches of the small subunit ribosomal RNA sequences (Van de Peer et al., 00); the dataset consists of 5630 trees on 129 leaves divided into 78 phylogenetic islands. PEVCCA stands for Porifera (sea sponges), Echinodermata (sea urchins, sea cucumbers), Vertebrata (fish, reptiles, mammals), Cnidaria (jellyfish), Crustacea (crabs, lobsters, shrimp), and Annelida (roundworms). The PEVCCA1 dataset contains 168 most parsimonious trees of PEVCCA (1 island). The strict consensus tree for this dataset is 77% resolved. The PEVCCA2 dataset includes the next best trees as well, for a total of 654 trees (5 islands). The strict consensus tree is 72% resolved. Correlation of the parameters We collected the output generated by all clustering methods and computed the correlation coefficient between KL and other parameters. Since KL is highly correlated (0.9 or above) with specificity, diameter, and density, this suggests KL is an informative criterion about the quality of clustering. See our website for more details. Comparison of different algorithms We compute the parameters in Table 1 for each clustering produced by the algorithms being tested. The results are in Figures 1, 2, and 3. We add minus signs in front of those parameters for which we prefer larger values; therefore a lower value in the y-direction is favored. See our website for the full set of figures. Caesal dataset. Most of the time the Kmeans clustering has the worst performance of all the methods. With a The correlation coefficient indicates how two parameters relate to each other linearly in the sample. The value is between 1 and 1. A value close to 1 or 1 indicates the two parameters are close to being linearly correlated. See any standard statistics textbook for more details. S289
6 C.Stockham et al. KL distance Agg 0 Agg 1 Agg 2 Kmeans KmeansVec PhyIsland 1Clu L1 distance Linf distance Number of clusters vs. KL Number of clusters vs. L1 Number of clusters vs. Linf wtddiam logwtdden _ 15 5 _ wtdspec Number of clusters vs. wtddiam Number of clusters vs. logwtdden Number of clusters vs. wtdspec Fig. 2. Results of the clustering experiment using the PEVCCA1 dataset. See the section on Experiment Settings for details. Table 1. Clustering parameters for the experiments Linf L1 L2 KL wtddiam wtdspec wtdden logwtdden the L distance. the L 1 distance. the L 2 distance. the Kullback-Leibler distance. W (C; diam), weighted sum of diameter. W (C; λ), weighted sum of specificity. W (C; ρ), weighted sum of density. base- logarithm of wtden. large enough number of clusters (5 or above), the KmVec algorithm can have very good scores in parameters other than L1, L2, Linf and KL, but has suboptimal scores in these information-loss measures. The Agg0 algorithm (single linkage) is unsatisfactory in all parameters, and increasing the number of clusters provides little improvement. Agg1 (complete linkage) has better overall performance in all parameters than Agg2, which is better than KmVec in all information loss measures (except with clusters for the Linf distance). The PhyIsl clustering is the same as the Agg1 two-cluster clustering. It is interesting to note that by increasing the number of clusters, Kmeans and KmVec can become worse. Since increasing the number of clusters in agglomerative clustering means refinement of the clustering by dividing some clusters, agglomerative clustering improves with respect to the monotone parameters; however the K- means clustering generally does not have this refinement relationship as we increase the number of clusters. PEVCCAl dataset. In this dataset and the next dataset (PEVCCA2), L1, L2, and Linf distances are uninformative; they always return values close to or equal to the maximally allowed value. This is due to the relatively low density of each cluster, causing many trees to be in B(C) but not in T, and thus contribute to the distance. However KL is very informative. There is only one phylogenetic island. In this dataset, all clustering methods other than PhyIsl have similar performance in all parameters. When the number of clusters increase, all the parameters improve. PEVCCA2 dataset. Kmeans is inferior in performance to Agg1 and Agg2, but better than Agg0. Agg1 and Agg2 have similar performance. When the number of clusters is S290
7 Statistically based postprocessing of phylogenetic analysis by clustering 2.2 KL distance Agg 0 Agg 1 Agg 2 Kmeans KmeansVec PhyIsland 1Clu L1 distance Linf distance Number of clusters vs. KL Number of clusters vs. L1 Number of clusters vs. Linf wtddiam logwtdden _ 15 5 _ wtdspec Number of clusters vs. wtddiam Number of clusters vs. logwtdden Number of clusters vs. wtdspec Fig. 3. Results of the clustering experiment using the PEVCCA2 dataset. See the section on Experiment Settings for details. low, Agg2 has better scores than Agg1; when the number of clusters is high (5 or more), Agg1 and Agg2 have similar performance. KmVec can be as good as Agg1 and Agg2 until the number of clusters is 7 or more, where its performance becomes suboptimal. The performance of PhyIsl is very bad in all parameters considered when compared to all other methods. Camp dataset. We applied Agg1 to this dataset. The dataset contains 216 trees out of 315 refinements of the strict consensus, which means the density is high. When we try to cluster the dataset, the specificities of the consensus trees improve slightly, but the densities drop dramatically. This suggests that one cluster is sufficient for this dataset, and that agglomerative clustering, by illustrating this fact, is robust. Summary. Agg1 and Agg2 have the best overall performance. Both Kmeans and KmVec are unreliable, and Agg0 and PhyIsl tend to have worse performance. Comparing clustering outputs to single-tree consensus In this section we compare the outputs of clustering to the single-consensus approach. The comparison is done using Caesal, PEVVCA1, and PEVCCA2. In each dataset, we compare the output of Agg1 with the strict consensus tree of the whole dataset. The number of clusters is determined by finding the number where the improvement starts to diminish; we use 3 clusters for Caesal and PEVCCA1, and 5 clusters for PEVCCA2. The results are in Table 2. In each of the datasets, the strict consensus tree of each cluster is much more resolved than the strict consensus of the whole dataset. The Caesal dataset has one large cluster (cluster 2), one medium cluster (cluster 1), and one small cluster (cluster 3). The small cluster is sparse; it has more refinements than the medium cluster and has relatively few trees, suggesting it is a collection of outliers in the whole set of trees. Similarly, cluster 2 in the PEVCCA1 dataset and cluster 3 and 5 in the PEVCCA2 dataset are sparse clusters. We remove these sparse clusters from the datasets. The percentage of trees dropped from Caesal, PEVCCA1, S291
8 C.Stockham et al. Table 2. Comparison of the clustering approach and the single-consensus approach. We use Agg1 with 3 clusters for Caesal and PEVCCA1, and 5 for PEVCCA2. The numtrees and the numref fields are the number of trees in the cluster and the refinements of the strict consensus of the cluster, respectively. The 1clu row in each dataset corresponds to the strict consensus of the whole set of trees Caesal KL(Agg1, 5 clusters) KL(Agg1, 3 clusters) clu numtrees specificity numref % % % 945 1clu % PEVCCA1 KL(Agg1, 5 clusters) KL(Agg1, 3 clusters) clu numtrees specificity numref % % % clu % PEVCCA2 KL(Agg1, 5 clusters)= , KL(1 cluster)= clu numtrees specificity numref % % % % % 2.1 1clu % and PEVCCA2 are 4%, 21.4%, and 14.4%, respectively. The specificities of the strict consensus trees of Caesal, PEVCCA1, and PEVCCA2 have increased to 85.4%, 81.7%, and 75.4%, respectively. This suggests the Caesal dataset is dominated by two major clusters (clusters 1 and 2) that are closer to each other than cluster 3; the small amount of increase in the specificities in PEVCCA1 and PEVCCA2 suggests the larger clusters are far away from each other. CONCLUSIONS In this paper we studied the clustering approach as a replacement for single-consensus postprocessing methods in phylogenetic analysis. We set up the framework for clustering in the space of phylogenetic trees and proposed several optimization problems. Of particular merit is our new approach, which we call the complexity versus information content problem. We also proposed the single-characteristic tree problem using the concept of information loss, and showed that it can be solved in polynomial time for all four distribution distances discussed. We then applied the most popular clustering algorithms used by computer scientists, as well as the phylogenetic island method used by biologists, to real datasets. We showed that complete linkage agglomerative clustering outperforms the other methods we examined with respect to most of our optimization criteria. We also looked at the best clusterings for each dataset and compared the strict consensus trees of the clusters to the strict consensus trees of the datasets, demonstrating an improvement in the multi-tree consensus over the single strict consensus. Our new approach can be used to improve the degree of resolution of the output trees, and provide more details about how the candidate trees are distributed. From our experimental study we recommend complete linkage agglomerative clustering with different numbers of clusters, and the use the Kullback-Leibler distance as the quality measure. A future research direction is to use nonuniform distributions in the information loss for clusterings and characteristic trees; for example, we can give weights to trees according to their likelihood score, and weights to clusters according to their densities or the number of trees they contain from the input set. Other open problems include developing algorithms that solve tree clustering problems optimally. One example of this is finding the optimal k- clustering (k > 1) for the Kullback-Leibler distance, or as we have shown, finding the clustering that has the smallest set of refinements. ACKNOWLEDGMENTS This research is funded by the David and Lucile Packard Foundation and the National Science Foundation (EIA and DEB ). We thank Beryl Simpson for providing us with the Caesal dataset, Derrick Zwickl for providing us with the PEVCCA dataset, and Robert Jansen and Bernard Moret for providing us with the Camp dataset. Some of the implementations in this paper use Daniel Huson s Tree library software. We thank Nina Amenta, Inderjit Dhillon, Jeff Klingner, Usman Roshan, and the participants of the DIMACS Bioconsensus II Workshop (01) for their comments and suggestions. REFERENCES Adams,E. (1986) N-trees as nestings: complexity, similarity and consensus. J. Classification, 3, S292
9 Statistically based postprocessing of phylogenetic analysis by clustering Cosner,M., Jansen,R., Moret,B., Raubeson,L., Wang,L.-S., Warnow,T. and Wyman,S. (00) A new fast heuristic for computing the breakpoint phylogeny and a phylogenetic analysis of a group of highly rearranged chloroplast genomes. In Proceedings of the 8th International Conference on Inteligence System for Molecular Biology (ISMB 00). AAAI Press, pp Day,W. (1985) Optimal algorithms for comparing trees with labelled leaves. J. Classification, 2, Ganapathysaravanabavan,G. and Warnow,T. (01) Finding a maximum compatible tree for a bounded number of trees with bounded degree is solvable in polynomial time. In Proceedings of the 1st Workshop on Algorithms in BioInformatics (WABI 01), Lecture Notes in Computer Science, 2149, pp Kannan,S., Warnow,T. and Yooseph,S. (1998) Computing the local consensus of trees. SIAM J. Comput., 27, Kullback,S. (1987) The Kullback Leibler distance. Am. Stat., 41, 340. Maddison,D. (1991) The discovery and importance of multiple islands of most parsimonious trees. Syst. Zool., 40, MathWorks (00) Matlab 6.1. Natick, MA, USA. McMorris,F. and Steel,M. (1993) The complexity of the median procedure for binary trees. In Proceedings of the International Federation of Classification Societies. Springer. Moret,B., Wang,L.-S., Warnow,T. and Wyman,S. (01a) New approaches for reconstructing phylogenies from gene order data. In Proceeedings of the 9th International Conference on Inteligence System for Molecular Biology (ISMB 01). AAAI Press, pp Moret,B., Wyman,S., Bader,D., Warnow,T. and Yan,M. (01b) A new implementation and detailed study of breakpoint analysis. In Proceedings of the 6th Pacific Symposium on Biocomputing (PSB 01). pp Nelson,G. (1979) Cladistic analysis and synthesis: principles and definitions, with a historical note on Adanson s Familles des Plantes ( ). Syst. Zool., 28, Phillips,C. and Warnow,T. (1996) The asymmetric median tree: a new model for building consensus trees. Disc. App. Math., 71, Robinson,D. and Foulds,L. (1981) Comparison of phylogenetic trees. Math. Biosci., 53, Steel,M. and Warnow,T. (1993) Kaikoura tree theorems: the maximum agreement subtree problem. Inf. Process. Lett., 48, Swofford,D. (01) PAUP* 4.0. Sinauer Associates. Swofford,D., Olson,G., Waddell,P. and Hillis,D. (1996) Phylogenetic inference. In Hillis,D., Moritz,C. and Mable,B. (eds), Molecular Systematics, Chapter 11, 2nd edn, Sinauer Associates, pp Thorley,J., Wilkinson,M. and Charleston,M. (1998) The information content of consensus trees. Advances in Data Science and Classification. Springer, pp Tishby,N., Pereira,F. and Bialek,W. (1999) The information bottleneck method. The 37th annual Allerton Conference on Communication, Control, and Computing. pp Van de Peer,Y., De Rijk,P., Wuyts,J., Winkelmans,T. and De Wachter,R. (00) The European small subunit ribosomal RNA database. Nucleic Acids Res., 28, Weeks,A., Larkin,L. and Simpson,B. (01) A chloroplast DNA molecular study of the phylogenetic relationships of members of the Caesalpinia group. Botany 01 Abstracts. Botanical Society of America, pp S293
A Randomized Algorithm for Comparing Sets of Phylogenetic Trees
A Randomized Algorithm for Comparing Sets of Phylogenetic Trees Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University E-mail: {sulsj,tlw}@cs.tamu.edu Technical Report
More informationA RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES
A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES SEUNG-JIN SUL AND TIFFANI L. WILLIAMS Department of Computer Science Texas A&M University College Station, TX 77843-3112 USA E-mail: {sulsj,tlw}@cs.tamu.edu
More informationApplied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees
Applied Mathematics Letters 24 (2011) 719 723 Contents lists available at ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml Graph triangulations and the compatibility
More informationMASTtreedist: Visualization of Tree Space based on Maximum Agreement Subtree
MASTtreedist: Visualization of Tree Space based on Maximum Agreement Subtree Hong Huang *1 and Yongji Li 2 1 School of Information, University of South Florida, Tampa, FL, 33620 2 Department of Computer
More informationRec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees
Rec-I-: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan Bernard M.E. Moret Tiffani L. Williams Tandy Warnow Abstract Estimations of phylogenetic trees are most commonly
More informationScaling species tree estimation methods to large datasets using NJMerge
Scaling species tree estimation methods to large datasets using NJMerge Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana Champaign 2018 Phylogenomics Software
More informationDynamic Programming for Phylogenetic Estimation
1 / 45 Dynamic Programming for Phylogenetic Estimation CS598AGB Pranjal Vachaspati University of Illinois at Urbana-Champaign 2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for
More informationDIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1
DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens Katherine St. John City University of New York 1 Thanks to the DIMACS Staff Linda Casals Walter Morris Nicole Clark Katherine St. John
More informationIntroduction to Triangulated Graphs. Tandy Warnow
Introduction to Triangulated Graphs Tandy Warnow Topics for today Triangulated graphs: theorems and algorithms (Chapters 11.3 and 11.9) Examples of triangulated graphs in phylogeny estimation (Chapters
More informationCS 581. Tandy Warnow
CS 581 Tandy Warnow This week Maximum parsimony: solving it on small datasets Maximum Likelihood optimization problem Felsenstein s pruning algorithm Bayesian MCMC methods Research opportunities Maximum
More informationABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3
The XIII International Conference Applied Stochastic Models and Data Analysis (ASMDA-2009) June 30-July 3, 2009, Vilnius, LITHUANIA ISBN 978-9955-28-463-5 L. Sakalauskas, C. Skiadas and E. K. Zavadskas
More informationAn Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms
An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University College Station, TX 77843-3 {sulsj,tlw}@cs.tamu.edu
More informationIntroduction to Computational Phylogenetics
Introduction to Computational Phylogenetics Tandy Warnow The University of Texas at Austin No Institute Given This textbook is a draft, and should not be distributed. Much of what is in this textbook appeared
More informationRecent Research Results. Evolutionary Trees Distance Methods
Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how
More informationSEEING THE TREES AND THEIR BRANCHES IN THE NETWORK IS HARD
1 SEEING THE TREES AND THEIR BRANCHES IN THE NETWORK IS HARD I A KANJ School of Computer Science, Telecommunications, and Information Systems, DePaul University, Chicago, IL 60604-2301, USA E-mail: ikanj@csdepauledu
More informationGenetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)
Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences joe@gs Phylogeny methods, part 1 (Parsimony and such) Methods of reconstructing phylogenies (evolutionary trees) Parsimony
More informationIntroduction to Graph Theory
Introduction to Graph Theory Tandy Warnow January 20, 2017 Graphs Tandy Warnow Graphs A graph G = (V, E) is an object that contains a vertex set V and an edge set E. We also write V (G) to denote the vertex
More informationDesigning parallel algorithms for constructing large phylogenetic trees on Blue Waters
Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation
More informationOlivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.
Arbres formels et Arbre de la Vie Olivier Gascuel Centre National de la Recherche Scientifique LIRMM, Montpellier, France www.lirmm.fr/gascuel 10 permanent researchers 2 technical staff 3 postdocs, 10
More informationHigh-Performance Algorithm Engineering for Computational Phylogenetics
High-Performance Algorithm Engineering for Computational Phylogenetics Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 High-Performance
More informationSeeing the wood for the trees: Analysing multiple alternative phylogenies
Seeing the wood for the trees: Analysing multiple alternative phylogenies Tom M. W. Nye, Newcastle University tom.nye@ncl.ac.uk Isaac Newton Institute, 17 December 2007 Multiple alternative phylogenies
More informationThe Performance of Phylogenetic Methods on Trees of Bounded Diameter
The Performance of Phylogenetic Methods on Trees of Bounded Diameter Luay Nakhleh 1, Usman Roshan 1, Katherine St. John 1 2, Jerry Sun 1, and Tandy Warnow 1 3 1 Department of Computer Sciences, University
More informationReconstruction of Large Phylogenetic Trees: A Parallel Approach
Reconstruction of Large Phylogenetic Trees: A Parallel Approach Zhihua Du and Feng Lin BioInformatics Research Centre, Nanyang Technological University, Nanyang Avenue, Singapore 639798 Usman W. Roshan
More informationPRec-I-DCM3: A Parallel Framework for Fast and Accurate Large Scale Phylogeny Reconstruction
PRec-I-DCM3: A Parallel Framework for Fast and Accurate Large Scale Phylogeny Reconstruction Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Luay Nakhleh Usman Roshan Abstract Phylogenetic trees play
More informationMODERN phylogenetic analyses are rapidly increasing
IEEE CONFERENCE ON BIOINFORMATICS & BIOMEDICINE 1 Accurate Simulation of Large Collections of Phylogenetic Trees Suzanne J. Matthews, Member, IEEE, ACM Abstract Phylogenetic analyses are growing at a rapid
More informationSequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin
Sequence length requirements Tandy Warnow Department of Computer Science The University of Texas at Austin Part 1: Absolute Fast Convergence DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2
More informationEvolutionary tree reconstruction (Chapter 10)
Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early
More informationOnline algorithms for clustering problems
University of Szeged Department of Computer Algorithms and Artificial Intelligence Online algorithms for clustering problems Summary of the Ph.D. thesis by Gabriella Divéki Supervisor Dr. Csanád Imreh
More informationParsimony-Based Approaches to Inferring Phylogenetic Trees
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:
More informationReconsidering the Performance of Cooperative Rec-I-DCM3
Reconsidering the Performance of Cooperative Rec-I-DCM3 Tiffani L. Williams Department of Computer Science Texas A&M University College Station, TX 77843 Email: tlw@cs.tamu.edu Marc L. Smith Computer Science
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationAlgorithms for Bioinformatics
Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and
More informationPhylogenetic networks that display a tree twice
Bulletin of Mathematical Biology manuscript No. (will be inserted by the editor) Phylogenetic networks that display a tree twice Paul Cordue Simone Linz Charles Semple Received: date / Accepted: date Abstract
More informationParallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis
Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis B. B. Zhou 1, D. Chu 1, M. Tarawneh 1, P. Wang 1, C. Wang 1, A. Y. Zomaya 1, and R. P. Brent 2 1 School of Information Technologies
More informationOn the Relationships between Zero Forcing Numbers and Certain Graph Coverings
On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,
More informationML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015
ML phylogenetic inference and GARLI Derrick Zwickl University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 Outline Heuristics and tree searches ML phylogeny inference and
More informationThroughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.
Chapter 7 SUPERTREE ALGORITHMS FOR NESTED TAXA Philip Daniel and Charles Semple Abstract: Keywords: Most supertree algorithms combine collections of rooted phylogenetic trees with overlapping leaf sets
More informationPhylospaces: Reconstructing Evolutionary Trees in Tuple Space
Phylospaces: Reconstructing Evolutionary Trees in Tuple Space Marc L. Smith 1 and Tiffani L. Williams 2 1 Colby College 2 Texas A&M University Department of Computer Science Department of Computer Science
More informationCooperative Rec-I-DCM3: A Population-Based Approach for Reconstructing Phylogenies
Cooperative Rec-I-DCM3: A Population-Based Approach for Reconstructing Phylogenies Tiffani L. Williams Department of Computer Science Texas A&M University tlw@cs.tamu.edu Marc L. Smith Department of Computer
More informationA New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees
A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees Kedar Dhamdhere, Srinath Sridhar, Guy E. Blelloch, Eran Halperin R. Ravi and Russell Schwartz March 17, 2005 CMU-CS-05-119
More informationSPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS
1 SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS C. THAN and L. NAKHLEH Department of Computer Science Rice University 6100 Main Street, MS 132 Houston, TX 77005, USA Email: {cvthan,nakhleh}@cs.rice.edu
More informationAnswer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency?
Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency? Fathiyeh Faghih and Daniel G. Brown David R. Cheriton School of Computer Science, University of
More informationApproximating Subtree Distances Between Phylogenies. MARIA LUISA BONET, 1 KATHERINE ST. JOHN, 2,3 RUCHI MAHINDRU, 2,4 and NINA AMENTA 5 ABSTRACT
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 13, Number 8, 2006 Mary Ann Liebert, Inc. Pp. 1419 1434 Approximating Subtree Distances Between Phylogenies AU1 AU2 MARIA LUISA BONET, 1 KATHERINE ST. JOHN, 2,3
More informationof the Balanced Minimum Evolution Polytope Ruriko Yoshida
Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida Figure 19.1 Genomes 3 ( Garland Science 2007) Origins of Species Tree (or web) of life eukarya
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationEvolution Module. 6.1 Phylogenetic Trees. Bob Gardner and Lev Yampolski. Integrated Biology and Discrete Math (IBMS 1300)
Evolution Module 6.1 Phylogenetic Trees Bob Gardner and Lev Yampolski Integrated Biology and Discrete Math (IBMS 1300) Fall 2008 1 INDUCTION Note. The natural numbers N is the familiar set N = {1, 2, 3,...}.
More informationA Fast Algorithm for Optimal Alignment between Similar Ordered Trees
Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221
More informationIntroduction to Trees
Introduction to Trees Tandy Warnow December 28, 2016 Introduction to Trees Tandy Warnow Clades of a rooted tree Every node v in a leaf-labelled rooted tree defines a subset of the leafset that is below
More informationMolecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony
Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand
More informationStudy of a Simple Pruning Strategy with Days Algorithm
Study of a Simple Pruning Strategy with ays Algorithm Thomas G. Kristensen Abstract We wish to calculate all pairwise Robinson Foulds distances in a set of trees. Traditional algorithms for doing this
More informationLARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR. A Thesis HYUN JUNG PARK
LARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR A Thesis by HYUN JUNG PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree
More informationNSGA-II for Biological Graph Compression
Advanced Studies in Biology, Vol. 9, 2017, no. 1, 1-7 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/asb.2017.61143 NSGA-II for Biological Graph Compression A. N. Zakirov and J. A. Brown Innopolis
More informationLab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES
Integrative Biology 200, Spring 2014 Principles of Phylogenetics: Systematics University of California, Berkeley Updated by Traci L. Grzymala Lab 07: Maximum Likelihood Model Selection and RAxML Using
More informationCluster Analysis. Angela Montanari and Laura Anderlucci
Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a
More informationChordal Graphs and Evolutionary Trees. Tandy Warnow
Chordal Graphs and Evolutionary Trees Tandy Warnow Possible Indo-European tree (Ringe, Warnow and Taylor 2000) Anatolian Vedic Iranian Greek Italic Celtic Tocharian Armenian Germanic Baltic Slavic Albanian
More informationImproved parameterized complexity of the Maximum Agreement Subtree and Maximum Compatible Tree problems LIRMM, Tech.Rep. num 04026
Improved parameterized complexity of the Maximum Agreement Subtree and Maximum Compatible Tree problems LIRMM, Tech.Rep. num 04026 Vincent Berry, François Nicolas Équipe Méthodes et Algorithmes pour la
More informationA more efficient algorithm for perfect sorting by reversals
A more efficient algorithm for perfect sorting by reversals Sèverine Bérard 1,2, Cedric Chauve 3,4, and Christophe Paul 5 1 Département de Mathématiques et d Informatique Appliquée, INRA, Toulouse, France.
More informationAn Edge-Swap Heuristic for Finding Dense Spanning Trees
Theory and Applications of Graphs Volume 3 Issue 1 Article 1 2016 An Edge-Swap Heuristic for Finding Dense Spanning Trees Mustafa Ozen Bogazici University, mustafa.ozen@boun.edu.tr Hua Wang Georgia Southern
More informationGeneralized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction
Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction William R. Pearson, Gabriel Robins,* and Tongtong Zhang* *Department of Computer Science and Department of Biochemistry, University
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationReconstructing Reticulate Evolution in Species Theory and Practice
Reconstructing Reticulate Evolution in Species Theory and Practice Luay Nakhleh Department of Computer Science Rice University Houston, Texas 77005 nakhleh@cs.rice.edu Tandy Warnow Department of Computer
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationCharacterization of Super Strongly Perfect Graphs in Chordal and Strongly Chordal Graphs
ISSN 0975-3303 Mapana J Sci, 11, 4(2012), 121-131 https://doi.org/10.12725/mjs.23.10 Characterization of Super Strongly Perfect Graphs in Chordal and Strongly Chordal Graphs R Mary Jeya Jothi * and A Amutha
More informationFast Hashing Algorithms to Summarize Large. Collections of Evolutionary Trees
Texas A&M CS Technical Report 2008-6- June 27, 2008 Fast Hashing Algorithms to Summarize Large Collections of Evolutionary Trees by Seung-Jin Sul and Tiffani L. Williams Department of Computer Science
More informationOn Covering a Graph Optimally with Induced Subgraphs
On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number
More informationTOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA)
TOWARDS NEW ESTIMATING INCREMENTAL DIMENSIONAL ALGORITHM (EIDA) 1 S. ADAEKALAVAN, 2 DR. C. CHANDRASEKAR 1 Assistant Professor, Department of Information Technology, J.J. College of Arts and Science, Pudukkottai,
More informationTheoretical Foundations of Clustering. Margareta Ackerman
Theoretical Foundations of Clustering Margareta Ackerman The Theory-Practice Gap Clustering is one of the most widely used tools for exploratory data analysis. Identifying target markets Constructing phylogenetic
More informationCLUSTERING IN BIOINFORMATICS
CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of
More informationMulticasting in the Hypercube, Chord and Binomial Graphs
Multicasting in the Hypercube, Chord and Binomial Graphs Christopher C. Cipriano and Teofilo F. Gonzalez Department of Computer Science University of California, Santa Barbara, CA, 93106 E-mail: {ccc,teo}@cs.ucsb.edu
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationClustering. (Part 2)
Clustering (Part 2) 1 k-means clustering 2 General Observations on k-means clustering In essence, k-means clustering aims at minimizing cluster variance. It is typically used in Euclidean spaces and works
More informationSequence clustering. Introduction. Clustering basics. Hierarchical clustering
Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering
More informationClustering Techniques
Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,
More informationEfficiently Computing the Robinson-Foulds Metric
Efficiently Computing the Robinson-Foulds Metric Nicholas D. Pattengale 1, Eric J. Gottlieb 1, Bernard M.E. Moret 1,2 {nickp, ejgottl, moret}@cs.unm.edu 1 Department of Computer Science, University of
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More information3. Cluster analysis Overview
Université Laval Multivariate analysis - February 2006 1 3.1. Overview 3. Cluster analysis Clustering requires the recognition of discontinuous subsets in an environment that is sometimes discrete (as
More informationPhylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst
Phylogenetics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Phylogenetics phylum = tree phylogenetics: reconstruction of evolutionary
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationParameterized graph separation problems
Parameterized graph separation problems Dániel Marx Department of Computer Science and Information Theory, Budapest University of Technology and Economics Budapest, H-1521, Hungary, dmarx@cs.bme.hu Abstract.
More informationA Memetic Heuristic for the Co-clustering Problem
A Memetic Heuristic for the Co-clustering Problem Mohammad Khoshneshin 1, Mahtab Ghazizadeh 2, W. Nick Street 1, and Jeffrey W. Ohlmann 1 1 The University of Iowa, Iowa City IA 52242, USA {mohammad-khoshneshin,nick-street,jeffrey-ohlmann}@uiowa.edu
More informationMissing Data Estimation in Microarrays Using Multi-Organism Approach
Missing Data Estimation in Microarrays Using Multi-Organism Approach Marcel Nassar and Hady Zeineddine Progress Report: Data Mining Course Project, Spring 2008 Prof. Inderjit S. Dhillon April 02, 2008
More informationComputing the Quartet Distance Between Trees of Arbitrary Degrees
January 22, 2006 University of Aarhus Department of Computer Science Computing the Quartet Distance Between Trees of Arbitrary Degrees Chris Christiansen & Martin Randers Thesis supervisor: Christian Nørgaard
More informationarxiv: v1 [math.co] 7 Dec 2018
SEQUENTIALLY EMBEDDABLE GRAPHS JACKSON AUTRY AND CHRISTOPHER O NEILL arxiv:1812.02904v1 [math.co] 7 Dec 2018 Abstract. We call a (not necessarily planar) embedding of a graph G in the plane sequential
More informationLecture 5: Multiple sequence alignment
Lecture 5: Multiple sequence alignment Introduction to Computational Biology Teresa Przytycka, PhD (with some additions by Martin Vingron) Why do we need multiple sequence alignment Pairwise sequence alignment
More informationComputing the All-Pairs Quartet Distance on a set of Evolutionary Trees
Journal of Bioinformatics and Computational Biology c Imperial College Press Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees M. Stissing, T. Mailund, C. N. S. Pedersen and G. S.
More informationUC Davis Computer Science Technical Report CSE On the Full-Decomposition Optimality Conjecture for Phylogenetic Networks
UC Davis Computer Science Technical Report CSE-2005 On the Full-Decomposition Optimality Conjecture for Phylogenetic Networks Dan Gusfield January 25, 2005 1 On the Full-Decomposition Optimality Conjecture
More information15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018
15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018 In this lecture, we describe a very general problem called linear programming
More informationA New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees
A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees Kedar Dhamdhere ½ ¾, Srinath Sridhar ½ ¾, Guy E. Blelloch ¾, Eran Halperin R. Ravi and Russell Schwartz March 17, 2005 CMU-CS-05-119
More informationCluster analysis. Agnieszka Nowak - Brzezinska
Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that
More informationPLANAR GRAPH BIPARTIZATION IN LINEAR TIME
PLANAR GRAPH BIPARTIZATION IN LINEAR TIME SAMUEL FIORINI, NADIA HARDY, BRUCE REED, AND ADRIAN VETTA Abstract. For each constant k, we present a linear time algorithm that, given a planar graph G, either
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationParallelizing SuperFine
Parallelizing SuperFine Diogo Telmo Neves ESTGF - IPP and Universidade do Minho Portugal dtn@ices.utexas.edu Tandy Warnow Dept. of Computer Science The Univ. of Texas at Austin Austin, TX 78712 tandy@cs.utexas.edu
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationCodon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)
Phylogeny Codon models Last lecture: poor man s way of calculating dn/ds (Ka/Ks) Tabulate synonymous/non- synonymous substitutions Normalize by the possibilities Transform to genetic distance K JC or K
More informationLecture: Bioinformatics
Lecture: Bioinformatics ENS Sacley, 2018 Some slides graciously provided by Daniel Huson & Celine Scornavacca Phylogenetic Trees - Motivation 2 / 31 2 / 31 Phylogenetic Trees - Motivation Motivation -
More informationA practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees
A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees Andreas Sand 1,2, Gerth Stølting Brodal 2,3, Rolf Fagerberg 4, Christian N. S. Pedersen 1,2 and Thomas Mailund
More informationClustering. Lecture 6, 1/24/03 ECS289A
Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationRotation Distance is Fixed-Parameter Tractable
Rotation Distance is Fixed-Parameter Tractable Sean Cleary Katherine St. John September 25, 2018 arxiv:0903.0197v1 [cs.ds] 2 Mar 2009 Abstract Rotation distance between trees measures the number of simple
More informationDistance-based Phylogenetic Methods Near a Polytomy
Distance-based Phylogenetic Methods Near a Polytomy Ruth Davidson and Seth Sullivant NCSU UIUC May 21, 2014 2 Phylogenetic trees model the common evolutionary history of a group of species Leaves = extant
More informationGenome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo
Genome 559: Introduction to Statistical and Computational Genomics Lecture15a Multiple Sequence Alignment Larry Ruzzo 1 Multiple Alignment: Motivations Common structure, function, or origin may be only
More information