Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)
|
|
- Ambrose Jennings
- 5 years ago
- Views:
Transcription
1 Distance based tree reconstruction Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)
2 All organisms have evolved from a common ancestor. Infer the evolutionary tree (tree topology and edge lengths) from molecular data. A gene tree inferred from a homologous gene from different species might be different from the species tree.
3 Distance based reconstruction Goal: Reconstruct an evolutionary tree from a distance matrix Input: n x n distance matrix d Output: weighted tree T with n leaves fitting d If d is additive, this problem has an efficient solution (which we will come back to), otherwise we should aim at finding a tree which is as good as possible
4 Distance based reconstruction The ideal case The tree reflects the distance matrix, i.e. the distances induced by the tree are equal to the distances in the matrix If this is possible, then the distance matrix is called additive
5 A reconstruction method Least square methods Given a tree topology T, select edge lengths such that n n Q T = w ij D ij d ij 2 i=1 j=1 is minimized, where dij is given by the matrix, Dij is the distance induced by the tree and the selected edge lengths, and wij is the confidence we have in dij, e.g. wij=1. Optimal edge lengths can be found by standard least square methods in time O(n3). The large version of the problem when the tree is unknown is NP-complete...
6 Counting rooted binary trees
7 Counting rooted binary trees
8 Counting rooted binary trees
9 Hierarchical clustering and UPGMA
10 Hierarchical clustering Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix.
11 Example
12 Example Group the closet two data points.
13 Example In each iteration, we join the closet two cluster.
14 Example
15 Example
16 Hierarchical clustering Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters are computed. Step 4: Repeat Step 2 and 3 until only one cluster remains.
17 Hierarchical clustering Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters are computed. Step 4: Repeat Step 2 and 3 until only one cluster remains. How to define and compute the distances between clusters?
18 Distance measures between clusters
19 Distance measures between clusters
20 Distance measures between clusters
21 Distance measures between clusters
22 Hierarchical clustering (UPGMA) Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time? Space consumption?
23 Hierarchical clustering (UPGMA) Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n3). Space consumption O(n2)
24 An implementation of UPGMA in Python
25 Proporties of an UPGMA tree
26 Trees reflecting a molecular clock A clocklike tree is a rooted tree, in which the total edge length from the root to any leaf is equal, i.e there is a molecular clock that ticks in a constant pace (the mutation rate is identical for all species), and all the observed species are at an equal number of ticks from the root... A clocklike tree induces an ultrametric distance matrix...
27 Ultrametric matrices and trees Formally: A matrix dij is ultrametric iff it is metric and any three species x, y, z satisfy that d(x,y) max{d(x,z), d(y,z)}, i.e there is a tie for the maximum of d(x,y), d(x,z), and d(y,z) Theorem (Gusfield): A symmetric matrix D has an ultra-metric tree iff D is an ultrametrix matrix Theorem (Gusfield): If D is ultrametric then the unique ultra-metric tree T for D can be constructed in time O(n2)
28 Two examples Ultrametric doesn t imply molecular clock a b 3 a a b c b c c 0 Molecular clock implies ultrametric 1 1 a 1 b 2 c a b c a b c
29 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n3). Space consumption O(n2)
30 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n3). Space consumption O(n2)
31 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster 2 correspond to a leaf in a tree T. The distance D(CTakes,C ) between cluster and Cj is d(i,j). O(n ) time. CanCbe improved to O(n). i j i Yields a total running time of O(n2). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n3). Space consumption O(n2)
32 Storing matrix D in a quad tree Building the quad tree for a nxn matrix takes time O(n2), when built we can pick the minimum in constant time O(1). Not surprising.
33 Storing matrix D in a quad tree Building the quad tree for a nxn matrix takes time O(n2), when built we can pick the minimum in constant time O(1). Not surprising. Question: How fast can we rebuilt the quad-tree to reflect the modified distance matrix we built in Step 3?
34 Updating the quad-tree Question: How fast can we rebuilt the quad-tree to reflect the modified distance matrix we built in Step 3? In Step 3, we only change 2 row/columns in D.
35 Updating the quad-tree Question: How fast can we rebuilt the quad-tree to reflect the modified distance matrix we built in Step 3? In Step 3, we only change 2 row/columns in D.
36 Updating the quad-tree Question: How fast can be rebuilt the quad-tree to reflect the modified distance matrix we built in Step 3? In Step 3, we only change 2 row/columns in D.
37 An implementation of a Quad Tree in Python
38 An implementation of a Quad Tree in Python
39 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n2) using a quad-tree to store D. Space consumption O(n2)
40 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: O(n) Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). O(1) using quad-tree, O(n2) without quad-tree Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. O(n) for computing new distance and rebuilding quad-tree Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n2) using a quad-tree to store D. Space consumption O(n2)
41 Does it matter in practice? Tbasic(n) / TQuad(n) TBasic(n) TQuad(n) Tquad(n) / n2 Tbasic(n) / n3
42 Neighbor Joining
43 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Popular distance based phylogenetc inference method with good accuracy. Introduced by Saitou and Nei in Reformulated by Studier and Keppler in Critcized by many but stll widely used. Often used as a benchmark for other methods.
44 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Let us assume that the distance matrix is additive, i.e. corresponds to tree distances from some tree with unknown topology and edge length, then this is the tree we want to infer If we from the distance matrix can infer which leaves (and sub trees) are neighbors in the unknown (true) tree, we can reconstruct the tree iteratively...
45 Neighbor Joining
46 Neighbor Joining Not easy to infer neighboring leaves!!
47 Neighbor Joining Here D(i,j) = D(k,l) = D(j,l) =13, and D(j,k) =12, so the leaves with minimum distance in the distance matrix are not necessarily neighbors in the tree, i.e. UPGMA will not reconstruct the tree correctly. Not easy to infer neighboring leaves!!
48 Neighbor Joining
49 Neighbor Joining
50 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm:
51 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm:
52 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm:
53 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm:
54 Running time and space consumption
55 Running time and space consumption Can we improve the running time of Step 1 by using a quad-tree? Why? Why not?
56 Properties of Neighbor Joining
57 How 'deterministic' is NJ? An experiment To see how much this matters in practice, I made a small experiment. I took 37 distance matrices based on Pfam sequences, with between 100 and 200 taxa. [ ] For each matrix I constructed ten equivalent distance matrices by randomly permuting the order of taxa. Then I used these ten matrices to construct ten neighbourjoining trees and fnally computed the Robinson-Foulds between all pairs of these trees. [...] As you can see, you can get quite diferent trees out of equivalent distance matrices and on realistic data too. I didn t check if the diferences between the trees are correlated with how much support the splits have in the data. Looking at bootstrap values for the trees might tell us something about that. Anyway, that is not the take home message I want to give here. That is that it doesn t necessarily make sense to talk the neighbourjoining tree for a distance matrix. Neighbour-joining is less deterministic than you might think!
58 How 'deterministic' is NJ?
59 RapidNJ RapidNJ is an implementation of NJ which uses heuristics to avoid searching large parts of the D matrix. No guarantees are given on how much of the matrix is avoided. RapidNJ was compared with three fast Neighbour-Joining implementatons: QuickTree A fast implementaton of the canonical Neighbour-Joining method. QuickJoin Uses an advanced data structure to reduce the search space. Clearcut Relaxed Neighbour-Joining. R8s was used to simulate phylogenetc trees from which distance matrices were constructed. Alignments from Pfam were used to infer distance matrices using QuickTree. Also work on how to use external memory to allow construction oflarge trees (> leaves).
60 Results on Simulated Data
61 Results on small Pfam Data
62 Results on large Pfam Data
63 Running times on Pfam data
Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony
Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand
More informationRapid Neighbour-Joining
Rapid Neighbour-Joining Martin Simonsen, Thomas Mailund and Christian N. S. Pedersen Bioinformatics Research Center (BIRC), University of Aarhus, C. F. Møllers Allé, Building 1110, DK-8000 Århus C, Denmark.
More informationEvolutionary tree reconstruction (Chapter 10)
Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early
More informationRapid Neighbour-Joining
Rapid Neighbour-Joining Martin Simonsen, Thomas Mailund, and Christian N.S. Pedersen Bioinformatics Research Center (BIRC), University of Aarhus, C. F. Møllers Allé, Building 1110, DK-8000 Århus C, Denmark
More informationWhat is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees
What is a phylogenetic tree? Algorithms for Computational Biology Zsuzsanna Lipták speciation events Masters in Molecular and Medical Biotechnology a.a. 25/6, fall term Phylogenetics Summary wolf cat lion
More informationClustering of Proteins
Melroy Saldanha saldanha@stanford.edu CS 273 Project Report Clustering of Proteins Introduction Numerous genome-sequencing projects have led to a huge growth in the size of protein databases. Manual annotation
More informationAlgorithms for Bioinformatics
Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and
More informationPhylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau
Phylogenetic Trees Lecture 12 Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau. Maximum Parsimony. Last week we presented Fitch algorithm for (unweighted) Maximum Parsimony:
More informationEVOLUTIONARY DISTANCES INFERRING PHYLOGENIES
EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 28 th November 2007 OUTLINE 1 INFERRING
More information11/17/2009 Comp 590/Comp Fall
Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 Problem Set #5 will be available tonight 11/17/2009 Comp 590/Comp 790-90 Fall 2009 1 Clique Graphs A clique is a graph with every vertex connected
More informationScaling species tree estimation methods to large datasets using NJMerge
Scaling species tree estimation methods to large datasets using NJMerge Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana Champaign 2018 Phylogenomics Software
More informationFastJoin, an improved neighbor-joining algorithm
Methodology FastJoin, an improved neighbor-joining algorithm J. Wang, M.-Z. Guo and L.L. Xing School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, P.R. China
More informationPhylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst
Phylogenetics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Phylogenetics phylum = tree phylogenetics: reconstruction of evolutionary
More informationIntroduction to Trees
Introduction to Trees Tandy Warnow December 28, 2016 Introduction to Trees Tandy Warnow Clades of a rooted tree Every node v in a leaf-labelled rooted tree defines a subset of the leafset that is below
More informationCodon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)
Phylogeny Codon models Last lecture: poor man s way of calculating dn/ds (Ka/Ks) Tabulate synonymous/non- synonymous substitutions Normalize by the possibilities Transform to genetic distance K JC or K
More informationof the Balanced Minimum Evolution Polytope Ruriko Yoshida
Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida Figure 19.1 Genomes 3 ( Garland Science 2007) Origins of Species Tree (or web) of life eukarya
More informationSequence clustering. Introduction. Clustering basics. Hierarchical clustering
Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering
More informationRecent Research Results. Evolutionary Trees Distance Methods
Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how
More informationOn the Optimality of the Neighbor Joining Algorithm
On the Optimality of the Neighbor Joining Algorithm Ruriko Yoshida Dept. of Statistics University of Kentucky Joint work with K. Eickmeyer, P. Huggins, and L. Pachter www.ms.uky.edu/ ruriko Louisville
More informationDIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1
DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens Katherine St. John City University of New York 1 Thanks to the DIMACS Staff Linda Casals Walter Morris Nicole Clark Katherine St. John
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationLecture 20: Clustering and Evolution
Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/11/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other
More informationDISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION
DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION CHUANG PENG DEPARTMENT OF MATHEMATICS MOREHOUSE COLLEGE ATLANTA, GA 30314 Abstract. One of the most fundamental aspects of bioinformatics in understanding
More information4/4/16 Comp 555 Spring
4/4/16 Comp 555 Spring 2016 1 A clique is a graph where every vertex is connected via an edge to every other vertex A clique graph is a graph where each connected component is a clique The concept of clustering
More informationLecture 20: Clustering and Evolution
Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/12/2013 Comp 465 Fall 2013 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other vertex A clique
More informationGene expression & Clustering (Chapter 10)
Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching
More informationDistance-based Phylogenetic Methods Near a Polytomy
Distance-based Phylogenetic Methods Near a Polytomy Ruth Davidson and Seth Sullivant NCSU UIUC May 21, 2014 2 Phylogenetic trees model the common evolutionary history of a group of species Leaves = extant
More informationTerminology. A phylogeny is the evolutionary history of an organism
Phylogeny Terminology A phylogeny is the evolutionary history of an organism A taxon (plural: taxa) is a group of (one or more) organisms, which a taxonomist adjudges to be a unit. A definition? from Wikipedia
More informationA fast neighbor joining method
Methodology A fast neighbor joining method J.F. Li School of Computer Science and Technology, Civil Aviation University of China, Tianjin, China Corresponding author: J.F. Li E-mail: jfli@cauc.edu.cn Genet.
More informationDesigning parallel algorithms for constructing large phylogenetic trees on Blue Waters
Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation
More informationCSE 549: Computational Biology
CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford Tree of Life 2 * H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 * H5N1 Influenza Strains The 2007 outbreak
More informationOlivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.
Arbres formels et Arbre de la Vie Olivier Gascuel Centre National de la Recherche Scientifique LIRMM, Montpellier, France www.lirmm.fr/gascuel 10 permanent researchers 2 technical staff 3 postdocs, 10
More informationDistance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006
Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2006 Distance Methods Due at the end of class: - Distance matrices and trees for two different distance
More informationStudy of a Simple Pruning Strategy with Days Algorithm
Study of a Simple Pruning Strategy with ays Algorithm Thomas G. Kristensen Abstract We wish to calculate all pairwise Robinson Foulds distances in a set of trees. Traditional algorithms for doing this
More informationDynamic Programming for Phylogenetic Estimation
1 / 45 Dynamic Programming for Phylogenetic Estimation CS598AGB Pranjal Vachaspati University of Illinois at Urbana-Champaign 2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for
More informationThe worst case complexity of Maximum Parsimony
he worst case complexity of Maximum Parsimony mir armel Noa Musa-Lempel Dekel sur Michal Ziv-Ukelson Ben-urion University June 2, 20 / 2 What s a phylogeny Phylogenies: raph-like structures whose topology
More informationReconciliation Problems for Duplication, Loss and Horizontal Gene Transfer Pawel Górecki. Presented by Connor Magill November 20, 2008
Reconciliation Problems for Duplication, Loss and Horizontal Gene Transfer Pawel Górecki Presented by Connor Magill November 20, 2008 Introduction Problem: Relationships between species cannot always be
More informationhuman chimp mouse rat
Michael rudno These notes are based on earlier notes by Tomas abak Phylogenetic Trees Phylogenetic Trees demonstrate the amoun of evolution, and order of divergence for several genomes. Phylogenetic trees
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg
More informationImprovement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets
Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets Vincent Ranwez and Olivier Gascuel Département Informatique Fondamentale et Applications, LIRMM,
More informationA Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem
A Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem Gang Wu Jia-Huai You Guohui Lin January 17, 2005 Abstract A lookahead branch-and-bound algorithm is proposed for solving
More informationComputing the All-Pairs Quartet Distance on a set of Evolutionary Trees
Journal of Bioinformatics and Computational Biology c Imperial College Press Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees M. Stissing, T. Mailund, C. N. S. Pedersen and G. S.
More informationCLUSTERING IN BIOINFORMATICS
CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of
More information5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction
Computational Methods for Data Analysis Massimo Poesio UNSUPERVISED LEARNING Clustering Unsupervised learning introduction 1 Supervised learning Training set: Unsupervised learning Training set: 2 Clustering
More informationLesson 13 Molecular Evolution
Sequence Analysis Spring 2000 Dr. Richard Friedman (212)305-6901 (76901) friedman@cuccfa.ccc.columbia.edu 130BB Lesson 13 Molecular Evolution In this class we learn how to draw molecular evolutionary trees
More informationSequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin
Sequence length requirements Tandy Warnow Department of Computer Science The University of Texas at Austin Part 1: Absolute Fast Convergence DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2
More informationA practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees
A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees Andreas Sand 1,2, Gerth Stølting Brodal 2,3, Rolf Fagerberg 4, Christian N. S. Pedersen 1,2 and Thomas Mailund
More informationParsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course)
Tree Searching We ve discussed how we rank trees Parsimony Least squares Minimum evolution alanced minimum evolution Maximum likelihood (later in the course) So we have ways of deciding what a good tree
More informationLecture: Bioinformatics
Lecture: Bioinformatics ENS Sacley, 2018 Some slides graciously provided by Daniel Huson & Celine Scornavacca Phylogenetic Trees - Motivation 2 / 31 2 / 31 Phylogenetic Trees - Motivation Motivation -
More informationAnswer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency?
Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency? Fathiyeh Faghih and Daniel G. Brown David R. Cheriton School of Computer Science, University of
More informationConstruction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA).
Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA). The UPGMA is the simplest method of tree construction. It was originally developed for
More informationFigure 4.1: The evolution of a rooted tree.
106 CHAPTER 4. INDUCTION, RECURSION AND RECURRENCES 4.6 Rooted Trees 4.6.1 The idea of a rooted tree We talked about how a tree diagram helps us visualize merge sort or other divide and conquer algorithms.
More informationABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3
The XIII International Conference Applied Stochastic Models and Data Analysis (ASMDA-2009) June 30-July 3, 2009, Vilnius, LITHUANIA ISBN 978-9955-28-463-5 L. Sakalauskas, C. Skiadas and E. K. Zavadskas
More informationCLC Phylogeny Module User manual
CLC Phylogeny Module User manual User manual for Phylogeny Module 1.0 Windows, Mac OS X and Linux September 13, 2013 This software is for research purposes only. CLC bio Silkeborgvej 2 Prismet DK-8000
More informationCS 581. Tandy Warnow
CS 581 Tandy Warnow This week Maximum parsimony: solving it on small datasets Maximum Likelihood optimization problem Felsenstein s pruning algorithm Bayesian MCMC methods Research opportunities Maximum
More informationM-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =
M-ary Search Tree B-Trees Section 4.7 in Weiss Maximum branching factor of M Complete tree has height = # disk accesses for find: Runtime of find: 2 Solution: B-Trees specialized M-ary search trees Each
More informationExercises: Disjoint Sets/ Union-find [updated Feb. 3]
Exercises: Disjoint Sets/ Union-find [updated Feb. 3] Questions 1) Suppose you have an implementation of union that is by-size and an implementation of find that does not use path compression. Give the
More informationClustering Techniques
Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,
More informationIntroduction to Triangulated Graphs. Tandy Warnow
Introduction to Triangulated Graphs Tandy Warnow Topics for today Triangulated graphs: theorems and algorithms (Chapters 11.3 and 11.9) Examples of triangulated graphs in phylogeny estimation (Chapters
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationSPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS
1 SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS C. THAN and L. NAKHLEH Department of Computer Science Rice University 6100 Main Street, MS 132 Houston, TX 77005, USA Email: {cvthan,nakhleh}@cs.rice.edu
More informationSeeing the wood for the trees: Analysing multiple alternative phylogenies
Seeing the wood for the trees: Analysing multiple alternative phylogenies Tom M. W. Nye, Newcastle University tom.nye@ncl.ac.uk Isaac Newton Institute, 17 December 2007 Multiple alternative phylogenies
More informationM-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)
M-ary Search Tree B-Trees (4.7 in Weiss) Maximum branching factor of M Tree with N values has height = # disk accesses for find: Runtime of find: 1/21/2011 1 1/21/2011 2 Solution: B-Trees specialized M-ary
More informationLecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 5 Finding meaningful clusters in data So far we ve been in the vector quantization mindset, where we want to approximate a data set by a small number
More informationComparing Implementations of Optimal Binary Search Trees
Introduction Comparing Implementations of Optimal Binary Search Trees Corianna Jacoby and Alex King Tufts University May 2017 In this paper we sought to put together a practical comparison of the optimality
More informationGeneralized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction
Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction William R. Pearson, Gabriel Robins,* and Tongtong Zhang* *Department of Computer Science and Department of Biochemistry, University
More informationLecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs
Lecture 5 Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs Reading: Randomized Search Trees by Aragon & Seidel, Algorithmica 1996, http://sims.berkeley.edu/~aragon/pubs/rst96.pdf;
More informationCISC 636 Computational Biology & Bioinformatics (Fall 2016) Phylogenetic Trees (I)
CISC 636 Computational iology & ioinformatics (Fall 2016) Phylogenetic Trees (I) Maximum Parsimony CISC636, F16, Lec13, Liao 1 Evolution Mutation, selection, Only the Fittest Survive. Speciation. t one
More informationClustering analysis of gene expression data
Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains
More informationComputational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh
Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh We saw earlier how we can use a concave gap penalty function γ, i.e. one that satisfies γ(x+1) γ(x)
More informationA Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony
A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony Jean-Michel Richer 1 and Adrien Goëffon 2 and Jin-Kao Hao 1 1 University of Angers, LERIA, 2 Bd Lavoisier, 49045 Anger Cedex 01,
More informationGenetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)
Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences joe@gs Phylogeny methods, part 1 (Parsimony and such) Methods of reconstructing phylogenies (evolutionary trees) Parsimony
More informationSYDE Winter 2011 Introduction to Pattern Recognition. Clustering
SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned
More informationPhylogenetics on CUDA (Parallel) Architectures Bradly Alicea
Descent w/modification Descent w/modification Descent w/modification Descent w/modification CPU Descent w/modification Descent w/modification Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea
More informationPrior Distributions on Phylogenetic Trees
Prior Distributions on Phylogenetic Trees Magnus Johansson Masteruppsats i matematisk statistik Master Thesis in Mathematical Statistics Masteruppsats 2011:4 Matematisk statistik Juni 2011 www.math.su.se
More informationClustering Algorithms. Margareta Ackerman
Clustering Algorithms Margareta Ackerman A sea of algorithms As we discussed last class, there are MANY clustering algorithms, and new ones are proposed all the time. They are very different from each
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationTree Models of Similarity and Association. Clustering and Classification Lecture 5
Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph
More informationBackground: disk access vs. main memory access (1/2)
4.4 B-trees Disk access vs. main memory access: background B-tree concept Node structure Structural properties Insertion operation Deletion operation Running time 66 Background: disk access vs. main memory
More informationIntroduction to Computational Phylogenetics
Introduction to Computational Phylogenetics Tandy Warnow The University of Texas at Austin No Institute Given This textbook is a draft, and should not be distributed. Much of what is in this textbook appeared
More information46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 2011
46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 Phylogeny Further reading and sources for parts of this hapter: R. Durbin, S. Eddy,. Krogh & G. Mitchison, Biological sequence analysis,
More informationPHYLOGENETIC RECONSTRUCTION methods attempt to find the evolutionary history of a given set of extant
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 14, Number 1, 2007 c Mary Ann Liebert, Inc. Pp. 1 15 DOI: 10.1089/cmb.2006.0115 Neighbor Joining Algorithms for Inferring Phylogenies via LCA Distances ILAN GRONAU
More informationEvolution of Tandemly Repeated Sequences
University of Canterbury Department of Mathematics and Statistics Evolution of Tandemly Repeated Sequences A thesis submitted in partial fulfilment of the requirements of the Degree for Master of Science
More informationClustering, cont. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden
Clustering, cont Genome 373 Genomic Informatics Elhanan Borenstein Some slides adapted from Jacques van Helden Improving the search heuristic: Multiple starting points Simulated annealing Genetic algorithms
More informationHierarchical Clustering Lecture 9
Hierarchical Clustering Lecture 9 Marina Santini Acknowledgements Slides borrowed and adapted from: Data Mining by I. H. Witten, E. Frank and M. A. Hall 1 Lecture 9: Required Reading Witten et al. (2011:
More informationRelaxed Neighbor Joining: A Fast Distance-Based Phylogenetic Tree Construction Method
J Mol Evol (2006) 62:785 792 DOI: 10.1007/s00239-005-0176-2 Relaxed Neighbor Joining: A Fast Distance-Based Phylogenetic Tree Construction Method Jason Evans, Luke Sheneman, James Foster Department of
More informationTreaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19
CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types
More informationMachine learning - HT Clustering
Machine learning - HT 2016 10. Clustering Varun Kanade University of Oxford March 4, 2016 Announcements Practical Next Week - No submission Final Exam: Pick up on Monday Material covered next week is not
More informationBinary Trees
Binary Trees 4-7-2005 Opening Discussion What did we talk about last class? Do you have any code to show? Do you have any questions about the assignment? What is a Tree? You are all familiar with what
More informationMarkovian Models of Genetic Inheritance
Markovian Models of Genetic Inheritance Elchanan Mossel, U.C. Berkeley mossel@stat.berkeley.edu, http://www.cs.berkeley.edu/~mossel/ 6/18/12 1 General plan Define a number of Markovian Inheritance Models
More informationCS521 \ Notes for the Final Exam
CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )
More informationThe History Bound and ILP
The History Bound and ILP Julia Matsieva and Dan Gusfield UC Davis March 15, 2017 Bad News for Tree Huggers More Bad News Far more convincingly even than the (also highly convincing) fossil evidence, the
More informationImproved parameterized complexity of the Maximum Agreement Subtree and Maximum Compatible Tree problems LIRMM, Tech.Rep. num 04026
Improved parameterized complexity of the Maximum Agreement Subtree and Maximum Compatible Tree problems LIRMM, Tech.Rep. num 04026 Vincent Berry, François Nicolas Équipe Méthodes et Algorithmes pour la
More informationCS 441 Discrete Mathematics for CS Lecture 26. Graphs. CS 441 Discrete mathematics for CS. Final exam
CS 441 Discrete Mathematics for CS Lecture 26 Graphs Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Final exam Saturday, April 26, 2014 at 10:00-11:50am The same classroom as lectures The exam
More informationIndexing in Search Engines based on Pipelining Architecture using Single Link HAC
Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily
More informationLinked Structures Songs, Games, Movies Part III. Fall 2013 Carola Wenk
Linked Structures Songs, Games, Movies Part III Fall 2013 Carola Wenk Biological Structures Nature has evolved vascular and nervous systems in a hierarchical manner so that nutrients and signals can quickly
More informationStats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms
Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,
More informationMidterm Examination CS540-2: Introduction to Artificial Intelligence
Midterm Examination CS540-2: Introduction to Artificial Intelligence March 15, 2018 LAST NAME: FIRST NAME: Problem Score Max Score 1 12 2 13 3 9 4 11 5 8 6 13 7 9 8 16 9 9 Total 100 Question 1. [12] Search
More informationTrees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.
Trees Q: Why study trees? : Many advance DTs are implemented using tree-based data structures. Recursive Definition of (Rooted) Tree: Let T be a set with n 0 elements. (i) If n = 0, T is an empty tree,
More information