Recent Research Results. Evolutionary Trees Distance Methods

Similar documents
Evolutionary tree reconstruction (Chapter 10)

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

Introduction to Trees

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

11/17/2009 Comp 590/Comp Fall

Distance-based Phylogenetic Methods Near a Polytomy

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Algorithms for Bioinformatics

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

CS 581. Tandy Warnow

Terminology. A phylogeny is the evolutionary history of an organism

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution

Parsimony-Based Approaches to Inferring Phylogenetic Trees

4/4/16 Comp 555 Spring

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

CSE 549: Computational Biology

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

On the Optimality of the Neighbor Joining Algorithm

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Scaling species tree estimation methods to large datasets using NJMerge

Solving problems on graph algorithms

BMI/CS 576 Fall 2015 Midterm Exam

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

The worst case complexity of Maximum Parsimony

Special course in Computer Science: Advanced Text Algorithms

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Introduction to Computational Phylogenetics

human chimp mouse rat

Lecture: Bioinformatics

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA).

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

Seeing the wood for the trees: Analysing multiple alternative phylogenies

Parsimony methods. Chapter 1

3. Cluster analysis Overview

Graphs: Introduction. Ali Shokoufandeh, Department of Computer Science, Drexel University

Lesson 2 7 Graph Partitioning

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

Characterizations of Trees

Clustering of Proteins

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Multiple Sequence Alignment Gene Finding, Conserved Elements

Introduction to Triangulated Graphs. Tandy Warnow

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Visual Representations for Machine Learning

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Behavioral Data Mining. Lecture 18 Clustering

Cost Partitioning Techniques for Multiple Sequence Alignment. Mirko Riesterer,

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees

Computational Genomics and Molecular Biology, Fall

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

3. Cluster analysis Overview

Generation of distancebased phylogenetic trees

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

EECS730: Introduction to Bioinformatics

Introduction to Graph Theory

CLC Phylogeny Module User manual

PHYLOGENETIC TREE BUILDING USING A NOVEL COMPRESSION-BASED NON-SYMMETRIC DISSIMILARITY MEASURE

Workload Characterization Techniques

Special course in Computer Science: Advanced Text Algorithms

Approximation Algorithms

AMS /672: Graph Theory Homework Problems - Week V. Problems to be handed in on Wednesday, March 2: 6, 8, 9, 11, 12.

Computational Molecular Biology

Solutions for the Exam 6 January 2014

STA 4273H: Statistical Machine Learning

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

Basic Combinatorics. Math 40210, Section 01 Fall Homework 4 Solutions

1. Evolutionary Tree Reconstruction 2. Two Hypotheses for Human Evolution 3. Did we evolve from Neanderthals? 4. Distance-Based Phylogeny 5.

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Stephen Scott.

Dynamic Programming for Phylogenetic Estimation

Notes 4 : Approximating Maximum Parsimony

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets

Graph and Digraph Glossary

Theorem 2.9: nearest addition algorithm

TELCOM2125: Network Science and Analysis

Systematics - Bio 615

Geometric Steiner Trees

Modularity CMSC 858L

Cluster Analysis. Angela Montanari and Laura Anderlucci

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

FastJoin, an improved neighbor-joining algorithm

Dimension Reduction CS534

Definitions. Matt Mauldin

Clustering CS 550: Machine Learning

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

Transcription:

Recent Research Results Evolutionary Trees Distance Methods

Indo-European Languages After Tandy Warnow

What is the purpose? Understand evolutionary history (relationship between species). Uderstand how various functions evolved. Understand forces and constraints on evolution. To do multiple alignment.

Multiple Alignment Evolutionary Tree Cat: GCGAGTAGAGCGTA Dog: CCGGTCGACGAA Frog: CAAGTCTCACGGAT Wolf: CCTGCCGACGA Bird: AGTCGCACGGTT Cat: GCGAGTAG-AGCGTA- Dog: -CCGGTCG-A-CGAA- Frog: -CAAGTCTCA-CGGAT Wolf: -CCTGCCG-A-CG-A- Bird: ---AGTCGCA-CGGTT Align Compute distances Construct tree

Evolutionary Tree Multiple Alignment Align along edges of the tree using pairwise alignment

Solution Methods

Some Complications Convergence or parallel evolution e.g. presence of wings in birds and bats. Assume that there is no convergence. Can often be achieved by excluding characters causing problems. Reversals, e.g. snakes loss of legs. Assume that there is no reversals. Can often be achieved by excluding characters causing problems.

General Methods Distance-based methods. Maximum parsimony method. Maximum likelihood methods. Consensus methods.

Distance-Based Methods Distances between all pairs of species are determined. If species are represented by DNA or protein sequences, the distances may depend on the number of substitutions, insertions and delitions needed to get one sequence from another. Tree is computed from the resulting distance matrix such that the distances in the tree fit the distances in the matrix as good as possible.

Maximum Parsimony Methods Principle of Parsimony: When given the choice between two explanations, one simple and one complex, choose the simple one. Character-based methods, f. ex. "number of legs", eukaryote vs. prokaryote organisms, bases in aligned DNA or amino acids in protein sequences. The problem is to find a tree that requires minimum number of changes.

Maximum Likelihood Methods A C G T A 1 u ap c bp g cp t uap c ubp g ucp t C udp a 1 u dp a ep g fp t uep g ufp t G ugp a uhp c 1 u gp a hp c ip t uip t T ujp a ukp c ulp g 1 u jp a kp c lp g base frequencies: p a, p c, p g, p t mutation rate: u frequencies of change of any base to any other: a, b, c, d, e, f, g, h, i, j, k, l.

Maximum Likelihood Methods n species: consider a topology with n leaves. Consider each base position for a given topology. Consider each possible assignment of bases to inner nodes. Compute the product of transition rates along all edges. Add up the products of all possible assignments. Add up for all base positions. This is the likelihood of particular topology. Repeat for every topology and select the one with maximum likelihood.

Consensus Methods A set of leaf-labeled trees (possibly weighted) is used to generate a tree. In some situations the trees have the same leaves but different topologies. It for example frequently happens that choosing different DNA- or protein sequences for the same family of species results in different evolutionary trees. In other situations, each tree spans only (small) subsets of species. The objective is then to find a tree spanning all species and agreeing in some way with the small trees.

Introduction to Distance Methods Distance methods reconstruct trees (rooted or unrooted) from a set of pairwise distances between the sequences. Introduced by Cavalli-Sforza and Edwards [1967] Fitch and Margoliash [1967] Influenced by clustering algorithms of Sokal and Sneath [1963]

Metric Spaces M is a set. d: M M R is a function. d is a distance function iff d(u,v) > 0 for all u,v M, u v, d(u,u) = 0 for all u M, d(u,v) = d(v,u) for all u,v M, d(u,v) d(u,w)+d(w,v) for all u,v,w M. A set M with a distance function d is called a metric space.

Tree Metrics T is a tree with edge weights and with elements of the set M as its leaves. Define d T (u,v) for all u,v in M as the length of the unique path from u to v in T. It can be shown that (M,d T ) is a metric space provided that edge weights are strictly positive.

Additive Distance Functions Given a distance function d on a set M. Does there exist a tree T with elements of M as leaves realizing d? If it is the case, d is said to be additive.

Four Points Condition (M,d) is a metric space. d is additive iff for every set of four different elements i, j, k,l M, two of the sums d ij +d kl, d ik +d jl, d il +d jk are the same and greater than or equal to the third sum. This condition is called four points condition. Necessity: Sufficiency: Constructive proof.

Sufficiency Given a metric space (M,d). Is there a weighted tree T with elements of M as its leaves such that d T (u,v) = d(u,v) for all u, v in M? Obvious for M =2. How about M =3?

Sufficiency for m=3 x + y = d 12, x+ z = d 13, y + z = d 23 2x + y + z = d 12 + d 13 ==> x = [d 12 + d 13 d 23 ] / 2 x + 2y + z = d 12 + d 23 ==> y = [d 12 + d 23 - d 13 ] / 2 x + y + 2z = d 13 + d 23 ==> z = [d 13 + d 23 d 12 ] / 2 Note that x, y, z 0. Can be 0! Solution is unique.

Sufficiency for m = 4 Solve for #1, #2, #4. Steiner point s 4 is somewhere on the path from #1 to #2. If s 4!= s, add an edge from s 4 to #4. If it overlaps, solve for s, #3 and #4. Assume not unique. So there are two trees with s 4 placed in different places, This implies that there are 2 different trees for #1, #3 and #4, a contradiction.

Sufficiency for m > 4 Assume that we have a unique tree for k species, k 4. The process is similar to what was done for 4 species.

Uniqueness Is the solution unique? Topology is unique. Assume that there are 2 distinct topologies. There must exist 3 leaves x, y, z such that the partitions induced by them in these two topologies are different. Assume that x and a fourth leaf w are in the same partition subset in one topology while there are in different partition subsets in the second topology. This implies that there are 2 different trees realizing the distances for four species x, y, z, w. A contradiction.

Uniqueness Is the solution unique? Edge weights are unique. Assume that there are two solutions with the same topology where the edge incident with a leaf v has different lengths. Let s be the Steiner point incident to this edge. This s defines a 3-set partition (one set consisting of v alone). Taking any x and y from the other two sets gives two distinct solutions, a contradiction. An interior edge in a fixed topology defines a 4-set partition. Assume that there are two solutions with the same topology but with different lengths of the selected interior edge. Taking any set of 4 leaves one from each partition set must give a unique solution, a contradiction.

Ultrametric Distance Functions A tree T in a metric space (M,d) where d is ultrametric has the following property: there is a way to place a root on T so that for all nodes in M, their distance to the root is the same. Such T is referred to as a molecular clock tree. d is ultrametric ==> d additive (M,d) is ultrametric iff every set of three elements i,j,k M, two of the distances coincide and are greater than or equal to the third one. (M,d) is ultrametric iff in the corresponding complete weighted graph G, the largest-weight edge in any cycle is not unique.

Sandwich Problem Given two distance functions d l and d u on a set M, d l (a,b) d u (a,b) for all species a, b. Does there exist an ultrametric tree T with elements of M as leaves such that d l (a,b) d T (a,b) d u (a,b) for all species a, b? Lower and upper bounds can be given by two weighted graphs G l and G u. Edges not present in G l have weight 0. Edges not present in G u have weight. If such an ultrametric T exists, it can be found in polynomial time. M. Farach, S. Kannan and T. Warnow, A robust model for finding evolutionary trees, Algorithmica 13 (1995) 155-179.

Unweighted Pair Group Method Using Averages - UPGMA Find species i and j with the smallest distance M(i, j). Create a new node (i, j ) and connect it to i and j by branches of length M(i, j) / 2. Compute the distance between the new ij group and all other groups (except i and j ) by using M ij, k = n i M i, k n j M j, k n ij n ij Delete the columns and rows of the data matrix that correspond to groups i and j, and add a column and row for group ij. If there is only one item in the data matrix, stop. Otherwise repeat.

UPGMA - Example

UPGMA Another Example

Neighbor Joining 8 3 14 10 12 9 10 6 8 15 11 13 10 8 8 r i = d ik M 2 r 1 =11 3 4, r 2 =10 1 4, r 3 =12 3 4, r 4 =13 3 4,r 5 =11 1 4, r 6 =12 1 4 D ij =d ij r j r i -14-21.5-12 -13-12 -14-14.5-15.5-14.5-12 -13-12 -15.5-18.5-15.5

Neighbor Joining Species 1 and 3 are replaced by a new species 7 at distance: d 17 =½(d 13 + r 1 - r 3 ) = 1 d 37 =½(d 13 + r 3 - r 1 ) = 2 d 27 =½(d 12 + d 23 - d 13 ) = 7 d 47 =½(d 14 + d 34 - d 13 ) = 13 d 57 =½(d 15 + d 35 - d 13 ) = 9 d 67 =½(d 16 + d 36 - d 13 ) = 11

Neighbor Joining

Multidimensional Scaling (MDS) Method of representing a given collection of dissimilarities between pairs of objects as distances between points in a multidimensional metric space. Obvious application: visual representation of objects in 2- or 3-dimensional Euclidean space such that distances between points in the space match the original dissimilarites between objects as close as possible. Our application: species represented as points in higher dimensional Euclidean space so that dissimilarities are captured as closely as possible by distances. Low-cost Steiner trees could be good evolutionary trees.

Classical MDS - Overview Δ: matrix of dissimilarities which actually are distances in some higher-dimensional Euclidean space. Δ2 : matrix of squared dissimilarities. H = I n n -1 11 T : centering matrix. B = ½HΔ2 H: inner product matrix. Spectral decomposion B=VΛV T where Λ is a diagonal matrix of eigenvalues λ 1 λ 2... λ n 0, V is corresponding matrix of normalized eigenvectors. d: number of non-zero eigenvalues. Λ d : diagonal matrix of the first d eigenvalues. Let V d denote the first d columns of V. Then X = V d Λ d ½ is the coordinate matrix for objects.

Δ: metric matrix of dissimilarities (not necessarily distances in some higher-dimensional Euclidean space). Classical MDS can be used to compute X (discarding negative eigenvalues). If negative eigenvalues are small (close to zero) then X is fairly accurate. In order to reduce the dimension d, small positive eigenvalues can be discarded.

Steiner Tree Problem Given n points in d-dimensional space, find a shortest network spanning them.

Heuristic Construct MST Local improvements: pair of edges meeting at angles less than 120 degrees. Shortcutting: pick a pair of nonadjacent vertices. Consider the longest edge on the path between them in the current tree. If the distance between the nonadjacent vertices is less than the length of the longest edge, shortcut and apply local improvements. Postprocessing: terminals of degree greater than 1 and Steiner points of degree greater than 3 have their degrees reduced by introducing appropriate Steiner points.

MSD-Steiner Heuristic

Validation Methods Reconstruction of known phylogenies rare, small and/or easy. Computational simulation: Given a tree and intial root sequence, use appropriate probabilistic model of evolution to generate leaf sequences. Apply any of the distance methods and compare with the tree used in the simulation. J. Stoye, D. Evers and F. Meyer, ROSE: Generating sequence families, Bioinformatcs 17 (1998) 157-163.

Comparison of Trees Comparison of trees partition metric: removing an edge in simulation tree or solution tree gives a cut. Number of cuts induced by one tree but not by the other tree. This can be at most 2n-6. Can be determined in linear time. Sensitive when applied to very similar trees.

Computational Experience How well does MDS retains the information needed to obtain good phylogenetic trees? Are short Euclidean Steiner trees good phylogenetic trees? How do phylogentic trees obtained by the Steiner method compare to other methods?

Quality of MDS (see Fig. 1a) Apply Neighbor Joining to the original distance matrices and to the MDS matrices (for increasing number of dimensions. For sufficient number of dimensions, the same phylogenetic trees are obtained. Number of dimension to obtain good phylogenetic trees seems to be bounded, e.g. 20, dimensions are typically needed for problem instances of size 50. ROSE's relatedness parameter seems not to affect the number of dimensions. Conclusion: The information needed to obtain good phylogenetic trees is preserved when using MDS.

Steiner Trees as Phylogenetic Trees (Fig. 2a) As the lengths of Steiner trees decrease to 85% of the lengths of MSTs, their partition metric distances to the correct trees start to decrease. This behavior is observed for all dimensions higher than ½ of the maximal dimension.

Direct Comparison of NJ and Steiner (Fig. 1b) These two methods seem comparable. It is possible that the way instances are generated by ROSE favors NJ.

Homework Critical assessment of the paper Method MDS Steiner tree heuristic Validation, use of ROSE Evaluation Problem instances Interpretation of results