Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Similar documents
Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

Evolutionary tree reconstruction (Chapter 10)

Terminology. A phylogeny is the evolutionary history of an organism

CSE 549: Computational Biology

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Distance-based Phylogenetic Methods Near a Polytomy

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Parsimony-Based Approaches to Inferring Phylogenetic Trees

human chimp mouse rat

Algorithms for Bioinformatics

Recent Research Results. Evolutionary Trees Distance Methods

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION

On the Optimality of the Neighbor Joining Algorithm

Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA).

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets

Introduction to Trees

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

Generation of distancebased phylogenetic trees

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

BMI/CS 576 Fall 2015 Midterm Exam

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

CS 581. Tandy Warnow

11/17/2009 Comp 590/Comp Fall

The worst case complexity of Maximum Parsimony

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Scaling species tree estimation methods to large datasets using NJMerge

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

Lecture 20: Clustering and Evolution

Prior Distributions on Phylogenetic Trees

Stephen Scott.

Lecture 20: Clustering and Evolution

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Phylogenetic Trees (I)

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Clustering of Proteins

4/4/16 Comp 555 Spring

Introduction to Computational Phylogenetics

Seeing the wood for the trees: Analysing multiple alternative phylogenies

46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 2011

Consider the character matrix shown in table 1. We assume that the investigator has conducted primary homology analysis such that:

Parsimony methods. Chapter 1

Phylogenetic Trees - Parsimony Tutorial #11

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course)

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency?

1 High-Performance Phylogeny Reconstruction Under Maximum Parsimony. David A. Bader, Bernard M.E. Moret, Tiffani L. Williams and Mi Yan

CLC Phylogeny Module User manual

Rapid Neighbour-Joining

Lab 8 Phylogenetics I: creating and analysing a data matrix

Multiple sequence alignment. November 2, 2017

UNIVERSITY OF CALIFORNIA. Santa Barbara. Testing Trait Evolution Models on Phylogenetic Trees. A Thesis submitted in partial satisfaction of the

Multiple Sequence Alignment. Mark Whitsitt - NCSA

MLSTest Tutorial Contents

Protein phylogenetics

Dynamic Programming for Phylogenetic Estimation

Basic Tree Building With PAUP

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

Rapid Neighbour-Joining

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Lecture: Bioinformatics

A New Approach For Tree Alignment Based on Local Re-Optimization

Lesson 13 Molecular Evolution

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

Multiple Sequence Alignment II

Heterotachy models in BayesPhylogenies

Relaxed Neighbor Joining: A Fast Distance-Based Phylogenetic Tree Construction Method

Definitions. Matt Mauldin

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction

CLUSTERING IN BIOINFORMATICS

Multiple sequence alignment. November 20, 2018

"PRINCIPLES OF PHYLOGENETICS" Spring 2008

Assignment 3 Phylogenetic Tree Reconstruction and Motif Finding

Phylogenetic Inference on Cell Processors

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms

FastJoin, an improved neighbor-joining algorithm

Basics of Multiple Sequence Alignment

Simulation of Molecular Evolution with Bioinformatics Analysis

Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

A Collapsing Method for Efficient Recovery of Optimal. Edges in Phylogenetic Trees

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.

STATISTICS IN THE BILLERA-HOLMES- VOGTMANN TREESPACE

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Systematics - Bio 615

Transcription:

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017

Learning Objectives understand the complexity of the space of all phylogenetic trees on n taxa understand what are distance matrix methods for phylogenetic inference understand what is the maximum parsimony paradigm and how it is used to infer trees 2

Learning Outcomes be able to give reasonable estimates of the number of trees on n taxa, perhaps with the mathematical formula be able to build manually the UPGMA tree (distance method) from a small distance matrix be able to calculate the most parsimonious history on a given labeled tree 3

Molecular Evolution & Phylogenetics Strategies in the quest for the best phylogenetic tree 4

The tree inference problem Problem: Assuming common descent, how to derive the most probably correct tree from the knowledge of the traits in the extant taxa (leaves)? tax1: ACGG tax2: AAGG tax3: AAGT tax4: GAGG phylogenetic inference tax2 tax4? tax3 tax1 input data = aligned nucl. tax2 tax4 tax1 tax3 5

So many trees! Let s count them. Fast growth of the number of binary tree topologies with n taxa: 1 unrooted binary tree topology with 3 taxa: 3 unrooted binary tree topologies with 4 taxa: B A B C C D D C B A D 15 unrooted binary tree topologies with 5 taxa A (2n-5)!! = 3 x 5 x 7 x 9 x x (2n-5) unrooted binary tree topologies with n taxa (2n-3) x (2n-5)!! = (2n-3)!! rooted binary tree topologies with n taxa B A C 6

for n = 10 taxa, approx. 2 million unrooted binary trees for n = 20 taxa, approx. 2.2 x 1020 unrooted binary trees So many trees! and then extra information is in the branch lengths! looking for the right tree is searching an infinite space. one has to devise strategies, either exact (algorithms) or including some amount of suboptimal guessing (heuristics) to perform tree inference 7

Several families of methods for phylogenetic tree inference distance matrix methods try and devise a tree from an input being a matrix of pairwise distances between taxa minimum evolution (ME) a.k.a. maximum parsimony (MP) methods try and find the tree with minimal number of evolutionary events (character changes) along its branches maximum likelihood (ML) methods try and find the tree maximising the probability of the data observed on the leaves, under a certain evolutionary model Bayesian methods try and associate trees with posterior probabilities (of the model, i.e. the tree, having considered the data), and ultimately pick the tree maximizing that posterior probability 8

Molecular Evolution & Phylogenetics Distance matrix methods 9

Distance matrix methods: pros and cons advantage: can be used on virtually any data (traits), as long as one knows how to calculate pairwise distances from them advantage: distance matrix methods are fast, accommodate even large numbers of taxa shortcoming: most often, calculated distances differ from patristic distances (sum of brach lengths on the path) shortcoming: loss of signal from the data, (complexity of a taxa summarised into its pairwise distances with others) remark: some methods produce constrained trees (e.g. respecting the molecular clock) 10

UPGMA (Sokal&Michener 1958): Unweighted Pair Group Method using arithmetic Averages progressively grouping taxa into clusters, building the tree bottom-up (from leaves to root) the distance between two clusters is the average distance between pairs of sequences from each cluster initially as many clusters as individual sequences. Place corresponding nodes at height 0. Simplest distance matrix method: UPGMA iteratively choose the two clusters C i and C j with minimum distance d ij, define a new cluster C k containing all sequences from clusters C i and C j, place the corresponding node at height d ij /2 and link it with the two nodes representing C i and C j continue until the root is reached, when all clusters are connected. 11

UPGMA at work: 1/3 A B C D A - 9 7 3 1.0 1.0 B - 2 6 B C C - 5 D - new cluster BC d(bc, A) = (9+7)/2 = 8 d(bc,d) = (6+5)/2 = 5.5 12

UPGMA at work: 2/3 A BC D A - 8 3 BC - 5.5 D - 1.5 1.5 A D new cluster AD d(bc, AD) = (8+5.5)/2 = 6.75 13

UPGMA at work: 3/3 From previous slide: d(bc, AD) = 6.75 2.375 1.875 d(bc, AD)/2 = 6.75/2 = 3.375 1.0 1.0 1.5 1.5 B C A D 14

UPGMA warning: arithmetic averages The distance between two clusters is the average distance between pairs (i,j) of taxa with i in one cluster and j in the other. For instance, in the prep. of cluster ABC from clusters AB and C, we use a weighted average: d(abc, E) = 2/3 * d(ab,e) + 1/3 * d(c,e) 15

Properties of the UPGMA tree inference trees reconstructed by the UPGMA algorithm are ultrametric (same length from the root to any leaf) if the molecular clock hypothesis holds (constant mutation rate), then UPGMA reconstructs the true tree. An ultrametric tree 16 A non-ultrametric tree

Additivity and the NJ algorithm In case the distance between any pair of leaves is equal to the length of the path between them, the tree is said to be additive. UPGMA always yields an additive tree, but the final distances do not always equal the original distances from the distance matrix: if additivity holds but not ultrametricity, then UPGMA will not reconstruct the correct tree Saitou & Nei (1987) and Studier & Keppler (1988) developed a method that always reconstructs the correct tree if the set of distances is additive: the Neighbour-Joining (NJ) algorithm NJ more complex than UPGMA, using corrected (modified) distances most popular distance method, used e.g. to build quick guide trees for multiple sequence alignment algorithms refined implementation: BioNJ (Gascuel 1997) 17

Distance methods: a summary two main algorithms: UPGMA and NJ methods suitable for a wide range of data: only need pairwise distances between taxa these algorithms are fast: (n 1) internal nodes created one by one, i.e. linear complexity greedy algorithms: no costly exploration of the set of all tree topologies 18

Rooting a tree with an outgroup UPGMA yields rooted trees NJ and other methods to be seen later produce unrooted trees. How to root them? in case the molecular clock property holds, one can choose for the root the midpoint on the longest path between any two leaves otherwise, one can root using an outgroup an outgroup is an extra taxon for which we know its MRCA (Most Recent Common Ancestor) with any taxon in the tree clearly predates the MRCA of all other taxa in the tree the node where it connects with the rest of the tree is the inferred root 19

Molecular Evolution & Phylogenetics Maximum Parsimony (MP) a.k.a Minimum Evolution (ME) 20

Minimum Evolution Principle Minimum Evolution principle goes for the evolutionary history accounting for (i.e. explaining ) the extant sequences with as few evolutionary events (character substitutions, deletions and insertions) as possible. Minimum count of events: Parsimony cost or Parsimony steps similar to trying and get a tree with smallest total branch length a principle (dogma), not a scientific truth beware! true number of substitutions inferred parsimony cost (think of the possible evolutionary history A T A) be cautious with MP phylogenies on very divergent data 21

Different trees, different parsimony costs 22

Different sites support different trees a site is a column in a multiple sequence alignment (one aligned character per taxa) phylogeny built from a Multiple Sequence Alignment (MSA) must take into account all sites sitei total parsimony cost = cost i ( independent sites paradigm) 23

Some sites don t choose (1/2) Some sites are uninformative from the viewpoint of MP: they are useless in deciding about the most parcimonious tree. 24

Some sites don t choose (2/2) More uninformative sites: 25

Informative site for Maximum Parsimony A site (column from an alignment) is informative for MP inference methods if and only if it contains at least two different characters in at least two copies each. MP methods ignore potentially many sites 26

Core to Maximum Parsimony analysis: Fitch s algorithm published by Fitch in 1971 fast algorithm to calculate the parsimony score for any given site on any given tree. gives both the maximum parsimony score and a most parsimonious history (ancestral characters) in linear time. doesn t search the space of the different phylogenies: mere calculation on a given tree 27

Fitch s algorithm, step by step let s = 0 be the score counter initialization let c k be the set of acceptable characters on node k for leaves k, c k is the singleton containing the character seen in the alignment moving up from leaves to root (post-order traversal), when k is a node having sons i and j, let c k = c i c j if that intersection is not empty, otherwise c k = c i c j and s = s + 1 once the root is reached, s is the total parsimony score and the acceptable sets on the nodes define most parsimonious histories 28

Fitch s algorithm: example application score = 0 A C A T T T A 29

Fitch s algorithm: example application score = 1 {A,C} {T} A C A T T T A 30

Fitch s algorithm: example application score = 2 {A} {A,T} {A,C} {T} A C A T T T A 31

Fitch s algorithm: example application {A,T} {A,T} score = 3 {A} {A,T} {A,C} {T} A C A T T T A 32

Fitch s algorithm: a most parsimonious history A A score = 3 A A A T A C A T T T A 33

Programme for the next session... heuristics for an efficient travel in the space of trees the likelihood of a tree the Maximum Likelihood framework Bayesian methods assessing the quality of a tree: bootstrap method and local confidence statistics 34