CSE 549: Computational Biology

Similar documents
Evolutionary tree reconstruction (Chapter 10)

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Parsimony-Based Approaches to Inferring Phylogenetic Trees

CS 581. Tandy Warnow

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Terminology. A phylogeny is the evolutionary history of an organism

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

Recent Research Results. Evolutionary Trees Distance Methods

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

Lecture: Bioinformatics

Algorithms for Bioinformatics

Introduction to Trees

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

On the Optimality of the Neighbor Joining Algorithm

11/17/2009 Comp 590/Comp Fall

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

Parsimony methods. Chapter 1

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Phylogenetic Trees (I)

Seeing the wood for the trees: Analysing multiple alternative phylogenies

Lecture 20: Clustering and Evolution

Prior Distributions on Phylogenetic Trees

4/4/16 Comp 555 Spring

Lecture 20: Clustering and Evolution

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Consider the character matrix shown in table 1. We assume that the investigator has conducted primary homology analysis such that:

How do we get from what we know (score of data given a tree) to what we want (score of the tree, given the data)?

Introduction to Computational Phylogenetics

The worst case complexity of Maximum Parsimony

DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

Distance-based Phylogenetic Methods Near a Polytomy

CLUSTERING IN BIOINFORMATICS

Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA).

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

human chimp mouse rat

A Note on Scheduling Parallel Unit Jobs on Hypercubes

Scaling species tree estimation methods to large datasets using NJMerge

Evolutionary Trees. Fredrik Ronquist. August 29, 2005

UNIVERSITY OF CALIFORNIA. Santa Barbara. Testing Trait Evolution Models on Phylogenetic Trees. A Thesis submitted in partial satisfaction of the

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Markovian Models of Genetic Inheritance

Tutorial using BEAST v2.4.7 MASCOT Tutorial Nicola F. Müller

Unique reconstruction of tree-like phylogenetic networks from distances between leaves

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets

A Collapsing Method for Efficient Recovery of Optimal. Edges in Phylogenetic Trees

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees

Three-Dimensional Cost-Matrix Optimization and Maximum Cospeciation

Dynamic Programming for Phylogenetic Estimation

arxiv: v2 [q-bio.pe] 8 Aug 2016

CLC Phylogeny Module User manual

Heterotachy models in BayesPhylogenies

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Phylogenetic Trees and Their Analysis

Introduction to Triangulated Graphs. Tandy Warnow

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS

Protein phylogenetics

Phylogenetic Trees - Parsimony Tutorial #11

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency?

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

A STEP-BY-STEP TUTORIAL FOR DISCRETE STATE PHYLOGEOGRAPHY INFERENCE

The History Bound and ILP

Improved parameterized complexity of the Maximum Agreement Subtree and Maximum Compatible Tree problems LIRMM, Tech.Rep. num 04026

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

"PRINCIPLES OF PHYLOGENETICS" Spring 2008

Reconstructing Reticulate Evolution in Species Theory and Practice

Notes 4 : Approximating Maximum Parsimony

Semi-Supervised Learning with Trees

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

Reconstructing Evolution of Natural Languages: Complexity and Parameterized Algorithms

Undirected Graphs. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { 1-2, 1-3, 2-3, 2-4, 2-5, 3-5, 3-7, 3-8, 4-5, 5-6 } n = 8 m = 11

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

Computational Genomics and Molecular Biology, Fall

Lab 8 Phylogenetics I: creating and analysing a data matrix

Copyright 2000, Kevin Wayne 1

PRINCIPLES OF PHYLOGENETICS Spring 2008 Updated by Nick Matzke. Lab 11: MrBayes Lab

Computing the Quartet Distance Between Trees of Arbitrary Degrees

Fast Algorithms for Large-Scale Phylogenetic Reconstruction

Discrete mathematics

Main Reference. Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taking a Random Walk through Tree Space Genetics 2005

Short-Cut MCMC: An Alternative to Adaptation

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

1 High-Performance Phylogeny Reconstruction Under Maximum Parsimony. David A. Bader, Bernard M.E. Moret, Tiffani L. Williams and Mi Yan

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

Trees. Arash Rafiey. 20 October, 2015

Topology Inference from Co-Occurrence Observations. Laura Balzano and Rob Nowak with Michael Rabbat and Matthew Roughan

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

CS301 - Data Structures Glossary By

Transcription:

CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford

Tree of Life 2 *

H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 *

H5N1 Influenza Strains The 2007 outbreak was a distinct strain bootstrap support Salzberg, Kingsford, et al., 2007 4 *

Questions Addressable by Phylogeny How many times has a feature arisen? been lost? How is a disease evolving to avoid immune system? What is the sequence of ancestral proteins? What are the most similar species? What is the rate of speciation? Is there a correlation between gain/loss of traits and environment? with geographical events? Which features are ancestral to a clade, which are derived? What structures are homologous, which are analogous? 5 *

Study Design Considerations Taxon sampling: how many individuals are used to represent a species? how is the outgroup chosen? Can individuals be collected or cultured? Marker selection: Sequence features should: be Representative of evolutionary history (unrecombined) have a single copy be able to be amplified using PCR able to be sequenced change enough to distinguish species, similar enough to perform MSA 6 *

Convergent Evolution Bird & bat wings arose independently (analogous) Has wings is thus a bad trait for phylogenetic inference Bone structure has common ancestor (homologous) 7 *

Divergent Evolution Obvious phenotypic traits are not necessarily good markers These are all one species! 8 *

Two phylogeny problems Note: Character below is not a letter (e.g. A,C,G,T), but a particular characteristic under which we consider the phylogeny (e.g. column of a MSA). Each character takes on a state (e.g. A,C,G,T). The small phylogeny problem Given: a set of characters at the leaves (extant species), a set of states for each character, the cost of transition from each state to every other, and the topology of the phylogenetic tree Find: a labeling for each internal node that minimizes the overall cost of transitions. The big phylogeny problem Given: a set of characters at the leaves (extant species), a set of states for each character, the cost of transition from each state to every other Find: a tree topology and labeling for each internal node that minimizes the overall cost (over all trees and internal states) 9

Small phylogeny problem parsimony One way to define the lowest cost set of transitions is to maximize parsimony. That is, posit as few transitions as necessary to produce the observed result. What characters should appear in the boxes? A A C C T 10

Small phylogeny problem parsimony One way to define the lowest cost set of transitions is to maximize parsimony. That is, posit as few transitions as necessary to produce the observed result. Assume transitions all have unit cost: A C G T A 0 1 1 1 C 1 0 1 1 G 1 1 0 1 T 1 1 1 0 A A C C T 11

Small phylogeny problem parsimony One way to define the lowest cost set of transitions is to maximize parsimony. That is, posit as few transitions as necessary to produce the observed result. Assume transitions all have unit cost: A C G T A 0 1 1 1 C 1 0 1 1 G 1 1 0 1 T 1 1 1 0 Fitch s algorithm provides a solution. A A C C T 12

Small phylogeny problem parsimony Fitch s algorithm (2-pass): Visit nodes in post-order traversal: store a set of characters at each node take the intersection of child s set if not empty; else take the union Visit nodes in pre-order traversal: If a child s character set has it s parent s label, choose it. Otherwise, select any character in this node s character set. {A,C,T} Phase 1 {A,C} {T} {A} {C} A A C C T 13

Small phylogeny problem parsimony Fitch s algorithm (2-pass): Visit nodes in post-order traversal: store a set of characters at each node take the intersection of child s set if not empty; else take the union Visit nodes in pre-order traversal: If a child s character set has it s parent s label, choose it. Otherwise, select any character in this node s character set. {A,C,T} Phase 2 {A,C} +1 {T} +1 {A} {C} cost = 2 A A C C T 14

Small phylogeny problem parsimony Fitch s algorithm (2-pass): Visit nodes in post-order traversal: store a set of characters at each node take the intersection of child s set if not empty; else take the union Visit nodes in pre-order traversal: If a child s character set has it s parent s label, choose it. Otherwise, select any character in this node s character set. {A,C,T} Note: There are generally many solutions of optimal cost. {A} +1 {A,C} {C} +1 {T} cost = 2 A A C C T 15

Small phylogeny problem parsimony What if there are different costs for each transition? Sankoff* provides a dynamic program to solve this case. For simplicity, consider only a single character, c Phase 1 (post-order): For each leaf v, state t, let S c t (v) = ( 0 if v c = t 1 otherwise For each internal v, state t, let S c t (v) =min i {C c ti + S c i (u)} +min j C c tj + S c j (w) Phase 2 (pre-order): Let the root take state r c = arg min t S c t (r) For all other v with parent u, let: v c = arg min t C c u c t + S c t (v) *Sankoff & Cedergren (1983) 16

Small phylogeny problem parsimony Consider the following tree and transition matrix: A C G T A 0 2.5 1 2.5 C 2.5 0 2.5 1 G 1 2.5 0 2.5 6 6 7 8 T 2.5 1 2.5 0 3.5 3.5 3.5 4.5 1 5 1 5 2.5 2.5 3.5 3.5 0 0 0 0 0 G A C A C example:http://evolution.gs.washington.edu/gs541/2010/lecture1.pdf 17

Small phylogeny problem parsimony Consider the following tree and transition matrix: A C G T A 0 2.5 1 2.5 C 2.5 0 2.5 1 A 6 6 7 8 G 1 2.5 0 2.5 T 2.5 1 2.5 0 A 1 5 1 5 A 3.5 3.5 3.5 4.5 A 2.5 2.5 3.5 3.5 0 0 0 0 0 G A C A C example:http://evolution.gs.washington.edu/gs541/2010/lecture1.pdf 18

Small phylogeny problem Maximum Likelihood Imagine we assume a specific, probabilistic model of sequence evolution. For example: Jukes-cantor or General Time Reversible Time reversible: Base frequencies: i Q ij = j Q ji Rate matrix (per unit time): α is the probability to mutate (per-unit time) Transition matrix at time t: https://en.wikipedia.org/wiki/models_of_dna_evolution 19

Small phylogeny problem Maximum Likelihood Imagine we assume a specific, probabilistic model of sequence evolution. Given a tree topology (with branch lengths), a set of states for each character, and the assumed model of state evolution Find the states at each internal node that maximizes the likelihood of the observed data (i.e. states at the leaves) Rather than choosing the best state at each site, we are summing over the possibility of all states (phylogenetic histories) 20

Consider the simple tree 6 d64 d65 d40 4 5 d41 d52 d53 0 1 2 3 For particular ancestral states s6, s4 and s5, we can score their likelihood as: L(s 6,s 4,s 5 )=p s6!s 4 (d 64 ) p s6!s 5 (d 65 ) p s4!s 0 (d 40 ) p s4!s 1 (d 41 ) p s5!s 2 (d 52 ) p s5!s 3 (d 53 ) Since we don t know these states, we must sum over them: L = X s 6 X s 4 X s 5 s6 L(s6,s 4,s 5 ) 21 Example from: http://www.bioinf.uni-freiburg.de/lehre/courses/ 2013_SS/V_Bioinformatik_1/phylo-maximum-likelihood.pdf

Small phylogeny problem Maximum Likelihood 6 d64 d65 d40 4 5 d41 d52 d53 0 1 2 3 It turns out that this objective (maximum likelihood) can also be optimized in polynomial time. This is done by re-arranging the terms and expressing them as conditional probabilities. The algorithm is due to Felsenstein* again, it is a dynamic program Felsenstein, Joseph. "Evolutionary trees from DNA sequences: a maximum likelihood approach." Journal of molecular evolution 17.6 (1981): 368-376. 22

Small phylogeny problem Maximum Likelihood Idea 1: Re-arrange the computation 6 to be more favorable d64 d65 4 5 L = X s 6 X s 4 X s 5 s6 L(s6,s 4,s 5 ) d40 d41 d52 d53 0 1 2 3 via. Horner s method (push summations to the right) 8P 9 = X < s 4 p s6!s 4 d(s 64 )(p s4!s 0 d(s 40 )p s4!s 1 d(s 41 )) = s6 : P ; s 6 s 5 p s6!s 5 d(s 65 )(p s5!s 2 d(s 52 )p s5!s 3 d(s 53 )) 23

Small phylogeny problem Maximum Likelihood 6 d64 d65 4 5 d40 d41 d52 d53 0 1 2 3 = X s 6 s6 8P 9 < s 4 p s6!s 4 d(s 64 )(p s4!s 0 d(s 40 )p s4!s 1 d(s 41 )) = : P ; s 5 p s6!s 5 d(s 65 )(p s5!s 2 d(s 52 )p s5!s 3 d(s 53 )) The structure of the equations here matches the structure of the tree ((.,.)(.,.)) see e.g. nested parenthesis encoding of trees. 24

Small phylogeny problem Maximum Likelihood Idea 2: define the total likelihood in terms of conditional likelihoods. 4 5 0 1 2 3 6 d64 d65 d40 d41 d52 d53 L k,s Conditional likelihood of the subtree rooted at k, assuming k takes on states s. 25

Small phylogeny problem Maximum Likelihood Now, we can define likelihood recursively! L k,s =Pr(s k = s) if k is a leaf L k,s = X s i p sk!s i (d ki )L i,si! 0 @ X sj 1 p sk!s j (d kj )L j,sj A k how can we do this efficiently? i dki dkj j

Small phylogeny problem Maximum Likelihood Now, we can define likelihood recursively! L k,s =Pr(s k = s) if k is a leaf L k,s = X s i p sk!s i (d ki )L i,si! 0 @ X sj 1 p sk!s j (d kj )L j,sj A k how can we do this efficiently? Dynamic programming: postorder traversal of the tree! i dki dkj j

Small phylogeny problem Maximum Likelihood At the root, we simply sum over all possible states to get the likelihood for the entire tree: L = X s r sr L r,sr r dri drj Using these likelihoods, we can ask questions like: i j What is the probability that node k had state A? What is the probability that node k didn t have state C? At node k, how likely was state A compared to state C?

Small phylogeny problem Maximum Likelihood This maximum likelihood framework is very powerful. It allows us to consider all evolutionary histories, weighted by their probabilities. Also lets us evaluate other tree parameters like branchlength. But we there can be many assumptions baked into our model (and such a model is part of our ML framework) What if our parameters are wrong? What if our assumptions about Markovian mutation are wrong? What if the structure of our model is wrong (correlated evolution)?

Large phylogeny problem searching for trees Distance-based methods: Sequences -> Distance Matrix -> Tree Neighbor joining, UPGMA Maximum Likelihood: Sequences + Model -> Tree + parameters Bayesian MCMC: Markov Chain Monte Carlo: random sampling of trees by random walk 30 *

Additivity (for distance-based methods) A distance matrix M is additive if a tree can be constructed such that dt(i,j) = path length from i to j = Mij. Such a tree faithfully represents all the distances 4-point condition: A metric space is additive if, given any 4 points, we can label them so that Mxy + Muv Mxu + Myv = Mxv + Myu If our metric is additive, there is exactly one tree realizing it, an it can be found by successive insertion # # lectures.molgen.mpg.de/phylogeny/reconstruction 31 *

Neighbor Joining Choose x, y to merge that minimize: Q(x, y) :=(n 2)D xy n X D xk + nx D yk! k=1 k=1 Update lengths: D xu := 1 2 D xy + 1 2(n 2) nx k=1 (D xk D yk ) x D yu := D xy D xu u y D uk := 1 2 (D xk + D yk D xy ) k 32 *

Neighbor Joining Example example from: https://en.wikipedia.org/wiki/neighbor_joining 33 *

What if our distances aren t so nice? UPGMA Find two most similar taxa (ie. such that Mij is smallest) Merge into new OTU (operational taxonomic unit) distance from k to to new OTU = average distance from k to each of OTUs members Repeat. Even if there is perfect tree, it may not find it. 34 *

Maximum Parsimony Input: n sequences of length k Output: A tree T = (V, E) and a sequence su of length k for each node u to minimize: X (u,v)2e Hamming(s u,s v ) NP-hard (reduction from Hamming distance Steiner tree) Can score a given tree in time O( nk). 35 *

Heuristic: Nearest Neighbor Interchange Walk from tree T to its neighbors, choosing best neighbor at each step. A B e A C e neighbors C D B D A e C D B 36 *

Maximum Likelihood Input: n sequences S 1,...,S n of length k; choice of model Output: Tree T and parameters p e for each edge to maximize: Pr[S 1,...,S n T, p] NP-hard if model is Jukes-Cantor; probably NP-hard for other models. 37 *

Bayesian MCMC Walk from tree T to its neighbors, choosing a particular neighbor at each step with probability related to its improvement in likelihood. This walk in the space of trees is a Markov chain. Under mild assumptions, and after taking many samples, trees are visited proportional to their true probabilities. # of times you visit a tree (after burn in )= probability of that topology Outputs a distribution of trees, not a single tree. 38 *

Bootstrapping How confident are we in a given edge? Bootstrapping: 1. Create (e.g.) 1,000 data sets of same size as input by sampling markers (MSA columns) with replacement. 2. Repeat phylogenetic inference on each set. 3. Support for edge is the % of trees containing this edge (bipartition). Interpretation: probability that edge would be inferred on a random data set drawn from the same distribution as the input set. 39 *

Going from an ensemble to a single tree Even if we can generate such an ensemble (e.g. a collection of trees where each is proportional to its true probability). How can we extract a single, meaningful, tree from this ensemble? 40

Splits abc def a b c d e f Every edge a split, a bipartition of the taxa taxa within a clade leading from the edge taxa outside the clade leading from the edge Example: this tree = {abc def, ab cdef + trivial splits} 41 *

Consensus Multiple trees: from bootstrap, from Bayesian MCMC, trees with sufficient likelihood, same parsimony: T = {T1,...,Tn} Splits of Ti := C(Ti) = { b(e) : e Ti } b(e) is the split (bipartition) for edge e. Majority consensus: tree given by splits which occur in > half inferred trees. 42 *

Incompatibility Tree 1 Tree 2 abc def abd cef aeb cdf a b c d e f a b c d f e Two splits are incompatible if they cannot be in the same tree. 43 *

Majority Consensus Always Exists Proof: 1. Let {sk} be the splits in > half the trees. 2. Pigeonhole: for each si, sj in {sk} there must be a tree containing both si and sj. 3. If si and sj are in same tree they are compatible. 4. Any set of compatible splits forms a tree. The {si} are pairwise compatible and form a tree. 44 *

Horizontal Gene Transfer Tree may not be best representation of evolutionary history. DNA uptake; retroviruses 45 *