46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 2011

Size: px

Start display at page:

Download "46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 2011"

Lawrence Marshall
6 years ago
Views:

1 46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 Phylogeny Further reading and sources for parts of this hapter: R. Durbin, S. Eddy,. Krogh & G. Mitchison, Biological sequence analysis, ambridge, 1998 Jones & Pevzner, Introduction to Bioinformatics, 004. J. Setubal & J. Meidanis, Introduction to computational molecular biology, D.W. Mount. Bioinformatics: Sequences and Genome analysis, 001. D.L. Swofford, G.J. Olsen, P.J.Waddell & D.M. Hillis, Phylogenetic Inference, in: D.M. Hillis (ed.), Molecular Systematics, ed., Sunderland Mass., The goal of phylogenetic analysis is to determine and describe the evolutionary relationship between a collection of extant species. In particular, this involves determining the order and approximate timing of speciation events. It is generally assumed that speciation is a branching process: a population of organisms becomes separated into two sub-populations. Over time, these evolve into separate species that do not crossbreed. Because of this assumption, a tree is often used to represent a proposed phylogeny for a set of species, showing how the species evolved from a common ancestor..0.1 Early evolutionary trees harles Darwin formulated the foundation of the theory of evolution. In his famous book The Origin of Species (published in 189) Darwin wrote: Species have changed, and are still slowly changing, by the preservation and accumulation of successive slight favorable variations. In a note book he drew this picture illustrating the concept of an evolutionary tree linking different species: Darwin s work inspired Ernst Haeckel (in Jena) to draw many evolutionary trees, such as this one (1866):

Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 47.0. The Giant Panda riddle Early phylogenetic studies were based soley on morphological characters, which was not always easy.

2 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The Giant Panda riddle Early phylogenetic studies were based soley on morphological characters, which was not always easy. For roughly 100 years scientists were unable to decide which family the giant panda belongs to. Giant pandas look like bears but have features that are unusual for bears and typical for raccoons, e.g., they do not hibernate. In 198, Steven O Brien and colleagues 1 solved the giant panda classification problem using DN sequences and algorithms..0. Using sequences for trees Once the role of DN was discovered (19), the idea developed that one could use molecular sequences to build phylogenetic trees. 198: rick proposed to use molecular sequences for phylogenetic tree reconstruction 196: Zuckerkandl and Pauling built the first phylogenetic tree from aligned amino acid sequences. They also formulated the molecular clock hypothesis. 1 O Brien SJ, Nash WG, Wildt DE, Bush ME, Benveniste RE. molecular solution to the riddle of the giant panda s phylogeny. (198) Nature 17(60):140-4.

3 48 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, : Fitch & Margoliash published the first algorithm for computing phylogenetic trees from sequence data, together with this tree: Phylogeny based on protein sequence of the gene cytochrome s: arl Woese and colleagues identified SSU rrn as an ideal universal molecular chronometer : It is abundant, coded for by organellar, nuclear and prokaryotic genomes. It has slow- and fast-evolving portions. It has a universally conserved structure. The modern Tree of Life is based on SSU (16S and 18S) rrn:.1 Basic concepts phylogeny describes the evolutionary history of a set of taxa and the ultimate goal of any phylogenetic analysis is to reconstruct some part of the Tree of Life, the evolutionary history of all life on Earth. Definition.1.1 (Related species) Two species are called related if they share a recent common ancestor. We say that a species a is more closely related to a species b than to a species c, if the last common ancestor of a and b is more recent that the last common ancestor of a and c. For example, humans are more closely related to mice (last common ancestor about 100 million years ago) than they are to fruit flies (last common ancestor about 600 million years ago). set of species is called a cluster. Definition.1. (lusters and monophyletic groups) cluster is called a clade or monophyletic group, if all species in are more closely related to each other than to any species outside of ; in other words, if all species in are descended from a common ancestor, which is not an ancestor of any other species under consideration. For example, the set of all marsupials forms a clade within the class of all mammals.

4 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, Given a set of species, the main goal of phylogenetic analysis is to compute a set of clusters such that each cluster is a monophylogenetic group, and then to represent the clusters as a rooted phylogenetic tree. tree consists of nodes connected by edges (also called branches). Terminal nodes (also called leaves) represent sequences or species for which we have data. Internal nodes represent hypothetical ancestors. dichotomy polytomy terminal node (leaf) internal node (hypothetical ancestor) interior edge root.1.1 Phylogenetic trees In the following, we will use X = {x 1, x,..., x n } to denote a set of taxa, where a taxon x i is simply a representative of a group of individuals defined in some way. Definition.1. phylogenetic tree (on X) is a system T = (V, E, λ) consisting of a connected graph (V, E) without cycles, together with a labeling λ of the leaves by elements of X, such that: 1. every leaf is labeled by exactly one taxon, and. every taxon appears exactly once as a label..1. Unrooted trees n unrooted phylogenetic tree is obtained by placing a set of taxa on the leaves of a tree: Pan_panisc Gorilla Homo_sap Mus_mouse Rattus_norv harbor_sel Bos_ta(cow) fin_whale blue whale Gorilla Pan_panisc Homo_sap Mus mouse harbor_sel Bos_ta(cow) fin_whale blue_whale Rattus norv Taxa X + tree phylogenetic tree T on X.1. Rooted trees Example: Modern version of the tree of life:

5 0 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The number of edges and nodes of an unrooted phylogenetic tree Let T be an unrooted phylogenetic tree on n taxa, i.e., with n leaves. How many nodes and edges does T have? Let us assume that T is binary (or bifurcating, each internal node being incident to precisely three edges). ny non-binary tree on n taxa will have less nodes and edges. B D B B D B E B B B E D E D D B B B D D E D E Lemma.1.4 (Nodes and edges in unrooted tree) binary phylogenetic tree T on n taxa has n nodes, n edges and n interior edges for all n..1. The number of rooted phylogenetic trees n unrooted tree T with n leaves has n nodes and n edges. root can be added in any of the n edges, thus producing n different rooted trees from T : b b b a a a c c c a b c b a c c a For n = there are three ways of adding a root. Similarly, there are different ways of adding an extra edge with a new leaf to obtain an unrooted tree on 4 leaves. This new tree has (n ) = b

6 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, edges and there are ways to obtain a new tree with leaves etc. ontinuing this, we see that there are unrooted trees on n leaves. There are rooted trees. U(n) = (n )!! := 7... (n ) R(n) = (n )!! = U(n) (n ) =... (n ) # taxa # pairs # edges U(n) R(n) n (. n )... (n ) (n )!! (n )!!. onstructing phylogenetic trees There are four main approaches to constructing phylogenetic trees from molecular data. 1. Using a distance method, one first computes a distance matrix from a given set of biological data and then one computes a tree that represents these distances as closely as possible.. Maximum parsimony takes as input a set of aligned sequences and attempts to find a tree and a labeling of its internal nodes by auxiliary sequences such that the number of mutations along the tree is minimum.. Given a probabilistic model of evolution, maximum likelihood approaches aim at finding a phylogenetic tree that maximizes the likelihood of obtaining the given sequences. 4. Given a probabilistic model of evolution, Bayesian approaches aim at sampling the posterior distribution of trees using an MM algorithm.. Distances Definition..1 (Distance matrix) Let X = {x 1, x,..., x n } be a set of taxa. metric D : X X R 0 assigns a distance d(x i, x j ) 0 between every pair of taxa x i, x j X. Sometimes we may abbreviate d ij = d(x i, x j ). We usually require: 1. Symmetry: d(x, y) = d(y, x) for all x, y X.. Identity: d(x, y) = 0 if and only if x = y.. Triangle inequalities: d(x, z) d(x, y) + d(y, z) for all x, y, z X.

7 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 In phylogenetics, often the identity property only holds in the first direction. lso, sometimes the triangle inequalities are violated, too. To allow these two violations, we will use the less formal term distance matrix...1 Hamming distance Let a collection of taxa be given by a set of distinct sequences = { 1,,..., n } and assume we are given a multiple sequence alignment of the sequences. Definition.. (Hamming distance) We define the (normalized) Hamming distance Ham( i, j ) between two taxa i and j (also called observed p-distance) as the number of mismatch positions in i and j, divided by the number of comparisons. We ignore any column in which both sequences contain a gap. If only one sequence has a gap in a column then we can either ignore the column, or treat it as a match, or as a mismatch, depending on the type of data. Sometimes, one ignores all columns in which any of the n sequences contains the gap. Example: Distances: 1 T T T T - G G T T T - - Ham( 1, ) = 4 1 = 0. Ham( 1, ) = 11 = 0.4 Ham(, ) = 11 = 0.7 Hamming distances are only suitable for closely related sequences because multiple mutations at the same position and back-mutations are not counted. To compensate for this in practice, Hamming distances can be transformed based on some model of evolution such as the Jukes-antor model..4 Distance-Based Tree Reconstruction Problem.4.1 (Distance-Based Phylogeny Problem) Reconstruct an evolutionary tree on n leaves from an n n distance matrix. Input: n n n distance matrix D = (d ij ) and a labeling λ. Output: phylogenetic tree T (rooted or unrooted) with n leaves and edge lengths.. UPGM We will now discuss a simple distance method called UPGM, which stands for unweighted pair group method using arithmetic averages (Sokal & Michener 198). Given a set of taxa X and a distance matrix D, UPGM produces a rooted phylogenetic tree T with edge lengths. It operates by clustering the given taxa, at each stage merging two clusters and at the same time creating a new node in the tree. The tree is assembled bottom-up, first clustering pairs of leaves, then

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 pairs of clustered leaves etc. Each node is given a height and the edge lengths are obtained as the difference of heights of its two end nodes...1 UPGM example Example: X = {1,,, 4, }, distances given by distance in the plane: cluster 1 and : cluster 4 and : t1=t=1/ d(1,) 6 7 t4=t=1/ d(4,) cluster 7 and : cluster 6 and 8: / d(,7) / d(6,8) UPGM produces a rooted, binary phylogenetic tree... The distance between two clusters Initially, we are given a distance d(x, y) between any two taxa, i.e. leaves, x and y. We define the distance d(i, j) = d( i, j ) between two clusters i X and j X to be the average distance between pairs of taxa from each cluster: 1 d(i, j) = d(x, y). i j x i,y j Update formula: If k is the union of two clusters i and j, and l is any other cluster, then d(k, l) = d(i, l) i + d(j, l) j. i + j This is a useful update formula, because using it in the algorithm, we can obtain the distance between two clusters in constant time... Edge length computation Edge lengths are set as shown here:

9 4 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 d(ij,kl)/ - d(i,j)/ d(ij,kl)/ - d(k,l)/ d(i,j)/ d(k,l)/ i j k l..4 pplication of the UPGM algorithm Example of UPGM applied to S rrn data: Original distances: bbreviations: Bsu: Bacillus subtilis Bst: Baclillus stearothermophilus Lvi: Lactobacillus viridescens mo: choleplasma modicum Mlu: Micrococcus luteus Bsu Bst Lvi mo Mlu Bsu Bst Lvi mo Minimum Mlu is d(bst,bsu)= We define the node connecting the node labeled Bst and Bsu, and place it at height d(bst, Bsu)/ = Using the update formula we compute eg. d((bst, Bsu), Lvi) = Bsu + Bst Lvi mo Mlu Bsu + Bst Lvi mo Mlu = 0.69 Minimum is d(bst+bsu,mlu)=0.19. We define the node connecting the node labeled Bst and Bsu, and place it at height d(bst + Bsu, Mlu)/ = Using the update formula we compute e.g. d((bst + Bsu + Mlu), Lvi) = = 0.07 Bsu + Bst + Mlu Lvi mo Bsu + Bst + Mlu Lvi 0.79 mo Minimum is d(lvi,mo)=0.19. We define the node connecting the node labeled Lvi and mo, and place it at height d(lvi, mo)/ = Using the update formula we compute eg. d(bst + Bsu + Mlu, Lvi + mo) = = 0.1 Bsu + Bst + Mlu Lvi + mo Bsu + Bst + Mlu 0.10 Lvi + mo

10 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 The resulting tree: Bsu Bst Mlu Lvi mo (However, we will see later that this tree is probably biologically incorrect.) 0.. The UPGM algorithm lgorithm..1 (UPGM) Input: set of taxa X = {x 1,..., x n } and a corresponding distance matrix D Output: binary, rooted phylogenetic UPGM tree T = (V, E, ω) on X Initialization Set = { 1 = {x 1 },..., n = {x n }} Set h({x i }) = 0 Set V = and E = Iteration while do Determine two clusters i and j for which d(i, j) is mininum Define a new cluster k by k = i j Set = ( { i, j }) { k } Set d(k, l) for all clusters l using the update formula Define node k with children i and j, and place it at height h = d(i,j) Set V = V {k} and E = E {(i, k), (j, k)} Set ω(i, k) = h(k) h(i) and ω(j, k) = h(k) h(j) Termination When only two clusters i and j remain, place the root at height d(i,j)...6 The molecular clock hypothesis Given a distance matrix D, the UPGM method aims at building a rooted tree T with the property that all leaves have the same distance from the root ρ: 1 4 This approach is suitable for sequence data that has evolved under circumstances in which the rate of mutations of sequences is constant over time and equal for all lineages in the tree. Definition.. The assumption that evolutionary events happen at a constant rate is called the

11 6 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 molecular clock hypothesis...7 UPGM and the molecular clock If the input distance matrix D was obtained from sequences that fulfill the molecular clock assumption, then the tree T reconstructed by UPGM from D will be correct. Otherwise, if the sequences deviate from this assumption, then UPGM may fail to reconstruct the tree correctly, for example: orrect tree T 0 UPGM tree T The problem here is that the closest leaves in T 0 are not neighboring leaves: they do not have a common parent node..6 Neighbor-Joining The most widely used distance method is Neighbor-Joining (NJ), originally introduced by Saitou and Nei (1987), and modified by Studier and Keppler (1988). Given a distance matrix D, Neighbor-Joining produces an unrooted phylogenetic tree T with edge lengths. It is more widely applicable than UPGM, as it does not assume a molecular clock. Starting from a star tree topology we assume that we are building a tree based on D by repeatedly pairing (or joining ) neighboring taxa How to determine which nodes are neighbors? It is not correct to simply to pick a pair i, j for which d ij minimum. Example: assume the true tree is Saitou, N., Nei, Y. (1987) SIM J. on omp. 10:40-41

12 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, x x x x x x 4 x Given distances generated by this tree: x x 0.9 x 1 and x have mininum distance, but are not neighbors. To avoid this problem, the trick is to subtract the averaged distances to all other leaves, thus compensating for long edges. We define a matrix N = N ij with where N ij := d ij (r i + r j ), r i = 1 L d ik, and L denotes the set of leaves. (Note that this is not precisely the average, as the number of non-zero summands is L 1, not L. ) Let us illustrate this result using the previous example: x1 0.4 x D = x x4 k L x4 x x x 4 x x x 0.9 r 1 = 1 4 ( ) = 0.7, and r = 0.7, r = 1.0 and r 4 = 1.0. Then eg. N 1 = d 1 (r 1 + r ) = 0. ( ) = 1.1. N = x x x 4 x x x 1.1 The minima in the matrix N correspond to the pair i = 1, j = and the pair i =, j = 4, as required, since in the tree x 1, x and x, x 4 are neighbors. Let i and j be two neighboring leaves that have the same parent node, k. Remove i, j from the list of nodes and add k to the current list of nodes. How do we have to set its distance to any given leaf m? m i k j d im = d ik + d km, d jm = d jk + d km and d ij = d ik + d jk, thus which implies d im + d jm = d ik + d km + d jk + d km = d ij + d km, d km = 1 (d im + d jm d ij ).

13 8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The Neighbor-Joining algorithm lgorithm.6.1 (Neighbor-Joining) Input: Distance matrix D on X Output: Phylogenetic tree T Initialization Define T to be the set of leaf nodes, one for each taxon. Set L = T. Iteration: while L > do ompute N ij from D ij Pick a pair i, j L for which N ij is mininum. Define a new node k and set d km = 1 (d im + d jm d ij ), for all m L. // update D dd k to T with edges of lengths d ik = 1 (d ij + r i r j ) and d jk = d ij d ik, joining k to i and j, respectively. Remove i and j from L and add k to L Termination When L consists of two nodes i and j, add the remaining edge between i and j, with length d ij. Why do we use d ik = 1 (d ij + r i r j ) to update distances? By definition, r i = 1 L the average distance q i = 1 L k L,k i,j d ik from i to all other nodes m i, j, plus k L d ik equals d ij L : m 1 i j dij k qi qj m m... average distance m L s we see from this figure: d ik = d ij + q i q j = d ij + q i q j +.6. Example Taxa B D B D - 1. Iteration: ompute N: first compute column sums r i : d ij L Initialisation: L = X = {, B,, D} d ij L = d ij + r i r j. r = = 7, r B = = 1, r = = 7, r D = = 7

14 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, N = = T axa B D B 8 (r + rb)/ (r + r)/ 9 (rb + r)/ 11 D 1 (r + rd)/ 14 (rb + rd)/ 11 (r + rd)/ T axa B D B D One minimum is N(, B) = 1. Define a new internal node k that pairs off and B and compute edge lengths from k to and from k to B: d(, k) = 1 (d(, B) + r r B ) = 1 (8 + 7/ 1/) = d(b, k) = d(, B) d(, k) = 8 = ompute new distances D : with T axa k D k d(k, ) d(k, D) 11 D d(k, ) = 1 (d(, ) + d(b, ) d(, B)) = 4 d(k, D) = 1 (d(, D) + d(b, D) d(, B)) = 9. Iteration: ompute N 1 : We have L = {k,, D}, L =, L = 1 r k = = 1, r = = 1, r D = = 0 N 1 = = T axa k D k (1 + 1)/1 11 D 9 (1 + 0)/1 11 (1 + 0)/1 T axa k D k D 4 4 Minimum is e.g. N 1 (k, ). Define a new node l, that pairs off k and : d(, l) = 1 (d(k, ) + r r k) = 1 ( ) = d(k, l) = d(k, ) d(, l) = 4 = 1 Termination: Only two nodes, D und l left. ompute length of edge d(l, D): Here is the final tree: d(l, D) = d(, D) d(l, ) = 11 = 8

15 60 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 B 1 8 D Example of Neighbor-Joining applied to S rrn data: Original distances: bbreviations: Bsu: Bacillus subtilis Bst: Baclillus stearothermophilus Lvi: Lactobacillus viridescens mo: choleplasma modicum Mlu: Micrococcus luteus mo The resulting NJ-tree: Lvi Bsu Bsu Bst Lvi mo Mlu Bsu Bst Lvi mo Mlu 0.06 Bst Mlu.7 Rooting unrooted trees In contrast to UPGM, most tree reconstruction methods produce an unrooted tree. Indeed, determining the root of a tree using computational methods is very difficult. In practice, the question of rooting a tree is addressed by adding an outgroup to the set of taxa under consideration. This is a taxon that is slightly more distantly related to the original taxa than any of the original taxa. The root is then assumed to be on the branch attaching the outgroup taxon to the rest of the tree. In practice, selecting an appropriate outgroup can be difficult: if it is too similar to the other taxa, then it might be more related to some than to others. If it is too distant, then there might not be enough similarity to the other taxa to perform meaningful comparisons. In the above neighboring tree, Mlu is the outgroup, Hence, the rooted version of this tree looks something like this: This tree is believed to be more correct than the one produced using UPGM: Bst Mlu Bsu Lvi mo

16 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, Bsu Bst Mlu Lvi mo In particular, the UPGM tree does not separate the outgroup from all other taxa by the root node. The reason why UPGM produces an incorrect tree is that two of the sequences, those of L.viridescens (Lvi) and.modicum (mo) are much more diverged than the others..8 Sequence-based methods In sequence-based methods for phylogeny, the input consists of a multiple sequence alignment M on a set of taxa X and a phylogenetic tree T is determined for M, usually by performing a search in tree space to find an optimal phylogenetic tree or trees. The three main approaches are maximum parsimony, maximum likelihood and Bayesian inference. We will discuss the first of these three approaches..9 Maximum parsimony Maximum parsimony is one of the most widely-used sequence-based tree reconstruction methods. In science, the principle of maximum parsimony is a preference for the least complex explanation for an observation. In phylogenetic analysis, the maximum parsimony problem is to find a phylogenetic tree T on X that explains a given multiple alignment of sequences on X using a minimum number of evolutionary events..9.1 The parsimony score of a tree The difference between two sequences x = x 1... x L and y = y 1... y L is simply their non-normalized Hamming distance Ham(x, y) = {k : x k y k }. ssume we are given a multiple sequence alignment = (a 1, a,..., a n ) on X and a corresponding phylogenetic tree T on X. If we assign the aligned sequences to the leaves and hypothetical ancestor sequences to the internal nodes of T, then we can obtain a score for T together with this assignment, by summing over all differences Ham(x, y), where x and y are any two sequences labeling two nodes that are joined by an edge in T. GT G G Example: 1 G G 1 0 Score = G The minimum value obtainable in this way is called the parsimony score P S(T, ) of T and :

17 6 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 Definition.9.1 (Parsimony score of a tree) Given a multiple alignment of sequences of length L and a corresponding phylogenetic tree T on X, its parsimony score is defined as: P S(T, ) = min Ham(u, v), λ {u,v} E where the minimum is taken over all possible labelings λ of the internal nodes of T by hypothetical ancestor sequences of length L..9. The Small Parsimony Problem Problem.9. (Small Parsimony Problem) The Small Parsimony Problem is to compute the parsimony score for an alignment of sequences on a phylogenetic tree T. an this be solved efficiently? s the parsimony score is obtained by summing over all columns, the columns are independent and so it suffices to discuss how to obtain an optimal assignment for one position. Example: G G.9. The Fitch algorithm In 1971, Walter Fitch published a dynamic programming algorithm that solves the small parsimony problem efficiently. Input: phylogenetic tree T, with n leaves, and a single character c with k possible states. Denote the state of the character for node v by c(v). The Fitch algorithm consists of two passes: 1. The first (bottom up) pass computes the parsimony score and a set of possible states for each node.. The second (top down) pass chooses a state for each node. The following algorithm ( bottom up pass ) computes the parsimony score for T and a fixed column c in the sequence alignment. Initially it is called with v equal to the root node and P S(T, c) = 0. lgorithm.9. (ParsimonyScore(v), Walter Fitch, 1971) Input: binary phylogenetic tree T, a state c(w) for each leaf w of T Output: The parsimony score P S(T, c) for T and c Set score = 0 if v is a leaf node then set F (v) = {c(v)} else for the two children w 1 and w of v do all ParsimonyScore(w i ) to compute F (w i ) Fitch, W. (1971) Toward defining the course of evolution: minimum change of specified tree topology. Syst. Zoology 0:

18 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, and add the result to score if F (w 1 ) F (w ) then Set F (v) = F (w 1 ) F (w ) else Set F (v) = F (w 1 ) F (w ) and increment score return score Example: use the Fitch algorithm to compute the parsimony score for the following labeled tree: G G The first pass computes the parsimony score. n optimal labeling of the internal nodes is obtained via the second pass ( top down ): Starting at the root node r, we label r using any character in F (r). Then, for each child w, we use the same letter, if it is contained in F (w), otherwise we use any letter in F (w) as label for w. We then visit the children von w etc: {,G} {G} {,G} {G} {G} {} {} {} {} {,} {} This algorithm also requires O(nL) steps, in total. G G.9.4 Fitch: Second pass Example: {,T} {,T} {} {} {T} {T} T T {} Fitch labeling one traceback result T T T T T T T T another traceback result not obtainable by traceback The following algorithm computes the assignments of the internal nodes for T and a fixed column c

19 64 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 in the sequence alignment. Initially it is called with e = null and r the root node. lgorithm.9.4 (Fitch algorithm, second pass) Input: phylogenetic tree T and the set F (w) for every node w in T Output: The fully labeled tree T for each node v, in top-down order do if v is the root node then hoose t F (v) and set c(v) = t else Let w be the parent node of v if c(w) F (v) then Set c(v) = c(w) else hoose t F (v) and set c(v) = t end.9. Evaluating the score on different trees ssume we are given the following four aligned sequences: Seq 1: G Seq : Seq : G G Seq 4: G There are three possible binary (non-rooted) topologies on four taxa: G GG G G G GG G GG G In each tree, label all internal nodes with sequences so as to minimize the score obtained by summing all mismatches along edges. Which tree minimizes this score?.9.6 The Fitch algorithm on unrooted trees bove, we formulated the Fitch algorithm for rooted trees. However, the minimum parsimony cost is independent of where the root is located in the tree T : To see this, consider a root node ρ connected to two nodes v and w, and let c( ) denote the character assigned to a node in the Fitch algorithm: 1. If c(v) = c(w), then c(ρ) = c(v) = c(w), because any other choice would add 1 to the score.. If c(v) c(w), then either c(ρ) = c(v) or c(ρ) = c(w), and both choices contribute 1 to the score. In both cases, we could replace ρ by a new edge e that connects v and w without changing the parsimony score of T.

20 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The Large Parsimony Problem Definition.9. (Parsimony score of an alignment) Given a multiple alignment = (a 1,..., a n ) on X, its parsimony score is defined as P S() = min{p S(T, ) T is a phylogenetic tree on X}. Problem.9.6 (Large Parsimony Problem) The Large Parsimony Problem is to compute P S() and also T mp = arg min{p S(T, ) T is a phylogenetic tree on X}. Potentially, we need to consider all (n )!! possible trees. Unfortunately, in general this can t be avoided and the maximum parsimony problem is known to be NP-hard. Exhaustive enumeration of all possible tree topologies will only work for n 10, say..10 Tree Searching Methods Potentially we need to consider all (n )!! possible trees. The following list summarizes some of the main strategies for either solving the problem exactly or obtaining good approximations: Exhaustive search (exact) Branch and bound search (exact) Stepwise addition (heuristic) Star decomposition (heuristic) Branch swapping search (heuristic).10.1 Branch and bound Recall how we obtained an expression for the number U(n) of unrooted phylogenetic tree topologies on n taxa: For n = there are three ways of adding an extra edge with a new leaf to obtain an unrooted tree on 4 leaves. This new tree has (n ) = edges and there are ways to obtain a new tree with leaves etc. To be precise, one can obtain any tree T i+1 on the label set { 1,..., i, i+1 } by adding an extra edge with a leaf labeled i+1 to some (unique) tree T i on { 1,..., i }. In other words, we can produce the set of all possible trees on n taxa by adding one leaf at a time in all possible ways, thus systematically generating a complete enumeration tree.

21 66 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 simple, but crucial observation is that adding a new sequence i+1 to a tree T i to obtain a new tree T i+1 cannot lead to a smaller parsimony score. This gives rise to the following bound criterion when generating the enumeration tree: if the local parsimony score of the current incomplete tree T is larger or equal to the best global score for any complete tree seen so far, then we do not generate or search the enumeration subtree below T. Example: ssume we are given an MS = {a 1, a,..., a } and at a given position i the characters are: a 1i =, a i =, a i =, a 4i = and a i =. ssume that the Neighbor-Joining tree on looks like this: a1 a a a4 a The parsimony score is = 1, thus we initialize global bound with 1 for position i. The first step in generating the enumeration tree is this: a 1 a a local = 1 global local = global local = global continue don t continue don t continue Only the first of the three trees fulfills the bound criterion and we do not pursue the other two trees. The second step in generating the enumeration tree looks like this:

22 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The first three trees are optimal. local = 1 local = 1 local = 1 local = local = pplication of branch-and-bound to evolutionary trees was first suggested by Mike Hendy and Dave Penny (198) 4. In practice, using branch and bound one can obtain exact solutions for data sets of twenty or more sequences, depending on the sequence length and the messiness of the data. good starting strategy is to first compute a tree T 0 for the data, e.g. using Neighbor-Joining, and then to initialize the global bound to the parsimony score of T The stepwise-addition heuristic Now we discuss a simple greedy heuristic (Felsenstein 1981) for approximating the optimal tree or score: Build the tree T by adding one leaf after the other, in each step greedily choosing the optimal position for the new leaf-edge: Given a multiple sequence alignment = {a 1, a,..., a n }, start with a tree T consisting of two leaves labeled a 1 and a. Given T i, for each edge e in T i, obtain a new tree Ti e as follows: Insert a new node v in e and join it via a new edge f to a new leaf w with label a i+1. Set T i+1 = arg min{p (Ti e, )}. a e e1 e a a1 a a a4 a 1 + a a a4 a 1 + T T e 1 T e T e Obviously, this approach does not guarantee an optimal result. Moreover, the result obtained will depend on the order in which the sequences are processed. 1 4 a a a4 a Hendy, M. D. and Penny, D. (198). Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences 9: 77-90

23 68 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, Phylogenetics Software Recommended software packages in phylogenetics: 1. Dendroscope: n interactive Java program for viewing phylogenetic trees and rooted networks, available from: Paup: powerful and widely-used commerical program for computing phylogenetic trees, available from Phylip: collection of command-line programs for computing phylogenetic trees, available from or run within a web application at: 4. SplitsTree: n interactive Java program for computing phylogenetic trees and networks, available from Summary Phylogenetic trees are used to represent the evolutionary relationships between species, in terms of monophylogenetic groups. There are a number of different types of methods for computing such trees from molecular data: Distance methods such as UPGM and NJ compute trees from distance matrices. The maximum parsimony searches for a tree that explains a given sequence alignment using the smallest number of substitutions. dditional methods include maximum likelihood estimation and Bayesian inference.

24 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, Tree traversals In a tree-traversal we visit and process every node in a tree exactly once. classified by the order in which the nodes are examined. In the following definition we assume that the trees are rooted and bifurcating. algorithms can be easily adapted to the case of non-bifurcating trees. Such traversals are However, all the Definition.1.1 (Tree traversals) Let T be a bifurcating rooted tree. There are four types of tree traversals: Preorder traversal: Examine the root of T. traverse its right subtree in preorder. Then traverse its left subtree in preorder, then Postorder traversal: First traverse the left subtree of the root in postorder, then traverse its right subtree in postorder, and finally examine the root. Inorder traversal: First traverse the left subtree of the root in inorder, then examine the root, and then traverse its right subtree in inorder. Breadth-first traversal: Put the root in a queue. Repeat the following until the queue is empty: Pop a node off the beginning of the queue and examine it. Then, if the node has any children, push them onto the end of the queue. bifurcating tree rooted at node ρ, with other nodes labeled a,..., h: ½ j h g i a b c d e f The different traversals of the above tree give rise to different orders in which the nodes are examined. The order in which nodes are examined in each of the traversals is: Traversal Order in which nodes are examined preorder: postorder: inorder: breadth-first: pre-order traversal is also called a top down algorithm. post-order traversal is also called a bottom up algorithm.

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early