46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 2011

Size: px
Start display at page:

Download "46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 2011"

Transcription

1 46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 Phylogeny Further reading and sources for parts of this hapter: R. Durbin, S. Eddy,. Krogh & G. Mitchison, Biological sequence analysis, ambridge, 1998 Jones & Pevzner, Introduction to Bioinformatics, 004. J. Setubal & J. Meidanis, Introduction to computational molecular biology, D.W. Mount. Bioinformatics: Sequences and Genome analysis, 001. D.L. Swofford, G.J. Olsen, P.J.Waddell & D.M. Hillis, Phylogenetic Inference, in: D.M. Hillis (ed.), Molecular Systematics, ed., Sunderland Mass., The goal of phylogenetic analysis is to determine and describe the evolutionary relationship between a collection of extant species. In particular, this involves determining the order and approximate timing of speciation events. It is generally assumed that speciation is a branching process: a population of organisms becomes separated into two sub-populations. Over time, these evolve into separate species that do not crossbreed. Because of this assumption, a tree is often used to represent a proposed phylogeny for a set of species, showing how the species evolved from a common ancestor..0.1 Early evolutionary trees harles Darwin formulated the foundation of the theory of evolution. In his famous book The Origin of Species (published in 189) Darwin wrote: Species have changed, and are still slowly changing, by the preservation and accumulation of successive slight favorable variations. In a note book he drew this picture illustrating the concept of an evolutionary tree linking different species: Darwin s work inspired Ernst Haeckel (in Jena) to draw many evolutionary trees, such as this one (1866):

2 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The Giant Panda riddle Early phylogenetic studies were based soley on morphological characters, which was not always easy. For roughly 100 years scientists were unable to decide which family the giant panda belongs to. Giant pandas look like bears but have features that are unusual for bears and typical for raccoons, e.g., they do not hibernate. In 198, Steven O Brien and colleagues 1 solved the giant panda classification problem using DN sequences and algorithms..0. Using sequences for trees Once the role of DN was discovered (19), the idea developed that one could use molecular sequences to build phylogenetic trees. 198: rick proposed to use molecular sequences for phylogenetic tree reconstruction 196: Zuckerkandl and Pauling built the first phylogenetic tree from aligned amino acid sequences. They also formulated the molecular clock hypothesis. 1 O Brien SJ, Nash WG, Wildt DE, Bush ME, Benveniste RE. molecular solution to the riddle of the giant panda s phylogeny. (198) Nature 17(60):140-4.

3 48 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, : Fitch & Margoliash published the first algorithm for computing phylogenetic trees from sequence data, together with this tree: Phylogeny based on protein sequence of the gene cytochrome s: arl Woese and colleagues identified SSU rrn as an ideal universal molecular chronometer : It is abundant, coded for by organellar, nuclear and prokaryotic genomes. It has slow- and fast-evolving portions. It has a universally conserved structure. The modern Tree of Life is based on SSU (16S and 18S) rrn:.1 Basic concepts phylogeny describes the evolutionary history of a set of taxa and the ultimate goal of any phylogenetic analysis is to reconstruct some part of the Tree of Life, the evolutionary history of all life on Earth. Definition.1.1 (Related species) Two species are called related if they share a recent common ancestor. We say that a species a is more closely related to a species b than to a species c, if the last common ancestor of a and b is more recent that the last common ancestor of a and c. For example, humans are more closely related to mice (last common ancestor about 100 million years ago) than they are to fruit flies (last common ancestor about 600 million years ago). set of species is called a cluster. Definition.1. (lusters and monophyletic groups) cluster is called a clade or monophyletic group, if all species in are more closely related to each other than to any species outside of ; in other words, if all species in are descended from a common ancestor, which is not an ancestor of any other species under consideration. For example, the set of all marsupials forms a clade within the class of all mammals.

4 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, Given a set of species, the main goal of phylogenetic analysis is to compute a set of clusters such that each cluster is a monophylogenetic group, and then to represent the clusters as a rooted phylogenetic tree. tree consists of nodes connected by edges (also called branches). Terminal nodes (also called leaves) represent sequences or species for which we have data. Internal nodes represent hypothetical ancestors. dichotomy polytomy terminal node (leaf) internal node (hypothetical ancestor) interior edge root.1.1 Phylogenetic trees In the following, we will use X = {x 1, x,..., x n } to denote a set of taxa, where a taxon x i is simply a representative of a group of individuals defined in some way. Definition.1. phylogenetic tree (on X) is a system T = (V, E, λ) consisting of a connected graph (V, E) without cycles, together with a labeling λ of the leaves by elements of X, such that: 1. every leaf is labeled by exactly one taxon, and. every taxon appears exactly once as a label..1. Unrooted trees n unrooted phylogenetic tree is obtained by placing a set of taxa on the leaves of a tree: Pan_panisc Gorilla Homo_sap Mus_mouse Rattus_norv harbor_sel Bos_ta(cow) fin_whale blue whale Gorilla Pan_panisc Homo_sap Mus mouse harbor_sel Bos_ta(cow) fin_whale blue_whale Rattus norv Taxa X + tree phylogenetic tree T on X.1. Rooted trees Example: Modern version of the tree of life:

5 0 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The number of edges and nodes of an unrooted phylogenetic tree Let T be an unrooted phylogenetic tree on n taxa, i.e., with n leaves. How many nodes and edges does T have? Let us assume that T is binary (or bifurcating, each internal node being incident to precisely three edges). ny non-binary tree on n taxa will have less nodes and edges. B D B B D B E B B B E D E D D B B B D D E D E Lemma.1.4 (Nodes and edges in unrooted tree) binary phylogenetic tree T on n taxa has n nodes, n edges and n interior edges for all n..1. The number of rooted phylogenetic trees n unrooted tree T with n leaves has n nodes and n edges. root can be added in any of the n edges, thus producing n different rooted trees from T : b b b a a a c c c a b c b a c c a For n = there are three ways of adding a root. Similarly, there are different ways of adding an extra edge with a new leaf to obtain an unrooted tree on 4 leaves. This new tree has (n ) = b

6 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, edges and there are ways to obtain a new tree with leaves etc. ontinuing this, we see that there are unrooted trees on n leaves. There are rooted trees. U(n) = (n )!! := 7... (n ) R(n) = (n )!! = U(n) (n ) =... (n ) # taxa # pairs # edges U(n) R(n) n (. n )... (n ) (n )!! (n )!!. onstructing phylogenetic trees There are four main approaches to constructing phylogenetic trees from molecular data. 1. Using a distance method, one first computes a distance matrix from a given set of biological data and then one computes a tree that represents these distances as closely as possible.. Maximum parsimony takes as input a set of aligned sequences and attempts to find a tree and a labeling of its internal nodes by auxiliary sequences such that the number of mutations along the tree is minimum.. Given a probabilistic model of evolution, maximum likelihood approaches aim at finding a phylogenetic tree that maximizes the likelihood of obtaining the given sequences. 4. Given a probabilistic model of evolution, Bayesian approaches aim at sampling the posterior distribution of trees using an MM algorithm.. Distances Definition..1 (Distance matrix) Let X = {x 1, x,..., x n } be a set of taxa. metric D : X X R 0 assigns a distance d(x i, x j ) 0 between every pair of taxa x i, x j X. Sometimes we may abbreviate d ij = d(x i, x j ). We usually require: 1. Symmetry: d(x, y) = d(y, x) for all x, y X.. Identity: d(x, y) = 0 if and only if x = y.. Triangle inequalities: d(x, z) d(x, y) + d(y, z) for all x, y, z X.

7 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 In phylogenetics, often the identity property only holds in the first direction. lso, sometimes the triangle inequalities are violated, too. To allow these two violations, we will use the less formal term distance matrix...1 Hamming distance Let a collection of taxa be given by a set of distinct sequences = { 1,,..., n } and assume we are given a multiple sequence alignment of the sequences. Definition.. (Hamming distance) We define the (normalized) Hamming distance Ham( i, j ) between two taxa i and j (also called observed p-distance) as the number of mismatch positions in i and j, divided by the number of comparisons. We ignore any column in which both sequences contain a gap. If only one sequence has a gap in a column then we can either ignore the column, or treat it as a match, or as a mismatch, depending on the type of data. Sometimes, one ignores all columns in which any of the n sequences contains the gap. Example: Distances: 1 T T T T - G G T T T - - Ham( 1, ) = 4 1 = 0. Ham( 1, ) = 11 = 0.4 Ham(, ) = 11 = 0.7 Hamming distances are only suitable for closely related sequences because multiple mutations at the same position and back-mutations are not counted. To compensate for this in practice, Hamming distances can be transformed based on some model of evolution such as the Jukes-antor model..4 Distance-Based Tree Reconstruction Problem.4.1 (Distance-Based Phylogeny Problem) Reconstruct an evolutionary tree on n leaves from an n n distance matrix. Input: n n n distance matrix D = (d ij ) and a labeling λ. Output: phylogenetic tree T (rooted or unrooted) with n leaves and edge lengths.. UPGM We will now discuss a simple distance method called UPGM, which stands for unweighted pair group method using arithmetic averages (Sokal & Michener 198). Given a set of taxa X and a distance matrix D, UPGM produces a rooted phylogenetic tree T with edge lengths. It operates by clustering the given taxa, at each stage merging two clusters and at the same time creating a new node in the tree. The tree is assembled bottom-up, first clustering pairs of leaves, then

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 pairs of clustered leaves etc. Each node is given a height and the edge lengths are obtained as the difference of heights of its two end nodes...1 UPGM example Example: X = {1,,, 4, }, distances given by distance in the plane: cluster 1 and : cluster 4 and : t1=t=1/ d(1,) 6 7 t4=t=1/ d(4,) cluster 7 and : cluster 6 and 8: / d(,7) / d(6,8) UPGM produces a rooted, binary phylogenetic tree... The distance between two clusters Initially, we are given a distance d(x, y) between any two taxa, i.e. leaves, x and y. We define the distance d(i, j) = d( i, j ) between two clusters i X and j X to be the average distance between pairs of taxa from each cluster: 1 d(i, j) = d(x, y). i j x i,y j Update formula: If k is the union of two clusters i and j, and l is any other cluster, then d(k, l) = d(i, l) i + d(j, l) j. i + j This is a useful update formula, because using it in the algorithm, we can obtain the distance between two clusters in constant time... Edge length computation Edge lengths are set as shown here:

9 4 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 d(ij,kl)/ - d(i,j)/ d(ij,kl)/ - d(k,l)/ d(i,j)/ d(k,l)/ i j k l..4 pplication of the UPGM algorithm Example of UPGM applied to S rrn data: Original distances: bbreviations: Bsu: Bacillus subtilis Bst: Baclillus stearothermophilus Lvi: Lactobacillus viridescens mo: choleplasma modicum Mlu: Micrococcus luteus Bsu Bst Lvi mo Mlu Bsu Bst Lvi mo Minimum Mlu is d(bst,bsu)= We define the node connecting the node labeled Bst and Bsu, and place it at height d(bst, Bsu)/ = Using the update formula we compute eg. d((bst, Bsu), Lvi) = Bsu + Bst Lvi mo Mlu Bsu + Bst Lvi mo Mlu = 0.69 Minimum is d(bst+bsu,mlu)=0.19. We define the node connecting the node labeled Bst and Bsu, and place it at height d(bst + Bsu, Mlu)/ = Using the update formula we compute e.g. d((bst + Bsu + Mlu), Lvi) = = 0.07 Bsu + Bst + Mlu Lvi mo Bsu + Bst + Mlu Lvi 0.79 mo Minimum is d(lvi,mo)=0.19. We define the node connecting the node labeled Lvi and mo, and place it at height d(lvi, mo)/ = Using the update formula we compute eg. d(bst + Bsu + Mlu, Lvi + mo) = = 0.1 Bsu + Bst + Mlu Lvi + mo Bsu + Bst + Mlu 0.10 Lvi + mo

10 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 The resulting tree: Bsu Bst Mlu Lvi mo (However, we will see later that this tree is probably biologically incorrect.) 0.. The UPGM algorithm lgorithm..1 (UPGM) Input: set of taxa X = {x 1,..., x n } and a corresponding distance matrix D Output: binary, rooted phylogenetic UPGM tree T = (V, E, ω) on X Initialization Set = { 1 = {x 1 },..., n = {x n }} Set h({x i }) = 0 Set V = and E = Iteration while do Determine two clusters i and j for which d(i, j) is mininum Define a new cluster k by k = i j Set = ( { i, j }) { k } Set d(k, l) for all clusters l using the update formula Define node k with children i and j, and place it at height h = d(i,j) Set V = V {k} and E = E {(i, k), (j, k)} Set ω(i, k) = h(k) h(i) and ω(j, k) = h(k) h(j) Termination When only two clusters i and j remain, place the root at height d(i,j)...6 The molecular clock hypothesis Given a distance matrix D, the UPGM method aims at building a rooted tree T with the property that all leaves have the same distance from the root ρ: 1 4 This approach is suitable for sequence data that has evolved under circumstances in which the rate of mutations of sequences is constant over time and equal for all lineages in the tree. Definition.. The assumption that evolutionary events happen at a constant rate is called the

11 6 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 molecular clock hypothesis...7 UPGM and the molecular clock If the input distance matrix D was obtained from sequences that fulfill the molecular clock assumption, then the tree T reconstructed by UPGM from D will be correct. Otherwise, if the sequences deviate from this assumption, then UPGM may fail to reconstruct the tree correctly, for example: orrect tree T 0 UPGM tree T The problem here is that the closest leaves in T 0 are not neighboring leaves: they do not have a common parent node..6 Neighbor-Joining The most widely used distance method is Neighbor-Joining (NJ), originally introduced by Saitou and Nei (1987), and modified by Studier and Keppler (1988). Given a distance matrix D, Neighbor-Joining produces an unrooted phylogenetic tree T with edge lengths. It is more widely applicable than UPGM, as it does not assume a molecular clock. Starting from a star tree topology we assume that we are building a tree based on D by repeatedly pairing (or joining ) neighboring taxa How to determine which nodes are neighbors? It is not correct to simply to pick a pair i, j for which d ij minimum. Example: assume the true tree is Saitou, N., Nei, Y. (1987) SIM J. on omp. 10:40-41

12 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, x x x x x x 4 x Given distances generated by this tree: x x 0.9 x 1 and x have mininum distance, but are not neighbors. To avoid this problem, the trick is to subtract the averaged distances to all other leaves, thus compensating for long edges. We define a matrix N = N ij with where N ij := d ij (r i + r j ), r i = 1 L d ik, and L denotes the set of leaves. (Note that this is not precisely the average, as the number of non-zero summands is L 1, not L. ) Let us illustrate this result using the previous example: x1 0.4 x D = x x4 k L x4 x x x 4 x x x 0.9 r 1 = 1 4 ( ) = 0.7, and r = 0.7, r = 1.0 and r 4 = 1.0. Then eg. N 1 = d 1 (r 1 + r ) = 0. ( ) = 1.1. N = x x x 4 x x x 1.1 The minima in the matrix N correspond to the pair i = 1, j = and the pair i =, j = 4, as required, since in the tree x 1, x and x, x 4 are neighbors. Let i and j be two neighboring leaves that have the same parent node, k. Remove i, j from the list of nodes and add k to the current list of nodes. How do we have to set its distance to any given leaf m? m i k j d im = d ik + d km, d jm = d jk + d km and d ij = d ik + d jk, thus which implies d im + d jm = d ik + d km + d jk + d km = d ij + d km, d km = 1 (d im + d jm d ij ).

13 8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The Neighbor-Joining algorithm lgorithm.6.1 (Neighbor-Joining) Input: Distance matrix D on X Output: Phylogenetic tree T Initialization Define T to be the set of leaf nodes, one for each taxon. Set L = T. Iteration: while L > do ompute N ij from D ij Pick a pair i, j L for which N ij is mininum. Define a new node k and set d km = 1 (d im + d jm d ij ), for all m L. // update D dd k to T with edges of lengths d ik = 1 (d ij + r i r j ) and d jk = d ij d ik, joining k to i and j, respectively. Remove i and j from L and add k to L Termination When L consists of two nodes i and j, add the remaining edge between i and j, with length d ij. Why do we use d ik = 1 (d ij + r i r j ) to update distances? By definition, r i = 1 L the average distance q i = 1 L k L,k i,j d ik from i to all other nodes m i, j, plus k L d ik equals d ij L : m 1 i j dij k qi qj m m... average distance m L s we see from this figure: d ik = d ij + q i q j = d ij + q i q j +.6. Example Taxa B D B D - 1. Iteration: ompute N: first compute column sums r i : d ij L Initialisation: L = X = {, B,, D} d ij L = d ij + r i r j. r = = 7, r B = = 1, r = = 7, r D = = 7

14 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, N = = T axa B D B 8 (r + rb)/ (r + r)/ 9 (rb + r)/ 11 D 1 (r + rd)/ 14 (rb + rd)/ 11 (r + rd)/ T axa B D B D One minimum is N(, B) = 1. Define a new internal node k that pairs off and B and compute edge lengths from k to and from k to B: d(, k) = 1 (d(, B) + r r B ) = 1 (8 + 7/ 1/) = d(b, k) = d(, B) d(, k) = 8 = ompute new distances D : with T axa k D k d(k, ) d(k, D) 11 D d(k, ) = 1 (d(, ) + d(b, ) d(, B)) = 4 d(k, D) = 1 (d(, D) + d(b, D) d(, B)) = 9. Iteration: ompute N 1 : We have L = {k,, D}, L =, L = 1 r k = = 1, r = = 1, r D = = 0 N 1 = = T axa k D k (1 + 1)/1 11 D 9 (1 + 0)/1 11 (1 + 0)/1 T axa k D k D 4 4 Minimum is e.g. N 1 (k, ). Define a new node l, that pairs off k and : d(, l) = 1 (d(k, ) + r r k) = 1 ( ) = d(k, l) = d(k, ) d(, l) = 4 = 1 Termination: Only two nodes, D und l left. ompute length of edge d(l, D): Here is the final tree: d(l, D) = d(, D) d(l, ) = 11 = 8

15 60 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 B 1 8 D Example of Neighbor-Joining applied to S rrn data: Original distances: bbreviations: Bsu: Bacillus subtilis Bst: Baclillus stearothermophilus Lvi: Lactobacillus viridescens mo: choleplasma modicum Mlu: Micrococcus luteus mo The resulting NJ-tree: Lvi Bsu Bsu Bst Lvi mo Mlu Bsu Bst Lvi mo Mlu 0.06 Bst Mlu.7 Rooting unrooted trees In contrast to UPGM, most tree reconstruction methods produce an unrooted tree. Indeed, determining the root of a tree using computational methods is very difficult. In practice, the question of rooting a tree is addressed by adding an outgroup to the set of taxa under consideration. This is a taxon that is slightly more distantly related to the original taxa than any of the original taxa. The root is then assumed to be on the branch attaching the outgroup taxon to the rest of the tree. In practice, selecting an appropriate outgroup can be difficult: if it is too similar to the other taxa, then it might be more related to some than to others. If it is too distant, then there might not be enough similarity to the other taxa to perform meaningful comparisons. In the above neighboring tree, Mlu is the outgroup, Hence, the rooted version of this tree looks something like this: This tree is believed to be more correct than the one produced using UPGM: Bst Mlu Bsu Lvi mo

16 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, Bsu Bst Mlu Lvi mo In particular, the UPGM tree does not separate the outgroup from all other taxa by the root node. The reason why UPGM produces an incorrect tree is that two of the sequences, those of L.viridescens (Lvi) and.modicum (mo) are much more diverged than the others..8 Sequence-based methods In sequence-based methods for phylogeny, the input consists of a multiple sequence alignment M on a set of taxa X and a phylogenetic tree T is determined for M, usually by performing a search in tree space to find an optimal phylogenetic tree or trees. The three main approaches are maximum parsimony, maximum likelihood and Bayesian inference. We will discuss the first of these three approaches..9 Maximum parsimony Maximum parsimony is one of the most widely-used sequence-based tree reconstruction methods. In science, the principle of maximum parsimony is a preference for the least complex explanation for an observation. In phylogenetic analysis, the maximum parsimony problem is to find a phylogenetic tree T on X that explains a given multiple alignment of sequences on X using a minimum number of evolutionary events..9.1 The parsimony score of a tree The difference between two sequences x = x 1... x L and y = y 1... y L is simply their non-normalized Hamming distance Ham(x, y) = {k : x k y k }. ssume we are given a multiple sequence alignment = (a 1, a,..., a n ) on X and a corresponding phylogenetic tree T on X. If we assign the aligned sequences to the leaves and hypothetical ancestor sequences to the internal nodes of T, then we can obtain a score for T together with this assignment, by summing over all differences Ham(x, y), where x and y are any two sequences labeling two nodes that are joined by an edge in T. GT G G Example: 1 G G 1 0 Score = G The minimum value obtainable in this way is called the parsimony score P S(T, ) of T and :

17 6 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 Definition.9.1 (Parsimony score of a tree) Given a multiple alignment of sequences of length L and a corresponding phylogenetic tree T on X, its parsimony score is defined as: P S(T, ) = min Ham(u, v), λ {u,v} E where the minimum is taken over all possible labelings λ of the internal nodes of T by hypothetical ancestor sequences of length L..9. The Small Parsimony Problem Problem.9. (Small Parsimony Problem) The Small Parsimony Problem is to compute the parsimony score for an alignment of sequences on a phylogenetic tree T. an this be solved efficiently? s the parsimony score is obtained by summing over all columns, the columns are independent and so it suffices to discuss how to obtain an optimal assignment for one position. Example: G G.9. The Fitch algorithm In 1971, Walter Fitch published a dynamic programming algorithm that solves the small parsimony problem efficiently. Input: phylogenetic tree T, with n leaves, and a single character c with k possible states. Denote the state of the character for node v by c(v). The Fitch algorithm consists of two passes: 1. The first (bottom up) pass computes the parsimony score and a set of possible states for each node.. The second (top down) pass chooses a state for each node. The following algorithm ( bottom up pass ) computes the parsimony score for T and a fixed column c in the sequence alignment. Initially it is called with v equal to the root node and P S(T, c) = 0. lgorithm.9. (ParsimonyScore(v), Walter Fitch, 1971) Input: binary phylogenetic tree T, a state c(w) for each leaf w of T Output: The parsimony score P S(T, c) for T and c Set score = 0 if v is a leaf node then set F (v) = {c(v)} else for the two children w 1 and w of v do all ParsimonyScore(w i ) to compute F (w i ) Fitch, W. (1971) Toward defining the course of evolution: minimum change of specified tree topology. Syst. Zoology 0:

18 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, and add the result to score if F (w 1 ) F (w ) then Set F (v) = F (w 1 ) F (w ) else Set F (v) = F (w 1 ) F (w ) and increment score return score Example: use the Fitch algorithm to compute the parsimony score for the following labeled tree: G G The first pass computes the parsimony score. n optimal labeling of the internal nodes is obtained via the second pass ( top down ): Starting at the root node r, we label r using any character in F (r). Then, for each child w, we use the same letter, if it is contained in F (w), otherwise we use any letter in F (w) as label for w. We then visit the children von w etc: {,G} {G} {,G} {G} {G} {} {} {} {} {,} {} This algorithm also requires O(nL) steps, in total. G G.9.4 Fitch: Second pass Example: {,T} {,T} {} {} {T} {T} T T {} Fitch labeling one traceback result T T T T T T T T another traceback result not obtainable by traceback The following algorithm computes the assignments of the internal nodes for T and a fixed column c

19 64 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 in the sequence alignment. Initially it is called with e = null and r the root node. lgorithm.9.4 (Fitch algorithm, second pass) Input: phylogenetic tree T and the set F (w) for every node w in T Output: The fully labeled tree T for each node v, in top-down order do if v is the root node then hoose t F (v) and set c(v) = t else Let w be the parent node of v if c(w) F (v) then Set c(v) = c(w) else hoose t F (v) and set c(v) = t end.9. Evaluating the score on different trees ssume we are given the following four aligned sequences: Seq 1: G Seq : Seq : G G Seq 4: G There are three possible binary (non-rooted) topologies on four taxa: G GG G G G GG G GG G In each tree, label all internal nodes with sequences so as to minimize the score obtained by summing all mismatches along edges. Which tree minimizes this score?.9.6 The Fitch algorithm on unrooted trees bove, we formulated the Fitch algorithm for rooted trees. However, the minimum parsimony cost is independent of where the root is located in the tree T : To see this, consider a root node ρ connected to two nodes v and w, and let c( ) denote the character assigned to a node in the Fitch algorithm: 1. If c(v) = c(w), then c(ρ) = c(v) = c(w), because any other choice would add 1 to the score.. If c(v) c(w), then either c(ρ) = c(v) or c(ρ) = c(w), and both choices contribute 1 to the score. In both cases, we could replace ρ by a new edge e that connects v and w without changing the parsimony score of T.

20 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The Large Parsimony Problem Definition.9. (Parsimony score of an alignment) Given a multiple alignment = (a 1,..., a n ) on X, its parsimony score is defined as P S() = min{p S(T, ) T is a phylogenetic tree on X}. Problem.9.6 (Large Parsimony Problem) The Large Parsimony Problem is to compute P S() and also T mp = arg min{p S(T, ) T is a phylogenetic tree on X}. Potentially, we need to consider all (n )!! possible trees. Unfortunately, in general this can t be avoided and the maximum parsimony problem is known to be NP-hard. Exhaustive enumeration of all possible tree topologies will only work for n 10, say..10 Tree Searching Methods Potentially we need to consider all (n )!! possible trees. The following list summarizes some of the main strategies for either solving the problem exactly or obtaining good approximations: Exhaustive search (exact) Branch and bound search (exact) Stepwise addition (heuristic) Star decomposition (heuristic) Branch swapping search (heuristic).10.1 Branch and bound Recall how we obtained an expression for the number U(n) of unrooted phylogenetic tree topologies on n taxa: For n = there are three ways of adding an extra edge with a new leaf to obtain an unrooted tree on 4 leaves. This new tree has (n ) = edges and there are ways to obtain a new tree with leaves etc. To be precise, one can obtain any tree T i+1 on the label set { 1,..., i, i+1 } by adding an extra edge with a leaf labeled i+1 to some (unique) tree T i on { 1,..., i }. In other words, we can produce the set of all possible trees on n taxa by adding one leaf at a time in all possible ways, thus systematically generating a complete enumeration tree.

21 66 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 011 simple, but crucial observation is that adding a new sequence i+1 to a tree T i to obtain a new tree T i+1 cannot lead to a smaller parsimony score. This gives rise to the following bound criterion when generating the enumeration tree: if the local parsimony score of the current incomplete tree T is larger or equal to the best global score for any complete tree seen so far, then we do not generate or search the enumeration subtree below T. Example: ssume we are given an MS = {a 1, a,..., a } and at a given position i the characters are: a 1i =, a i =, a i =, a 4i = and a i =. ssume that the Neighbor-Joining tree on looks like this: a1 a a a4 a The parsimony score is = 1, thus we initialize global bound with 1 for position i. The first step in generating the enumeration tree is this: a 1 a a local = 1 global local = global local = global continue don t continue don t continue Only the first of the three trees fulfills the bound criterion and we do not pursue the other two trees. The second step in generating the enumeration tree looks like this:

22 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, The first three trees are optimal. local = 1 local = 1 local = 1 local = local = pplication of branch-and-bound to evolutionary trees was first suggested by Mike Hendy and Dave Penny (198) 4. In practice, using branch and bound one can obtain exact solutions for data sets of twenty or more sequences, depending on the sequence length and the messiness of the data. good starting strategy is to first compute a tree T 0 for the data, e.g. using Neighbor-Joining, and then to initialize the global bound to the parsimony score of T The stepwise-addition heuristic Now we discuss a simple greedy heuristic (Felsenstein 1981) for approximating the optimal tree or score: Build the tree T by adding one leaf after the other, in each step greedily choosing the optimal position for the new leaf-edge: Given a multiple sequence alignment = {a 1, a,..., a n }, start with a tree T consisting of two leaves labeled a 1 and a. Given T i, for each edge e in T i, obtain a new tree Ti e as follows: Insert a new node v in e and join it via a new edge f to a new leaf w with label a i+1. Set T i+1 = arg min{p (Ti e, )}. a e e1 e a a1 a a a4 a 1 + a a a4 a 1 + T T e 1 T e T e Obviously, this approach does not guarantee an optimal result. Moreover, the result obtained will depend on the order in which the sequences are processed. 1 4 a a a4 a Hendy, M. D. and Penny, D. (198). Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences 9: 77-90

23 68 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, Phylogenetics Software Recommended software packages in phylogenetics: 1. Dendroscope: n interactive Java program for viewing phylogenetic trees and rooted networks, available from: Paup: powerful and widely-used commerical program for computing phylogenetic trees, available from Phylip: collection of command-line programs for computing phylogenetic trees, available from or run within a web application at: 4. SplitsTree: n interactive Java program for computing phylogenetic trees and networks, available from Summary Phylogenetic trees are used to represent the evolutionary relationships between species, in terms of monophylogenetic groups. There are a number of different types of methods for computing such trees from molecular data: Distance methods such as UPGM and NJ compute trees from distance matrices. The maximum parsimony searches for a tree that explains a given sequence alignment using the smallest number of substitutions. dditional methods include maximum likelihood estimation and Bayesian inference.

24 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, Tree traversals In a tree-traversal we visit and process every node in a tree exactly once. classified by the order in which the nodes are examined. In the following definition we assume that the trees are rooted and bifurcating. algorithms can be easily adapted to the case of non-bifurcating trees. Such traversals are However, all the Definition.1.1 (Tree traversals) Let T be a bifurcating rooted tree. There are four types of tree traversals: Preorder traversal: Examine the root of T. traverse its right subtree in preorder. Then traverse its left subtree in preorder, then Postorder traversal: First traverse the left subtree of the root in postorder, then traverse its right subtree in postorder, and finally examine the root. Inorder traversal: First traverse the left subtree of the root in inorder, then examine the root, and then traverse its right subtree in inorder. Breadth-first traversal: Put the root in a queue. Repeat the following until the queue is empty: Pop a node off the beginning of the queue and examine it. Then, if the node has any children, push them onto the end of the queue. bifurcating tree rooted at node ρ, with other nodes labeled a,..., h: ½ j h g i a b c d e f The different traversals of the above tree give rise to different orders in which the nodes are examined. The order in which nodes are examined in each of the traversals is: Traversal Order in which nodes are examined preorder: postorder: inorder: breadth-first: pre-order traversal is also called a top down algorithm. post-order traversal is also called a bottom up algorithm.

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/11/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/12/2013 Comp 465 Fall 2013 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other vertex A clique

More information

4/4/16 Comp 555 Spring

4/4/16 Comp 555 Spring 4/4/16 Comp 555 Spring 2016 1 A clique is a graph where every vertex is connected via an edge to every other vertex A clique graph is a graph where each connected component is a clique The concept of clustering

More information

11/17/2009 Comp 590/Comp Fall

11/17/2009 Comp 590/Comp Fall Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 Problem Set #5 will be available tonight 11/17/2009 Comp 590/Comp 790-90 Fall 2009 1 Clique Graphs A clique is a graph with every vertex connected

More information

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees What is a phylogenetic tree? Algorithms for Computational Biology Zsuzsanna Lipták speciation events Masters in Molecular and Medical Biotechnology a.a. 25/6, fall term Phylogenetics Summary wolf cat lion

More information

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 28 th November 2007 OUTLINE 1 INFERRING

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst Phylogenetics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Phylogenetics phylum = tree phylogenetics: reconstruction of evolutionary

More information

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such) Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences joe@gs Phylogeny methods, part 1 (Parsimony and such) Methods of reconstructing phylogenies (evolutionary trees) Parsimony

More information

human chimp mouse rat

human chimp mouse rat Michael rudno These notes are based on earlier notes by Tomas abak Phylogenetic Trees Phylogenetic Trees demonstrate the amoun of evolution, and order of divergence for several genomes. Phylogenetic trees

More information

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:

More information

CSE 549: Computational Biology

CSE 549: Computational Biology CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford Tree of Life 2 * H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 * H5N1 Influenza Strains The 2007 outbreak

More information

The worst case complexity of Maximum Parsimony

The worst case complexity of Maximum Parsimony he worst case complexity of Maximum Parsimony mir armel Noa Musa-Lempel Dekel sur Michal Ziv-Ukelson Ben-urion University June 2, 20 / 2 What s a phylogeny Phylogenies: raph-like structures whose topology

More information

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet) Phylogeny Codon models Last lecture: poor man s way of calculating dn/ds (Ka/Ks) Tabulate synonymous/non- synonymous substitutions Normalize by the possibilities Transform to genetic distance K JC or K

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea Descent w/modification Descent w/modification Descent w/modification Descent w/modification CPU Descent w/modification Descent w/modification Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

More information

DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION

DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION CHUANG PENG DEPARTMENT OF MATHEMATICS MOREHOUSE COLLEGE ATLANTA, GA 30314 Abstract. One of the most fundamental aspects of bioinformatics in understanding

More information

Recent Research Results. Evolutionary Trees Distance Methods

Recent Research Results. Evolutionary Trees Distance Methods Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how

More information

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1 DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens Katherine St. John City University of New York 1 Thanks to the DIMACS Staff Linda Casals Walter Morris Nicole Clark Katherine St. John

More information

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ) Distance based tree reconstruction Hierarchical clustering (UPGMA) Neighbor-Joining (NJ) All organisms have evolved from a common ancestor. Infer the evolutionary tree (tree topology and edge lengths)

More information

Distance-based Phylogenetic Methods Near a Polytomy

Distance-based Phylogenetic Methods Near a Polytomy Distance-based Phylogenetic Methods Near a Polytomy Ruth Davidson and Seth Sullivant NCSU UIUC May 21, 2014 2 Phylogenetic trees model the common evolutionary history of a group of species Leaves = extant

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Phylogenetic Trees (I)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Phylogenetic Trees (I) CISC 636 Computational iology & ioinformatics (Fall 2016) Phylogenetic Trees (I) Maximum Parsimony CISC636, F16, Lec13, Liao 1 Evolution Mutation, selection, Only the Fittest Survive. Speciation. t one

More information

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

of the Balanced Minimum Evolution Polytope Ruriko Yoshida Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida Figure 19.1 Genomes 3 ( Garland Science 2007) Origins of Species Tree (or web) of life eukarya

More information

On the Optimality of the Neighbor Joining Algorithm

On the Optimality of the Neighbor Joining Algorithm On the Optimality of the Neighbor Joining Algorithm Ruriko Yoshida Dept. of Statistics University of Kentucky Joint work with K. Eickmeyer, P. Huggins, and L. Pachter www.ms.uky.edu/ ruriko Louisville

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Parsimony methods. Chapter 1

Parsimony methods. Chapter 1 Chapter 1 Parsimony methods Parsimony methods are the easiest ones to explain, and were also among the first methods for inferring phylogenies. The issues that they raise also involve many of the phenomena

More information

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie. Arbres formels et Arbre de la Vie Olivier Gascuel Centre National de la Recherche Scientifique LIRMM, Montpellier, France www.lirmm.fr/gascuel 10 permanent researchers 2 technical staff 3 postdocs, 10

More information

Consider the character matrix shown in table 1. We assume that the investigator has conducted primary homology analysis such that:

Consider the character matrix shown in table 1. We assume that the investigator has conducted primary homology analysis such that: 1 Inferring trees from a data matrix onsider the character matrix shown in table 1. We assume that the investigator has conducted primary homology analysis such that: 1. the characters (columns) contain

More information

Dynamic Programming for Phylogenetic Estimation

Dynamic Programming for Phylogenetic Estimation 1 / 45 Dynamic Programming for Phylogenetic Estimation CS598AGB Pranjal Vachaspati University of Illinois at Urbana-Champaign 2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for

More information

Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction

Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction William R. Pearson, Gabriel Robins,* and Tongtong Zhang* *Department of Computer Science and Department of Biochemistry, University

More information

Rapid Neighbour-Joining

Rapid Neighbour-Joining Rapid Neighbour-Joining Martin Simonsen, Thomas Mailund and Christian N. S. Pedersen Bioinformatics Research Center (BIRC), University of Aarhus, C. F. Møllers Allé, Building 1110, DK-8000 Århus C, Denmark.

More information

Scaling species tree estimation methods to large datasets using NJMerge

Scaling species tree estimation methods to large datasets using NJMerge Scaling species tree estimation methods to large datasets using NJMerge Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana Champaign 2018 Phylogenomics Software

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets Vincent Ranwez and Olivier Gascuel Département Informatique Fondamentale et Applications, LIRMM,

More information

Introduction to Trees

Introduction to Trees Introduction to Trees Tandy Warnow December 28, 2016 Introduction to Trees Tandy Warnow Clades of a rooted tree Every node v in a leaf-labelled rooted tree defines a subset of the leafset that is below

More information

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony Jean-Michel Richer 1 and Adrien Goëffon 2 and Jin-Kao Hao 1 1 University of Angers, LERIA, 2 Bd Lavoisier, 49045 Anger Cedex 01,

More information

Terminology. A phylogeny is the evolutionary history of an organism

Terminology. A phylogeny is the evolutionary history of an organism Phylogeny Terminology A phylogeny is the evolutionary history of an organism A taxon (plural: taxa) is a group of (one or more) organisms, which a taxonomist adjudges to be a unit. A definition? from Wikipedia

More information

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony Jean-Michel Richer 1,AdrienGoëffon 2, and Jin-Kao Hao 1 1 University of Angers, LERIA, 2 Bd Lavoisier, 49045 Anger Cedex 01, France

More information

Evolutionary Trees. Fredrik Ronquist. August 29, 2005

Evolutionary Trees. Fredrik Ronquist. August 29, 2005 Evolutionary Trees Fredrik Ronquist August 29, 2005 1 Evolutionary Trees Tree is an important concept in Graph Theory, Computer Science, Evolutionary Biology, and many other areas. In evolutionary biology,

More information

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau Phylogenetic Trees Lecture 12 Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau. Maximum Parsimony. Last week we presented Fitch algorithm for (unweighted) Maximum Parsimony:

More information

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course)

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course) Tree Searching We ve discussed how we rank trees Parsimony Least squares Minimum evolution alanced minimum evolution Maximum likelihood (later in the course) So we have ways of deciding what a good tree

More information

Lab 8 Phylogenetics I: creating and analysing a data matrix

Lab 8 Phylogenetics I: creating and analysing a data matrix G44 Geobiology Fall 23 Name Lab 8 Phylogenetics I: creating and analysing a data matrix For this lab and the next you will need to download and install the Mesquite and PHYLIP packages: http://mesquiteproject.org/mesquite/mesquite.html

More information

CS 581. Tandy Warnow

CS 581. Tandy Warnow CS 581 Tandy Warnow This week Maximum parsimony: solving it on small datasets Maximum Likelihood optimization problem Felsenstein s pruning algorithm Bayesian MCMC methods Research opportunities Maximum

More information

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens GrÃP pl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation

More information

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w

7.1 Introduction. A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w Chapter 7 Trees 7.1 Introduction A (free) tree T is A simple graph such that for every pair of vertices v and w there is a unique path from v to w Tree Terminology Parent Ancestor Child Descendant Siblings

More information

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures. Trees Q: Why study trees? : Many advance DTs are implemented using tree-based data structures. Recursive Definition of (Rooted) Tree: Let T be a set with n 0 elements. (i) If n = 0, T is an empty tree,

More information

Binary Trees

Binary Trees Binary Trees 4-7-2005 Opening Discussion What did we talk about last class? Do you have any code to show? Do you have any questions about the assignment? What is a Tree? You are all familiar with what

More information

BMI/CS 576 Fall 2015 Midterm Exam

BMI/CS 576 Fall 2015 Midterm Exam BMI/CS 576 Fall 2015 Midterm Exam Prof. Colin Dewey Tuesday, October 27th, 2015 11:00am-12:15pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary.

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

CLC Phylogeny Module User manual

CLC Phylogeny Module User manual CLC Phylogeny Module User manual User manual for Phylogeny Module 1.0 Windows, Mac OS X and Linux September 13, 2013 This software is for research purposes only. CLC bio Silkeborgvej 2 Prismet DK-8000

More information

12/5/17. trees. CS 220: Discrete Structures and their Applications. Trees Chapter 11 in zybooks. rooted trees. rooted trees

12/5/17. trees. CS 220: Discrete Structures and their Applications. Trees Chapter 11 in zybooks. rooted trees. rooted trees trees CS 220: Discrete Structures and their Applications A tree is an undirected graph that is connected and has no cycles. Trees Chapter 11 in zybooks rooted trees Rooted trees. Given a tree T, choose

More information

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser Multiple Sequence Alignment Sum-of-Pairs and ClustalW Ulf Leser This Lecture Multiple Sequence Alignment The problem Theoretical approach: Sum-of-Pairs scores Practical approach: ClustalW Ulf Leser: Bioinformatics,

More information

FastJoin, an improved neighbor-joining algorithm

FastJoin, an improved neighbor-joining algorithm Methodology FastJoin, an improved neighbor-joining algorithm J. Wang, M.-Z. Guo and L.L. Xing School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, P.R. China

More information

Least Common Ancestor Based Method for Efficiently Constructing Rooted Supertrees

Least Common Ancestor Based Method for Efficiently Constructing Rooted Supertrees Least ommon ncestor ased Method for fficiently onstructing Rooted Supertrees M.. Hai Zahid, nkush Mittal, R.. Joshi epartment of lectronics and omputer ngineering, IIT-Roorkee Roorkee, Uttaranchal, INI

More information

Introduction to Computational Phylogenetics

Introduction to Computational Phylogenetics Introduction to Computational Phylogenetics Tandy Warnow The University of Texas at Austin No Institute Given This textbook is a draft, and should not be distributed. Much of what is in this textbook appeared

More information

Unique reconstruction of tree-like phylogenetic networks from distances between leaves

Unique reconstruction of tree-like phylogenetic networks from distances between leaves Unique reconstruction of tree-like phylogenetic networks from distances between leaves Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA email: swillson@iastate.edu

More information

March 20/2003 Jayakanth Srinivasan,

March 20/2003 Jayakanth Srinivasan, Definition : A simple graph G = (V, E) consists of V, a nonempty set of vertices, and E, a set of unordered pairs of distinct elements of V called edges. Definition : In a multigraph G = (V, E) two or

More information

Rapid Neighbour-Joining

Rapid Neighbour-Joining Rapid Neighbour-Joining Martin Simonsen, Thomas Mailund, and Christian N.S. Pedersen Bioinformatics Research Center (BIRC), University of Aarhus, C. F. Møllers Allé, Building 1110, DK-8000 Århus C, Denmark

More information

Evolution Module. 6.1 Phylogenetic Trees. Bob Gardner and Lev Yampolski. Integrated Biology and Discrete Math (IBMS 1300)

Evolution Module. 6.1 Phylogenetic Trees. Bob Gardner and Lev Yampolski. Integrated Biology and Discrete Math (IBMS 1300) Evolution Module 6.1 Phylogenetic Trees Bob Gardner and Lev Yampolski Integrated Biology and Discrete Math (IBMS 1300) Fall 2008 1 INDUCTION Note. The natural numbers N is the familiar set N = {1, 2, 3,...}.

More information

Analysis of Algorithms

Analysis of Algorithms Algorithm An algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions. A computer program can be viewed as an elaborate algorithm. In mathematics and

More information

Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection

Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection Sistemática Teórica Hernán Dopazo Biomedical Genomics and Evolution Lab Lesson 03 Statistical Model Selection Facultad de Ciencias Exactas y Naturales Universidad de Buenos Aires Argentina 2013 Statistical

More information

Evolution of Tandemly Repeated Sequences

Evolution of Tandemly Repeated Sequences University of Canterbury Department of Mathematics and Statistics Evolution of Tandemly Repeated Sequences A thesis submitted in partial fulfilment of the requirements of the Degree for Master of Science

More information

Lesson 13 Molecular Evolution

Lesson 13 Molecular Evolution Sequence Analysis Spring 2000 Dr. Richard Friedman (212)305-6901 (76901) friedman@cuccfa.ccc.columbia.edu 130BB Lesson 13 Molecular Evolution In this class we learn how to draw molecular evolutionary trees

More information

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser Multiple Sequence lignment Sum-of-Pairs and ClustalW Ulf Leser This Lecture Multiple Sequence lignment The problem Theoretical approach: Sum-of-Pairs scores Practical approach: ClustalW Ulf Leser: Bioinformatics,

More information

1 High-Performance Phylogeny Reconstruction Under Maximum Parsimony. David A. Bader, Bernard M.E. Moret, Tiffani L. Williams and Mi Yan

1 High-Performance Phylogeny Reconstruction Under Maximum Parsimony. David A. Bader, Bernard M.E. Moret, Tiffani L. Williams and Mi Yan Contents 1 High-Performance Phylogeny Reconstruction Under Maximum Parsimony 1 David A. Bader, Bernard M.E. Moret, Tiffani L. Williams and Mi Yan 1.1 Introduction 1 1.2 Maximum Parsimony 7 1.3 Exact MP:

More information

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 ML phylogenetic inference and GARLI Derrick Zwickl University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 Outline Heuristics and tree searches ML phylogeny inference and

More information

Search Trees. Undirected graph Directed graph Tree Binary search tree

Search Trees. Undirected graph Directed graph Tree Binary search tree Search Trees Undirected graph Directed graph Tree Binary search tree 1 Binary Search Tree Binary search key property: Let x be a node in a binary search tree. If y is a node in the left subtree of x, then

More information

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms CLUSTAL W Courtesy of jalview Motivations Collective (or aggregate) statistic

More information

Isometric gene tree reconciliation revisited

Isometric gene tree reconciliation revisited DOI 0.86/s05-07-008-x Algorithms for Molecular Biology RESEARCH Open Access Isometric gene tree reconciliation revisited Broňa Brejová *, Askar Gafurov, Dana Pardubská, Michal Sabo and Tomáš Vinař Abstract

More information

The History Bound and ILP

The History Bound and ILP The History Bound and ILP Julia Matsieva and Dan Gusfield UC Davis March 15, 2017 Bad News for Tree Huggers More Bad News Far more convincingly even than the (also highly convincing) fossil evidence, the

More information

Neighbour Joining. Algorithms in BioInformatics 2 Mandatory Project 1 Magnus Erik Hvass Pedersen (971055) November 2004, Daimi, University of Aarhus

Neighbour Joining. Algorithms in BioInformatics 2 Mandatory Project 1 Magnus Erik Hvass Pedersen (971055) November 2004, Daimi, University of Aarhus Neighbour Joining Algorithms in BioInformatics 2 Mandatory Project 1 Magnus Erik Hvass Pedersen (971055) November 2004, Daimi, University of Aarhus 1 Introduction The purpose of this report is to verify

More information

in interleaved format. The same data set in sequential format:

in interleaved format. The same data set in sequential format: PHYML user's guide Introduction PHYML is a software implementing a new method for building phylogenies from sequences using maximum likelihood. The executables can be downloaded at: http://www.lirmm.fr/~guindon/phyml.html.

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

v V Question: How many edges are there in a graph with 10 vertices each of degree 6?

v V Question: How many edges are there in a graph with 10 vertices each of degree 6? ECS20 Handout Graphs and Trees March 4, 2015 (updated 3/9) Notion of a graph 1. A graph G = (V,E) consists of V, a nonempty set of vertices (or nodes) and E, a set of pairs of elements of V called edges.

More information

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment Sequence lignment (chapter 6) p The biological problem p lobal alignment p Local alignment p Multiple alignment Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity

More information

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES Integrative Biology 200, Spring 2014 Principles of Phylogenetics: Systematics University of California, Berkeley Updated by Traci L. Grzymala Lab 07: Maximum Likelihood Model Selection and RAxML Using

More information

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees. Chapter 7 SUPERTREE ALGORITHMS FOR NESTED TAXA Philip Daniel and Charles Semple Abstract: Keywords: Most supertree algorithms combine collections of rooted phylogenetic trees with overlapping leaf sets

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology Introduction Chapter 4 Trees for large input, even linear access time may be prohibitive we need data structures that exhibit average running times closer to O(log N) binary search tree 2 Terminology recursive

More information

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Abhishek Majumdar, Peter Z. Revesz Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln,

More information

Efficient Generation of Evolutionary Trees

Efficient Generation of Evolutionary Trees fficient Generation of volutionary Trees MUHMM ULLH NN 1 M. SIUR RHMN 2 epartment of omputer Science and ngineering angladesh University of ngineering and Technology (UT) haka-1000, angladesh 1 adnan@cse.buet.ac.bd

More information

Copyright 2000, Kevin Wayne 1

Copyright 2000, Kevin Wayne 1 Chapter 3 - Graphs Undirected Graphs Undirected graph. G = (V, E) V = nodes. E = edges between pairs of nodes. Captures pairwise relationship between objects. Graph size parameters: n = V, m = E. Directed

More information

Seeing the wood for the trees: Analysing multiple alternative phylogenies

Seeing the wood for the trees: Analysing multiple alternative phylogenies Seeing the wood for the trees: Analysing multiple alternative phylogenies Tom M. W. Nye, Newcastle University tom.nye@ncl.ac.uk Isaac Newton Institute, 17 December 2007 Multiple alternative phylogenies

More information

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS 1 SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS C. THAN and L. NAKHLEH Department of Computer Science Rice University 6100 Main Street, MS 132 Houston, TX 77005, USA Email: {cvthan,nakhleh}@cs.rice.edu

More information

Phylogenetic networks that display a tree twice

Phylogenetic networks that display a tree twice Bulletin of Mathematical Biology manuscript No. (will be inserted by the editor) Phylogenetic networks that display a tree twice Paul Cordue Simone Linz Charles Semple Received: date / Accepted: date Abstract

More information

Basic Tree Building With PAUP

Basic Tree Building With PAUP Phylogenetic Tree Building Objectives 1. Understand the principles of phylogenetic thinking. 2. Be able to develop and test a phylogenetic hypothesis. 3. Be able to interpret a phylogenetic tree. Overview

More information

Lecture: Bioinformatics

Lecture: Bioinformatics Lecture: Bioinformatics ENS Sacley, 2018 Some slides graciously provided by Daniel Huson & Celine Scornavacca Phylogenetic Trees - Motivation 2 / 31 2 / 31 Phylogenetic Trees - Motivation Motivation -

More information

Prior Distributions on Phylogenetic Trees

Prior Distributions on Phylogenetic Trees Prior Distributions on Phylogenetic Trees Magnus Johansson Masteruppsats i matematisk statistik Master Thesis in Mathematical Statistics Masteruppsats 2011:4 Matematisk statistik Juni 2011 www.math.su.se

More information

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens Gröpl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents Section 5.5 Binary Tree A binary tree is a rooted tree in which each vertex has at most two children and each child is designated as being a left child or a right child. Thus, in a binary tree, each vertex

More information

TREES. Trees - Introduction

TREES. Trees - Introduction TREES Chapter 6 Trees - Introduction All previous data organizations we've studied are linear each element can have only one predecessor and successor Accessing all elements in a linear sequence is O(n)

More information

Generation of distancebased phylogenetic trees

Generation of distancebased phylogenetic trees primer for practical phylogenetic data gathering. Uconn EEB3899-007. Spring 2015 Session 12 Generation of distancebased phylogenetic trees Rafael Medina (rafael.medina.bry@gmail.com) Yang Liu (yang.liu@uconn.edu)

More information

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree Chapter 4 Trees 2 Introduction for large input, even access time may be prohibitive we need data structures that exhibit running times closer to O(log N) binary search tree 3 Terminology recursive definition

More information

3.4 Multiple sequence alignment

3.4 Multiple sequence alignment 3.4 Multiple sequence alignment Why produce a multiple sequence alignment? Using more than two sequences results in a more convincing alignment by revealing conserved regions in ALL of the sequences Aligned

More information

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University College Station, TX 77843-3 {sulsj,tlw}@cs.tamu.edu

More information

Tree. A path is a connected sequence of edges. A tree topology is acyclic there is no loop.

Tree. A path is a connected sequence of edges. A tree topology is acyclic there is no loop. Tree A tree consists of a set of nodes and a set of edges connecting pairs of nodes. A tree has the property that there is exactly one path (no more, no less) between any pair of nodes. A path is a connected

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information