EFFICIENT LARGE-SCALE PHYLOGENY RECONSTRUCTION

Size: px
Start display at page:

Download "EFFICIENT LARGE-SCALE PHYLOGENY RECONSTRUCTION"

Transcription

1 EFFICIENT LARGE-SCALE PHYLOGENY RECONSTRUCTION MIKLÓS CSŰRÖS AND MING-YANG KAO Abstract. In this study we introduce two novel distance-based algorithms with provably high computational and statistical efficiency. Furthermore, we report the results of experiments simulating sequence evolution on large trees with 35, 500, and 895 leaves showing high success rates of our algorithms for large mutation probabilities, and high success rates of the popular Neighbor-Joining algorithm for small mutation probabilities. The efficiency of our HGT/FP and HGT/ME algorithms opens up the possibility of obtaining even better performance in large-scale phylogeny reconstruction in terms of speed and accuracy than ever previously achieved.. Introduction. Large-scale phylogeny reconstruction is becoming a reality with the rapidly increasing availability of molecular sequences. Ambitious ventures such as the Green Plant Phylogeny Project [2] and the Ribosomal Database Project [7] aim at recovering evolutionary trees with hundreds and thousands of species. epidemiology When deriving a large tree, concerns about efficiency revolve evenly around the computational speed and statistical accuracy of the reconstruction. For example, maximum likelihood methods have considerable statistical appeal but are notoriously slow, and for this reason are rarely used on trees with more than a few tens of leaves. Distance-based methods have been favored for their computational speed, but even algorithms which build a tree with n leaves in O(n 4 ) time may be not fast enough if n is in the order of thousands. On the other hand, fast algorithms that do not extract topology information efficiently enough from the input sequences may require amounts of data beyond our reach for successful reconstruction of large trees. Recent theoretical results on the statistical efficiency of distance-based algorithms [, 8] make these latter ideal candidates for large-scale phylogeny recovery, but experiments involving large trees are still rare. To our knowledge, the largest tree previously used in an experimental study of phylogeny reconstruction has 228 leaves [3]. Our goal with this paper is two-fold. First, we introduce two novel distance-based algorithms with provably high computational and statistical efficiency. Secondly, we report the results of experiments simulating sequence evolution on trees with 35, 500, and 895 leaves showing high success rates of the popular Neighbor-Joining algorithm for small mutation probabilities, and high success rates of our algorithms for large mutation probabilities... The Neyman-Jukes-Cantor model of sequence evolution. Let A = {, 2,...,r} be a finite alphabet and A + denote the set of all non-empty sequences over A. An evolutionary tree T is defined by an underlying tree and a mutation model. The underlying tree is a rooted binary tree in which every internal node has exactly two children. The mutation model randomly associates a sequence of A + with each tree node. This paper employs the Neyman-Jukes-Cantor mutation model [5, 8], formulated as follows. An edge mutation probability p e is assigned to every edge e. Given an arbitrary root sequence s s 2 s l A l associated with the root, sequences are associated with the other nodes via l random node labelings. The labelings are mutually independent. Every sequence s k-th character is generated in the k-th labeling of the nodes with characters of A, starting at the root and proceeding towards the leaves along the edges in the following manner. The root is labeled with s k. On edge e, the child s label is picked randomly according to Pr { child s label is i parent s label is j } = { pe if i = j; p e r otherwise. Department of Computer Science, Yale University, New Haven, CT 06520; {csuros-miklos, kao-mingyang}@cs.yale.edu.

2 For the sake of simplicity, we assume that the characters of the root sequence are independent and identically distributed, drawn according to the root label distribution π = π,...,π r. Let the node set of the evolutionary tree be V. Define the random labels ξ (u) : u V as random variables taking values on A for which the joint distribution is defined by the random node labeling process described above, and the root label is picked randomly according to π..2. Evolutionary tree reconstruction algorithms. For an evolutionary tree T, the topology Ψ(T) is the unrooted binary tree obtained from the underlying rooted tree by removing the edge directions, and replacing the root and its incident edges with one single edge connecting the root s children. The problem of evolutionary tree reconstruction is that of finding Ψ(T) from sequences associated with its leaf set L. A sample of length l generated by T is a set of length l sequences X (u) : u L, where the character vectors for each position k =,...l are independent and identically distributed as defined by the random labeling process. Consequently, for each position, the vector of characters in that position is distributed identically to ξ (u) : u L. The output of an evolutionary tree reconstruction algorithm is an unrooted binary tree Ψ with the same leaf set L as Ψ(T). The algorithm succeeds if Ψ = Ψ(T). The success rate of the algorithm on samples of length l is the probability of T generating a set of length l sequences on which the algorithm succeeds. The statistical efficiency of the algorithm is measured by the minimum sample length required to achieve a given success rate δ, where 0 < δ < is the error probability sought. Let n denote the number of leaves in Ψ(T). An algorithm is statistically efficient if the minimum sample length is polynomial in n and log(/δ). Evolutionary tree reconstruction algorithms are usually grouped into three categories [0, 26]: maximum likelihood, character-based, and distance-based algorithms. Maximum likelihood algorithms attempt to find a tree Ψ that is the topology of the evolutionary tree that gives the highest probability of generating the observed sequences. All known maximum likelihood algorithms besides exhaustive search use heuristic optimization with no accuracy guarantees. The most commonly used character-based algorithms are parsimony algorithms, which attempt to find a Ψ that minimizes the number of character changes on the edges. This task is NP-hard [], and widely used heuristics are without accuracy guarantees. Furthermore, it has been shown [9] that no sample length assures that parsimony methods recover the topology of certain evolutionary trees. In other words, parsimony is not statistically consistent for all trees. Distance-based algorithms utilize a matrix of pairwise distances between the observed sequences and build a tree from the matrix. Arguably, the most widely used distance-based algorithm is Neighbor-Joining of Saitou and Nei [24], which builds a tree with n leaves in O(n 3 ) running time [25]..3. Evolutionary distances. Let T be an evolutionary tree with the Neyman-Jukes-Cantor mutation model. The distance between two ( tree { nodes u and v, using their labels ξ (u), ξ (v) in a random labeling, is defined as D(u, v) = ln Pr ξ (u) = ξ (v)} ) Pr{ξ(u) ξ (v) } r, with the convention that lnx = whenever x 0. The length of an edge e = uv is defined as D(e) = D(u, v). Distance-based algorithms attempt to find Ψ(T) based on the additive property of D, i.e., that for every three nodes u, v, and w of Ψ(T), if v lies on the path between u and w in Ψ(T), then D(u, v) + D(v, w) = D(u, w). Distance-based algorithms work with distance estimates computed from the input sequences. Let χ denote the indicator function. If u and u are leaves with associated sample sequences s s 2 s l and s s 2 s l, respectively, then the empirical distance between them is defined as ˆD(u, u ) = ln (l ( l k= χ{s k = s k } χ{s k s k } )) r. 2

3 A Initialize Ψ as the star formed by a triplet and its center. A2 For each remaining leaf q, A2. Select a triplet quv with u, v Ψ and q Ψ defining a new inner node c on an edge e Ψ. A2.2 Add c on edge e and connect q to it. Fig.. A possible general outline for a distance-based algorithm. 2. The algorithms. 2.. The Harmonic Greedy Triplets Principle. A triplet uvw consists of three distinct leaves u, v, and w of T. Every triplet defines an internal node at which the three pairwise paths between the leaves intersect, with the four nodes forming a star-shaped topology. This internal node is the center of the triplet. By additivity of D, the distances between the center c and the leaves can be obtained by D(u, c) = ( D(u, v) + D(u, w) D(v, w) ) /2, which is estimated by ˆD (u, c) = Ctr( ˆD, uvw) = ˆD(u, v) + ˆD(u, w) ˆD(v, w). () 2 if all three estimated distances are finite. This equation is the basic formula for estimating edge lengths in many distance-based algorithms. The formula can be used in conjunction with distancebased algorithms of the outline shown in Figure. Whether a triplet defines a new inner node in Step A2. of the general outline can be judged by employing Equation () in conjunction with the estimated distances ˆD. In particular, if c is the center of a triplet uvw and c is the center of a triplet uv w, then by calculating D (c, c ) = Ctr( ˆD, uvw) Ctr( ˆD, uv w ), (2) D (c, c ) can serve as an estimate of D(c, c ). Two triplets to which Equation (2) applies, i.e., which share at least one leaf, are called matching triplets. The Harmonic Greedy Triplets (HGT) principle provides a guideline for the triplet selection mechanism in Steps A and A2., regardless of how separate triplet centers are recognized. The selection is based ( on the analysis of the error in Equation (). Let uvw be a triplet and define H(u, v, w) = H e D(u,v) + e D(u,w) + e D(v,w)), where H denotes harmonic mean. We showed [7] that the error in Equation () depends on H(u, v, w) in the sense that for every 0 < ǫ <, { Pr Ctr( ˆD, uvw) Ctr(D, uvw) } ln( ǫ) ( ) 3exp βlǫ 2 H 2 (u, v, w), 2 where β > /9 is a constant depending on the alphabet size. The HGT principle is that the selection in Steps A and A2. should be a greedy selection of the triplet uvw with the largest average similarity defined by Ĥ(u, v, w) = 3 e ˆD(u,v) + e ˆD(u,w) + e ˆD(v,w). If any of the empirical distances equal, then Ĥ(u, v, w) = HGT and the Four-Point Condition. An evolutionary tree with four leaves has three possible topologies (see Figure 2), denoted by uv wz, uw vz, and uz vw. Based on distance additivity, Buneman s four point condition [4] states that the topology with four leaves is uv wz if and only if D(u, v) + D(w, z) < D(u, w) + D(v, z) = D(u, z) + D(v, w). If D(u, v) is positive and finite 3

4 u z u z u w v w w v z v Fig. 2. Three possible topologies on four leaves. The quartet topologies are denoted by uv wz, uw vz, and uz vw, respectively. for every leaf pair (u, v), then Ψ(T) is uniquely determined and can be obtained by Buneman s algorithm, deriving the topology from the leaf quartet topologies obtained through repeated uses of the four-point condition. Since the equality of the two larger sums in the condition is not likely when using estimated distances, equality on the right-hand side should not be checked in that case. The relaxed four-point condition for topology uv wz is defined as ˆD(u, v) + ˆD(w, z) < ˆD(u, w) + ˆD(v, z); ˆD(u, v) + ˆD(w, z) < ˆD(u, z) + ˆD(v, w). (3) Figure 3 describes our algorithm HGT/FP, which is based on the HGT principle and employs the relaxed four-point condition to recognize separate triplet centers. The HGT/FP algorithm follows the outline of Figure. The algorithm uses Equation (2) to estimate edge lengths. In order to ensure that only matching triplets are compared, the algorithm maintains a set def(z) for each node z, so that def(z) = {z} if z is a leaf, and def(z) = {q, u, v} if z is added to Ψ using the triplet quv. Notice that there is a mapping from nodes of Ψ to nodes of T defined by def( ), so that if z Ψ is a leaf, then it corresponds to the same leaf in T; otherwise it corresponds to the center of def(z) in T. The algorithm utilizes a greedy mechanism in the triplet selection, favorizing triplets with large Ĥ. In the initialization step, the greedy selection is from the O(n2 ) triplets that contain the same arbitrarily fixed leaf. In iteration steps, the algorithm only considers edge-triplet pairs z z 2, quv where u def(z ), v def(z 2 ), and z z 2 is an edge on the path between u and v in Ψ. Such pairs are called relevant pairs. By definition, there are O(n) relevant pairs for every edge. The topology is recovered successfully if HGT/FP selects a relevant pair z z 2, quv in each iteration step, for which the center of the triplet quv in T falls properly onto the path between z and z 2. The HGT/FP algorithm tests whether the center of a triplet quv falls onto an edge z z 2 by employing the relaxed four-point condition of Equation (3) in the following manner. Let z be an internal node in Ψ, let def(z ) = {w, w, w }, and assume that z 2 lies on the path between z and w in Ψ, without loss of generality. HGT/FP tests whether the relaxed four-point condition holds for wq w w. If the condition holds, then for the center c of quv, either c lies on the path between z and z 2, or z 2 lies on the path between z and c. The relaxed four-point condition is checked similarly for z 2 if it is an internal node. If z i is a leaf, the condition for z i is not tested. If the one or two tested conditions hold for the pair, then it is a good relevant pair. The HGT/FP algorithm maintains a set M of good relevant pairs throughout the iterations. In each step, the set M has one entry for each leaf q Ψ, so that M[q] is either null, or it is a good relevant pair z z 2, quv such that Ĥ(q, u, v) is maximal. The set is updated every time a new triplet is added by testing the O(n) relevant pairs with respect to the newly created edges in Ψ. This mechanism is implemented in the Update-M subroutine detailed in Figure 4. Theorem. The running time of the HGT/FP algorithm on a tree with n leaves is O(n 2 ). The algorithm uses O(n) work space. Proof. (Sketch.) The initialization step takes O(n 2 ). Each iteration step takes O(n) time, and there are (n 3) steps. The work space needed in addition to O() local variables comprises the 4

5 Algorithm Harmonic Greedy Triplets with Four Point Condition Input: n n distance matrix containing the values ˆD(u, v) for all leaf pairs (u, v). Output: Ψ. F Select an arbitrary leaf u and find a triplet uvw with the maximum Ĥ(u, v, w). F2 if Ĥ(u, v, w) = 0 then let Ψ be the empty tree, fail, and stop. F3 Let Ψ be the star with three edges formed by uvw and its center c; F4 Set def(c) {u, v, w}. F5 First set all M[x] to null; then for edges zc with z {u, v, w}, Update-M(zc). F6 repeat F7 if M[z] = null for all z L then fail and stop. F8 Find M[q] = z z 2, quv with the maximum Ĥ(q, u, v). F9 F0 F F2 F3 Split z z 2 into two edges z c and z 2 c in Ψ.D (z 2, c). Add to Ψ the leaf q and an edge qc.d (q, c); Set def(c) {q, u, v}. For every z with M[z] containing the edge z z 2, set M[z] null. For each zc {z c, z 2 c, qc}, Update-M(zc). F4 until all leaves are inserted to Ψ ; i.e., this loop has iterated n 3 times. F5 Output Ψ. Fig. 3. The HGT/FP algorithm. The Update-M subroutine is detailed in Figure 4. Calculations pertaining to edge length estimation are not shown here for brevity. storage of Ψ in O(n) space, and the space occupied by M, also of O(n). The statistical efficiency of the HGT/FP algorithm is stated by the following theorem (proof omitted). Theorem 2. Let T be an arbitrary evolutionary tree with n leaves, such that there exist 0 < f g < bounding the edge mutation probabilities, i.e., for each edge e, f p e g. For every error probability 0 < δ <, there exists ( log δ l = O + log n ) ( r r g)γ f 2, (4) with γ = O(log n), such that the success rate of HGT/FP is at least ( δ) on samples of length l. Moreover, for almost every tree topology under the uniform or Yule-Harding distributions, γ = O(log log n). Remark. Erdős et al. [8] closely studied certain evolutionary tree topology properties under different topology distributions, including the Yule-Harding distribution [2] and presented the first distance-based algorithms with provable statistical efficiency HGT and the minimum evolution heuristic. The input to distance-based algorithms is the n n matrix ˆD = [ ˆD(u, v): u, v L]. In view of distance additivity, the matrix of true distances D = [D(u, v): u, v L] describes the unrooted tree Ψ(T) with edges weighted by their length, and each entry of D is the sum of edge weights on the path between the leaves corresponding to the entry. Distance-based algorithms output an unrooted tree Ψ with estimated edge weights d. The minimum evolution heuristic [22] attempts to minimize Ψ = uv Ψ d (uv) given certain constraints on Ψ built from the empirical distances ˆD. 5

6 Algorithm Update-M Input: an edge z z 2 Ψ U for each triplet quv with q Ψ, u def(z ), v def(z 2 ), U2 if z z 2, quv is a good relevant pair U3 then assign z z 2, quv to M[q] if Ĥ(q, u, v) is greater than that of M[q]. Fig. 4. The Update-M subroutine. The subroutine tests every relevant pair for the edge z z 2 via the relaxed four-point condition in order to determine which triplets can be used to add a node on z z 2. The Fast Harmonic Greedy Triplets (Fast-HGT) [6], is a distance-based algorithm, also based on the HGT principle, which uses a minimum distance parameter D min to recognize separate triplet centers. The Fast-HGT algorithm follows the general outline of Figure, and uses only relevant pairs in Step A2.. For a relevant pair z z 2, quv, Equation (2) is employed to estimate the distance between the centers of the triplets def(z ) and quv, or def(z 2 ) and quv. If one of the computed values D (z i, c) falls between ( D min ) and D min, then quv is judged to define the node z i and is thus not used for adding new nodes. Changing the parameter D min results in different trees Ψ (D min ) output by Fast-HGT. The edge length estimates delivered by Fast-HGT can be used in applying the minimum evolution heuristic for selecting the minimum distance parameter D min. The resulting algorithm, HGT/ME, aims at optimizing Ψ (D min ) as a function of D min. We have found that Ψ (D min ) is a closely unimodal function of D min, thus Golden Section Search [20] gives satisfying results for the minimization. Since the number of D min values giving different results is determined by the granularity of ˆD, only O(log l) iterations are needed and all use the same distance matrix. The HGT/ME algorithm is described in Figure 5. Theorem 3. The running time of the HGT/ME algorithm building a tree with n leaves is O(n 2 log l). The algorithm uses O(n) work space. Proof. By setting the minimum bracketing value ǫ = l /2, Fast-HGT is executed at most log 2 l times in Step M, and at most 2+ log β (2l) times Step M2 with β = (+ 5)/2. Since the running time of Fast-HGT is O(n 2 ), the running time of HGT/ME is as stated by the theorem. Fast-HGT needs O(n) space, and HGT/ME needs to store at most four topologies at a time to carry out the minimization. 3. Experimental results. 3.. Robinson-Foulds distance. Simulated experiments are often used to assess the statistical efficiency of evolutionary tree reconstruction algorithms. Simulation consists of generating sample sequences by T given the mutation model. The output Ψ of the algorithm is compared to Ψ(T) using distance measures between unrooted binary trees. We use the Robinson-Foulds distance [2] for this purpose, defined as follows. Let Ψ be an unrooted binary tree with leaf set L. A split generated by an edge e is the unordered pair (L, L 2 ), where L and L 2 are the leaf sets of the two subtrees obtained by removing e from Ψ. The split set S(Ψ) is the set of all splits generated by edges of Ψ. Let Ψ, Ψ 2 be two unrooted trees with the same leaf set L and let n = L. The normalized Robinson-Foulds distance between Ψ and Ψ 2 is defined as S(Ψ ) + S(Ψ 2 ) 2 S(Ψ ) S(Ψ 2 ) RF%(Ψ, Ψ 2 ) =, 2(n 3) which is always between 0 and 00%. We say that an evolutionary tree reconstruction algorithm has δ Robinson-Foulds error on a given sample sequence generated by T, if for the algorithm s 6

7 Algorithm Fast Harmonic Greedy Triplets with Minimum Evolution Heuristic Input: n n empirical distance matrix ˆD and a small positive bracketing value ǫ, say, ǫ = l /2. Output: Ψ. M Find a = 2 k ǫ with the smallest k =, 2,..., log 2 ǫ such that Fast-HGT builds a full tree with D min = 2 k ǫ. M2 Minimize Ψ (D min ) on the interval D min [0, a] with Golden Section Search using ǫ as the minimum bracketing size, and output Ψ built by Fast-HGT at that value. Fig. 5. The HGT/ME algorithm. output Ψ, RF%(Ψ, Ψ(T)) = δ Experimental procedure. We simulated DNA sequence evolution along three large trees in the Neyman-Jukes-Cantor model. The topologies returned by HGT/FP and HGT/ME were compared to the ones returned by Neighbor-Joining, which is the most popular distancebased method and has reportedly (e.g., [23, 4]) achieved high experimental success rates on difficult trees with respect to other evolutionary tree reconstruction algorithms. In the experiments we used qclust [3] implementing the O(n 3 ) Neighbor-Joining algorithm [25]. The topologies were compared using the Robinson-Foulds distance leaf tree. The tree is based on a phylogeny derived from mitochondrial DNA sequences in the course of debating the African origin of humans [6]. We scaled the edge lengths linearly from the originally calculated number of character changes per edge, so that all edge lengths fall into the interval [0.25,.0] corresponding to mutation probabilities between 0.09 and This same scaled tree was also used by [4] in similar experiments. The performance of HGT/ME and HGT/FP on the 35-leaf tree is compared to that of Neighbor-Joining in Figure 6. HGT/ME and HGT/FP perform slightly better starting at sequence lengths of 500 and converge faster to recover the topology than Neighbor-Joining leaf tree. In a set of experiments we used a 500-leaf tree with the topology of a seed plant phylogeny based on rbcl gene sequences from [5]. We scaled the edge lengths from the original number of character changes per edge so that all edge lengths fall into the range [0.,.0], corresponding to mutation probabilities between 0.07 and Our algorithms outperform Neighbor- Joining from around sample length l = 200, and miss only 3% of the edges at l = It is worth pointing out that deriving the original tree took several months in computer time employing parsimony methods, while HGT/FP, HGT/ME, and Neighbor-Joining produce their output in a few seconds on a desktop computer leaf tree. In another set of experiments, we used an 895-leaf tree derived from the evolutionary tree of Eukaryotes based on 2S sequences in the Ribosomal Database Project [7]. After removing a subtree containing distantly related taxa from the original tree of 2055 leaves, we scaled the edge lengths linearly so that they fell into the interval [0.,.0]. HGT/FP and HGT/ME converge quickly so that they miss only one edge on the majority of 2000 length samples, and steadily recover the topology from samples of length Neighbor-Joining s performance improves only three-fold between sample lengths of 200 and 0000 and it still misplaces edges at 5000 length sequences. We also conducted experiments on a differently scaled version of the 895-leaf tree, in which the edge lengths were linearly mapped onto the [0.0,.0] interval. Most of the edges in this tree are short, with around 60% of them having the shortest edge length corresponding to mutation 7

8 RF% 35-leaf tree, high mutation probabilities 0 Neighbor-joining HGT/ME HGT/FP sample length Fig. 6. Experimental results on the 35-leaf tree with edge mutation probabilities between 0.09 and The plot shows the Robinson-Foulds error of the algorithms observed on ten separate samples for each length. The graphs go through the median values. probability. The results of the experiments are shown in Figure 8. In accordance with previous findings in simulations with different smaller trees (e.g., [23]), Neighbor-Joining performs well, achieving high success rates at relatively short sample sequences. HGT/FP is more sensitive to short edge lengths due to the greedy selection of triplets lying at its core. At large sample sizes, however, it does converge more quickly than Neighbor-Joining on the highly divergent tree and misplaces very few edges from l = 5000 on. 4. Concluding remarks. We have presented two novel efficient algorithms for building large divergent evolutionary trees. The HGT/FP algorithm builds the topology in O(n 2 ) time from an n n distance matrix, while HGT/ME runs in O(n 2 log l) when the input matrix is derived from a sample of length l. Our algorithms achieve high success rates on large trees with 35, 500 and 895 leaves with large mutation probabilities. When working with trees with over one thousand leaves, running time of the algorithms becomes crucial. Existing O(n 4 )-time evolutionary tree building algorithms may take days to finish on today s desktop computers and slower algorithms are virtually unusable without having considerable insight into biological features of the data set at hand. Space requirements also need to be considered since the number of pairwise distances between 895 leaves is already around.8 million. For example, algorithms working with quartets such as Buneman s [4] may have to deal with about half a trillion quartets in such a large tree. In addition to computational issues, statistical characteristics of algorithms also become more stressed as one builds larger trees. Neighbor-Joining and most other algorithms have not been proven to require asymptotically polynomial sample sizes in order to correctly recover the topology, while HGT/FP and the Fast-HGT algorithm used by HGT/ME are statistically efficient. Atteson [] proved for Neighbor-Joining the existence of similar bounds to those of Theorem 2, but with γ = O(n) in the exponent. The exponential sample length bound is essentially due to the fact that Neighbor-Joining calculates edge lengths from an average that involves distances between arbitrarily remote nodes in T, for which the estimation error may be large. However, when the mutation probabilities are small, Neighbor-Joining s approach may be justified as illustrated by the example of the 895-leaf tree with small mutation probabilities. Neighbor-Joining averages many estimated distances. When the edge mutation probabilities are small, the distances are small and 8

9 RF% 500-leaf tree, high mutation probabilities 0 Neighbor-joining HGT/ME HGT/FP sample length Fig. 7. Experimental results on the 500-leaf tree with edge mutation probabilities ranging between 0.07 and do not differ by much, so the average provides more accurate information about the topology than a greedy approach does. On the other hand, the error committed while calculating the average is governed by the error in the estimation of the largest distance in the expression, which may be significant when the mutation probabilities are large. Consequently, the statistical performance of Neighbor-Joining is less stable, and a greedy algorithm may provide better efficiency in the case of large mutation probabilities. Many possible applications of evolutionary tree building algorithms may need to build large trees. Examples include large projects in evolutionary biology such as the ones cited [6, 7] and in epidemiology [9]. It will be of practical importance to determine which of the existing algorithms are the most suitable for the ranges of mutation probabilities and tree topologies defined by the application at hand. REFERENCES [] K. Atteson, The performance of neighbor-joining algorithms of phylogeny reconstruction, in COCOON 97, vol. 276 of Lecture Notes in Computer Science, Springer-Verlag, 997, pp [2] K. S. Brown, Deep Green rewrites evolutionary history of plants, Science, 285 (999), p [3] J. Brzustowski, qclust V0.2, [4] P. Buneman, The recovery of trees from dissimilarity matrices, in Mathematics in the Archaelogical and Historical Sciences, F. R. Hodson, D. G. Kendall, and P. Tautu, eds., Edinburgh University Press, 97, pp [5] M. W. Chase, D. E. Soltis, R. G. Olmstead, D. Morgan, D. H. Les, B. D. Mishler, M. R. Duvall, R. A. Price, H. G. Hills, Y.-L. Qiu, K. A. Kron, J. H. Rettig, E. Conti, J. D. Palmer, J. R. Manhart, K. J. Sytsma, H. J. Michaels, W. J. Kress, K. G. Karol, W. D. Clark, M. Hedrn, B. S. Gaut, R. K. Jansen, K.-J. Kim, C. F. Wimpee, J. F. Smith, G. R. Furnier, S. H. Strauss, Q.-Y. Xiang, G. M. Plunkett, P. M. Soltis, S. M. Swensen, S. E. Williams, P. A. Gadek, C. J. Quinn, L. E. Eguiarte, E. Golenberg, G. H. Learn, Jr., S. W. Graham, S. C. H. Barrett, S. Dayanandan, and V. A. Albert, Phylogenetics of seed plants: An analysis of nucleotide sequences from the plastid gene rbcl, Annals of the Missouri Botanical Garden, 80 (993), pp [6] M. Csűrös and M.-Y. Kao, Fast recovery of evolutionary trees through Harmonic Greedy Triplets, Submitted for journal publication. [7] M. Csűrös and M.-Y. Kao, Recovering evolutionary trees through Harmonic Greedy Triplets, in SODA 99, ACM/SIAM, 999, pp [8] P. L. Erdős, M. A. Steel, L. A. Székely, and T. J. Warnow, A few logs suffice to build (almost) all trees 9

10 RF% 895-leaf tree, high mutation probabilities RF% 895-leaf tree, low mutation probabilities 0 Neighbor-joining 0 HGT/ME HGT/FP 0. HGT/ME HGT/FP sample length 0. Neighbor-joining sample length Fig. 8. Experimental results on the 895-leaf tree with edge mutation probabilities ranging between 0.07 and 0.47 on the left, and between and 0.47 on the right. (I), Random Structures and Algorithms, 4 (999), pp Preliminary version as DIMACS TR97-7. [9] J. Felsenstein, Cases in which parsimony or compatibility methods will be positively misleading, Systematic Zoology, 22 (978), pp [0], Phylogenies from molecular sequences: inference and reliability, Annual Review of Genetics, 22 (988), pp [] R. L. Graham and L. R. Foulds, Unlikelihood that minimal phylogenies for a realistic biological study can be constructed in reasonable computational time, Mathematical Biosciences, 60 (982), pp [2] E. F. Harding, The probabilities of rooted tree shapes generated by random bifurcation, Advances in Applied Probability, 3 (97), pp [3] D. M. Hillis, Inferring complex phylogenies, Nature, 383 (996), pp [4] D. H. Huson, S. Nettles, and T. J. Warnow, Obtaining highly accurate topology estimates of evolutionary trees from very short sequences, in RECOMB 99, ACM Press, 999, pp [5] T. H. Jukes and C. R. Cantor, Evolution of protein molecules, in Mammalian Protein Metabolism, H. N. Munro, ed., vol. III, Academic Press, New York, 969, ch. 24, pp [6] D. R. Maddison, M. Ruovolo, and D. L. Swofford, Geographic origins of human mitochondrial DNA: hylogenetic evidence from control region sequences, Systematic Biology, 4 (992), pp. 24. [7] B. L. Maidak, J. R. Cole, T. G. Lilburn, J. Charles T. Parker, P. R. Saxman, J. M. Stredwick, G. M. Garrity, B. Li, G. J. Olsen, S. Pramanik, T. M. Schmidt, and J. M. Tiedje, The RDP (Ribosomal Database Project) continues, Nucleic Acids Research, 28 (2000), pp [8] J. Neyman, Molecular studies of evolution: a source of novel statistical problems, in Statistical Decision Theory and Related Topics, S. S. Gupta and J. Yackel, eds., Academic Press, New York, 97, pp. 27. [9] C.-Y. Ou, C. A. Cieselski, G. Myers, C. I. Bandea, C.-C. Luo, B. T. M. Korber, J. I. Mullins, G. Schochetman, R. L. Berkelman, A. N. Economou, J. J. Witte, L. J. Furman, G. A. Satten, K. A. MacInnes, J. W. Curran, and H. W. Jaffe, Molecular epidemiology of HIV transmission in a dental practice, Science, 256 (992), pp [20] W. H. Press, S. A. Teukolsky, W. V. Vetterling, and B. P. Flannery, Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, Cambridge, UK, 2nd ed., 992. [2] D. F. Robinson and L. R. Foulds, Comparison of phylogenetic trees, Mathematical Biosciences, 53 (98), pp [22] A. Rzhetsky and M. Nei, Theoretical foundation of the minimum evolution method or phylogenetic inference, Molecular Biology and Evolution, 0 (993), pp [23] N. Saitou and T. Imanishi, Relative efficiencies of the Fitch-Margoliash, maximum parsimony, maximum likelihood, minimum evolution, and neighbor-joining methods of phylogenetic tree reconstruction in obtaining the correct tree, Molecular Biology and Evolution, 6 (989), pp [24] N. Saitou and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Molecular Biology and Evolution, 4 (987), pp [25] J. A. Studier and K. J. Keppler, A note on the neighbor-joining method of Saitou and Nei, Molecular Biology and Evolution, 5 (98), pp [26] D. L. Swofford, G. J. Olsen, P. J. Waddell, and D. M. Hillis, Phylogenetic inference, in Molecular Systematics, D. M. Hillis, C. Moritz, and B. K. Mable, eds., Sinauer Associates, Sunderland, Mass., 2nd ed., 996, ch., pp

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such) Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences joe@gs Phylogeny methods, part 1 (Parsimony and such) Methods of reconstructing phylogenies (evolutionary trees) Parsimony

More information

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1 DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens Katherine St. John City University of New York 1 Thanks to the DIMACS Staff Linda Casals Walter Morris Nicole Clark Katherine St. John

More information

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin Sequence length requirements Tandy Warnow Department of Computer Science The University of Texas at Austin Part 1: Absolute Fast Convergence DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2

More information

CS 581. Tandy Warnow

CS 581. Tandy Warnow CS 581 Tandy Warnow This week Maximum parsimony: solving it on small datasets Maximum Likelihood optimization problem Felsenstein s pruning algorithm Bayesian MCMC methods Research opportunities Maximum

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

Scaling species tree estimation methods to large datasets using NJMerge

Scaling species tree estimation methods to large datasets using NJMerge Scaling species tree estimation methods to large datasets using NJMerge Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana Champaign 2018 Phylogenomics Software

More information

Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction

Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction William R. Pearson, Gabriel Robins,* and Tongtong Zhang* *Department of Computer Science and Department of Biochemistry, University

More information

Speeding up Parsimony Scoring with Streaming SIMD Extensions 2

Speeding up Parsimony Scoring with Streaming SIMD Extensions 2 Speeding up Parsimony Scoring with Streaming SIMD Extensions 2 Jason Evans and James Foster Initiative for Bioinformatics and Evolutionary Studies Department

More information

Introduction to Triangulated Graphs. Tandy Warnow

Introduction to Triangulated Graphs. Tandy Warnow Introduction to Triangulated Graphs Tandy Warnow Topics for today Triangulated graphs: theorems and algorithms (Chapters 11.3 and 11.9) Examples of triangulated graphs in phylogeny estimation (Chapters

More information

Recent Research Results. Evolutionary Trees Distance Methods

Recent Research Results. Evolutionary Trees Distance Methods Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how

More information

in interleaved format. The same data set in sequential format:

in interleaved format. The same data set in sequential format: PHYML user's guide Introduction PHYML is a software implementing a new method for building phylogenies from sequences using maximum likelihood. The executables can be downloaded at: http://www.lirmm.fr/~guindon/phyml.html.

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Relaxed Neighbor Joining: A Fast Distance-Based Phylogenetic Tree Construction Method

Relaxed Neighbor Joining: A Fast Distance-Based Phylogenetic Tree Construction Method J Mol Evol (2006) 62:785 792 DOI: 10.1007/s00239-005-0176-2 Relaxed Neighbor Joining: A Fast Distance-Based Phylogenetic Tree Construction Method Jason Evans, Luke Sheneman, James Foster Department of

More information

On the Optimality of the Neighbor Joining Algorithm

On the Optimality of the Neighbor Joining Algorithm On the Optimality of the Neighbor Joining Algorithm Ruriko Yoshida Dept. of Statistics University of Kentucky Joint work with K. Eickmeyer, P. Huggins, and L. Pachter www.ms.uky.edu/ ruriko Louisville

More information

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst Phylogenetics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Phylogenetics phylum = tree phylogenetics: reconstruction of evolutionary

More information

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency?

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency? Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency? Fathiyeh Faghih and Daniel G. Brown David R. Cheriton School of Computer Science, University of

More information

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet) Phylogeny Codon models Last lecture: poor man s way of calculating dn/ds (Ka/Ks) Tabulate synonymous/non- synonymous substitutions Normalize by the possibilities Transform to genetic distance K JC or K

More information

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation

More information

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees What is a phylogenetic tree? Algorithms for Computational Biology Zsuzsanna Lipták speciation events Masters in Molecular and Medical Biotechnology a.a. 25/6, fall term Phylogenetics Summary wolf cat lion

More information

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand

More information

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ) Distance based tree reconstruction Hierarchical clustering (UPGMA) Neighbor-Joining (NJ) All organisms have evolved from a common ancestor. Infer the evolutionary tree (tree topology and edge lengths)

More information

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees A Randomized Algorithm for Comparing Sets of Phylogenetic Trees Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University E-mail: {sulsj,tlw}@cs.tamu.edu Technical Report

More information

CSE 549: Computational Biology

CSE 549: Computational Biology CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford Tree of Life 2 * H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 * H5N1 Influenza Strains The 2007 outbreak

More information

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/11/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other

More information

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES SEUNG-JIN SUL AND TIFFANI L. WILLIAMS Department of Computer Science Texas A&M University College Station, TX 77843-3112 USA E-mail: {sulsj,tlw}@cs.tamu.edu

More information

Evolution Module. 6.1 Phylogenetic Trees. Bob Gardner and Lev Yampolski. Integrated Biology and Discrete Math (IBMS 1300)

Evolution Module. 6.1 Phylogenetic Trees. Bob Gardner and Lev Yampolski. Integrated Biology and Discrete Math (IBMS 1300) Evolution Module 6.1 Phylogenetic Trees Bob Gardner and Lev Yampolski Integrated Biology and Discrete Math (IBMS 1300) Fall 2008 1 INDUCTION Note. The natural numbers N is the familiar set N = {1, 2, 3,...}.

More information

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie. Arbres formels et Arbre de la Vie Olivier Gascuel Centre National de la Recherche Scientifique LIRMM, Montpellier, France www.lirmm.fr/gascuel 10 permanent researchers 2 technical staff 3 postdocs, 10

More information

The Performance of Phylogenetic Methods on Trees of Bounded Diameter

The Performance of Phylogenetic Methods on Trees of Bounded Diameter The Performance of Phylogenetic Methods on Trees of Bounded Diameter Luay Nakhleh 1, Usman Roshan 1, Katherine St. John 1 2, Jerry Sun 1, and Tandy Warnow 1 3 1 Department of Computer Sciences, University

More information

Multicasting in the Hypercube, Chord and Binomial Graphs

Multicasting in the Hypercube, Chord and Binomial Graphs Multicasting in the Hypercube, Chord and Binomial Graphs Christopher C. Cipriano and Teofilo F. Gonzalez Department of Computer Science University of California, Santa Barbara, CA, 93106 E-mail: {ccc,teo}@cs.ucsb.edu

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/12/2013 Comp 465 Fall 2013 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other vertex A clique

More information

Study of a Simple Pruning Strategy with Days Algorithm

Study of a Simple Pruning Strategy with Days Algorithm Study of a Simple Pruning Strategy with ays Algorithm Thomas G. Kristensen Abstract We wish to calculate all pairwise Robinson Foulds distances in a set of trees. Traditional algorithms for doing this

More information

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University College Station, TX 77843-3 {sulsj,tlw}@cs.tamu.edu

More information

An Edge-Swap Heuristic for Finding Dense Spanning Trees

An Edge-Swap Heuristic for Finding Dense Spanning Trees Theory and Applications of Graphs Volume 3 Issue 1 Article 1 2016 An Edge-Swap Heuristic for Finding Dense Spanning Trees Mustafa Ozen Bogazici University, mustafa.ozen@boun.edu.tr Hua Wang Georgia Southern

More information

Introduction to Computational Phylogenetics

Introduction to Computational Phylogenetics Introduction to Computational Phylogenetics Tandy Warnow The University of Texas at Austin No Institute Given This textbook is a draft, and should not be distributed. Much of what is in this textbook appeared

More information

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

of the Balanced Minimum Evolution Polytope Ruriko Yoshida Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida Figure 19.1 Genomes 3 ( Garland Science 2007) Origins of Species Tree (or web) of life eukarya

More information

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3 The XIII International Conference Applied Stochastic Models and Data Analysis (ASMDA-2009) June 30-July 3, 2009, Vilnius, LITHUANIA ISBN 978-9955-28-463-5 L. Sakalauskas, C. Skiadas and E. K. Zavadskas

More information

Distance-based Phylogenetic Methods Near a Polytomy

Distance-based Phylogenetic Methods Near a Polytomy Distance-based Phylogenetic Methods Near a Polytomy Ruth Davidson and Seth Sullivant NCSU UIUC May 21, 2014 2 Phylogenetic trees model the common evolutionary history of a group of species Leaves = extant

More information

A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees

A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees Andreas Sand 1,2, Gerth Stølting Brodal 2,3, Rolf Fagerberg 4, Christian N. S. Pedersen 1,2 and Thomas Mailund

More information

Introduction to Trees

Introduction to Trees Introduction to Trees Tandy Warnow December 28, 2016 Introduction to Trees Tandy Warnow Clades of a rooted tree Every node v in a leaf-labelled rooted tree defines a subset of the leafset that is below

More information

4/4/16 Comp 555 Spring

4/4/16 Comp 555 Spring 4/4/16 Comp 555 Spring 2016 1 A clique is a graph where every vertex is connected via an edge to every other vertex A clique graph is a graph where each connected component is a clique The concept of clustering

More information

Dynamic Programming for Phylogenetic Estimation

Dynamic Programming for Phylogenetic Estimation 1 / 45 Dynamic Programming for Phylogenetic Estimation CS598AGB Pranjal Vachaspati University of Illinois at Urbana-Champaign 2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information

Rapid Neighbour-Joining

Rapid Neighbour-Joining Rapid Neighbour-Joining Martin Simonsen, Thomas Mailund and Christian N. S. Pedersen Bioinformatics Research Center (BIRC), University of Aarhus, C. F. Møllers Allé, Building 1110, DK-8000 Århus C, Denmark.

More information

A Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem

A Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem A Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem Gang Wu Jia-Huai You Guohui Lin January 17, 2005 Abstract A lookahead branch-and-bound algorithm is proposed for solving

More information

Approximating Node-Weighted Multicast Trees in Wireless Ad-Hoc Networks

Approximating Node-Weighted Multicast Trees in Wireless Ad-Hoc Networks Approximating Node-Weighted Multicast Trees in Wireless Ad-Hoc Networks Thomas Erlebach Department of Computer Science University of Leicester, UK te17@mcs.le.ac.uk Ambreen Shahnaz Department of Computer

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

Communication Networks I December 4, 2001 Agenda Graph theory notation Trees Shortest path algorithms Distributed, asynchronous algorithms Page 1

Communication Networks I December 4, 2001 Agenda Graph theory notation Trees Shortest path algorithms Distributed, asynchronous algorithms Page 1 Communication Networks I December, Agenda Graph theory notation Trees Shortest path algorithms Distributed, asynchronous algorithms Page Communication Networks I December, Notation G = (V,E) denotes a

More information

PRec-I-DCM3: A Parallel Framework for Fast and Accurate Large Scale Phylogeny Reconstruction

PRec-I-DCM3: A Parallel Framework for Fast and Accurate Large Scale Phylogeny Reconstruction PRec-I-DCM3: A Parallel Framework for Fast and Accurate Large Scale Phylogeny Reconstruction Cristian Coarfa Yuri Dotsenko John Mellor-Crummey Luay Nakhleh Usman Roshan Abstract Phylogenetic trees play

More information

Fast and Reliable Reconstruction of Phylogenetic Trees with Very Short Edges Extended Abstract

Fast and Reliable Reconstruction of Phylogenetic Trees with Very Short Edges Extended Abstract Fast and Reliable Reconstruction of Phylogenetic Trees with Very Short Edges Extended Abstract Ilan Gronau Shlomo Moran Sagi Snir Abstract Phylogenetic reconstruction is the problem of reconstructing an

More information

The 3-Steiner Root Problem

The 3-Steiner Root Problem The 3-Steiner Root Problem Maw-Shang Chang 1 and Ming-Tat Ko 2 1 Department of Computer Science and Information Engineering National Chung Cheng University, Chiayi 621, Taiwan, R.O.C. mschang@cs.ccu.edu.tw

More information

11/17/2009 Comp 590/Comp Fall

11/17/2009 Comp 590/Comp Fall Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 Problem Set #5 will be available tonight 11/17/2009 Comp 590/Comp 790-90 Fall 2009 1 Clique Graphs A clique is a graph with every vertex connected

More information

Pacific Symposium on Biocomputing 5: (2000)

Pacific Symposium on Biocomputing 5: (2000) InVEST: Interactive and Visual Edge Selection Tool for Constructing Evolutionary Trees Paul Kearney, Adrian Secord and Haoyong Zhang Department of Computer Science University of Waterloo Waterloo, ON,

More information

Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis

Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis B. B. Zhou 1, D. Chu 1, M. Tarawneh 1, P. Wang 1, C. Wang 1, A. Y. Zomaya 1, and R. P. Brent 2 1 School of Information Technologies

More information

Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees

Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees Journal of Bioinformatics and Computational Biology c Imperial College Press Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees M. Stissing, T. Mailund, C. N. S. Pedersen and G. S.

More information

On 2-Subcolourings of Chordal Graphs

On 2-Subcolourings of Chordal Graphs On 2-Subcolourings of Chordal Graphs Juraj Stacho School of Computing Science, Simon Fraser University 8888 University Drive, Burnaby, B.C., Canada V5A 1S6 jstacho@cs.sfu.ca Abstract. A 2-subcolouring

More information

Efficiency of Data Distribution in BitTorrent-Like Systems

Efficiency of Data Distribution in BitTorrent-Like Systems Efficiency of Data Distribution in BitTorrent-Like Systems Ho-Leung Chan 1,, Tak-Wah Lam 2, and Prudence W.H. Wong 3 1 Department of Computer Science, University of Pittsburgh hlchan@cs.pitt.edu 2 Department

More information

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees Applied Mathematics Letters 24 (2011) 719 723 Contents lists available at ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml Graph triangulations and the compatibility

More information

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony Jean-Michel Richer 1 and Adrien Goëffon 2 and Jin-Kao Hao 1 1 University of Angers, LERIA, 2 Bd Lavoisier, 49045 Anger Cedex 01,

More information

Chordal Graphs and Evolutionary Trees. Tandy Warnow

Chordal Graphs and Evolutionary Trees. Tandy Warnow Chordal Graphs and Evolutionary Trees Tandy Warnow Possible Indo-European tree (Ringe, Warnow and Taylor 2000) Anatolian Vedic Iranian Greek Italic Celtic Tocharian Armenian Germanic Baltic Slavic Albanian

More information

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets Vincent Ranwez and Olivier Gascuel Département Informatique Fondamentale et Applications, LIRMM,

More information

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS 1 SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS C. THAN and L. NAKHLEH Department of Computer Science Rice University 6100 Main Street, MS 132 Houston, TX 77005, USA Email: {cvthan,nakhleh}@cs.rice.edu

More information

Algorithms for Computing Cluster Dissimilarity between Rooted Phylogenetic

Algorithms for Computing Cluster Dissimilarity between Rooted Phylogenetic Send Orders for Reprints to reprints@benthamscience.ae 8 The Open Cybernetics & Systemics Journal, 05, 9, 8-3 Open Access Algorithms for Computing Cluster Dissimilarity between Rooted Phylogenetic Trees

More information

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES Integrative Biology 200, Spring 2014 Principles of Phylogenetics: Systematics University of California, Berkeley Updated by Traci L. Grzymala Lab 07: Maximum Likelihood Model Selection and RAxML Using

More information

human chimp mouse rat

human chimp mouse rat Michael rudno These notes are based on earlier notes by Tomas abak Phylogenetic Trees Phylogenetic Trees demonstrate the amoun of evolution, and order of divergence for several genomes. Phylogenetic trees

More information

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 28 th November 2007 OUTLINE 1 INFERRING

More information

Notes 4 : Approximating Maximum Parsimony

Notes 4 : Approximating Maximum Parsimony Notes 4 : Approximating Maximum Parsimony MATH 833 - Fall 2012 Lecturer: Sebastien Roch References: [SS03, Chapters 2, 5], [DPV06, Chapters 5, 9] 1 Coping with NP-completeness Local search heuristics.

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Shortest Paths Date: 10/13/15

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Shortest Paths Date: 10/13/15 600.363 Introduction to Algorithms / 600.463 Algorithms I Lecturer: Michael Dinitz Topic: Shortest Paths Date: 10/13/15 14.1 Introduction Today we re going to talk about algorithms for computing shortest

More information

15.4 Longest common subsequence

15.4 Longest common subsequence 15.4 Longest common subsequence Biological applications often need to compare the DNA of two (or more) different organisms A strand of DNA consists of a string of molecules called bases, where the possible

More information

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY KARL L. STRATOS Abstract. The conventional method of describing a graph as a pair (V, E), where V and E repectively denote the sets of vertices and edges,

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

Ch9: Exact Inference: Variable Elimination. Shimi Salant, Barak Sternberg

Ch9: Exact Inference: Variable Elimination. Shimi Salant, Barak Sternberg Ch9: Exact Inference: Variable Elimination Shimi Salant Barak Sternberg Part 1 Reminder introduction (1/3) We saw two ways to represent (finite discrete) distributions via graphical data structures: Bayesian

More information

Analyzing Evolutionary Trees

Analyzing Evolutionary Trees Analyzing Evolutionary Trees Katherine St. John Lehman College and the Graduate Center City University of New York stjohn@lehman.cuny.edu Katherine St. John City University of New York 1 Overview Talk

More information

Phylogenetic networks that display a tree twice

Phylogenetic networks that display a tree twice Bulletin of Mathematical Biology manuscript No. (will be inserted by the editor) Phylogenetic networks that display a tree twice Paul Cordue Simone Linz Charles Semple Received: date / Accepted: date Abstract

More information

Trinets encode tree-child and level-2 phylogenetic networks

Trinets encode tree-child and level-2 phylogenetic networks Noname manuscript No. (will be inserted by the editor) Trinets encode tree-child and level-2 phylogenetic networks Leo van Iersel Vincent Moulton the date of receipt and acceptance should be inserted later

More information

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony Jean-Michel Richer 1,AdrienGoëffon 2, and Jin-Kao Hao 1 1 University of Angers, LERIA, 2 Bd Lavoisier, 49045 Anger Cedex 01, France

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Suggested Reading: Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Probabilistic Modelling and Reasoning: The Junction

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

FastJoin, an improved neighbor-joining algorithm

FastJoin, an improved neighbor-joining algorithm Methodology FastJoin, an improved neighbor-joining algorithm J. Wang, M.-Z. Guo and L.L. Xing School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, P.R. China

More information

Computing the Quartet Distance Between Trees of Arbitrary Degrees

Computing the Quartet Distance Between Trees of Arbitrary Degrees January 22, 2006 University of Aarhus Department of Computer Science Computing the Quartet Distance Between Trees of Arbitrary Degrees Chris Christiansen & Martin Randers Thesis supervisor: Christian Nørgaard

More information

Terminology. A phylogeny is the evolutionary history of an organism

Terminology. A phylogeny is the evolutionary history of an organism Phylogeny Terminology A phylogeny is the evolutionary history of an organism A taxon (plural: taxa) is a group of (one or more) organisms, which a taxonomist adjudges to be a unit. A definition? from Wikipedia

More information

Unique reconstruction of tree-like phylogenetic networks from distances between leaves

Unique reconstruction of tree-like phylogenetic networks from distances between leaves Unique reconstruction of tree-like phylogenetic networks from distances between leaves Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA email: swillson@iastate.edu

More information

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Khaddouja Boujenfa, Nadia Essoussi, and Mohamed Limam International Science Index, Computer and Information Engineering waset.org/publication/482

More information

SEEING THE TREES AND THEIR BRANCHES IN THE NETWORK IS HARD

SEEING THE TREES AND THEIR BRANCHES IN THE NETWORK IS HARD 1 SEEING THE TREES AND THEIR BRANCHES IN THE NETWORK IS HARD I A KANJ School of Computer Science, Telecommunications, and Information Systems, DePaul University, Chicago, IL 60604-2301, USA E-mail: ikanj@csdepauledu

More information

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

Distance Methods. PRINCIPLES OF PHYLOGENETICS Spring 2006 Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2006 Distance Methods Due at the end of class: - Distance matrices and trees for two different distance

More information

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2 Chapter 5 A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2 Graph Matching has attracted the exploration of applying new computing paradigms because of the large number of applications

More information

Graphs and Discrete Structures

Graphs and Discrete Structures Graphs and Discrete Structures Nicolas Bousquet Louis Esperet Fall 2018 Abstract Brief summary of the first and second course. É 1 Chromatic number, independence number and clique number The chromatic

More information

Constructions of hamiltonian graphs with bounded degree and diameter O(log n)

Constructions of hamiltonian graphs with bounded degree and diameter O(log n) Constructions of hamiltonian graphs with bounded degree and diameter O(log n) Aleksandar Ilić Faculty of Sciences and Mathematics, University of Niš, Serbia e-mail: aleksandari@gmail.com Dragan Stevanović

More information

Multi-Way Number Partitioning

Multi-Way Number Partitioning Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Multi-Way Number Partitioning Richard E. Korf Computer Science Department University of California,

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

The strong chromatic number of a graph

The strong chromatic number of a graph The strong chromatic number of a graph Noga Alon Abstract It is shown that there is an absolute constant c with the following property: For any two graphs G 1 = (V, E 1 ) and G 2 = (V, E 2 ) on the same

More information

New Constructions of Non-Adaptive and Error-Tolerance Pooling Designs

New Constructions of Non-Adaptive and Error-Tolerance Pooling Designs New Constructions of Non-Adaptive and Error-Tolerance Pooling Designs Hung Q Ngo Ding-Zhu Du Abstract We propose two new classes of non-adaptive pooling designs The first one is guaranteed to be -error-detecting

More information

Rapid Neighbour-Joining

Rapid Neighbour-Joining Rapid Neighbour-Joining Martin Simonsen, Thomas Mailund, and Christian N.S. Pedersen Bioinformatics Research Center (BIRC), University of Aarhus, C. F. Møllers Allé, Building 1110, DK-8000 Århus C, Denmark

More information

1 Minimum Cut Problem

1 Minimum Cut Problem CS 6 Lecture 6 Min Cut and Karger s Algorithm Scribes: Peng Hui How, Virginia Williams (05) Date: November 7, 07 Anthony Kim (06), Mary Wootters (07) Adapted from Virginia Williams lecture notes Minimum

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Bipartite Roots of Graphs

Bipartite Roots of Graphs Bipartite Roots of Graphs Lap Chi Lau Department of Computer Science University of Toronto Graph H is a root of graph G if there exists a positive integer k such that x and y are adjacent in G if and only

More information

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming

More information

Min-Cost Multicast Networks in Euclidean Space

Min-Cost Multicast Networks in Euclidean Space Min-Cost Multicast Networks in Euclidean Space Xunrui Yin, Yan Wang, Xin Wang, Xiangyang Xue School of Computer Science Fudan University {09110240030,11110240029,xinw,xyxue}@fudan.edu.cn Zongpeng Li Dept.

More information

REDUCING GRAPH COLORING TO CLIQUE SEARCH

REDUCING GRAPH COLORING TO CLIQUE SEARCH Asia Pacific Journal of Mathematics, Vol. 3, No. 1 (2016), 64-85 ISSN 2357-2205 REDUCING GRAPH COLORING TO CLIQUE SEARCH SÁNDOR SZABÓ AND BOGDÁN ZAVÁLNIJ Institute of Mathematics and Informatics, University

More information

Evaluating the Effect of Perturbations in Reconstructing Network Topologies

Evaluating the Effect of Perturbations in Reconstructing Network Topologies DSC 2 Working Papers (Draft Versions) http://www.ci.tuwien.ac.at/conferences/dsc-2/ Evaluating the Effect of Perturbations in Reconstructing Network Topologies Florian Markowetz and Rainer Spang Max-Planck-Institute

More information