EFFICIENT LARGE-SCALE PHYLOGENY RECONSTRUCTION

Size: px

Start display at page:

Download "EFFICIENT LARGE-SCALE PHYLOGENY RECONSTRUCTION"

Anthony Reed
5 years ago
Views:

1 EFFICIENT LARGE-SCALE PHYLOGENY RECONSTRUCTION MIKLÓS CSŰRÖS AND MING-YANG KAO Abstract. In this study we introduce two novel distance-based algorithms with provably high computational and statistical efficiency. Furthermore, we report the results of experiments simulating sequence evolution on large trees with 35, 500, and 895 leaves showing high success rates of our algorithms for large mutation probabilities, and high success rates of the popular Neighbor-Joining algorithm for small mutation probabilities. The efficiency of our HGT/FP and HGT/ME algorithms opens up the possibility of obtaining even better performance in large-scale phylogeny reconstruction in terms of speed and accuracy than ever previously achieved.. Introduction. Large-scale phylogeny reconstruction is becoming a reality with the rapidly increasing availability of molecular sequences. Ambitious ventures such as the Green Plant Phylogeny Project [2] and the Ribosomal Database Project [7] aim at recovering evolutionary trees with hundreds and thousands of species. epidemiology When deriving a large tree, concerns about efficiency revolve evenly around the computational speed and statistical accuracy of the reconstruction. For example, maximum likelihood methods have considerable statistical appeal but are notoriously slow, and for this reason are rarely used on trees with more than a few tens of leaves. Distance-based methods have been favored for their computational speed, but even algorithms which build a tree with n leaves in O(n 4 ) time may be not fast enough if n is in the order of thousands. On the other hand, fast algorithms that do not extract topology information efficiently enough from the input sequences may require amounts of data beyond our reach for successful reconstruction of large trees. Recent theoretical results on the statistical efficiency of distance-based algorithms [, 8] make these latter ideal candidates for large-scale phylogeny recovery, but experiments involving large trees are still rare. To our knowledge, the largest tree previously used in an experimental study of phylogeny reconstruction has 228 leaves [3]. Our goal with this paper is two-fold. First, we introduce two novel distance-based algorithms with provably high computational and statistical efficiency. Secondly, we report the results of experiments simulating sequence evolution on trees with 35, 500, and 895 leaves showing high success rates of the popular Neighbor-Joining algorithm for small mutation probabilities, and high success rates of our algorithms for large mutation probabilities... The Neyman-Jukes-Cantor model of sequence evolution. Let A = {, 2,...,r} be a finite alphabet and A + denote the set of all non-empty sequences over A. An evolutionary tree T is defined by an underlying tree and a mutation model. The underlying tree is a rooted binary tree in which every internal node has exactly two children. The mutation model randomly associates a sequence of A + with each tree node. This paper employs the Neyman-Jukes-Cantor mutation model [5, 8], formulated as follows. An edge mutation probability p e is assigned to every edge e. Given an arbitrary root sequence s s 2 s l A l associated with the root, sequences are associated with the other nodes via l random node labelings. The labelings are mutually independent. Every sequence s k-th character is generated in the k-th labeling of the nodes with characters of A, starting at the root and proceeding towards the leaves along the edges in the following manner. The root is labeled with s k. On edge e, the child s label is picked randomly according to Pr { child s label is i parent s label is j } = { pe if i = j; p e r otherwise. Department of Computer Science, Yale University, New Haven, CT 06520; {csuros-miklos, kao-mingyang}@cs.yale.edu.

2 For the sake of simplicity, we assume that the characters of the root sequence are independent and identically distributed, drawn according to the root label distribution π = π,...,π r. Let the node set of the evolutionary tree be V. Define the random labels ξ (u) : u V as random variables taking values on A for which the joint distribution is defined by the random node labeling process described above, and the root label is picked randomly according to π..2. Evolutionary tree reconstruction algorithms. For an evolutionary tree T, the topology Ψ(T) is the unrooted binary tree obtained from the underlying rooted tree by removing the edge directions, and replacing the root and its incident edges with one single edge connecting the root s children. The problem of evolutionary tree reconstruction is that of finding Ψ(T) from sequences associated with its leaf set L. A sample of length l generated by T is a set of length l sequences X (u) : u L, where the character vectors for each position k =,...l are independent and identically distributed as defined by the random labeling process. Consequently, for each position, the vector of characters in that position is distributed identically to ξ (u) : u L. The output of an evolutionary tree reconstruction algorithm is an unrooted binary tree Ψ with the same leaf set L as Ψ(T). The algorithm succeeds if Ψ = Ψ(T). The success rate of the algorithm on samples of length l is the probability of T generating a set of length l sequences on which the algorithm succeeds. The statistical efficiency of the algorithm is measured by the minimum sample length required to achieve a given success rate δ, where 0 < δ < is the error probability sought. Let n denote the number of leaves in Ψ(T). An algorithm is statistically efficient if the minimum sample length is polynomial in n and log(/δ). Evolutionary tree reconstruction algorithms are usually grouped into three categories [0, 26]: maximum likelihood, character-based, and distance-based algorithms. Maximum likelihood algorithms attempt to find a tree Ψ that is the topology of the evolutionary tree that gives the highest probability of generating the observed sequences. All known maximum likelihood algorithms besides exhaustive search use heuristic optimization with no accuracy guarantees. The most commonly used character-based algorithms are parsimony algorithms, which attempt to find a Ψ that minimizes the number of character changes on the edges. This task is NP-hard [], and widely used heuristics are without accuracy guarantees. Furthermore, it has been shown [9] that no sample length assures that parsimony methods recover the topology of certain evolutionary trees. In other words, parsimony is not statistically consistent for all trees. Distance-based algorithms utilize a matrix of pairwise distances between the observed sequences and build a tree from the matrix. Arguably, the most widely used distance-based algorithm is Neighbor-Joining of Saitou and Nei [24], which builds a tree with n leaves in O(n 3 ) running time [25]..3. Evolutionary distances. Let T be an evolutionary tree with the Neyman-Jukes-Cantor mutation model. The distance between two ( tree { nodes u and v, using their labels ξ (u), ξ (v) in a random labeling, is defined as D(u, v) = ln Pr ξ (u) = ξ (v)} ) Pr{ξ(u) ξ (v) } r, with the convention that lnx = whenever x 0. The length of an edge e = uv is defined as D(e) = D(u, v). Distance-based algorithms attempt to find Ψ(T) based on the additive property of D, i.e., that for every three nodes u, v, and w of Ψ(T), if v lies on the path between u and w in Ψ(T), then D(u, v) + D(v, w) = D(u, w). Distance-based algorithms work with distance estimates computed from the input sequences. Let χ denote the indicator function. If u and u are leaves with associated sample sequences s s 2 s l and s s 2 s l, respectively, then the empirical distance between them is defined as ˆD(u, u ) = ln (l ( l k= χ{s k = s k } χ{s k s k } )) r. 2

3 A Initialize Ψ as the star formed by a triplet and its center. A2 For each remaining leaf q, A2. Select a triplet quv with u, v Ψ and q Ψ defining a new inner node c on an edge e Ψ. A2.2 Add c on edge e and connect q to it. Fig.. A possible general outline for a distance-based algorithm. 2. The algorithms. 2.. The Harmonic Greedy Triplets Principle. A triplet uvw consists of three distinct leaves u, v, and w of T. Every triplet defines an internal node at which the three pairwise paths between the leaves intersect, with the four nodes forming a star-shaped topology. This internal node is the center of the triplet. By additivity of D, the distances between the center c and the leaves can be obtained by D(u, c) = ( D(u, v) + D(u, w) D(v, w) ) /2, which is estimated by ˆD (u, c) = Ctr( ˆD, uvw) = ˆD(u, v) + ˆD(u, w) ˆD(v, w). () 2 if all three estimated distances are finite. This equation is the basic formula for estimating edge lengths in many distance-based algorithms. The formula can be used in conjunction with distancebased algorithms of the outline shown in Figure. Whether a triplet defines a new inner node in Step A2. of the general outline can be judged by employing Equation () in conjunction with the estimated distances ˆD. In particular, if c is the center of a triplet uvw and c is the center of a triplet uv w, then by calculating D (c, c ) = Ctr( ˆD, uvw) Ctr( ˆD, uv w ), (2) D (c, c ) can serve as an estimate of D(c, c ). Two triplets to which Equation (2) applies, i.e., which share at least one leaf, are called matching triplets. The Harmonic Greedy Triplets (HGT) principle provides a guideline for the triplet selection mechanism in Steps A and A2., regardless of how separate triplet centers are recognized. The selection is based ( on the analysis of the error in Equation (). Let uvw be a triplet and define H(u, v, w) = H e D(u,v) + e D(u,w) + e D(v,w)), where H denotes harmonic mean. We showed [7] that the error in Equation () depends on H(u, v, w) in the sense that for every 0 < ǫ <, { Pr Ctr( ˆD, uvw) Ctr(D, uvw) } ln( ǫ) ( ) 3exp βlǫ 2 H 2 (u, v, w), 2 where β > /9 is a constant depending on the alphabet size. The HGT principle is that the selection in Steps A and A2. should be a greedy selection of the triplet uvw with the largest average similarity defined by Ĥ(u, v, w) = 3 e ˆD(u,v) + e ˆD(u,w) + e ˆD(v,w). If any of the empirical distances equal, then Ĥ(u, v, w) = HGT and the Four-Point Condition. An evolutionary tree with four leaves has three possible topologies (see Figure 2), denoted by uv wz, uw vz, and uz vw. Based on distance additivity, Buneman s four point condition [4] states that the topology with four leaves is uv wz if and only if D(u, v) + D(w, z) < D(u, w) + D(v, z) = D(u, z) + D(v, w). If D(u, v) is positive and finite 3

4 u z u z u w v w w v z v Fig. 2. Three possible topologies on four leaves. The quartet topologies are denoted by uv wz, uw vz, and uz vw, respectively. for every leaf pair (u, v), then Ψ(T) is uniquely determined and can be obtained by Buneman s algorithm, deriving the topology from the leaf quartet topologies obtained through repeated uses of the four-point condition. Since the equality of the two larger sums in the condition is not likely when using estimated distances, equality on the right-hand side should not be checked in that case. The relaxed four-point condition for topology uv wz is defined as ˆD(u, v) + ˆD(w, z) < ˆD(u, w) + ˆD(v, z); ˆD(u, v) + ˆD(w, z) < ˆD(u, z) + ˆD(v, w). (3) Figure 3 describes our algorithm HGT/FP, which is based on the HGT principle and employs the relaxed four-point condition to recognize separate triplet centers. The HGT/FP algorithm follows the outline of Figure. The algorithm uses Equation (2) to estimate edge lengths. In order to ensure that only matching triplets are compared, the algorithm maintains a set def(z) for each node z, so that def(z) = {z} if z is a leaf, and def(z) = {q, u, v} if z is added to Ψ using the triplet quv. Notice that there is a mapping from nodes of Ψ to nodes of T defined by def( ), so that if z Ψ is a leaf, then it corresponds to the same leaf in T; otherwise it corresponds to the center of def(z) in T. The algorithm utilizes a greedy mechanism in the triplet selection, favorizing triplets with large Ĥ. In the initialization step, the greedy selection is from the O(n2 ) triplets that contain the same arbitrarily fixed leaf. In iteration steps, the algorithm only considers edge-triplet pairs z z 2, quv where u def(z ), v def(z 2 ), and z z 2 is an edge on the path between u and v in Ψ. Such pairs are called relevant pairs. By definition, there are O(n) relevant pairs for every edge. The topology is recovered successfully if HGT/FP selects a relevant pair z z 2, quv in each iteration step, for which the center of the triplet quv in T falls properly onto the path between z and z 2. The HGT/FP algorithm tests whether the center of a triplet quv falls onto an edge z z 2 by employing the relaxed four-point condition of Equation (3) in the following manner. Let z be an internal node in Ψ, let def(z ) = {w, w, w }, and assume that z 2 lies on the path between z and w in Ψ, without loss of generality. HGT/FP tests whether the relaxed four-point condition holds for wq w w. If the condition holds, then for the center c of quv, either c lies on the path between z and z 2, or z 2 lies on the path between z and c. The relaxed four-point condition is checked similarly for z 2 if it is an internal node. If z i is a leaf, the condition for z i is not tested. If the one or two tested conditions hold for the pair, then it is a good relevant pair. The HGT/FP algorithm maintains a set M of good relevant pairs throughout the iterations. In each step, the set M has one entry for each leaf q Ψ, so that M[q] is either null, or it is a good relevant pair z z 2, quv such that Ĥ(q, u, v) is maximal. The set is updated every time a new triplet is added by testing the O(n) relevant pairs with respect to the newly created edges in Ψ. This mechanism is implemented in the Update-M subroutine detailed in Figure 4. Theorem. The running time of the HGT/FP algorithm on a tree with n leaves is O(n 2 ). The algorithm uses O(n) work space. Proof. (Sketch.) The initialization step takes O(n 2 ). Each iteration step takes O(n) time, and there are (n 3) steps. The work space needed in addition to O() local variables comprises the 4

5 Algorithm Harmonic Greedy Triplets with Four Point Condition Input: n n distance matrix containing the values ˆD(u, v) for all leaf pairs (u, v). Output: Ψ. F Select an arbitrary leaf u and find a triplet uvw with the maximum Ĥ(u, v, w). F2 if Ĥ(u, v, w) = 0 then let Ψ be the empty tree, fail, and stop. F3 Let Ψ be the star with three edges formed by uvw and its center c; F4 Set def(c) {u, v, w}. F5 First set all M[x] to null; then for edges zc with z {u, v, w}, Update-M(zc). F6 repeat F7 if M[z] = null for all z L then fail and stop. F8 Find M[q] = z z 2, quv with the maximum Ĥ(q, u, v). F9 F0 F F2 F3 Split z z 2 into two edges z c and z 2 c in Ψ.D (z 2, c). Add to Ψ the leaf q and an edge qc.d (q, c); Set def(c) {q, u, v}. For every z with M[z] containing the edge z z 2, set M[z] null. For each zc {z c, z 2 c, qc}, Update-M(zc). F4 until all leaves are inserted to Ψ ; i.e., this loop has iterated n 3 times. F5 Output Ψ. Fig. 3. The HGT/FP algorithm. The Update-M subroutine is detailed in Figure 4. Calculations pertaining to edge length estimation are not shown here for brevity. storage of Ψ in O(n) space, and the space occupied by M, also of O(n). The statistical efficiency of the HGT/FP algorithm is stated by the following theorem (proof omitted). Theorem 2. Let T be an arbitrary evolutionary tree with n leaves, such that there exist 0 < f g < bounding the edge mutation probabilities, i.e., for each edge e, f p e g. For every error probability 0 < δ <, there exists ( log δ l = O + log n ) ( r r g)γ f 2, (4) with γ = O(log n), such that the success rate of HGT/FP is at least ( δ) on samples of length l. Moreover, for almost every tree topology under the uniform or Yule-Harding distributions, γ = O(log log n). Remark. Erdős et al. [8] closely studied certain evolutionary tree topology properties under different topology distributions, including the Yule-Harding distribution [2] and presented the first distance-based algorithms with provable statistical efficiency HGT and the minimum evolution heuristic. The input to distance-based algorithms is the n n matrix ˆD = [ ˆD(u, v): u, v L]. In view of distance additivity, the matrix of true distances D = [D(u, v): u, v L] describes the unrooted tree Ψ(T) with edges weighted by their length, and each entry of D is the sum of edge weights on the path between the leaves corresponding to the entry. Distance-based algorithms output an unrooted tree Ψ with estimated edge weights d. The minimum evolution heuristic [22] attempts to minimize Ψ = uv Ψ d (uv) given certain constraints on Ψ built from the empirical distances ˆD. 5

6 Algorithm Update-M Input: an edge z z 2 Ψ U for each triplet quv with q Ψ, u def(z ), v def(z 2 ), U2 if z z 2, quv is a good relevant pair U3 then assign z z 2, quv to M[q] if Ĥ(q, u, v) is greater than that of M[q]. Fig. 4. The Update-M subroutine. The subroutine tests every relevant pair for the edge z z 2 via the relaxed four-point condition in order to determine which triplets can be used to add a node on z z 2. The Fast Harmonic Greedy Triplets (Fast-HGT) [6], is a distance-based algorithm, also based on the HGT principle, which uses a minimum distance parameter D min to recognize separate triplet centers. The Fast-HGT algorithm follows the general outline of Figure, and uses only relevant pairs in Step A2.. For a relevant pair z z 2, quv, Equation (2) is employed to estimate the distance between the centers of the triplets def(z ) and quv, or def(z 2 ) and quv. If one of the computed values D (z i, c) falls between ( D min ) and D min, then quv is judged to define the node z i and is thus not used for adding new nodes. Changing the parameter D min results in different trees Ψ (D min ) output by Fast-HGT. The edge length estimates delivered by Fast-HGT can be used in applying the minimum evolution heuristic for selecting the minimum distance parameter D min. The resulting algorithm, HGT/ME, aims at optimizing Ψ (D min ) as a function of D min. We have found that Ψ (D min ) is a closely unimodal function of D min, thus Golden Section Search [20] gives satisfying results for the minimization. Since the number of D min values giving different results is determined by the granularity of ˆD, only O(log l) iterations are needed and all use the same distance matrix. The HGT/ME algorithm is described in Figure 5. Theorem 3. The running time of the HGT/ME algorithm building a tree with n leaves is O(n 2 log l). The algorithm uses O(n) work space. Proof. By setting the minimum bracketing value ǫ = l /2, Fast-HGT is executed at most log 2 l times in Step M, and at most 2+ log β (2l) times Step M2 with β = (+ 5)/2. Since the running time of Fast-HGT is O(n 2 ), the running time of HGT/ME is as stated by the theorem. Fast-HGT needs O(n) space, and HGT/ME needs to store at most four topologies at a time to carry out the minimization. 3. Experimental results. 3.. Robinson-Foulds distance. Simulated experiments are often used to assess the statistical efficiency of evolutionary tree reconstruction algorithms. Simulation consists of generating sample sequences by T given the mutation model. The output Ψ of the algorithm is compared to Ψ(T) using distance measures between unrooted binary trees. We use the Robinson-Foulds distance [2] for this purpose, defined as follows. Let Ψ be an unrooted binary tree with leaf set L. A split generated by an edge e is the unordered pair (L, L 2 ), where L and L 2 are the leaf sets of the two subtrees obtained by removing e from Ψ. The split set S(Ψ) is the set of all splits generated by edges of Ψ. Let Ψ, Ψ 2 be two unrooted trees with the same leaf set L and let n = L. The normalized Robinson-Foulds distance between Ψ and Ψ 2 is defined as S(Ψ ) + S(Ψ 2 ) 2 S(Ψ ) S(Ψ 2 ) RF%(Ψ, Ψ 2 ) =, 2(n 3) which is always between 0 and 00%. We say that an evolutionary tree reconstruction algorithm has δ Robinson-Foulds error on a given sample sequence generated by T, if for the algorithm s 6

7 Algorithm Fast Harmonic Greedy Triplets with Minimum Evolution Heuristic Input: n n empirical distance matrix ˆD and a small positive bracketing value ǫ, say, ǫ = l /2. Output: Ψ. M Find a = 2 k ǫ with the smallest k =, 2,..., log 2 ǫ such that Fast-HGT builds a full tree with D min = 2 k ǫ. M2 Minimize Ψ (D min ) on the interval D min [0, a] with Golden Section Search using ǫ as the minimum bracketing size, and output Ψ built by Fast-HGT at that value. Fig. 5. The HGT/ME algorithm. output Ψ, RF%(Ψ, Ψ(T)) = δ Experimental procedure. We simulated DNA sequence evolution along three large trees in the Neyman-Jukes-Cantor model. The topologies returned by HGT/FP and HGT/ME were compared to the ones returned by Neighbor-Joining, which is the most popular distancebased method and has reportedly (e.g., [23, 4]) achieved high experimental success rates on difficult trees with respect to other evolutionary tree reconstruction algorithms. In the experiments we used qclust [3] implementing the O(n 3 ) Neighbor-Joining algorithm [25]. The topologies were compared using the Robinson-Foulds distance leaf tree. The tree is based on a phylogeny derived from mitochondrial DNA sequences in the course of debating the African origin of humans [6]. We scaled the edge lengths linearly from the originally calculated number of character changes per edge, so that all edge lengths fall into the interval [0.25,.0] corresponding to mutation probabilities between 0.09 and This same scaled tree was also used by [4] in similar experiments. The performance of HGT/ME and HGT/FP on the 35-leaf tree is compared to that of Neighbor-Joining in Figure 6. HGT/ME and HGT/FP perform slightly better starting at sequence lengths of 500 and converge faster to recover the topology than Neighbor-Joining leaf tree. In a set of experiments we used a 500-leaf tree with the topology of a seed plant phylogeny based on rbcl gene sequences from [5]. We scaled the edge lengths from the original number of character changes per edge so that all edge lengths fall into the range [0.,.0], corresponding to mutation probabilities between 0.07 and Our algorithms outperform Neighbor- Joining from around sample length l = 200, and miss only 3% of the edges at l = It is worth pointing out that deriving the original tree took several months in computer time employing parsimony methods, while HGT/FP, HGT/ME, and Neighbor-Joining produce their output in a few seconds on a desktop computer leaf tree. In another set of experiments, we used an 895-leaf tree derived from the evolutionary tree of Eukaryotes based on 2S sequences in the Ribosomal Database Project [7]. After removing a subtree containing distantly related taxa from the original tree of 2055 leaves, we scaled the edge lengths linearly so that they fell into the interval [0.,.0]. HGT/FP and HGT/ME converge quickly so that they miss only one edge on the majority of 2000 length samples, and steadily recover the topology from samples of length Neighbor-Joining s performance improves only three-fold between sample lengths of 200 and 0000 and it still misplaces edges at 5000 length sequences. We also conducted experiments on a differently scaled version of the 895-leaf tree, in which the edge lengths were linearly mapped onto the [0.0,.0] interval. Most of the edges in this tree are short, with around 60% of them having the shortest edge length corresponding to mutation 7

8 RF% 35-leaf tree, high mutation probabilities 0 Neighbor-joining HGT/ME HGT/FP sample length Fig. 6. Experimental results on the 35-leaf tree with edge mutation probabilities between 0.09 and The plot shows the Robinson-Foulds error of the algorithms observed on ten separate samples for each length. The graphs go through the median values. probability. The results of the experiments are shown in Figure 8. In accordance with previous findings in simulations with different smaller trees (e.g., [23]), Neighbor-Joining performs well, achieving high success rates at relatively short sample sequences. HGT/FP is more sensitive to short edge lengths due to the greedy selection of triplets lying at its core. At large sample sizes, however, it does converge more quickly than Neighbor-Joining on the highly divergent tree and misplaces very few edges from l = 5000 on. 4. Concluding remarks. We have presented two novel efficient algorithms for building large divergent evolutionary trees. The HGT/FP algorithm builds the topology in O(n 2 ) time from an n n distance matrix, while HGT/ME runs in O(n 2 log l) when the input matrix is derived from a sample of length l. Our algorithms achieve high success rates on large trees with 35, 500 and 895 leaves with large mutation probabilities. When working with trees with over one thousand leaves, running time of the algorithms becomes crucial. Existing O(n 4 )-time evolutionary tree building algorithms may take days to finish on today s desktop computers and slower algorithms are virtually unusable without having considerable insight into biological features of the data set at hand. Space requirements also need to be considered since the number of pairwise distances between 895 leaves is already around.8 million. For example, algorithms working with quartets such as Buneman s [4] may have to deal with about half a trillion quartets in such a large tree. In addition to computational issues, statistical characteristics of algorithms also become more stressed as one builds larger trees. Neighbor-Joining and most other algorithms have not been proven to require asymptotically polynomial sample sizes in order to correctly recover the topology, while HGT/FP and the Fast-HGT algorithm used by HGT/ME are statistically efficient. Atteson [] proved for Neighbor-Joining the existence of similar bounds to those of Theorem 2, but with γ = O(n) in the exponent. The exponential sample length bound is essentially due to the fact that Neighbor-Joining calculates edge lengths from an average that involves distances between arbitrarily remote nodes in T, for which the estimation error may be large. However, when the mutation probabilities are small, Neighbor-Joining s approach may be justified as illustrated by the example of the 895-leaf tree with small mutation probabilities. Neighbor-Joining averages many estimated distances. When the edge mutation probabilities are small, the distances are small and 8

9 RF% 500-leaf tree, high mutation probabilities 0 Neighbor-joining HGT/ME HGT/FP sample length Fig. 7. Experimental results on the 500-leaf tree with edge mutation probabilities ranging between 0.07 and do not differ by much, so the average provides more accurate information about the topology than a greedy approach does. On the other hand, the error committed while calculating the average is governed by the error in the estimation of the largest distance in the expression, which may be significant when the mutation probabilities are large. Consequently, the statistical performance of Neighbor-Joining is less stable, and a greedy algorithm may provide better efficiency in the case of large mutation probabilities. Many possible applications of evolutionary tree building algorithms may need to build large trees. Examples include large projects in evolutionary biology such as the ones cited [6, 7] and in epidemiology [9]. It will be of practical importance to determine which of the existing algorithms are the most suitable for the ranges of mutation probabilities and tree topologies defined by the application at hand. REFERENCES [] K. Atteson, The performance of neighbor-joining algorithms of phylogeny reconstruction, in COCOON 97, vol. 276 of Lecture Notes in Computer Science, Springer-Verlag, 997, pp [2] K. S. Brown, Deep Green rewrites evolutionary history of plants, Science, 285 (999), p [3] J. Brzustowski, qclust V0.2, [4] P. Buneman, The recovery of trees from dissimilarity matrices, in Mathematics in the Archaelogical and Historical Sciences, F. R. Hodson, D. G. Kendall, and P. Tautu, eds., Edinburgh University Press, 97, pp [5] M. W. Chase, D. E. Soltis, R. G. Olmstead, D. Morgan, D. H. Les, B. D. Mishler, M. R. Duvall, R. A. Price, H. G. Hills, Y.-L. Qiu, K. A. Kron, J. H. Rettig, E. Conti, J. D. Palmer, J. R. Manhart, K. J. Sytsma, H. J. Michaels, W. J. Kress, K. G. Karol, W. D. Clark, M. Hedrn, B. S. Gaut, R. K. Jansen, K.-J. Kim, C. F. Wimpee, J. F. Smith, G. R. Furnier, S. H. Strauss, Q.-Y. Xiang, G. M. Plunkett, P. M. Soltis, S. M. Swensen, S. E. Williams, P. A. Gadek, C. J. Quinn, L. E. Eguiarte, E. Golenberg, G. H. Learn, Jr., S. W. Graham, S. C. H. Barrett, S. Dayanandan, and V. A. Albert, Phylogenetics of seed plants: An analysis of nucleotide sequences from the plastid gene rbcl, Annals of the Missouri Botanical Garden, 80 (993), pp [6] M. Csűrös and M.-Y. Kao, Fast recovery of evolutionary trees through Harmonic Greedy Triplets, Submitted for journal publication. [7] M. Csűrös and M.-Y. Kao, Recovering evolutionary trees through Harmonic Greedy Triplets, in SODA 99, ACM/SIAM, 999, pp [8] P. L. Erdős, M. A. Steel, L. A. Székely, and T. J. Warnow, A few logs suffice to build (almost) all trees 9

10 RF% 895-leaf tree, high mutation probabilities RF% 895-leaf tree, low mutation probabilities 0 Neighbor-joining 0 HGT/ME HGT/FP 0. HGT/ME HGT/FP sample length 0. Neighbor-joining sample length Fig. 8. Experimental results on the 895-leaf tree with edge mutation probabilities ranging between 0.07 and 0.47 on the left, and between and 0.47 on the right. (I), Random Structures and Algorithms, 4 (999), pp Preliminary version as DIMACS TR97-7. [9] J. Felsenstein, Cases in which parsimony or compatibility methods will be positively misleading, Systematic Zoology, 22 (978), pp [0], Phylogenies from molecular sequences: inference and reliability, Annual Review of Genetics, 22 (988), pp [] R. L. Graham and L. R. Foulds, Unlikelihood that minimal phylogenies for a realistic biological study can be constructed in reasonable computational time, Mathematical Biosciences, 60 (982), pp [2] E. F. Harding, The probabilities of rooted tree shapes generated by random bifurcation, Advances in Applied Probability, 3 (97), pp [3] D. M. Hillis, Inferring complex phylogenies, Nature, 383 (996), pp [4] D. H. Huson, S. Nettles, and T. J. Warnow, Obtaining highly accurate topology estimates of evolutionary trees from very short sequences, in RECOMB 99, ACM Press, 999, pp [5] T. H. Jukes and C. R. Cantor, Evolution of protein molecules, in Mammalian Protein Metabolism, H. N. Munro, ed., vol. III, Academic Press, New York, 969, ch. 24, pp [6] D. R. Maddison, M. Ruovolo, and D. L. Swofford, Geographic origins of human mitochondrial DNA: hylogenetic evidence from control region sequences, Systematic Biology, 4 (992), pp. 24. [7] B. L. Maidak, J. R. Cole, T. G. Lilburn, J. Charles T. Parker, P. R. Saxman, J. M. Stredwick, G. M. Garrity, B. Li, G. J. Olsen, S. Pramanik, T. M. Schmidt, and J. M. Tiedje, The RDP (Ribosomal Database Project) continues, Nucleic Acids Research, 28 (2000), pp [8] J. Neyman, Molecular studies of evolution: a source of novel statistical problems, in Statistical Decision Theory and Related Topics, S. S. Gupta and J. Yackel, eds., Academic Press, New York, 97, pp. 27. [9] C.-Y. Ou, C. A. Cieselski, G. Myers, C. I. Bandea, C.-C. Luo, B. T. M. Korber, J. I. Mullins, G. Schochetman, R. L. Berkelman, A. N. Economou, J. J. Witte, L. J. Furman, G. A. Satten, K. A. MacInnes, J. W. Curran, and H. W. Jaffe, Molecular epidemiology of HIV transmission in a dental practice, Science, 256 (992), pp [20] W. H. Press, S. A. Teukolsky, W. V. Vetterling, and B. P. Flannery, Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, Cambridge, UK, 2nd ed., 992. [2] D. F. Robinson and L. R. Foulds, Comparison of phylogenetic trees, Mathematical Biosciences, 53 (98), pp [22] A. Rzhetsky and M. Nei, Theoretical foundation of the minimum evolution method or phylogenetic inference, Molecular Biology and Evolution, 0 (993), pp [23] N. Saitou and T. Imanishi, Relative efficiencies of the Fitch-Margoliash, maximum parsimony, maximum likelihood, minimum evolution, and neighbor-joining methods of phylogenetic tree reconstruction in obtaining the correct tree, Molecular Biology and Evolution, 6 (989), pp [24] N. Saitou and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Molecular Biology and Evolution, 4 (987), pp [25] J. A. Studier and K. J. Keppler, A note on the neighbor-joining method of Saitou and Nei, Molecular Biology and Evolution, 5 (98), pp [26] D. L. Swofford, G. J. Olsen, P. J. Waddell, and D. M. Hillis, Phylogenetic inference, in Molecular Systematics, D. M. Hillis, C. Moritz, and B. K. Mable, eds., Sinauer Associates, Sunderland, Mass., 2nd ed., 996, ch., pp

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences joe@gs Phylogeny methods, part 1 (Parsimony and such) Methods of reconstructing phylogenies (evolutionary trees) Parsimony