Study of a Simple Pruning Strategy with Days Algorithm

Size: px

Start display at page:

Download "Study of a Simple Pruning Strategy with Days Algorithm"

Nelson Gardner
5 years ago
Views:

1 Study of a Simple Pruning Strategy with ays Algorithm Thomas G. Kristensen Abstract We wish to calculate all pairwise Robinson Foulds distances in a set of trees. Traditional algorithms for doing this seldom take into account the fact that they are to be utilized on a set of very similar trees. Some randomized algorithms [] are actually slower, the more similar the trees are. In this paper we present an augmentation of ays algorithm that runs faster the more similar the trees are. Our studies indicate that (1) a careful implementation of ays algorithm is faster than recent randomized approaches, () our pruning strategy improves the running time of ays algorithm and that (3) augmenting ays algorithm with more sophisticated pruning strategies that remove larger topologies is most likely futile. 1 Introduction When researching the phylogenetic relationship between a set species, different methods will often disagree on the result. The need therefore arises for a measure of similarity between every pair of suggestions in a set of results. If the suggestions are represented as phylogenetic trees we can use the Robinson Foulds distance. Several algorithms exist that given a set of t phylogenetic trees over the same set of n species will calculate the Robinson Foulds distance between every pair of these. ays algorithm is a deterministic algorithm that computes the distance between two trees in time O(n) ([1]), which is optimal when studying exactly two trees. The algorithm can be used to calculate all pairwise distances in a set of trees in time O(t n) which is suboptimal as the size of the input is O(tn) and the size of the output is O(t ). Several randomized algorithms exist, but they are only fast when the trees are very dissimilar which is not the case in most real life applications ([]). A part of our study is therefore focused on comparing a randomized approach to the classical algorithm by ay. In this paper we investigate an extension of ays algorithm that removes small topologies that are shared among the trees by a pruning process that does not alter the asymptotic upper bound of the original algorithm. A goal of our study is to investigate if the extra time used for detecting and removing these 1

2 A C Figure 1: A simple example of phylogenetic tree. topologies pays of, and whether or not more complicated extensions of ays algorithm are advised. This paper is organized as follows: we first present the background of this study, including phylogenetic trees and the Robinson Foulds distance. Next we present ays algorithm and our pruning strategy. We then describe our experimental setup including implementation details and choice of data set. We finish with an analysis and discussion of our results and a short conclusion. ackground.1 Phylogenetic Trees A phylogenetic tree is a way of representing the evolutionary relationship between a set of species. ach leaf in a phylogenetic tree represents a species, and each inner node represents a speciation event. The tree can be rooted or unrooted. Given n species (or taxa), different methods exist to infer a phylogenetic tree with n leafs. These methods are based on different background data and philosophies, and will therefore often disagree on the topology of the underlying tree. Sometimes one method might even produce several suggestions based on the same data. It is therefore useful to have a measure of agreement between t trees over the same taxa.. Robinson Foulds istance If we remove an edge from a tree, we split the set of leaves into two disjoint sets, called a bipartition. In Figure 1, the dashed edge separates A and from C, and, giving rise to the bipartition A C. The edges that connect the leaves A,, C, and to the rest of the tree are trivial, as the bipartition they define is present in every phylogenetic tree over the taxa. We will therefore ignore them, and focus on the set of nontrivial bipartitions. Let (T ) denote the set of nontrivial bipartition in a tree T. In Figure, (T 1 ) is {A C, A C} and (T ) is {A C, A C}. Given a

3 C A A T 1 C T Figure : Two phylogenetic trees that share exactly one bipartition A C. pair of trees, T 1 and T, we can count the number of bipartitions in (T 1 ) not found in (T ) as (T 1 ) (T ). In our example, (T 1 ) (T ) is {A C} as the two trees share one split; A C. The Robinson Foulds distance d RF (T 1, T ) is d RF (T 1, T ) = 1 ( (T 1) (T ) + (T ) (T 1 ) ) Given t trees T 1,..., T t, we wish to calculate the Robinson Foulds distance between every pair of these. As the distance is symmetric, we only need to compare T i to T i+1,..., T t. 3 ays Algorithm 3.1 Original Algorithm ays algorithm was first presented in [1]. The main idea is to represent the bipartitions by intervals instead of numbers. First, we root the two trees T 1 and T in a taxon, e.g. A (see Figure 3). Next, we perform a depth first traversal of T 1, remembering in which order the leaves are visited. The order in which the leaves are visited defines a map from taxon to a number. We apply this map to the leaves of T. ach split in T 1 now has a well defined interval associated with it, namely the interval from its leftmost child to its rightmost. Similar, some of the nontrivial splits in T have a well defined interval associated with them, based on the numbers on the leafs in their subtree, even though they might not be sorted (see Figure 3). The intervals can be collected in a depth first search by comparing the number of leafs to the smallest and largest leaf in the subtrees. If an interval is shared between the two trees, this corresponds to a shared bipartition. We can examine which intervals are shared by treating them as list of tuples. The lists can be sorted using radix sort in O(n) and compared for duplicates in O(n). Instead of sorting, ay use an O(n) table where lookups are performed using a bijective map from the inner nodes of T 1 to {,..., n}. In 3

4 A T 1 T A C C [, 4] [3, 4] 1 [3, 4] Figure 3: ays algorithm illustrated. our study we use a bijective map into O(n ) for speeding up the computation by removing some bookkeeping in ays algorithm. We can use the leaf map from T 1 on the trees T,..., T t, calculating the Robinson Foulds distance between T 1 and all these, yielding O(t n) in total. Again, notice that we only have to compare the tree T i width the trees T j where j > i. This fact does not reduce the asymptotic running time of the algorithm, but of course improves the running time of the algorithm in practice. 3. Pruning Strategy If we can identify a topology that are shared among all the trees, we can replace it with a leaf without altering the result, but with an improvement in execution time. When we reach the tree T i in ays algorithm, the topologies need only be shared among the remaining t i trees T i+1,..., T t to be replaced. The problem is, of course, that we should be able to identify and remove the topologies fast. In our experiments, we accomplish this by only considering shared topologies with exactly two leafs called cherries. That is, topologies identified as intervals of size two in ays algorithm, such as the interval [3, 4] in the previous example. Such an interval can easily be replaced by one of its leafs, reducing the size of the trees, but without altering the distance between them. Identifying the intervals that are shared among a set of O(t) trees can be done by keeping a table that keeps track of how many times each of the O(n) cherries have been seen in the depth first traversal of the trees. Updating the table does not alter the asymptotic running time of ays algorithm. If we can 4

5 Algorithm NA RNA Hash-RF ays algorithm Pruning Table 1: xecution time of our algorithm on trees from []. map from interval to node position in each tree in constant time, we can also remove cherries in constant time. Such a map can be maintained in the depth first traversal of the tree without any further asymptotic cost and as we remove at most tn cherries in our algorithm this step will not alter the asymptotic execution time. If the trees are very similar, we expect the trees to be pruned very fast, resulting in very small trees. However, if the trees are very dissimilar, we expect the alternations to slow down the program significantly. Therefore, examining the two versions of ays algorithm will be one of our main goals. 4 xperimental Setup We have implemented the two algorithms in C++. The implementations are available on... They all share the same code for parsing the files and printing the output along with most of the code for traversing the trees. We have tested our implementations on two sets of realistic binary trees from the article []. The trees are generated by the Recursive-Iterative CM3 (Rec-I-CM3) algorithm on (1) a set of 500 aligned rbcl NA sequences and () a set of 1,17 aligned large subunit ribosomal RNA sequences. In the rest of this article, these will be referred to as NA and RNA. oth sets consist of 1000 trees. xperiments have been performed on a different number of trees by running the implementations on the first t trees in the NA data set. We also test for varying sizes of trees by removing leafs from this data set before performing our experiments; we believe that the resulting trees still represent realistic trees. All timing tests were performed on a Macbook with.16 GHz and G RAM and for each experiment we have performed five runs and present the average. 5 Results 5.1 Rec-I-CM3 Trees We want to compare our implementations to randomized approaches, and have therefore run our programs on the data sets NA and RNA from the article []. ach algorithm was run five times and the average is presented in Table 1 along with the running time of the Hash-RF program from []. As can be 5

6 seen from Table 1, our algorithm outperform the randomized approach. This is particularly encouraging as the observations of Hash-RF are performed on a 3 GHz processor as opposed to the.16 GHz machine used for ays and the pruning strategy. It is less encouraging to see that the pruning strategy seems to be slower on the RNA data set. To examine what the best obtainable time is we ran the two programs on 1,000 copies of the first tree in the NA and RNA data sets. ays ordinary algorithm used the same time on the two trees as it did on the 1,000 different trees. The pruning strategy used 3.0 seconds on the NA tree and 6.6 seconds on the RNA tree, which is less than a third of the time used by the original algorithm. 5. Running Time xperiments s ays Pruning + Running time as a function of n n Figure 4: Running time in seconds s of the number of taxa n. The number of trees is kept at 1,000. The results of running our implementations on trees of varying size are illustrated in Figure 4, where it can be seen that the running time is indeed linear in the number of taxa n. As can be seen, the extra work associated with detecting and removing the shared topologies renders our algorithm faster than ays algorithm, even for as little as 100 taxa. We have also tested our implementations on a different number of trees. The results are presented in 5, where it can be seen that both algorithms are 6

7 s Running time as a function of t 1 ays 10 Pruning t Figure 5: Running time in seconds s of the number of trees t. The number of taxa is kept at 500. quadratic in the number of trees t. The pruning strategy is however a bit faster when the number of trees exceed Close xamination of Pruning Strategy As previously described, our pruning strategy only removes shared topologies of size two. Removing entire topologies in binary trees can be done by repeatedly removing topologies of size two until there are none left. We have examined how much we can prune the trees in terms of how many taxa we can remove. In Figure 6 we have plotted the total number of removed taxa as a function of how many trees we have had as source T i. As can be seen, our strategy identifies and removes shared topologies within very few iterations and far the most iterations are performed on trees that share no common topologies. This process is particularly active in the beginning of the algorithm. In Figure 6(a) we have focused our attention on the first ten iterations, where within eight iterations the remaining trees share no topologies. As a shared topology (a subtree rooted at an inner node) will always have at least one topology of size two, the number of taxa will be reduced by at least one in each iteration of our algorithm (given that the trees share a common topology). An example of this can be seen in Figure 6(b), where a typical part of Figure 6 is magnified. The algorithm removes one taxa in each iteration, quickly reaching the optimal number of removed taxa. 7

8 n (a) (b) i (b) (a) Figure 6: Top: Number of removable taxa as a function of iterations i. The number of taxa that are removed by our pruning strategy is presented in solid lines, the number of removable taxa is presented in grey. elow: Magnification of the first 10 iterations (a) and a magnification of a typical place on the graph (b), where we can see an entire topology being removed. 8

9 In a run of ays algorithm on t trees, the number of FS traversals performed is 1 t(t + 1) 1. On the NA data set from Figure 6 this is 500,499. Close inspection of the observed data reveals that 48,566 of the performed FS traversals in our pruning strategy are performed on trees of optimal size in the sense that no topology is shared among all the trees. This means that only 17,933 ( 3.58%) of our FS traversals are performed on trees of suboptimal size. The same pattern is seen on the other tested data sets. 6 Conclusions and Future Works Our studies show that using the pruning strategy pays of when the trees are realistic in the sense that they are generated by a phylogenetic inference program. It even outperforms randomized approaches, but our studies indicates that so does the now more than 0 year old algorithm by ay. Our studies also indicate that the possible improvement obtained by jumping from removing small shared topologies of size two to removing entire topologies is negligible if not non existing. If however we remove the entire set of shared topologies before running ays algorithm we might gain a small improvement, but this has not been the focus of this study. In our pruning strategy, we remove topologies of size two, which corresponds to small bipartitions. We have not investigated how much, if anything, could be obtained by removing shared bipartitions with more than two taxa on both sides. We would not be able to reduce the number of taxa, but the number of internal nodes could be lowered and might result in a better algorithm. References [1] William ay. Optimal algorithms for comparing trees with labeled leaves. Journal of Classification, (1):7 8, ecember [] Seung-Jin Sul and Tiffani L. Williams. A randomized algorithm for comparing sets of phylogenetic trees. In APC, pages ,

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES SEUNG-JIN SUL AND TIFFANI L. WILLIAMS Department of Computer Science Texas A&M University College Station, TX 77843-3112 USA E-mail: {sulsj,tlw}@cs.tamu.edu