Computing the Quartet Distance Between Trees of Arbitrary Degrees

Size: px

Start display at page:

Download "Computing the Quartet Distance Between Trees of Arbitrary Degrees"

Darleen Higgins
5 years ago
Views:

1 January 22, 2006 University of Aarhus Department of Computer Science Computing the Quartet Distance Between Trees of Arbitrary Degrees Chris Christiansen & Martin Randers Thesis supervisor: Christian Nørgaard Storm Pedersen

3 Abstract Comparing trees with regard to their topology is in itself an interesting theoretical problem in computer science, and furthermore researchers working in the interdisciplinary field of computational biology need tools to compare phylogenetic trees, i.e. trees that describe the relation of species according to evolutionary history. Different methods and different information can result in different phylogenetic trees, and consequently there is a need to be able to compare such trees. Comparison of trees can be done by calculating the distance between them, and among the distance measures usable on trees are the quartet distance. A quartet is a set of four leaves in a tree, and the edges in the tree connecting the leaves imply the topology of the quartet. The quartet distance between two trees containing the same leaves is the number of quartets containing the same four leaves that have different topology in the two trees. Previous algorithms focus on calculating the quartet distance between binary trees. We explore different approaches for calculating the quartet distance between trees of arbitrary degrees. Each approach gives rise to one or two algorithms with varying running times and space consumptions. The running times are verified experimentally and a possibility for reducing the space consumption of the fastest algorithm is discussed. We have implemented the fastest algorithm in a tool, which is also presented, along with the feature of visualizing the similarity of trees using the quartet distance. i

5 Acknowledgments This thesis is the outcome of a project that began more than one year ago. Back then we followed the course Algorithms in Bioinformatics - Trees and Structures taught by Christian N. S. Pedersen and René Thomsen at Department of Computer Science, University of Aarhus. The final project of the course was about implementing an algorithm for computing the quartet distance between binary trees. Furthermore we were asked to give suggestions about how to compute the quartet distance between trees of arbitrary degrees. In collaboration with Steffen Wang Fischer we came up with such an algorithm, which later became the basis of our thesis. Thanks to Steffen for allowing us to use our collaborative work, and thanks to Christian for drawing our attention to the possibility of using the project as a basis for our thesis. We would also like to thank Thomas Mailund and Christian for contributing with their own ideas regarding algorithms for calculating the quartet distance. Our joint work has among other things resulted in three papers, [7 9], the first of which was published and later presented at the 5th Workshop on Algorithms in Bioinformatics - WABI The second and third papers are currently not submitted, but will be in the near future. During the last five months we have shared our office with Martin Stig Stissing, who is also writing a master s thesis on the quartet distance. When Stissing 1 heard about our algorithms, he participated actively in the development of an even faster algorithm, which is presented in [9]. Christian and Thomas have both been very helpful and forthcoming during the months we have been working on our thesis. They have given constructive feedback on all parts of our work, for which we are very grateful. We also want to thank Stissing for the work we have done together and for the good times we have had in our office. Chris Christiansen & Martin Randers Århus, January 22, Since there are two Martins in our office, we call the second one Stissing. iii

7 Contents Abstract Acknowledgments i iii 1 Introduction Phylogenetic Trees Tree Distance Thesis Outline I Algorithms 5 2 Terminology 7 3 Previous Algorithms for Calculating the Quartet Distance Binary Trees Tsang s Algorithm Brodal et al. s Algorithm Trees of Arbitrary Degrees Tsang s Algorithm Brodal et al. s Algorithm Summary Calculating Leaf Set Sizes Subtree Leaf Set Sizes Shared Leaf Set Sizes Using Expansion Using Rooting Non-leaf Subtrees Algorithms for Computing the Quartet Distance Center Based Approaches Without Shared Leaf Set Sizes With Shared Leaf Set Sizes Edge Claiming Approaches Counting Quartets Using Edges Butterfly Quartets in a Single Tree Expansion Shared and Nonshared Butterfly Quartets and Butterfly Quartets in a Single Tree v

8 5.2.4 Without Expansion Butterfly Quartets in a Single Tree Without Expansion Shared and Nonshared Butterfly Quartets Node Claiming Approach Counting Quartets Using Nodes Butterfly Quartets in a Single Tree Shared and Nonshared Butterfly Quartets Summary Experiments Expected Performance Center Based Algorithms Edge Claiming Algorithms Node Claiming Algorithm Shared Leaf Set Size Algorithms Performance in Practice Center Based Algorithms Edge Claiming Algorithms Node Claiming Algorithm Shared Leaf Set Size Algorithms Summary II Related Subjects 49 7 Quartet Distance and Related Measures Normalized Measures Quartet Fit Similarity Input Trees that do not Fit the Assumptions Leaves with Arbitrary Labels Nodes of Degree Two Trees with Different Leaf Sets Pruning Trees Other Measures Reducing Space Consumption Coloring and Rooting Visualization Visualization Using Inducing Edges and Center Nodes Visualization Using all Edges Tool Visualization Comparison of Split Distance and Quartet Distance 67 vi

9 13 Conclusions 69 III Papers Computing the Quartet Distance Between Trees of Arbitrary Degrees Introduction Algorithms General Quartet Distance in Time O(n 4 ) and Space O(n) General Quartet Distance in Time O(n 3 ) and Space O(n 2 ) General Quartet Distance in Time O(n 2 d 2 ) and Space O(n 2 ) Experiments Conclusions Tools for Calculating the Split- and Quartet-Distance for Sets of Trees of Arbitrary Degrees Background Split Distance Quartet Distance Implementation SplitDist QuartetDist Results and Discussion Conclusions Availability and Requirements Fast Calculation of the Quartet Distance Between Trees of Arbitrary Degrees Introduction Terminology Algorithm Counting Shared Butterfly Quartets Counting Nonshared Butterfly Quartets Counting Butterfly Quartets in a Single Tree Time Analysis for Different Types of Trees Verification of Theoretical Running Times Calculating the Shared Leaf Set Sizes Calculating the Sizes of all Subtree Leaf Sets Conclusion A Table of Symbols 125 vii

11 1 1 Introduction In computer science a tree is a data structure consisting of nodes (or vertices) connected by edges (or branches) in such a way that there is exactly one path of edges between any pair of nodes. The degree of a node is the number of edges connected to it, and nodes of degree one are called leaves. Trees where all nodes, except leaves, have degree three are called binary trees. A tree is rooted if one of the nodes have been designated the root of the tree, otherwise it is said to be unrooted. Rooted trees can be used to represent a hierarchy, while unrooted trees can be used to describe less strict relationships. The configuration of nodes and edges in a tree is called the topology of the tree. The tree data structure is well suited to describe relationships between elements represented by the leaves of the tree. These relationships are encoded in the topology of the tree. Since trees with the same set of leaves can have different topologies, it is interesting to measure this difference, both as a purely computer science problem, but also for researchers who work with trees in other contexts. 1.1 Phylogenetic Trees Since Charles Darwin proposed the theory of evolution through natural selection, there has been a lot of interest in reconstructing the evolutionary history of all life forms. A phylogenetic tree is a tree showing the evolutionary interrelationships among various species that are believed to have a common ancestor. The leaves of a phylogenetic tree represent the species and the rest of the nodes represent their ancestors. A speciation event is the direct evolution of a species into two or more species, and thus the evolution of one ancestral species into modern day species consist of a number of speciation events. The true interrelationships between species usually unknown, but different methods have been used to construct trees modeling it. Among these methods are datings of fossils, and the comparison of DNA (or RNA) from different organisms. There are several DNA based methods, see e.g. [13] or [16] for an overview. Some methods infer rooted trees, where the most recent common ancestor of the set of species is represented by a root, and the direction of evolution is from the root to the leaves. Other methods infer unrooted trees, which can be rooted by using an outgroup, i.e. a species that is known to be more distant to all of the species in the unrooted tree than they are to each other. In evolutionary biology it is considered highly unlikely that a single speciation event results in three or more species. Therefore all speciation events are assumed to result in exactly two species. This means that if for example a species evolves into three species, it is assumed that two speciation events

12 2 CHAPTER 1 have occurred. When inferring evolutionary trees the goal is to have them contain every speciation event, which means that the trees must be binary or fully resolved. In some cases, the data available is not sufficient to infer these fully resolved trees. Different methods for creating phylogenetic trees handle this problem differently. Most methods create fully resolved trees and do this by choosing the most likely order of speciation events according to some model. Other methods, e.g. the Buneman [2, 6] and refined Buneman [4, 15], construct fully resolved binary trees only if this is well supported by the input data. This means that trees inferred by these methods may be non-binary. Using the same data, different methods can infer different trees for a set of species. Also, the same method can infer different trees for a set of species when applied to different information about the species, e.g. different genes. One way of studying such differences is to define a distance measure on trees. 1.2 Tree Distance When defining distance measures on trees one must consider which properties of the trees to consider. When the trees have the same leaves (species in phylogenetic trees), the only differences between them can be the branch lengths and the topology of the trees. Several distance measures based on the topology of the input trees have been proposed, e.g. the symmetric difference metric [20], the nearest-neighbor interchange metric [27], the subtree transfer distance [1], the Robinson and Foulds distance [21], and the quartet distance [12]. As mentioned in [13], the symmetric distance and the Robinson and Foulds distance can also easily take branch lengths into account. Our thesis is concerned with calculating the quartet distance between unrooted trees. Each set of four species in an evolutionary tree defines a quartet, which can have four possible quartet topologies, shown in Fig We will denote quartets with topologies of the types in Fig. 1.1(a)-1.1(c) as butterfly quartets or quartets with butterfly topology, and quartets with topologies of the type in Fig. 1.1(d) as star quartets or quartets with star topology. Note that the three butterfly quartets have different topologies, even though we will say that they all have butterfly topology. Given two trees on the same set of species, the quartet distance between them is the number of quartets for which the quartet topologies differ in the two trees. a c a b a b a c b (a) d c (b) d d (c) c b (d) d Figure 1.1: The four possible quartet topologies of species a, b, c, and d in an unrooted tree. Quartets with topology (a) are written ab cd and are called butterfly quartets, or quartets with butterfly topology, similarly for (b) and (c). Quartets with topology (d) are written a b c, and are called star quartets d or quartets with star topology.

13 THESIS OUTLINE 3 Subsets of one, two or three leaves have a fixed topology in an unrooted tree, regardless of the topology of the tree, see Fig. 1.2, so at least four leaves (i.e. quartets) are needed to give information about the topology of the tree. The set of quartets for a tree is unique and, as shown in [6], a tree can be reconstructed in polynomial time from its quartets. a (a) a (b) b b c (c) a Figure 1.2: The possible topologies of sets of one, two and three leaves from an unrooted tree. The topologies are fixed, regardless of the topology of the tree. Previous algorithms for computing the quartet distance, e.g. the ones created by Bryant et al. [5] and Brodal et al. [3], focus on comparing binary trees and therefore do not consider star quartets. In this thesis we present a set of algorithms that compute the quartet distance between two trees of arbitrary degrees. 1.3 Thesis Outline The rest of this thesis is divided into three parts. Part I presents our algorithms and some experimental results, validating the running time of the algorithms. In Chap. 2 we define some terminology used in the rest of the thesis and list some basic observations. In Chap. 3 we present two of the fastest algorithms for computing the quartet distance between binary trees and discuss the possibility of extending these to work on trees of arbitrary degrees. In Chap. 4 we explain how to do the preprocessing needed for our algorithms, which are presented in Chap. 5. In Chap. 6, we present experiments that validates the theoretical running time of the algorithms. Part II contains various subjects related to the quartet distance. In Chap. 7 we discuss measures related to the quartet distance. In Chap. 8 we show how to deal with input trees that do not fit the assumptions used in Part I and how this affects the measures in Chap. 7. In Chap. 9 we show a way of reducing the space consumption of the fastest algorithm in Part I. In Chap. 10 we present ideas for visualizing the similarity of trees using the quartet distance and in Chap. 11 we present our tool for calculating the quartet distance and related measures. In Chap. 12 we present the result of a comparison between the split distance and the quartet distance and finally in Chap. 13 we summarize and conclude our work. Part III contains reprints of three papers which we have written in cooperation with Christian N. S. Pedersen, Thomas Mailund and in one case Martin Stig Stissing. The papers are self contained and use slightly different terminology than the rest of the thesis. Chap. 14 is a reprint of the paper Computing

14 4 CHAPTER 1 the quartet distance between trees of arbitrary degree ([7]), which presents two of our algorithms for computing the quartet distance for trees of arbitrary degree. Chap. 15 is a reprint of Tools for Calculating the Split- and Quartet-Distance for Sets of Trees of Arbitrary Degrees ([8]), which present our tool for computing the quartet distance, along with Thomas Mailund s tool for calculating the split distance. We also present a comparison of the split distance and the quartet distance in the paper. Chap. 16 is a reprint of Fast Calculation of the Quartet Distance Between Trees of Arbitrary Degrees ([9]), in which we present the asymptotically fastest of the algorithms described in this thesis. The symbols we use throughout the thesis can be found in Appendix A along with a short explanation. It should not be used as a stand-alone dictionary of the symbols, but as a way to refresh the meaning of the symbols for the reader. The first parts of this thesis that were written, were the papers we present in Part III. The papers have been a big part of our work, and therefore we have chosen not to make them an integral part of Part I and II. Instead we present our work in a continuous manner in Part I and II, and fill in details not covered by our papers due to limited numbers of pages. Some details are well covered in the papers, and therefore we will refer to these to avoid too much redundancy. These referrals will be to the chapters and sections in this thesis presenting the papers, which will make them easily distinguishable from referrals to papers by other authors. By not integrating the papers fully into the first two parts, we hope to give the reader an impression of the working process behind this thesis.

15 5 Part I Algorithms

17 7 2 Terminology The algorithms we describe in this thesis all work under the same assumptions: There are two unrooted input trees of arbitrary degrees, T and T, and these have the same n leaves, numbered 1,..., n. In this thesis, we assume that all trees are unrooted trees of arbitrary degrees, unless explicitly stated otherwise. Furthermore we will assume that no nodes in the trees have degree two. We let V denote the nodes of T that are not leaves and similarly with V and T. The term internal nodes refers to elements of V and V, i.e. internal nodes are not leaves. The set of leaves in T is denoted L and the set of edges is denoted E, similarly with L and E in T. Notice that in any tree T we have V < L = n. We also define the internal edges, IE, to be the set of edges that connect two internal nodes, i.e. an internal edge is not attached to a leaf. Note that and E = V + L 1 = V + n 1, IE = V 1. The degree, d v, of an internal node v V is the number of edges attached to it. Similarly we define the internal degree, id v, of an internal node v V to be the number of internal edges attached to it. Notice that degree and internal degree can also be defined for leaves. In this case the degree is always one and the internal degree is always zero. We will let: d = max v V (d v) and id = max v V (id v). Similarly for d and id in T. Summing the degree of all internal nodes and leaves in a tree, corresponds to counting all edges twice, i.e. d v = 2 E, v V L similarly summing the internal degree of all internal nodes in a tree corresponds to counting all internal edges twice, i.e. id v = 2 IE = 2 ( V 1). Note also that v V V d n. We will say that every edge e in a tree T defines two rooted subtrees F and F; instead of viewing e as a single undirected edge, we can view it as two directed

18 8 CHAPTER 2 edges e 1 and e 2. In the setting shown in Fig. 2.1 we say that F is in front of e 1 and F is behind e 1, while F is in front of e 2 and F is behind e 2. For directed edges, we also say that they represent the subtrees in front of them, i.e. in Fig. 2.1 e 1 represents F and e 2 represents F. In general, when speaking of a rooted subtree F, the subtree that is behind the edge representing F is denoted F. Note that F = F. F e 1 e e 2 F Figure 2.1: Edge e defines rooted trees F and F and directed edges e 1 and e 2. A node v of degree d v has d v directed edges pointing away from it. As shown in Fig. 2.2, each of these edges represent a rooted subtree and all of these d v subtrees are the subtrees of v. F 3 F 2 F 4 v F 1 F 5 F 6 Figure 2.2: The internal node v and the subtrees, F 1,..., F d, of v. Here d = 6. Each rooted subtree contains a set of leaves, and when given two subtrees F and G of T and T respectively, we define their shared leaf set, F G, as the set of leaves present in both subtrees. Thus, the term F G is the number of leaves present in both F and G. We will also let the leaf set size, G, of a single tree G, denote the number of leaves in it. Note that F + F = n, because the subtrees represented by a directed edge and its opposite pointing directed edge are complementary when regarding the leaves of the tree they are contained in. When referring to an element, i.e. a node, a subtree or an edge, in a tree and another similar element in another tree, we will often refer to this as a pair of elements, e.g. a pair of nodes from two trees. In other words we make it implicit that a pair of elements, have one element in one tree and the other in another tree. Two unrooted trees T and T contain a number of rooted subtrees. We denote the shared leaf set sizes of all pairs of these rooted subtrees as the shared leaf set sizes between T and T. As mentioned in Chap. 1, we divide quartets into butterfly quartets and star quartets. Since butterfly quartets can have three different topologies (see Fig. 1.1), a quartet can have butterfly topology in both input trees even though the topologies are not the same. We call these quartets nonshared butterfly quar-

19 9 tets. Butterfly quartets with the same topology in both input trees are called shared butterfly quartets. For example, in Fig. 2.3 the topologies of quartets in the first tree are ab cd, ab ce, ab de, ac de and bc de, while the topologies of quartets in the second tree are ab cd, ab ce, ab de, ae cd and be cd. This means that the three quartets containing a and b are shared butterfly quartets, while the remaining two quartets are nonshared butterfly quartets. In general we say that a quartet is shared if it has the same topology in both trees. This means that quartets that have star topology in both trees are automatically shared. a c d a e d b e b c Figure 2.3: The topologies of quartets in the first tree are ab cd, ab ce, ab de, ac de and bc de, while the topologies of quartets in the second tree are ab cd, ab ce, ab de, ae cd and be cd. This means that the three quartets containing a and b are shared butterfly quartets, while the remaining two quartets are nonshared butterfly quartets. Each choice of four leaves in a tree defines a quartet, thus there are ( n) 4 quartets in a tree with n leaves. We use the term qdist(t, T ) to denote the quartet distance between trees T and T. The quartet similarity, which is the the number of shared quartets, is likewise denoted by qsim(t, T ). Summing these gives the total number of quartets, i.e. ( ) n qdist(t, T ) + qsim(t, T ) =. 4 When using big O notation we write f (x) = O(g(x)) when f (x) is big O of g(x). This is a slight abuse of the equality sign, since the property of being big O of a function is not symmetric, and for this reason some authors prefer using the -sign instead. However we think that the equality sign represents the word is better.

20 10 CHAPTER 3 3 Previous Algorithms for Calculating the Quartet Distance Most existing algorithms for computing the quartet distance focus on binary trees. The only exception we know of is the tool presented by Piaggio-Talice et al. in [19], which can compute the quartet distance in time O(n 4 ) for a pair of trees. Since the algorithm implemented in the tool is not described in [19], and since we present an O(n 4 ) time algorithm of our own in Chap. 5, we will not consider this algorithm. Instead we briefly present two of the fastest algorithms designed to work on binary trees and discuss the possibilities of extending these to work on trees of arbitrary degrees. 3.1 Binary Trees The purpose of this section is not to fully describe the algorithms, but instead to give the reader an impression of the parts of the algorithms, that are important when trying to generalize them to work on trees of arbitrary degrees. Binary trees only contains butterfly quartets, and each internal node has degree three. This is used by the O(n 2 ) time algorithm in [5], described in detail by Tsang in his master s thesis [26], and the O(n log n) algorithm in [3] by Brodal et al Tsang s Algorithm Tsang s algorithm is based on the observation that every quartet is associated to at least one internal edge, that splits the leaves in two pairs. Such an edge is said to induce the quartet s topology. The algorithm works by building a set of quartet topology sets, qt-sets, for each tree, that represent sets of topologies of quartets. The qt-sets are created by processing the internal edges of each tree in an ordered way: The processing starts in some arbitrary internal edge, which claims all quartet topologies induced by it, and encodes these in a qt-set. Then the rest of the edges are processed in an order such that they have at least one adjacent edge that has already been processed. To make sure that each quartet topology is only claimed by one qt-set, each edge claims all induced topologies not already claimed by adjacent edges, and encodes these in a constant number of qt-sets. Since there are O(n) internal edges in each tree, and each of these adds a constant number of qt-sets, all induced quartet topologies in a binary tree can be encoded in O(n) qt-sets. The number of butterfly quartets with the same topology encoded by two qt-sets can be computed in constant time. Thus

21 TREES OF ARBITRARY DEGREES 11 counting all butterfly quartets with the same topology in a pair of trees with n leaves can be done by comparing all pairs of qt-sets and takes time O(n 2 ). Subtracting this number from the total number of quartets, ( n 4), results in the quartet distance. This gives an O(n 2 ) time algorithm for determining the quartet distance between a pair of binary trees Brodal et al. s Algorithm The algorithm in [3] is based on doing a hierarchical decomposition of one of the input trees and coloring the leaves in the other (see the paper for details). The decomposition tree is rooted and, under the assumption that the input tree is binary, has logarithmic height. Each internal node in the decomposition tree contains a polynomial with a constant number of terms, based on the assumption that both of the input trees are binary. These polynomials are used when computing the quartet distance. The other tree is colored according to each node, which means giving the leaves of each subtree of the node a unique color. Given such a coloring the value of the polynomial in the root of the decomposition tree can be computed, and determines the number of butterfly quartets in the tree that is decomposed that agrees with the coloring. Each time a leaf receives a new color, all the polynomials from the leaf in the decomposition tree, to the root must be updated. There is a logarithmic number of polynomials, that can each be updated in constant time. The leaves are colored according to each node in such a way that when all coloring has been done, each leaf has been colored a logarithmic number of times. This enables Brodal et al. to create an algorithm running in time O(n log 2 n) which can be improved to run in time O(n log n). 3.2 Trees of Arbitrary Degrees Since the two presented algorithms work on binary trees, they do not consider star quartets. However, in Sec we show that the quartet distance can be found by only considering butterfly quartets. Therefore we will ignore this problem and investigate the possibilities of extending the algorithms to be able to handle trees of arbitrary degrees Tsang s Algorithm Assume that Tsang s algorithm is given two trees with internal nodes of arbitrary degrees. When processing an internal edge, the algorithm needs to consider the adjacent edges that may or may not have already been processed. In Fig. 3.1, the edge e has a processed adjacent edge e c and both e and e c are connected to a node v of degree d v, so e must at least claim all quartets where two leaves are in different subtrees of v and two leaves are in the subtree at the other end of e 1. Since v has degree d v and all combinations of two subtrees 1 This subtree is actually also a subtree of v, but we do not consider it so in this setting.

22 12 CHAPTER 3 must be encoded as qt-sets, there are at least ( d v 1) 2 = O(d 2 v ) qt-sets created when processing e. F 1 F 2 e c v e F F 3 Figure 3.1: The edge e c is processed, so all quartets containing two leaves from F 3 have been claimed. Therefore processing e involves creating qt-sets representing topologies of quartets where two leaves are in F and two leaves are in different subtrees, F 1, F 2 and F 3. This gives ( 3 2) = 3 qt-sets. All internal edges in a tree must be processed, so given some internal node v there are id v internal edges connected to it, which will be processed in some order. By the arguments above, all of these edges, except the first, will need at least ( d v 1) 2 qt-sets to encode claimed quartet topologies. This means that there will be at least ( ) dv 1 (id v 1) = O 2 id v d 2 v = O d 2 id v = O( V d 2 ) v T v T v T qt-sets in a tree T. Comparing all pairs of qt-sets in two trees T and T thus takes time O( V V d 2 d 2 ). An example of a worst case input tree is shown in Fig If such a tree contains n leaves, it contains O(n) inner nodes, and O(n) internal edges connected to a node of degree O(n). This gives the algorithm a worst case running time of O(n 6 ), even though there are only ( n 4) = O(n 4 ) quartets to compare. Figure 3.2: A worst-case input tree. If such a tree contain n leaves, it contains O(n) internal nodes and O(n) internal edges connected to an internal node of degree O(n) Brodal et al. s Algorithm Brodal et al. s algorithm as described in [3] relies on the input trees being binary, but at the time of writing, in [24], Stissing et al. are generalizing the algorithm

23 SUMMARY 13 presented in [3] to work on trees where all nodes have a degree less than some constant. However if the input trees have a node of degree O(n), like for example the tree shown in Fig. 3.2, the current status is that the decomposition tree will have linear height and the polynomial in the root will contain O(n 6 ) terms. In conclusion, the current status is that on worst case trees, a generalized version of the algorithm will have a running time of at least O(n 6 ). Again, this exceeds the O(n 4 ) quartets that have to be compared. 3.3 Summary Above we presented two of the fastest algorithms for calculating the quartet distance on binary trees and discussed the possibilities for extending them to work on non-binary trees. In the case of Tsang s algorithm, it can easily be generalized, but only at the expense of running time. The worst case running time for the generalized version of Tsang s algorithm is O(n 6 ), even though a pair of trees with n leaves only contain O(n 4 ) different quartets. Brodal et al. s algorithm is currently being generalized to trees of constant maximal degree, without changing the asymptotic time complexity. The current status is that for trees with degree O(n), the running time will be O(n 6 ) for this generalized algorithm. To our knowledge, before we began our work on this thesis, the only algorithm capable of calculating the quartet distance between trees of arbitrary degrees is the one implemented in the tool presented in [19]. Our contribution to this niche is the analysis and implementation of five algorithms capable of computing the quartet distance between trees of arbitrary degrees. Four of our algorithms are asymptotically faster than the one described in [19] and are all presented in Chap. 5. Preprocessing of the input trees is necessary for some of these algorithms, this preprocessing is presented in the following chapter.

24 14 CHAPTER 4 4 Calculating Leaf Set Sizes In this chapter, we investigate how to compute sizes of leaf sets in a single tree, and sizes of shared leaf sets in two input trees. These sizes are needed for the algorithms we present in the next chapter. When considering binary trees, the sizes of subtrees in a single tree can be found in time O(n), while the sizes of shared leaf sets can be found in time O(n 2 ). At first glance, it might seem that handling trees containing high-degree nodes makes the problem simpler, since the number of subtrees decreases as the degrees increase. However, this is not the case and care must be taken to compute the values in time that is not asymptotically worse than O(n) and O(n 2 ) respectively. 4.1 Subtree Leaf Set Sizes Assume we are given a rooted subtree F r, rooted in the node r, and have to compute the number of leaves in it. The solution can be easily defined recursively: If r is a leaf, the size is one. If r is not a leaf it is an internal node of degree d r in the unrooted tree. Let F r have the rooted subtrees F r1,..., F rdr 1 as children. The number of leaves in F r can be computed as the sum of leaves in these children. If the input tree is binary, this means that the size of a rooted subtree is either one, or the sum of precisely two other sizes, which can be computed in O(1) time. There are three rooted subtrees of each of the n 2 internal nodes and the size of each takes O(1) time to compute. Thus computing the sizes of all rooted subtrees in an unrooted binary tree takes time O 3 1 = O(n). v V If the input tree is not binary, high-degree nodes can make this running time asymptotically worse. Each time the size of a subtree rooted in an internal node v of degree d v has to be computed, it requires summing d v 1 terms as opposed to the two terms in the binary case. Since there are d v such subtrees, the sizes of the subtrees of v takes O(d 2 v ) time to compute. Doing this for all nodes takes time O d 2 v v V = O d d v = O(dn). v V Assuming the input tree has a topology like the one shown in Fig. 3.2, the running time will be O(n 2 ). The problem with the approach is that subtrees rooted in a node of high-degree takes a long time to handle.

25 SHARED LEAF SET SIZES 15 Rooting T in an arbitrary internal node r gives rise to the rooted tree T r as seen in Fig We define the subtrees of T r as the subtrees pointing down (away from the root). By a single depth first traversal of T r the sizes of all subtrees of T r can be computed and since there is only one subtree of T r rooted in each node, this takes time O d v 1 = O(n). v T r r r (a) An unrooted tree T. (b) A rooted tree T r. Figure 4.1: An arbitrary node, r, is chosen as root, turning T into T r. This only computes the sizes of all subtrees of T that do not contain r, since only sizes of trees that point away from r is computed. Consequently we do not know the sizes of the subtrees that do contain r. For each subtree F that does not contain r, the subtree F does contain r, and using this observation, we can compute the remaining subtree sizes since F + F = L = n F = n F. (4.1) Computing all sizes of subtrees in T r, and using (4.1) allows all sizes of subtrees of T to be found in time O(n). 4.2 Shared Leaf Set Sizes When considering binary trees, calculating and storing shared leaf set sizes of all pairs of subtrees for a pair of trees with n leaves can be done in time and space O(n 2 ), see e.g. [26]. Assume we are given a pair rooted subtrees F and G rooted in v and v respectively. With respect to this rooting, let the children of v be F 1 and F 2, and let the children of v be G 1 and G 2, see Fig The following equality gives two recursive ways of calculating the shared leaf set size of F and G: F G = F 1 G + F 2 G = F G 1 + F G 2. (4.2) There are O(n) rooted subtrees in each tree, and by using dynamic programming, all of the O(n 2 ) sizes of shared leaf sets can be found in time and space O(n 2 ). Note that it takes constant time to compute the shared leaf set size of a

26 16 CHAPTER 4 v F 1 F 2 v G 1 G 2 (a) F (b) G Figure 4.2: (a) The subtrees of v are F 1 and F 2. (b) the subtrees of v are G 1 and G 2. pair of leaves, since leaves are numbered. If the numbers are equal, the shared leaf set size is one, otherwise it is zero. Assume that the trees have arbitrary degrees, and that v and v have children F 1,..., F dv and G 1,..., G d v respectively. We get the following generalization of (4.2): F G = d v 1 i=1 F i G = d v 1 j=1 F G j. (4.3) Using dynamic programming, computing (4.3) takes time O(min{d, d }) for all pairs of rooted subtrees F and G. Each directed edge in a tree corresponds to a rooted subtree, and for each pair of nodes v and v (including leaves) in the trees, the expression has to be computed for each pair of directed edges leading to the internal nodes, which takes time v T v int d v d v min{d 1, d 1} = O(n 2 min{d, d }). (4.4) Computing the shared leaf set sizes between two trees of the type shown in Fig. 3.2 will take time O(n 3 ). This can be done faster and below we will present two ways of calculating the sizes of shared leaf sets in time and space O(n 2 ). The first of these is based on expansion of the input trees, while the second uses rooting of the input trees. We also present an algorithm for computing the sizes of shared leaf sets of all non-leaf rooted subtrees of trees of arbitrary degrees in time O(n + V V ) and space O( V V ) Using Expansion In this section we describe an alternative approach using expansion of the input trees. This way of finding shared leaf set sizes is quite slow in practice, but its results are necessary to use the algorithm presented in Sec Expanding a tree of arbitrary degrees is done by expanding every internal node with degree higher than three to a number of binary nodes, Fig. 4.3 shows an expansion of a node of degree five. Thus, expansion of a tree is done by adding a number of

27 SHARED LEAF SET SIZES 17 edges to it, and the edges in the expanded trees can be divided into the original edges and the introduced edges. Trees that are expanded to binary trees retain an important property of the original ones: Any edge in the unexpanded tree will represent a subtree with the same set of leaves in the expanded tree. The topology of the subtrees might not be the same, due to introduced edges, but the sets of leaves are and this is used when calculating the shared leaf set sizes. b c a d (a) Original tree with high-degree node. e f a e b c f d (b) Expanded tree, newly introduced edges are dashed. Figure 4.3: A tree with a high-degree node (a), and an expanded tree for this tree (b). Note that the subtrees that are induced by the solid edges, have the same leaf sets in both trees. Expanding an arbitrary tree with n leaves to a binary tree with n leaves can be done in time O(n). As shown above, the shared leaf set sizes between two binary trees with n leaves can be computed in time O(n 2 ), hence calculating the shared leaf set sizes for two expanded trees can also be done in time O(n 2 ). By keeping track of the original edges in the expanded trees, it is possible to extract the shared leaf set sizes between the unexpanded trees, from the shared leaf set sizes of the expanded trees Using Rooting In this section we will assume that all sizes of leaf sets in the input trees have been computed cf. Sec 4.1, which takes time and space O(n). Let F and G be a pair of subtrees from T and T. The following observations are also used in Sec. 16.7, and are easily verified by using mathematical properties of sets, they enable us to compute the shared leaf set sizes in a way similar to computing sizes of subtrees in Sec. 4.1: F G = G F G, F G = F F G, (4.5) F G = n ( G + F F G ). Given a pair of undirected edges, there are four pairs of rooted subtrees associated with these. If the shared leaf set size between any of these four pairs have been computed, we can compute the remaining in constant time by using (4.5). Fig. 4.1 shows the rooting of T in some arbitrary node r, and a similar rooting of T in some node r can also be done. Computing the shared leaf set sizes between subtrees represented by the directed edges in the rooted trees (i.e. the ones pointing away from the roots) is done using (4.3), and since there is only

28 18 CHAPTER 4 one of these directed edges pointing to each internal node and each leaf, it takes time v V v inv min{(d v 1), (d v 1)} + 1 = O(n 2 ). (4.6) v L v inl This is consequently the time needed to compute all shared leaf set sizes between subtrees of T and T, since (4.5) can be used to compute the remaining shared leaf set sizes. This approach requires that the individual leaf set sizes in the trees have been computed, but since this only takes time O(n), the asymptotic time complexity is not changed. The space used is O(n 2 ), since all shared leaf set sizes must be stored. This algorithm and the algorithm in the following section are also presented Sec. 16.7, with slightly different terminology. The level of detail is the same, but in Sec algorithms written in pseudocode are also presented (Alg. 3 and Alg. 4) Non-leaf Subtrees Two of the algorithms presented in the following chapter only need shared leaf set sizes of rooted subtrees that are not leaves. In this section we describe a way to compute only these. We will assume that the sizes of leaf sets of subtrees in the individual input trees have been computed in time and space O(n). For each pair of internal nodes v, v, let Leaf (v, v ) be the number of matching leaves connected directly to both v and v. Fig. 4.4 gives an example of two trees and a table containing the values of Leaf. Assume, like in Sec. 4.2 that we are given two rooted non-leaf subtrees F and G, one from each tree, rooted in v and v respectively. Let the non-leaf children of v be F 1,..., F idv 1, and let the children of v be G 1,..., G idv 1. Then id r 1 id r 1 F G = Leaf (v, v ) + F i G + i=1 j=1 F G i id r 1 i=1 id r 1 j=1 F i G j. (4.7) The first term counts all leaves in F G connected directly to both v and v. The second term counts all leaves in F G that are not connected directly to v. The third term counts all leaves in F G that are not connected directly to v. This way, all leaves in F G are counted at least once, but all of these not connected directly to neither v nor v have been counted twice. The fourth term subtracts precisely these leaves. Since leaves are numbered, they can be found in constant time, and each pair of matching leaves, and the pair of internal nodes connected to these, can be found in constant time. Thus the Leaf table can be filled in time O(n + V V ) and space O( V V ). Using (4.7), the Leaf table and the approach described in Sec makes it possible to find all intersections of non-leaf subtrees in time v T v int id v id v O( V V ). (4.8)

29 SHARED LEAF SET SIZES 19 b 3 2 a c a b a b a 0 1 b 2 0 c 0 3 Figure 4.4: Two trees and their shared leaf table. In total the resulting algorithm uses time and space O(n + V V ), since computing the individual leaf set sizes for each tree uses time and space O(n). In the worst case, both V and V are O(n), and thus both time and space consumption is O(n 2 ) which is no better than the previous algorithm. However, trees exist where both V and V are O(1), so in the best case both time and space consumption is O(n).

30 20 CHAPTER 5 5 Algorithms for Computing the Quartet Distance Given two input trees, the topologies of quartets in each of the two trees can be grouped into five categories, which we will use to compute the quartet distance in the following chapters. Originally we used only four categories (see Sec ), but have since split the last of the four categories into two. Given two input trees T and T, the five categories are: Q SS (T, T ): The number of quartets that have a star topology in both T and T. Q SB (T, T ): The number of quartets that have a star topology in T, and a butterfly topology in T. Q BS (T, T ): The number of quartets that have a butterfly topology in T, and a star topology in T. Q B=B (T, T ): The number of quartets that have the same butterfly topology in T and T. This is also referred to as shared butterfly quartets. Q B B (T, T ): The number of quartets that have a different butterfly topology in T and T. This is also referred to as nonshared butterfly quartets. The total number of quartets in the trees is the sum of the terms above, i.e. Q SS (T, T ) + Q SB (T, T ) + Q BS (T, T ) + Q B=B (T, T ) + Q B B (T, T ) = ( ) n. 4 The quartet distance is the number of quartets that have different topology in the two input trees, thus it can be expressed as qdist(t, T ) = Q SB (T, T ) + Q BS (T, T ) + Q B B (T, T ). It is also possible to compute the quartet distance by calculating qsim(t, T ) = Q SS (T, T )+Q B=B (T, T ) and subtract the result from the total number of quartets ( n ) 4 : qdist(t, T ) = ( ) n qsim(t, T ) = 4 ( ) n (Q SS (T, T ) + Q B=B (T, T )). 4

31 CENTER BASED APPROACHES 21 To give an overview of these and later results we tabulate them in a summary table (Tab. 5.1). Each row in the table represents a term and how to express that term using the five categories, for example: qdist(t, T ) = 0 Q SS (T, T )+1 Q SB (T, T ) + 1 Q BS (T, T )+0 Q B=B (T, T ) + 1 Q B B (T, T ). Q SS (T, T ) Q SB (T, T ) Q BS (T, T ) Q B=B (T, T ) Q B B (T, T ) qdist(t, T ) ( n 4) qsim(t, T ) Table 5.1: Summary table, showing qdist(t, T ), qsim(t, T ) and ( n 4) expressed in terms of Q SS (T, T ), Q SB (T, T ), Q BS (T, T ), Q B=B (T, T ) and Q B B (T, T ). The table makes it obvious that e.g. qdist(t, T ) = ( n 4) qsim(t, T ). So it is possible to compute the quartet distance either directly, by counting all quartets with different topology, or indirectly, by counting all quartets with the same topology. 5.1 Center Based Approaches Given three leaves a, b, and c in an input tree T, there is a unique internal node, C in T, in which the paths from a to b, a to c and b to c are joined, see Fig We will call this node the center of a, b and c. Let the subtree of C containing a be denoted T a, similarly for leaves b and c. Any remaining subtrees of C are collectedly denoted T rest, see Fig For each leaf x different from a, b and c, a quartet is defined, and its topology in T can be easily determined from the center: if x T a then the topology is ax bc, if x T b, the topology is bx ac and if x T c the topology is ab cx. If x T rest, then the topology is a b c x. Using these observations, some simple algorithms can be created. a C b c Figure 5.1: The center, C, of leaves a, b and c Without Shared Leaf Set Sizes Assume we are given a center C of leaves a, b and c in an input tree. A single traversal of the tree can determine for each other leaf whether it is located in T a, T b, T c or T rest. Such a traversal uses time O(n), and the information about the location of all leaves can be stored in space O(n). Doing this for both input trees allows the topologies of all quartets in the trees containing the a, b and c to be compared in time O(n). Since there are O(n 3 ) ways to choose a, b and c, and

32 22 CHAPTER 5 T b b a T a C T c c Figure 5.2: The three subtrees containing the leaves a, b and c are called T a, T b and T c, respectively. Any remaining trees (shown in lighter grey) are collectedly called T rest. the center of these can be found in time O(n), a simple O(n 4 ) time algorithm can be created. The algorithm compares quartets directly, so it can either count qdist(t, T ) directly or compute qsim(t, T ) and then subtract this from ( n 4) to get the quartet distance. The algorithm s direct comparison of quartets makes it easily extendable to count the number of shared and/or different quartets in k trees in time O(kn 4 ) and space O(kn). For a more efficient way of calculating the quartet distance for a large set of binary trees, see [25] With Shared Leaf Set Sizes In this section we will assume that the shared leaf set sizes between the input trees have been computed, cf. Sec Let C be the center of leaves a, b and c in an input tree T, and C be the center of leaves a, b and c in an input tree T. For each leaf x different from a that is both in T a and T a, the trees have a shared butterfly quartet of the form ax bc. The number of leaves with this property can be computed as T a T a 1, and similarly for leaves in T b T b and T c T c. For each leaf in both T rest and T rest the trees have a shared star quartet, and the number of leaves with this property can be computed as T rest T rest. Note that since neither a, b nor c are in T rest or T rest, we do not subtract one. Combining these expressions enables us to compute the number of shared quartets containing a, b and c as T a T a + T b T b + T c T c 3 + T rest T rest. The first three terms are shared leaf set sizes, and are therefore available in constant time. In Sec we show that T rest T rest can also be computed in constant time using leaf set sizes and shared leaf set sizes. This means that all shared quartets containing a, b and c can be counted in constant time, which

33 EDGE CLAIMING APPROACHES 23 implies that qdist(t, T ) or qsim(t, T ) can be computed in time O(n 3 ), if the centers of each triplet of leaves can be found in constant time. Finding the centers in constant time is done by finding a linear number of centers in linear time. For each pair of leaves, a and b, there are n 2 triples of leaves containing these. The centers of these triples can be found and stored in time O(n) and space, by a single traversal of the tree. Since there are O(n 2 ) pairs of leaves, this can be done for all pairs in time O(n 3 ) and space O(n). Combining this with the time and space needed to compute the shared leaf sets results in an algorithm using time O(n 3 ) and space O(n 2 ). 5.2 Edge Claiming Approaches One of our main contributions in Sec is that the quartet distance can be found without considering star quartets in the input trees. The quartet distance can be computed by the expression qdist(t, T ) = Q B=B (T, T) + Q B=B (T, T ) (5.1) 2 Q B=B (T, T ) Q B B (T, T ). We focus on how to compute the right hand terms in this section. Note that Q B=B (T, T) is actually the number of butterfly quartets in T and similarly for Q B=B (T, T ) and T. We add these along with Q B=B (T, T ) and Q B B (T, T ) to the summary table. Q SS (T, T ) Q SB (T, T ) Q BS (T, T ) Q B=B (T, T ) Q B B (T, T ) qdist(t, T ) ( n 4) qsim(t, T ) Q B=B (T, T) Q B=B (T, T ) Q B=B (T, T ) Q B B (T, T ) Table 5.2: Summary table, showing that qdist(t, T ) can also be expressed in terms of Q B=B (T, T), Q B=B (T, T ), Q B=B (T, T ) and Q B B (T, T ). Butterfly quartets can be defined as sets of four leaves a, b, c and d that are partitioned in two pairs, a, b and c, d, by at least one edge. Every edge with this property is said to induce the quartet ab cd. Since multiple edges can induce a butterfly quartet, the total number of butterfly quartets cannot be computed by counting the number of quartets all edges induce. Instead we define which edges have the right to claim which quartets. In [26], Tsang does it by allowing edges to claim all induced butterfly quartets not already claimed by adjacent edges; we use a different approach Counting Quartets Using Edges For an edge to induce a butterfly quartet, it must have at least two leaves in each of the two subtrees connected to it. Consequently edges connected to

34 24 CHAPTER 5 leaves can never induce butterfly quartets, therefore we focus only on internal edges. Several internal edges may induce the same butterfly quartet, e.g. in Fig. 5.3 the quartet ab cd is induced by both e 1 and e 2, while the quartet ab ce is only induced by e 1. a b e e 1 e 2 c d Figure 5.3: The quartet ab cd is induced by both e 1 and e 2, while the quartet ab ce is only induced by e 1. To associate quartets with a fixed number of edges, it is convenient to look at directed edges. A butterfly quartet ab cd defines two directed butterfly quartets: ab cd and ab cd. For the undirected internal edge e, inducing quartet ab cd, the corresponding directed internal edges induce the directed quartets ab cd and ab cd. To each directed quartet, ab cd, we can uniquely associate a directed internal edge, e 1 such that a and b are leaves in the tree F behind e 1, and such that c and d are leaves in different subtrees F 1 and F 2 of the root of the tree in front of e 1, see Fig We call such a tree substructure a directed edge claim, written F ie 1 (F 1, F 2 ), and say that e 1 claims the directed quartet ab cd and we also say that e 1 claims an undirected quartet ab cd if it claims one of its directed quartets, so each butterfly quartet is claimed by exactly two directed internal edges. a b F e 1 F i F j c d Figure 5.4: The directed edge e 1 claims all directed quartets ab cd where a, b F, c F i and d F j. In this case we show only two trees in front of the edge, in case there were more, all combinations of pairs of these would be used by e 1 to claim directed quartets Butterfly Quartets in a Single Tree Given two input trees, T and T, it is necessary to know the number of butterfly quartets in each tree, Q B=B (T, T) and Q B=B (T, T ) respectively, in order to use (5.1) to compute the quartet distance. Here we describe how to compute these values using directed edge claims. We will assume that the sizes of the leaf sets in the individual input trees have been computed. A directed edge claim F ie ( F i, F j ) represents a number of directed quartets claimed by e, this number can be computed in constant time using the expression

35 EDGE CLAIMING APPROACHES 25 ( ) F F i F j. 2 For a tree T of arbitrary degrees, a directed internal edge e pointing to an internal node v of degree d v is part of ( d v 1) 2 O(d 2 v ) directed edge claims. For each node v, there are id v internal directed edges pointing to v, and therefore the total number of directed edge claims in T is O id v d 2 v = O( V d2 ). v V Since it is possible to compute the number of butterfly quartets represented by each directed edge claim in constant time, the total time needed to compute the number of butterfly quartets in a tree T, Q B=B (T, T), is O( V d 2 ). The precomputation of the individual leaf set sizes takes time and space O(n), so the total time consumption is O( V d 2 ) and the space consumption is O(n) Expansion Shared and Nonshared Butterfly Quartets and Butterfly Quartets in a Single Tree Above we described how to compute Q B=B (T, T) and Q B=B (T, T ), but also Q B=B (T, T ) and Q B B (T, T ) are needed to compute the quartet distance using (5.1). The algorithm described in this section is the same algorithm as the one running in time O(n 2 d 2 ) described in Sec There we do not distinguish O(n 2 ) from O( V V ) since V and V each is O(n), but for some trees the difference is significant, see Sec Counting Q B=B (T, T ) and Q B B (T, T ) for two input trees T and T, can be done by comparing all pairs of directed edge claims. Using the shared leaf set sizes of the subtrees in a pair of directed edge claims, the number of both shared and nonshared quartets for that pair of claims, F ie ( ) ie F i, F j and G (G k, G l ), can be computed in constant time. The shared quartets can be computed as ( ) F G ( Fi G k F j G l + F i G l F j G k ), 2 as shown in Sec , where it is also shown that a similar but longer expression can be used to compute the number of nonshared quartets. In T and T there are O( V d 2 ) and O( V d 2 ) directed edge claims respectively, thus Q B=B (T, T ) and Q B B (T, T ) can be computed in time O( V V d 2 d 2 ). Since calculating the shared leaf set sizes uses time and space O(n 2 ) the total time needed to compute (5.1) is O( V d 2 + V d 2 + V V d 2 d 2 + n 2 ) = O( V V d 2 d 2 ), and the total space needed is O(n 2 ). In trees of arbitrary degrees, there are O( V d 2 ) directed edge claims, but in binary trees, there are only O( V ) = O(n) directed edge claims. Using this, the

36 26 CHAPTER 5 time consumption can be reduced to O( V V dd ) by transforming the input trees into binary trees annotated with information about the original trees. The transformation is an expansion from arbitrary trees T and T to binary trees T b and T, done as described in Sec b An expanded tree induce all the butterfly quartets in the original nonexpanded input tree, but it does also induce additional butterfly quartets due to the newly added internal nodes and edges. Hence directed edge claims in the expanded tree can contain butterfly quartets not present in the original tree. Therefore, we define a new structure on the expanded trees, the extended directed edge claim. In short, the extended edge claims can be used to count the number of butterfly quartets in the expanded trees, that were also butterfly quartets in the unexpanded tree. The details can be found in Sec For every pair of extended directed edge claims, the number of shared and nonshared butterfly quartets can be computed in constant time. After the expansion, an original directed internal edge that was pointing to an internal node v of degree d v in the original tree, T, is part of d v extended directed edge claims and similarly for an edge pointing to an internal node v of degree d v in T. Therefore the total time needed to compute the number of butterfly quartets in a single tree, T, is O( V d) and the time needed to compute the number of shared and nonshared butterfly quartets, Q B=B (T, T ) and Q B B (T, T ), is O( V V dd ). The sizes of the leaf sets in the individual trees, and the sizes of shared leaf sets of the trees,are needed to obtain this complexity. Consequently, the total time consumption for calculating the quartet distance using (5.1) is O(n 2 + V d + V d + V V dd ) = O( V V dd ), since n V d and n V d. Furthermore, the algorithm use O(n 2 ) space to hold the shared leaf set sizes Without Expansion Butterfly Quartets in a Single Tree Calculating the number of butterfly quartets, Q B=B (T, T) and Q B=B (T, T ), in each tree can also be done without expansion. For each directed internal edge we let it claim all directed butterfly quartets that it induces and subtract all the ones it does not claim. We will assume that the sizes of leaf sets in the individual trees have been computed. The algorithm we present here is the foundation of the algorithm we present in Sec Let e be an internal directed edge pointing to an internal node v, let the subtree in front of e be F, and the subtree behind e be F. One of the d v subtrees of v is F, and we let the rest be named F 1,..., F dv 1. Note that the union of F 1,..., F dv 1 is F. The number of directed butterfly quartets induced by e is and ( F 2 )( F 2 ),

37 EDGE CLAIMING APPROACHES 27 ( ) F d 1 ( ) Fi 2 2 i=1 is the number of directed butterfly quartets induced, but not claimed, by e. Note that when calculating the sum, there is no need to consider F i s that are leaves, since at least two leaves are needed to produce a non-zero term. This observation will be used several times in this thesis, and when summing over the subtrees, F 1,..., F idv, of a node, v, it is implicit that these are all the non-leaf subtrees of v. Consequently the number of directed butterfly quartets claimed by e is ( F 2 ) ( F 2 ) id 1 ( ) Fi 2, (5.2) i=1 which can be computed in time O(id v ). Calculating (5.2) for each of the id v internal directed edges pointing to v for all nodes v takes time O id 2 v = O( V id). v V Computing the number of butterfly quartets in a tree can thus be done by first computing the sizes of leaf sets in the tree, and then (5.2) for all internal directed edges. This uses time O(n + V id) and space O(n) Without Expansion Shared and Nonshared Butterfly Quartets Given a pair of directed internal edges e and e pointing to internal nodes v and v in trees T and T, we will use an approach similar to the one above to count the number of shared butterfly quartets claimed by both edges. We assume that all shared leaf set sizes of non-leaf subtrees have been computed. Let the non-leaf subtrees of v be F 1... F idv and the non-leaf subtrees of v be G 1... G idv. Furthermore let the tree in front of e be F, and the tree behind be F, similarly let the tree in front of e be G and the tree behind be G. This means that there exists some x and y, where 1 x id v and 1 y id v, such that F x = F and G y = G. The number of shared butterfly quartets induced by both edges is ( )( ) F G F G. (5.3) 2 2 To compute the number of shared butterfly quartets claimed by both edges, we need to subtract those that are induced but not claimed by e and e. The non-leaf subtrees in front of e are F 1... F idv, except F x and the non-leaf subtrees in front of e are G 1... G idv, except G y. Therefore the number of these butterfly quartets is

38 28 CHAPTER 5 id v i=1 i x ( )( ) id v F G Fi G ( )( ) F G F Gj + (5.4) j=1 j y id v id v i=1 i x ( )( ) F G Fi G j. 2 2 j=1 j y The first sum counts the number of shared butterfly quartets induced by both e and e, but not claimed by e. Symmetrically, the second sum counts the number of shared butterfly quartets induced by both e and e, but not claimed by e. Both of these sums count the number of shared butterfly quartets induced by both edges, but claimed by neither. The final double sum subtracts these quartets once, since they have been added twice. (5.4) is computable in time O(id v id v ), and the total number of directed quartets claimed by both e and e can be computed by subtracting (5.4) from (5.3) in time O(id v id v ). Summing these numbers for all pairs of directed internal edges in two trees T and T, yields 2 Q B=B (T, T ). For each pair of internal nodes v and v there are id v internal directed edges pointing to v and id v internal directed edges pointing to v. This means that there are id v id v pairs of edges for this pair of nodes, and consequently Q B=B (T, T ) can be computed in time ( ) id 2 v id2 v id v id v id id = O( V V id id ). v V v V v V v V By changing the expressions inside the sums in (5.3) and (5.4) in a similar way as it is done in Sec , it is also possible to express Q B B (T, T ) without changing the running time. Combining this with calculating the shared leaf set sizes for non-leaf subtrees, the leaf set sizes of both input trees and the number of butterfly quartets in each tree, we get an algorithm for computing the quartet distance which uses time O((n + V V ) + n + V id + V id + V V id id ) = O(n + V V id id ), and space O(n + V V ). 5.3 Node Claiming Approach The algorithm described in Sec works by processing each pair of directed internal edges in the input trees. Here we show how to improve the running time by processing these pairs of edges in a specific order.

39 NODE CLAIMING APPROACH Counting Quartets Using Nodes Instead of letting the edges claim quartets, we let the internal nodes claim them. An internal node claims all directed quartets claimed by any directed internal edge pointing to the node. More formally, if the internal node v has the directed internal edges e 1,..., e idv pointing to it, v claims all directed quartets claimed by e 1,..., e idv with the definitions from Sec From these definitions, it follows that the two directed edges claiming an undirected quartet points to two different internal nodes. Consequently each undirected quartet is claimed by precisely two internal nodes (see also Sec ) Butterfly Quartets in a Single Tree As we describe in Sec. 16.4, we can count the number of butterfly quartets in a tree T by summing the number of undirected quartets claimed by all nodes in T. We assume that the sizes of all subtrees of T have been computed. Given non-leaf subtrees F 1,..., F idv of an internal node v in T, we let id v ( ) Fi S v =. 2 i=1 Let e be an internal edge pointing to v from some F x. Then e represents the subtree F x. From (5.2) we know that we can count the number of directed butterfly quartets claimed by e using the expression ( Fx 2 ) ( Fx 2 ) id v i=1,i x ( ) ( ) (( Fi Fx Fx 2 = 2 2 ) ( )) Fx S v +. 2 S v can be computed in time O(id v ), and using it, each of the id v internal edges pointing to v can be processed in constant time. The number of butterfly quartets claimed by v can thus be computed in time O(id v ). The total number of butterfly quartets in the tree is found by summing over all internal nodes, and dividing by two. So Q B=B (T, T) and Q B=B (T, T ) can be computed in time O(n), since the sizes of leaf sets are needed. The space consumption is also O(n) Shared and Nonshared Butterfly Quartets Let v be an internal node in T and v be an internal node in T. Let the non-leaf subtrees of v be F 1,..., F idv, and the non-leaf subtrees of v be G 1,..., G idv. We focus on how to compute the number of shared butterfly quartets claimed by both v and v and assume that all shared leaf set sizes of non-leaf subtrees have been computed. First, the nodes are preprocessed, by computing a number of sums in time O(id v id v ). Using these sums it is possible to compute (5.4) in constant time, for any pair of directed internal edges e and e pointing to v and v respectively. This

40 30 CHAPTER 5 enables the computation of the number of shared butterfly quartets claimed by v and v to be done in time O(id v id v ). For i = 1,..., id v and j = 1,..., id v, the following sums are computed in the preprocessing step S j = S i = S j = S i = S = id v i=1 ( ) Fi G j, 2 id v ( ) Fi G j, 2 j=1 id v i=1 ( ) Fi G j, 2 id v ( ) Fi G j, 2 j=1 id v id v ( ) Fi G j. 2 i=1 j=1 There are 2 id v sums, the S i s and the S is, that can be computed in time O(id v ), 2 id v sums, the S j s and the S j s, that can be computed in time O(id v ), and one double sum, S, that can be computed in time O(id v id v ). Consequently, all of the sums can be computed in time O(id v id v ) and stored in space O(id v + id v ). Let e and e be a pair of directed internal edges pointing to v and v with trees F x and G y behind them, respectively. To compute the number of shared butterfly quartets claimed by both edges we can use (5.3) and (5.4) from Sec The first of these is computable in constant time, while the second is computable in time O(id v id v ). As also shown in Sec , (5.4) can be computed in constant time by using the sums above, since reformulating the expression to the current setting yields ( ) Fx G y id v ( ) Fi G y id v ( ) Fx G j id v id v ( ) Fi G j + = i=1 j=1 i=1 j=1 i x j y i x j y ( ( ) ( ) Fx G y Fx G y ) 2 S y + S x 2 Fx G y 2 ( ( ))) S S Fx G y x S y +. 2 Since there are id v id v pairs of internal edges pointing to v and v, it takes time O(id v id v ) to compute the number of shared directed butterfly quartets claimed

41 SUMMARY 31 by the nodes. Summing these numbers for all pairs of internal nodes gives the total number of shared directed butterfly quartets, which is 2 Q B=B (T, T ), in time O id v id v = O( V V ). v V v V Since each pair of nodes can be processed independently, at most O(id + id ) space is needed. When processing a pair of internal nodes, O(id v id v min{id v, id v }) time is needed to compute the number of nonshared quartets claimed by the nodes, see Sec Summing these numbers for all pairs of internal nodes gives 2 Q B B (T, T ) and takes time O id v id v min{id v, id v } = O( V V min{id, id }). v V v V To achieve this, it is necessary to precompute O(min{id 2 v, id 2 v }) sums when processing a pair of nodes. Again each pair of nodes can be processed independently and thus the space needed is O(min{id 2, id 2 }). The total time needed to compute the quartet distance is the time needed to compute shared leaf sets of non-leaf subtrees, Q B=B (T, T) and Q B=B (T, T ), Q B=B (T, T ) and Q B B (T, T ). Combined, this is time O(n + V V + ( V + n) + ( V + n) + V V + V V min{id, id }) = O(n + V V min{id, id }) and space O(n + V V + n + n + (id + id ) + min{id 2, id 2 } = O(n + V V ). 5.4 Summary In this chapter we have presented our five different algorithms for computing the quartet distance. Two are based on centers of triples of leaves, two on edges that claim quartets and the last on letting nodes claim quartets. The algorithms have different time and space consumptions, which are summarized in Tab The algorithm with the best asymptotic running time is the node-claiming algorithm, which uses time O(n + V V min{id, id }) to compute the quartet distance. As far as we know, the fastest algorithm published before this, uses time O(n 4 ) to compute the distance. To our knowledge, no lower bounds in time or space consumption for computing the quartet distance between trees of arbitrary degrees have been proved. Doing this remain open problems as well as how to compute the

42 32 CHAPTER 5 Time Space Center based without shared leaf set sizes O(n 4 ) O(n) Center based using shared leaf set sizes O(n 3 ) O(n 2 ) Edge claiming using expansion O( V V dd ) O(n 2 ) Edge claiming without expansion O(n + V V idid ) O(n + V V ) Node claiming O(n + V V min{id, id }) O(n + V V ) Table 5.3: The time and space consumption for calculating the quartet distance using the five algorithms described in this chapter. quartet distance even faster than the algorithms we presented above do. Possibly, the quartet distance can be computed faster, using other combinations of Q SS (T, T ), Q SB (T, T ), Q BS (T, T ), Q B=B (T, T ) and Q B B (T, T ), than we have used. Defining new categories or dividing the existing ones into smaller parts, might also lead to faster algorithms. Finding shared star quartets, which have only the center node and the leaves as common denominators, seems harder than finding shared butterfly quartets. Finding nonshared butterfly quartets also seems harder then finding shared butterfly quartets, but equivalent to finding shared star quartets, since our edge and node claiming algorithms use nonshared butterfly quartets instead of star quartets to compute the quartet distance. Whether finding shared star quartets and nonshared butterfly quartets are equivalent problems and whether these problems are harder than the problem of finding shared butterfly quartets remains to be investigated.

43 33 6 Experiments The algorithms presented have all been implemented in Java, and the correctness of the implementations have been validated by experiments. We have also done experiments regarding the running times of the algorithms on different classes of trees. To verify that the algorithms compute the correct result, we have implemented a simple O(n 6 ) time algorithm for computing the quartet distance. After implementing the O(n 4 ) time algorithm, the results of the two algorithms were compared on a large number of random trees with varying sizes. Having two fundamentally different algorithms giving the same results is a good indication of correctness. By continuing this process for the other implemented algorithms, we have a strong indication that all the algorithms compute the quartet distance correctly, since they do not all rely on the same basic ideas. The correctness of the shared leaf set size algorithms have been implicitly validated during the validation of the quartet distance algorithms, since neither the O(n 6 ) nor O(n 4 ) time algorithms use these sizes. In addition to validating the correctness of all of the algorithms, we have investigated the expected running times for different classes of trees and tested these in practice. 6.1 Expected Performance In this section we investigate how the topology of the input trees are expected to affect the running times of the algorithms, we will investigate four different classes of trees. A worst case tree is a tree that contains O(n) internal nodes, and where at least one node has an internal degree that is O(n). Trees of this type is the worst input for the fastest of our algorithms, hence the name. An example of such a tree can be seen in Fig 3.2. A d-ary tree is a tree where all nodes have degree d. If a tree contains V nodes of degree precisely d, it contains precisely d + ( V 1) (d 2) leaves, which means that a d-ary tree with n leaves contains O( n d ) internal nodes, i.e. V = O( n d ). The internal degrees of nodes in d-ary trees depend on the topology of the tree, and can vary from 0 to d. If the tree is a star, i.e. a node connected directly to all leaves, the internal degree is zero for all nodes, and thus id = 0. If the tree is a chain of internal nodes, the internal degrees of all internal nodes except the ends of the chain is two, and thus id = 2. If the tree is like the worst case tree, but with the outer internal nodes having degree d, id = d. A random tree is a tree created using randomization in the construction, which we have chosen to do in the following way: We start with a tree containing two

44 34 CHAPTER 6 leaves connected to each other, to which the remaining leaves are added one at a time. A leaf is added either by connecting it to an existing internal node, or by splitting an edge by adding a new internal node, and connecting the leaf to this node. Each edge and node have uniform probability of being chosen as each leaf is added. A r8s based tree is a tree created by the program r8s (see [22]), and then modified, since all trees created are binary. The modification consist of contracting edges, to obtain trees of arbitrary degrees (contracting an edge e connecting nodes u and v means removing u and e and attaching the rest of u s edges to v). Each edge is contracted with a probability that is inversely proportional with its length, which is determined by r8s, i.e. a short edge has a higher probability of being contracted than a long edge. The distribution of the number of internal nodes, and the internal degrees of these, in random and r8s based trees are beyond the scope of this thesis to analyze. Running a number of tests on such trees can however give an impression of how the algorithms perform on real life data Center Based Algorithms The two center based algorithms traverse the entire trees several times when calculating the quartet distance. Once for every triplet of leaves in the O(n 4 ) time algorithm and once for every pair of leaves in the O(n 3 ) time algorithm. Traversing the entire trees means visiting all internal nodes, leaves and edges in the trees, and since the number of internal nodes and internal edges can vary for trees with the same number of leaves, this is expected to have an impact on the performance. In a binary tree, V = n 2 and IE = n 3, whereas in a star tree (a tree where all leaves are connected to a single internal node) V = 1 and IE = 0. We expect better performance from the algorithms when used on trees with few internal nodes and edges, than when used on trees with many internal nodes and edges, but since each traversal visits all n leaves, the algorithms will not run asymptotically faster than O(n 4 ) and O(n 3 ) on any trees Edge Claiming Algorithms The running times of the edge claiming algorithms are O(n 2 + V V dd ) and O(n + V V id id ), so these are not expressed in terms of the number of leaves n. Even though the number of internal nodes V in a tree is O(n), it is important not to substitute V with n in the analysis. Worst case trees contain O(n) internal nodes, at least one of which have an internal degree that is O(n), and thus the worst case running time of both edge claiming algorithms are expected to be O(n 4 ). However, d-ary trees are another matter, such trees have O( n d ) internal nodes and therefore the running time for the first of the algorithms will be O(n 2 ) on such trees. For d-ary trees with a fixed number of internal nodes the performance of the second algorithm may vary from O(n + n2 ) to O(n 2 ), depending on the topology of the trees. This is d 2 caused by id being partially independent of d as mentioned above.

45 PERFORMANCE IN PRACTICE Node Claiming Algorithm The node claiming algorithm is similar in many ways to the fastest of the edge claiming algorithms, but should be asymptotically faster. The class of input trees where they are expected to differ the most is worst case trees. Where the edge claiming algorithm is expected to use time O(n 4 ), the node claiming algorithm is expected to use only O(n 3 ). In the previous section we described how comparison of two d-ary trees causes the time consumption of the edge claiming algorithms to be reduced to O(n 2 ) or faster. The node claiming algorithm does not need both of the input trees to be d-ary to achieve this. More precisely, assume two input trees are given, and let d min be the minimal degree of all internal nodes in both trees. Then there can be at most O( n d min ) nodes in each tree, and therefore if min{id, id } is O(d 2 ) the running time of the node claiming algorithm is expected to be min O(n 2 ) Shared Leaf Set Size Algorithms The naive algorithm for calculating the shared leaf set sizes has a theoretical running time of O(n 2 min{d, d }), and thus it is expected to run in time O(n 3 ) on worst case trees and time O(n 2 ) on d-ary trees. The algorithm using expansion is expected to run in time O(n 2 ) for all classes of input trees. However the overhead of expanding the trees may in some cases make it slower than the naive algorithm, especially on small trees that are close to being binary. This is because all trees that are not binary must be expanded by the algorithm, which then uses the naive approach on these expanded trees. Thus we can expect approximately the same running times for all non-binary input trees with the same number of leaves. The rooting approach is also expected to run in time O(n 2 ) for all classes of input trees. The running time of O(n 2 ) is caused by the need to compare all pairs of leaves in the input trees (see (4.6)), but also all pairs of non-leaf subtrees must be compared. The number of these is dependent on the internal structure of the trees, i.e. the number of internal nodes. The algorithm will run faster on trees with a small internal structure, than on trees that have large internal structure. For example we can expect the running time to decrease as d increases on d-ary input trees with the same number of leaves. The algorithm for calculating only the shared leaf set sizes of non-leaf subtrees has a theoretical running time of O(n+ V V ). The expected running time is thus O(n 2 ) for worst case trees and O(n + n2 ) on d-ary trees, so the running d 2 time should decrease for increasing d. 6.2 Performance in Practice To test the time consumption of each of the implemented algorithms, we have done a series of tests, using the four different types of trees, worst case, d- ary, random and r8s based, described above. We used six different types of

46 36 CHAPTER 6 d-ary trees, binary, 6-ary, 15-ary, 30-ary, 60-ary and 90-ary. Each algorithm was run on all types of trees with 100, 120,..., 500 leaves, and additionally the node claiming algorithm and every algorithm for calculating the shared leaf set sizes were run on trees with up to 1500 leaves. For each type and size of trees, the algorithms were run on seven different pairs of trees. The maximal and minimal running times were discarded, and the average of the remaining running times was used to plot the graphs in this section. The data is plotted in a double logarithmic coordinate system. This makes it easier to determine the exponent of the polynomial functions represented by the plotted running times, since the slope is the exponent. We have also plotted three known polynomials of the form c n x in each graph, to help approximate the slopes. We say that the best fit polynomial for a set of running times, is the polynomial with the slope that is the best approximation of the slope of the plotted running times Center Based Algorithms Fig. 6.1 shows the running times of the O(n 4 ) time algorithm for the classes of trees mentioned above. In all cases the best fit polynomials are the ones of the form c x n 4. From this we conclude that our expectations about the algorithm running in time O(n 4 ) for all trees are correct. Furthermore the second graph in the figure shows that the algorithm is faster for trees that have a smaller internal structure, which was also expected. The results for the O(n 3 ) time algorithm, shown in Fig. 6.2, are similar to the results discussed above. This time the c x n 3 are the best fit polynomials in all cases. The internal structure of the input trees affect the running times in the same way as for the O(n 4 ) time algorithm, though less pronounced. All of this is as expected Edge Claiming Algorithms The running times of the edge claiming algorithm that uses expansion are plotted in the graphs in Fig On worst case trees, the best approximation is the c x n 4 polynomial, supporting the expectation on the running times for this class of trees. On d-ary trees the running times are O(n 2 ) as also expected. The graph for d-ary trees also show that the running time increases for increasing d. This is not caused by the need to expand more nodes in trees with a high degree, since the second graph in Fig. 6.7 shows no difference in running times for calculating shared leaf set sizes using expansion. We have not been able to find the cause, but it has been confined to the implementation of the algorithm for computing the shared and nonshared quartets using extended edge claims. The last graph in the figure shows that on random and r8s based trees, the algorithm also runs in time O(n 2 ). This indicates that even though the worst case running time is O(n 4 ), this algorithm is a better choice than the O(n 3 ) time algorithm in practice. The expected running times of O(n 4 ) for worst case trees and O(n 2 ) on d-ary trees, for the O(n + V V id id ) time algorithm, is supported by the graphs in

47 PERFORMANCE IN PRACTICE 37 Time usage for the O(n 4 ) algorithm on worst case trees Time in milliseconds 1e+04 1e+05 1e+06 1e+07 1e+08 worst case c 1 n 3 c 2 n 4 c 3 n Number of leaves Time in milliseconds 1e+04 1e+05 1e+06 1e+07 1e+08 d=3 d=6 d=15 d=30 d=60 d=90 c 4 n 3 c 5 n 4 c 6 n 5 Time usage for the O(n 4 ) algorithm on d ary trees Number of leaves Time in milliseconds 1e+04 1e+05 1e+06 1e+07 1e+08 Time usage for the O(n 4 ) algorithm on random topology and r8s based trees random r8s c 7 n 3 c 8 n 4 c 9 n Number of leaves Figure 6.1: The running time of the O(n 4 ) algorithm for worst case trees, d-ary trees, random and r8s based trees. The lines plot polynomials c i n y i, where c i and y i are constants.

48 38 CHAPTER 6 Time usage for the O(n 3 ) algorithm on worst case trees Time in milliseconds 1e+03 5e+03 5e+04 5e+05 worst case c 1 n 2 c 2 n 3 c 3 n Number of leaves Time usage for the O(n 3 ) algorithm on d ary trees Time in milliseconds 1e+03 5e+03 5e+04 5e+05 d=3 d=6 d=15 d=30 d=60 d=90 c 4 n 2 c 5 n 3 c 6 n Number of leaves Time usage for the O(n 3 ) algorithm on random topology and r8s based trees Time in milliseconds 1e+03 5e+03 5e+04 5e+05 random r8s c 7 n 2 c 8 n 3 c 9 n Number of leaves Figure 6.2: The running time of the O(n 3 ) algorithm for worst case trees, d-ary trees, random and r8s based trees. The lines plot polynomials c i n y i, where c i and y i are constants.

49 PERFORMANCE IN PRACTICE 39 Time usage for the O(n 2 + V V dd ) algorithm on worst case trees Time in milliseconds 1e+02 1e+04 1e+06 1e+08 worst case c 1 n 3 c 2 n 4 c 3 n Number of leaves Time in milliseconds 1e+02 1e+04 1e+06 1e+08 d=3 d=6 d=15 d=30 d=60 d=90 c 4 n c 5 n 2 c 6 n 3 Time usage for the O(n 2 + V V dd ) algorithm on d ary trees Number of leaves Time in milliseconds 1e+02 1e+04 1e+06 1e+08 Time usage for the O(n 2 + V V dd ) algorithm on random topology and r8s based trees random r8s c 7 n c 8 n 2 c 9 n Number of leaves Figure 6.3: The running time of the O(n 2 + V V dd ) algorithm for worst case trees, d-ary trees, random and r8s based trees. The lines plot polynomials c i n y i, where c i and y i are constants.

50 40 CHAPTER 6 Time usage for the O(n+ V V idid ) algorithm on worst case trees Time in milliseconds 1e+00 1e+02 1e+04 1e+06 worst case c 1 n 3 c 2 n 4 c 3 n Number of leaves Time usage for the O(n+ V V idid ) algorithm on d ary trees Time in milliseconds 1e+00 1e+02 1e+04 1e+06 d=3 d=6 d=15 d=30 d=60 d=90 c 4 n c 5 n 2 c 6 n Number of leaves Time usage for the O(n+ V V idid ) algorithm on random topology and r8s based trees Time in milliseconds 1e+00 1e+02 1e+04 1e+06 random r8s c 7 n c 8 n 2 c 9 n Number of leaves Figure 6.4: The running time of the O(n + V V id id ) algorithm for worst case trees, d-ary trees, random and r8s based trees. The lines plot polynomials c i n y i, where c i and y i are constants.

51 PERFORMANCE IN PRACTICE 41 Time usage for the O(n+ V V min{id,id }) algorithm on worst case trees Time in milliseconds worst case c 1 n 2 c 2 n 3 c 3 n Number of leaves Time usage for the O(n+ V V min{id,id }) algorithm on d ary trees Time in milliseconds d=3 d=6 d=15 d=30 d=60 d=90 c 4 n c 5 n 2 c 6 n Number of leaves Time usage for the O(n+ V V min{id,id }) algorithm on random topology and r8s based tree Time in milliseconds random r8s c 7 n c 8 n 2 c 9 n Number of leaves Figure 6.5: The running time of the O(n + V V min{id, id }) algorithm for worst case trees, d-ary trees, random and r8s based trees. The lines plot polynomials c i n y i, where c i and y i are constants.

52 42 CHAPTER 6 Fig Notice that this algorithm is between 10 and 100 times faster than the other edge claiming algorithm. Contrary to the other edge claiming algorithm, this algorithm also runs faster for higher degrees on d-ary trees, which was also expected Node Claiming Algorithm The first two graphs in Fig. 6.5, show that the running times of the node claiming algorithm are O(n 3 ) and O(n 2 ) as expected. Decreasing running times for increasing degree in d-ary trees is also as expected. Note that the second edge claiming algorithm is faster on d-ary, random and r8s based trees. This is probably caused by the overhead of precomputing the sums used by the node claiming algorithm, especially for nodes with small internal degree. On the other hand, the performance of the node claiming algorithm on the worst case trees is significantly better than all of the other algorithms Shared Leaf Set Size Algorithms As can be seen in Fig. 6.6, the naive algorithm performs as expected, using O(n 2 ) time on d-ary trees, and O(n 3 ) time on worst case trees. These running times demonstrates the importance of the choice of algorithm for computing the shared leaf set sizes. On d-ary trees, this naive algorithm for computing the shared leaf set sizes is actually ten times slower than the node claiming algorithm for computing the quartet distance, and this includes the calculation of the shared leaf set sizes needed by that algorithm. The performance of the expansion based algorithm can be seen in Fig It is not sensitive to the topology of the input trees, and runs in time O(n 2 ) on all types of trees. Given different input trees of a given size, the running time is approximately the same. It is a bit slower than the naive algorithm on small trees with a small degree, e.g. binary and 6-ary trees, but when the sizes of the trees increase, the expansion algorithm is the fastest of the two. All of these results are as expected. The rooting approach is expected to use O(n 2 ) time on all classes of trees, and less time on trees with smaller internal structure. The results of the tests can be seen in Fig. 6.8 and supports the expectations. The graphs in Fig. 6.9 shows the performance of the algorithm that only computes shared leaf set sizes for non-leaf subtrees. It runs in time O(n 2 ) on worst case and d-ary trees, with decreasing running times for increasing d, as expected. 6.3 Summary When comparing the running times for all of the algorithms on random and r8s based trees, a number of facts can be observed. Algorithms that are expected to be faster on trees with a small internal structure, e.g. the center based algorithms, are faster on r8s based than on random trees. Algorithms that

53 SUMMARY 43 Time usage for the naive algorithm on worst case trees Time in milliseconds 1e+02 1e+03 1e+04 1e+05 1e+06 worst case c 1 n 2 c 2 n 3 c 3 n Number of leaves Time usage for the naive algorithm on d ary trees Time in milliseconds 1e+02 1e+03 1e+04 1e+05 1e+06 d=3 d=6 d=15 d=30 d=60 d=90 c 4 n c 5 n 2 c 6 n Number of leaves Time usage for the naive algorithm on random topology and r8s based trees Time in milliseconds 1e+02 1e+03 1e+04 1e+05 1e+06 random r8s c 7 n c 8 n 2 c 9 n Number of leaves Figure 6.6: The running time of the naive algorithm for computing shared leaf set sizes for worst case trees, d-ary trees, random and r8s based trees. The lines plot polynomials c i n y i, where c i and y i are constants.

54 44 CHAPTER 6 Time usage for the expansion based algorithm on worst case trees Time in milliseconds 1e+02 5e+02 5e+03 5e+04 worst case c 1 n c 2 n 2 c 3 n Number of leaves Time usage for the expansion based algorithm on d ary trees Time in milliseconds 1e+02 5e+02 5e+03 5e+04 d=3 d=6 d=15 d=30 d=60 d=90 c 4 n c 5 n 2 c 6 n Number of leaves Time usage for the expansion based algorithm on random topology and r8s based trees Time in milliseconds 1e+02 5e+02 5e+03 5e+04 random r8s c 7 n c 8 n 2 c 9 n Number of leaves Figure 6.7: The running time of the expansion based algorithm for computing shared leaf set sizes for worst case trees, d-ary trees, random and r8s based trees. The lines plot polynomials c i n y i, where c i and y i are constants.

55 SUMMARY 45 Time usage for the rooting based algorithm on worst case trees Time in milliseconds worst case c 1 n c 2 n 2 c 3 n Number of leaves Time usage for the rooting based algorithm on d ary trees Time in milliseconds d=3 d=6 d=15 d=30 d=60 d=90 c 4 n c 5 n 2 c 6 n Number of leaves Time usage for the rooting based algorithm on random topology and r8s based trees Time in milliseconds random r8s c 7 n c 8 n 2 c 9 n Number of leaves Figure 6.8: The running time of the rooting based algorithm for computing shared leaf set sizes for worst case trees, d-ary trees, random and r8s based trees. The lines plot polynomials c i n y i, where c i and y i are constants.

56 46 CHAPTER 6 Time usage for the internal algorithm on worst case trees Time in milliseconds worst case c 1 n c 2 n 2 c 3 n Number of leaves Time usage for the internal algorithm on d ary trees Time in milliseconds d=3 d=6 d=15 d=30 d=60 d=90 c 4 n c 5 n 2 c 6 n Number of leaves Time usage for the internal algorithm on random topology and r8s based trees Time in milliseconds random r8s c 7 n c 8 n 2 c 9 n Number of leaves Figure 6.9: The running time of the algorithm for computing shared leaf set sizes of only non-leaf subtrees for worst case trees, d-ary trees, random and r8s based trees. The lines plot polynomials c i n y i, where c i and y i are constants.

57 SUMMARY 47 depend on degrees of nodes, e.g. the naive algorithm for computing shared leaf set sizes, perform better on random trees than on r8s based trees. Finally, algorithms that depend on the internal degree of the nodes, e.g. the node claiming algorithm, are faster on r8s based trees than on random trees. These differences are probably caused by r8s based trees having a smaller internal structure, consisting of nodes with a higher degree, than random trees. The results in Sec. 6.2 show that the second edge claiming approach and the node claiming approach are the fastest algorithms for computing the quartet distance in general, i.e. for each class of trees, one of these algorithms is the fastest. In Fig the running times of the edge claiming and node claiming algorithms can be directly compared: On random trees, the second edge claiming algorithm is at most twice as fast as the node claiming algorithm, while the node claiming algorithm is slightly faster on r8s-based tree. These results are based on examination of the data behind the graphs, since the logarithmic scale on these makes it difficult to see. On worst case trees, the node claiming approach is both asymptotically faster and also faster in practice. For trees with 100 leaves it is approximately 40 times faster, and on trees with 500 leaves this factor has increased to 250. For completeness we have also added a comparison of the algorithms on binary and 90-ary trees. In these cases the second edge claiming algorithm is only slightly faster than the node claiming algorithm. We have also added data for the first edge claiming algorithm, and as can easily be verified by looking at Fig. 6.10, it is the slowest of the three algorithms in all cases. Since the edge claiming algorithm is only faster than the node claiming algorithm by a factor of at most two, the latter algorithm is a better choice when working with trees of unknown topology. It would also be possible to implement a variant of the node-claiming algorithm, that compares the internal degree of each pair of internal nodes before deciding whether to compute the sums. This would effectively be a hybrid of the two fastest algorithms. Opposed to the algorithms for computing the quartet distance, there is no reason to create a hybrid of the algorithms for computing the shared leaf set sizes. If all the sizes are needed, the rooting approach is clearly faster than the first two approaches. If only the non-leaf subtree sizes are needed, the last approach is even faster than the rooting approach. In this chapter, we have verified that the practical running times of the implemented algorithms agree with both the theoretical and expected running times. This suggests that both the implementations and the theoretical time analysis of algorithms are correct.

58 48 CHAPTER 6 Time usage for the edge claiming and node claiming algorithms on worst case trees Time in milliseconds 1e+00 1e+02 1e+04 1e+06 1e+08 Node claiming algorithm Edge claiming algorithm Edge claiming algorithm using expansion Number of leaves Time usage for the edge claiming and node claiming algorithms on d ary trees Time in milliseconds 1e+00 1e+02 1e+04 1e+06 1e+08 Node claiming algorithm (d=3) Node claiming algorithm (d=90) Edge claiming algorithm (d=3) Edge claiming algorithm (d=90) Edge claiming algorithm using expansion (d=3) Edge claiming algorithm using expansion (d=90) Number of leaves Time in milliseconds 1e+00 1e+02 1e+04 1e+06 1e+08 Time usage for the edge claiming and node claiming algorithms on random topology trees and r8s based trees Node claiming algorithm (r8s) Node claiming algorithm (random) Edge claiming algorithm (r8s) Edge claiming algorithm (random) Edge claiming algorithm using expansion (r8s) Edge claiming algorithm using expansion (random) Number of leaves Figure 6.10: The running times of the edge claiming and node claiming algorithms. To avoid clutter only the running times on two types of d-ary trees are shown.

59 49 Part II Related Subjects

61 51 7 Quartet Distance and Related Measures The quartet distance is a number that quantifies the distance between a pair of trees T and T. We have already mentioned one other measure that is closely related to the quartet distance, namely the quartet similarity, which quantifies the similarity of the trees. The sum of these two measures is the total number of quartets in each input tree, i.e. qdist(t, T ) + qsim(t, T ) = ( ) n, 4 where n is the number of leaves in the input trees. Remember that we are working under the assumption that the input trees both have the same n leaves. 7.1 Normalized Measures When comparing the quartet distance or quartet similarity of a pair of trees, the absolute values qdist(t, T ) and qsim(t, T ) are not always good measures for comparison. A pair of closely related large trees can easily have a larger quartet distance than two smaller trees that are only remotely related. Therefore we define the normalized quartet distance and the normalized quartet similarity. Since there are ( n 4) quartets in any tree with n leaves, the normalized quartet distance between two trees T and T is qdist(t, T) ( n 4), and the normalized quartet similarity is qsim(t, T) ( n 4). These measures do not suffer from the problems described above, since they provide a percentage of the maximal distance or similarity. Note also that these two normalized measures sum to one. 7.2 Quartet Fit Similarity In [19] Piaggio-Talice et al. define the Quartet Fit Similarity between a tree of arbitrary degrees T S and a binary tree T M as

62 52 CHAPTER 7 f (T S, T M ) = 1 q b b + q sb q b b + q b=b + q sb, (7.1) where q b=b = Q B=B (T S, T M ), q b b = Q B B (T S, T M ) and q sb = Q SB (T S, T B ). Note that neither q ss = Q BS (T S, T M ) nor q bs = Q SS (T, T M ) is used in this definition, since T M is binary. To allow T M to be non-binary, we must consider the cases where quartets in T M have star topology. We extend the definition of the quartet fit similarity with these: f (T S, T M ) = 1 q b b + q sb + q bs q b b + q b=b + q sb + q ss + q bs. (7.2) In the setting where T M is binary and T S is arbitrary, q ss = q bs = 0 and therefore this general quartet fit similarity is compatible with the one from [19]. The denominator in (7.2) is the total number of quartets in the input trees, i.e. ( ) n q b b + q b=b + q sb + q ss + q bs =, 4 and the number of quartets that do not have the same topology in both trees is q b b + q sb + q bs and thus the normalized quartet distance is q b b + q sb + q bs q b b + q b=b + q sb + q ss + q bs. It follows directly that (7.2) is equal to the normalized quartet similarity.

63 53 8 Input Trees that do not Fit the Assumptions Input trees that contains nodes with degree two, do not contain precisely the same leaves, or do not have their leaves numbered 1 to n, will cause different problems in the analysis of running times, and in the execution of the algorithms. Instead of handling the problems for each of the algorithms in isolation, we describe a general approach that works for any algorithm that computes the quartet distance. 8.1 Leaves with Arbitrary Labels Since leaf labels can be arbitrary, we need to create a mapping between numbers and these names, such that the assumption about leaf numbering can be fulfilled. A mapping is needed because the same labels must be mapped to the same number in both input trees. Assuming that the trees T and T have L T and L T leaves respectively, every leaf label will be mapped to a number from the interval [1; L T + L T ]. This can be done for a pair of trees using a hash set in expected constant time per leaf label, i.e. expected O( L T + L T ) time. 8.2 Nodes of Degree Two If we allow the input trees to have nodes of degree two, the running time of the center based algorithms cannot be guaranteed. As mentioned in Sec these algorithms traverse the trees, and with nodes of degree two, the internal structure of the trees can be arbitrarily large without changing the topology of quartets or the number of leaves. The running time of the other algorithms will also be affected, since each node of degree two adds to the size of V or V. The solution is to remove all nodes of degree two before applying the algorithms to the trees. Removing a node v of degree two with edges e 1 and e 2 attached is done in constant time by removing v and e 1 and attaching the other end point of e 1 to e 2. By doing a single traversal of the internal nodes of the tree, all nodes of degree two can be identified and removed. If V is the number of internal nodes with degree larger than two and V 2 is the number of internal nodes with degree equal to two, the total time consumption needed to remove the nodes of degree two is O( V + V 2 ). Note that V 2 can exceed the number of leaves, so it is not necessarily O(n).

64 54 CHAPTER Trees with Different Leaf Sets Another potential problem is running the algorithms on trees that do not have the same leaf set. The algorithms are all based on the assumption that the leaf sets of the trees are the same, and even though they might be fixed to work on trees with different leaf sets, it is a lot easier to eliminate the different leaves from the trees before running the algorithms. Note that after the elimination of excess leaves the remaining leaves must be relabeled as described in Sec Assume that the input trees T and T have the set of leaves L T and L T, respectively, and that L T L T. Let L = L T L T, and let T be the tree with the same topology as T but restricted to the leaf set L, and T be the tree with the same topology as T but restricted to the leaf set L. Any quartet present in T is also present in T, and has the same topology. The same applies for T and T. Since the leaves that are only present in one of the input trees add to the difference of the trees, it is natural that quartets containing at least one of these leaves adds to the quartet distance. Thus we define the quartet distance between the trees T and T as: ( ) ( ) qdist(t, T LT LT ) = + 2qsim( T, T ) qdist( T, T ). (8.1) 4 4 The first two terms counts all quartets in the trees. Quartets only present in one tree is counted once, while quartets present in both trees are counted twice. Some of the quartets present in both trees have the same topology, some have different topology. The third term subtracts all quartets with the same topology twice, since they do not add to the quartet distance and have been counted twice. The last term subtracts all quartets with different topology once, since they do add to the distance, but have been counted twice. Note that if the trees contain the same leaves, this definition is fully compatible with the original definition of the quartet distance. The next section describes how to create T from T in time O( L T + L T ), so this is added to the time needed for computing the quartet distance Pruning Trees We will assume that the leaves in a tree are numbered from the interval [1; L T + L T ] and that there are no internal nodes of degree two, since these problems can be eliminated as described above. Given two trees, a traversal of the sets of leaves can identify leaves that are only in one of the trees in time O( L T + L T ). Removing a single leaf attached to an internal node of degree larger than three can be done without further action and takes constant time. Removing a leaf from an internal node of degree three creates an internal node of degree two, which can be removed as described in Sec. 8.2 in constant time. Therefore removing k leaves takes time O(k), but since k L T + L T, the total time needed to prune the trees is O( L T + L T ).

65 TREES WITH DIFFERENT LEAF SETS Other Measures We have already extended the definition of the quartet distance to include trees with different leaf sets. Since the different leaves do not add to the similarity of the trees, we define the quartet similarity of two trees T and T as qsim(t, T ) = qsim( T, T ), where T and T are defined as above. Since we have more than ( n 4) quartets in total, where n is the size of the leaf set T and T have in common, the normalized versions of the measures have to be extended too. We redefine the normalized quartet distance as qdist(t, T ) qdist(t, T ) + qsim(t, T ), and the normalized quartet similarity as qsim(t, T ) qdist(t, T ) + qsim(t, T ). Note that for trees with equal leaf sets, these definitions are fully compatible with the original ones.

66 56 CHAPTER 9 9 Reducing Space Consumption When creating algorithms and data structures to solve a problem, it is important that they use as little time and space as possible. The focus of this thesis has so far been centered around optimizing the running times of the algorithms. The first algorithm uses O(n) space, the second and third uses O(n 2 ), and the last two uses O( V V ). The improvement from O(n 2 ) to O( V V ) was actually just a byproduct of optimizing the running time. In some applications of an algorithm it is the space, and not the time, that is the limiting factor. In this chapter we investigate the possibilities of reducing the space consumption of the O( V V min{id, id }) time algorithm. Note however, that this space optimization has not been implemented. The space consumption of the O( V V min{id, id }) time algorithm is caused by the need to store O( V V ) shared leaf set sizes. Since the algorithm only counts the shared or nonshared quartets of two nodes, v and v at a time, only id v id v shared leaf set sizes are needed when processing v and v. On the other hand, these sizes might be computed from other sizes or they might be needed themselves to compute sizes dependent on them, so it is not obvious how to avoid using O( V V ) space. 9.1 Coloring and Rooting A possible way to reduce the space consumption is to use coloring in one of the input trees, T, and rooting of the other tree, T. Our proposal reduces the space consumption to O( V id), but increases the running time to O( V n + V V min{id, id }). Given a node v in a tree T, a coloring of T according to v means associating the leaves of each non-leaf subtree of v with the colors 1... id v, one for each subtree. All subtrees that are leaves (i.e. leaves directly connected to v) are colored with the color 0. Such a coloring requires a single traversal of T and thus takes time O(n). T must be rooted in an arbitrary internal node like described in Sec. 4.1, this gives rise to the rooted tree T r. We say that the direction from the root to the leaves is down. Each internal node is annotated with a vector of size id v + 1, i.e. an entry for each of the colors related to the subtrees of v, including the color 0. All entries of the vectors are initialized to zero. Fig. 9.1 shows a small example of what is done with the rooted tree given a coloring of the other input tree: Each internal node in the rooted tree that is directly connected to leaves, has its vector incremented at the entries corresponding to the color of those leaves, one increment per leaf. Then the internal nodes are updated by a depth first traversal of T r. After the traversal, each

67 COLORING AND ROOTING (a) An example of an initialized rooted tree (b) The same tree after the depth first traversal. Figure 9.1: A rooted tree annotated with vectors for containing the shared leaf set sizes of its own subtrees and subtrees from another tree. The other tree (not shown) has a node that has three subtrees colored with the colors 1 3 and a single leaf attached, colored with the color 0. node s vector contains the sum of the vectors in the nodes directly below it. For a fixed node v in T, this process takes time O n + id v id v O(n + V id v ). (9.1) v V Each node in T r is the root of a unique subtree of T r. The vector of a node v in T r contains at entry i the shared leaf set size of the subtree of v with color i, and the subtree of T r rooted in v. Using these values and (4.5) from Sec the shared leaf set sizes of all subtrees of v and all rooted subtrees of T can be computed (by using F G = F F G ). After these have been computed the shared leaf set sizes of all subtrees represented by edged pointing to v and all rooted subtrees of T can be computed (by using F G = G F G ). Each of the values takes constant time to compute, and therefore the time consumption is still O(n + V id v ). Since we have a vector of size id v + 1 in each internal node in T r, the space consumption is (id v + 1) = O( V id v ), v V for each node v in V. Since the nodes in V can be processed one at a time, there is at most need for O( V id) space to compute the shared leaf set sizes. Using these, the number of shared and nonshared butterfly quartets can be computed for v and every node v in T, in space O(min{id 2 v, id 2 v }), as shown in Sec Thus, there is need for at most O(min{id 2, id 2 }) additional space to compare pairs of nodes. Since O(min{id 2, id 2 }) = O( V id), the total space consumption of the algorithm is O( V id). The space consumption of the algorithm in Sec. 5.3 is O( V V ), and since id < V, there is an asymptotic improvement in the space consumption, however the time consumption is not unaffected. For each node v V we have to do the O(n) time coloring and process T r in time O(n + V id v ) as shown in (9.1). This takes the total time of O (n + n + V id v ) = O( V n + V V ). v V

68 58 CHAPTER 9 As shown in Sec. 5.3, calculating the number of shared and nonshared butterfly quartets can be done in time O( V V min{id, id }), provided that the shared leaf set sizes are available. Above we have shown how to make these available, but not all at the same time. This is not a problem however since the algorithm compares pairs of nodes, and the shared leaf sets available corresponds to a number of pairs of nodes. The total time consumption for calculating the quartet distance using the space optimization is O( V n + V V + V V min{id, id }) = O( V n + V V min{id, id }), and the space consumption is O( V id). Note that the roles of the trees can be switched, such that the time consumption is O( V n + V V min{id, id }) and the space consumption is O( V id). If id is close to V, that is if T is very star-like, the algorithm will not use much less space than the original algorithm. On the other hand, if the algorithms are run on two binary trees, the time and space consumption of the original algorithm is O(n 2 ). For the space optimized algorithm presented here, the time consumption is also O(n 2 ), but the space consumption is only O(n). Whether the space consumption can be reduced further from O( V id), and whether it can be reduced without changing the time consumption, will be left as open problems.

69 59 10 Visualization The quartet distance and quartet similarity along with their normalized versions are ways of quantifying the distance between, or similarity of, two trees. These numbers give no clues about which quartets have different topologies and which do not. To mend this, one could make algorithms that outputs quartets with different or same topologies. Another possibility is to annotate the input trees with information about the quartets. A quartet consists of four leaves, connected by a number of edges and nodes in a tree. To a quartet we associate all edges and nodes in a tree, that connects the four leaves of the quartet. In other words, all edges and nodes on the path between any pair of the four leaves in the quartet are associated to the quartet. Annotating trees with information about edges and nodes associated to shared quartets, can be used when visualizing the similarities of two input trees. Below we present two approaches to doing this, which might inspire other people to investigate the problem in more details Visualization Using Inducing Edges and Center Nodes In every input tree, each star quartet has a unique center node as described in Sec. 5.1, and all butterfly quartets are claimed by exactly two oriented edges as described in Sec An obvious way of annotating the input trees would be to annotate each claiming edge with the number of quartets it claims that have the same topology in the other input tree, and similarly for the center nodes. This approach is simple and can easily be implemented by combining a variant of the O(n 3 ) time center based algorithm with a variant of any of the edge claiming algorithms. However it only gives little information about the nodes and edges associated to shared quartets: At least five edges and two internal nodes are associated to butterfly quartets, but only two edges and zero nodes are annotated. Star quartets have at least four edges and one node associated, but only the center node is annotated. An approach similar to the one above also annotates the edges that induce, but does not necessarily claim, shared butterfly quartets. This approach is not better when considering star quartets, but for butterfly quartets, all edges that separate the two pairs of leaves are annotated. The implementation is still simple, and can be done by combining a variant of the O(n 3 ) time center based algorithm for annotating the center nodes, and a variant of one of the edge claiming algorithms for annotating the edges. The variant of the edge claiming algorithm must be altered to annotate inducing edges instead of only claiming edges, but since every pair of edges is processed, this is a trivial change.

70 60 CHAPTER 10 Neacomys Scolomys Microryzomys Zygodontomys 74% 74% Nectomys 100% Pseudoryzomys Amphinectomys 100% Megalomys 100% 100% Melanomys 100% Lundomys 100% Sigmodontomys 86% 0% 81% Nesoryzomys Holochilus Scolomys Megaoryzomys 92% Microryzomys Neacomys Orysomys Oecomys Oligoryzomys Zygodontomys Nectomys 74% 74% Amphinectomys 100% 100% Megalomys 100% Pseudoryzomys Melanomys 100% 100% 100% Oligoryzomys 90% Lundomys 53% 15% Nesoryzomys Holochilus Megaoryzomys Sigmodontomys Oecomys Orysomys Figure 10.1: Two different phylogenies constructed by altering the phylogeny of the group Oryzomiyni found on [18]. The two trees are in total agreement on the phylogeny of Nectomys, Amphinectomys, Megalomys, Melanomys, Holochilus, Lundomys and Pseudoryzomys, which is shown with thick black edges and black nodes, along with an annotation of 100%. Differences result in thinner edges and a lower percentage annotation on both nodes and edges.

71 VISUALIZATION USING ALL EDGES 61 We have implemented an algorithm that does this annotation while calculating the quartet distance, and an example of annotated trees can be seen in Fig Each node in a tree has a total number of star quartets in which it is the center node. Likewise each edge a tree has a total number of butterfly quartets it induces. Computing these numbers enables the annotation to reflect the percentage of the total number of quartets that are shared quartets for each center node and claiming edge. The annotation in Fig shows this percentage in three ways: As numbers, as thickness of edges or size of nodes and as the hue of the edges and nodes. Bright colors, thin edges and small nodes mean low percentage, while dark colors, thick edges and large nodes mean a high percentage. Nodes of degree less than four and edges connected to leaves do not follow these rules, since they cannot be center for or induce any quartets. The reason for using percentages is that some edges have a higher total number of quartets that they induce than others. Edges that are in the middle of the tree, i.e. edges that split the leaves in two parts that have approximately the same size, induce more quartets than edges that are at the rim of the tree. By using percentages, the visualization is not biased towards annotating rim edges as being less important than other edges. For example if the number of shared quartets a rim edge induces is small compared to other edges, but still close to the total number of quartets it induces, it will be annotated with a dark color and be thick. The same arguments applies for the center nodes. The total number of quartets an edge induces, can be calculated in a way similar to the way the number of butterfly quartets in a single tree can be computed. To calculate the total number of star quartets a node is center of, a variant of the O(n 3 ) time center based algorithm can be used. This variant uses leaf set sizes of one input tree instead of the shared leaf set sizes. This visualization approach lacks the ability to annotate edges that are associated to star quartets and edges that are associated to, but not inducing, shared butterfly quartets, for example edges connected to leaves. Furthermore only center nodes of star quartets are annotated. In the following section we look into a way of overcoming these problems Visualization Using all Edges When annotating the trees with information about the shared quartets, we believe that the annotation should give an overview of which parts of the trees are similar and which are not. The major drawbacks of the visualization approaches presented above are the missing annotation of edges associated to, but not inducing, shared butterfly quartets, and the missing annotation of both edges and nodes (except the center node) associated to star quartets. As described, each quartet is associated to a number of edges and a number of nodes. When annotating a tree, one can choose to annotate either edges, nodes or both. Each node that is associated to a shared quartet, may have several edges attached that are not associated to the same shared quartet, but on the contrary both nodes that are connected to an edge that is associated to a shared

72 62 CHAPTER 10 quartet are also connected to the same shared quartet. Therefore our next visualization approach annotates edges for both star quartets and butterfly quartets, but it does not annotate any nodes. Conceptually the final visualization approach is simple: Annotate each edge with the number of shared quartets it is associated to. Since this includes edges that are not necessarily inducing quartets, all edges are covered and nothing will be missed since we consider all quartets regardless of their topology. As we see it, there is no easy way to implement this approach by altering any of the fast algorithms, the main reason being that all four edges connected to the leaves of a quartet are not traversed by any of the algorithms that uses less time than O(n 4 ). For simplicity we have implemented an algorithm very for this visualization approach using time O(n 5 ) and space O(n). The implementation of the algorithm is very similar to the implementation of our O(n 4 ) time algorithm for computing the quartet distance, but uses more time in order to do the annotation correctly. Whether this form of annotation can be done faster remains an open problem. Fig shows an example of this new visualization. Here the thickness of the edges represent a percentage like described in the previous section. Computing the total number of quartets that an edge is associated to is done by using the O(n 5 ) time algorithm on two instances of the same tree. It remains an open problem how to do this in a faster way, when one has to include all edges that are associated to any type of quartet. To enable accurate readings, the percentage is also given as a fraction. Contrary to the last visualization approach, the hue of an edge describes the absolute number of quartets the edge is associated to. This gives an extra visual dimension to the annotation, since it is possible to get an overview of both the percentage and the absolute number of shared quartets associated to the edges without looking at actual numbers in the annotation. Note that some of the edges that had 100% thickness in Fig does not have 100% thickness in Fig This is caused by the edges associated to star quartets and the edges associated to butterfly quartets that do not induce the butterfly quartets.

73 VISUALIZATION USING ALL EDGES 63 Scolomys Zygodontomys 315 of of 364 Microryzomys 315 of 364 Neacomys 575 of of 364 Amphinectomys 575 of of 364 Nectomys 345 of of of 1034 Pseudoryzomys 345 of of 650 Lundomys 345 of of of of of 364 Nesoryzomys 494 of 650 Orysomys 289 of of of 364 Oligoryzomys 289 of 364 Oecomys Holochilus 269 of of 364 Sigmodontomys Megaoryzomys Neacomys Zygodontomys 315 of of 364 Microryzomys 315 of 364 Scolomys 575 of of 364 Amphinectomys 575 of of 364 Nectomys 345 of of of of of 1224 Pseudoryzomys 345 of of 650 Lundomys 345 of of of 364 Nesoryzomys Megaoryzomys 269 of of of of of of 364 Oligoryzomys of 364 Sigmodontomys Holochilus Oecomys Orysomys Figure 10.2: Two different phylogenies constructed by altering the phylogeny of the group Oryzomyini found on [18].

74 64 CHAPTER Tool The node claiming algorithm presented in Part 2 has been embedded in a tool for calculating the quartet distance and the related measures. In Chap. 15 we describe this tool along with the SplitDist tool created by Thomas Mailund. Our tool takes as input trees in newick format and outputs the quartet distance between the trees. If there are more than two input trees, the distance between each pair of trees will be computed and outputted. The tool can also print the normalized quartet distance, the quartet similarity and the normalized quartet similarity. Furthermore if the trees do not have the same leaf sets, the number of quartets present in one tree, but not in the other and vice versa, can be printed. Fig shows an example of the output of the tool. Quartet Distance Matrix Ory1.tree: Ory2.tree: - - Normalized Quartet Distance Matrix Ory1.tree: Ory2.tree: - - Quartet Similarity Matrix Ory1.tree: Ory2.tree: - - Figure 11.1: Running the tool on the two trees shown in Fig To make the tool easier to use, we have created a graphical user interface for it. It provides a menu where the features of the tool can be enabled and disabled by a few mouse clicks, furthermore it can be opened without using command line parameters. When running the tool via the graphical user interface, it prints the command necessary to execute the tool with the given options, from a command line, and can thus also serve as a means to learn how to use it directly from the command line. Both the tool and the graphical user interface is available for download at chrisc/qdist

VISUALIZATION 65 Figure 11.2: The graphical user interface for the tool. 11.1 Visualization After writing Chap.

75 VISUALIZATION 65 Figure 11.2: The graphical user interface for the tool Visualization After writing Chap. 15 we augmented the tool with the feature of visualizing the quartet distance as described in Chapter 10. The visualization can be outputted to files in ether the Graphviz dot format, [11], or in the udraw(graph) format, [28]. The udraw(graph) tool also supports reading an input tree and then updating it little by little. Our tool can use this feature, by letting udraw(graph) draw the input trees, and then update the color and thickness of the edges in the trees while the algorithm is running, the number of shared quartets and the total number of quartets that each edge induces is shown as a tool tip, when hovering the mouse over each edge. Fig 11.3 shows an example of this online visualization, but without any tool tip showing. Since udraw(graph) does not support drawing of unrooted trees, an arbitrary node has been chosen to be root in the input trees for the algorithm. As described, the visualization feature is implemented using an algorithm running in time O(n 5 ), that compares the topologies of each pair of quartets explicitly. Therefore it should not be used for large scale computations of the quartet distance and the related measures. We hope that making the feature available will inspire other people to look into the problems of visualization of tree similarity, both in terms of usefulness and time complexity.

(c) When the algorithm has finished. Figure 11.

76 66 CHAPTER 11 (a) The visualization after a short period of time. (b) After the algorithm has run for a little while. (c) When the algorithm has finished. Figure 11.3: The online visualization updates the drawings of the two input trees while the visualization algorithm is running.

Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees

Journal of Bioinformatics and Computational Biology c Imperial College Press Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees M. Stissing, T. Mailund, C. N. S. Pedersen and G. S.