Evolution Module. 6.1 Phylogenetic Trees. Bob Gardner and Lev Yampolski. Integrated Biology and Discrete Math (IBMS 1300)

Evolution Module 6.1 Phylogenetic Trees Bob Gardner and Lev Yampolski Integrated Biology and Discrete Math (IBMS 1300) Fall 2008 1

INDUCTION Note. The natural numbers N is the familiar set N = {1, 2, 3,...}. From the formal definition of N, we have the following: The Principle of Mathematical Induction. Let S N have the properties: (i) 1 S. (ii) For all n N, ifn S then n +1 S. Then S = N. Note. We will use mathematical induction to prove the validity of certain formulae. The proof technique will be to confirm the formula for N = 1 (step (i)) and then show that assuming the validity of the formula for N = n implies its validity for N = n + 1. You might visualize this like knocking down dominoes. If you know that (i) you have knocked down the first domino and that (ii) whenever a domino falls, it knocks down the next domino, then you can conclude that all of the dominoes fall. 2

Example. Prove that for all n N, we have: N 1+3+5+ +(2N 1) = (2i 1) = N 2. i=1 Solution. First, for N =1, n (2i 1) = 2(1) 1=1=1 2 i=1 and the formula holds. Second, assume the formula holds for N = n: n (2i 1) = n 2. i=1 We now show that the formula holds for N = n +1: n+1 n (2i 1) = (2i 1) + (2(n +1) 1) i=1 i=1 = n 2 +(2n + 1) by the induction hpothesis = n 2 +2n +1=(n +1) 2. Hence, by the Principle of Mathematical Induction, the formula holds for all N N. N (2i 1) = N 2 i=1 3

Exercise 6.1.1. In the computation of definite integrals using the definition and regular partitions, you encountered the formulae: n k = k=1 n(n +1), 2 n k 2 = k=1 n(n + 1)(2n +1), 6 n ( ) 2 n(n +1) k 3 =. 2 Use the Principle of Mathematical Induction to prove each of these. Exercise 6.1.2. In computing the derivative of f(x) =x n, n N, by definition, you used the Binomial Theorem which states that N ( ) N (a + b) N = a N i b i i i=0 for all N N, where (N ) N! = i (N i)!i!. Use the Principle of Mathematical Induction to prove the Binomial Theorem. k=1 4

Note. An alternate form of the Principle of Mathematical Induction is the following: The Strong Principle of Mathematical Induction. Let S N have the properties: (i) 1 S (ii) For all n S, if{1, 2,...,n} S then n +1 S. Then S = N. Note. Going back to the domino analogy, we would say that if (i) you have knocked down the first domino and (ii) whenever the first n dominoes fall, the (n +1) st domino falls, then you can conclude that all of the dominoes fall. 5

TREES Note. In section 6.5 of Symbiosis 1 (Inbreeding), we introduced the idea of a graph and used a method of counting paths to define and calculate an inbreeding coefficient. In this section, we turn the application of a graph theoretic object called a tree to the study of relationships between species or other biological units. Note. Recall that a tree is an acyclic connected graph. That is, a connected graph which contains no cycle. Note. If we delete an edge from any tree, then the resulting graph consists of two disjoint pieces (i.e., two connected components). This observation, combined with the Principal of Mathematical Induction allows us to prove an important theorem concerning trees. 6

Theorem 6.1.1. A tree with n vertices has n 1 edges. Proof. We prove the result holds for all n N by using induction. First, a tree with N = 1 vertex contains N 1 = 0 edges and a graph with N = 2 vertices has N 1 = 1 edge. Suppose the result holds for N = 1,N = 2,...,N = n. Consider a tree T with N = n + 1 vertices. Let (x, y) bean edge of T. If we delete edge (x, y) from T, then we get two disjoint connected graphs T 1 and T 2. Since T is an acyclic graph, then so are graphs T 1 and T 2. That is T 1 and T 2 are trees. Let N 1 be the number of vertices of T 1 and let N 2 be the number of vertices of T 2. Notice that N 1 <n+1, N 2 <n+ 1, and N 1 + N 2 = N = n + 1. So by the induction hypothesis, tree T 1 and T 2 have N 1 1 and N 2 1 edges, respectively. Now the edge set of T is the edge set of T 1 combined with the edge set of T 2 and the edge (x, y). So T has (N 1 1) + (N 2 1)+1=N 1 + N 2 1=N 1 edges. Therefore the result holds for N = n + 1 and by the Strong Principle of Mathematical Induction, the result holds for all N N. 7

Theorem 6.1.2. In a tree, any two vertices are connected by a unique path. Proof. Since trees are by definition connected graphs, we know that there exists at least one path between any two vertices. We show uniqueness through proof by contradiction in which we assume that the theorem is false and derive a contradiction, leading us to conclude that the theorem is, in fact, true. So we start by assuming that there is more than one path between some two vertices. Say there are two paths between vertices u and v. Let the paths be P 1 = ue 1 v 1 e 2 v 2 e n 1 v n 1 e n v and P 2 = ue 1 v 1 e 2 v 2 e m 1 v m 1 e m v. Since the two paths are different, there is some edge e =(x, y) of P 1 that is not an edge of P 2. Now the graph P 1 P 2 e is connected (for any vertex in this graph, there is either a path to vertex u or a path to vertex v, and since P 2 is a path between u and v then there is a path between any two vertices of this graph). So in particular, there is a path in P 1 P 2 e from x to y, say path P. But then P e contans a cycle which is present in the original graph. However, since the original 8

graph is a tree, it cannot contain a cycle. Therefore under the assumption of the existence of two paths between some pair of vertices, we have derived a contradiction. Hence there must be only one path between any two vertices of a tree. 9

Note. In fact, the converse of Theorem 6.1.1 is also true (assuming a connected graph): Theorem 6.1.3. A connected graph with n vertices and n 1 edges is a tree. Exercise 6.1.3. Use the Strong Principle of Mathematical Induction and Theorem 6.1.2 to prove Theorem 6.1.3. HINT: If T is a tree on N = n+1 vertices and you delete one vertex of T (and hence you delete all the edges incident to that vertex), then what is left? Note. Combining Theorems 6.1.1 and 6.1.3, we get a classification of graphs which are trees: Theorem 6.1.4. A connected graph with v vertices and e edges is a tree if and only if v = e 1. 10

Definition. The degree of a vertex of a graph is the number of edges incident to it. That is, the number of edges of which it is an element. Theorem 6.1.5. The sum of the degrees of the vertices of any graph is twice the number of edges. Proof. Every edge of a graph has two vertices as elements. So if we add up the sums of the degrees of the vertices of a graph, then we count each edge twice and the sum is twice the number of edges: degree(v) = the number of edges of G. v is a vertex of G Theorem 6.1.6. A tree on two or more vertices has at least two vertices of degree one. Proof. Let T be a tree with n vertices. Then T has n 1 edges by Theorem 6.1.1 and the sum of the degrees of the vertices of the tree by Theorem 6.1.5 is degree(v) =2(n 1)=2n 2. v T 11

If the degree of each vertex is two or more, then the sum of the degrees of the vertices is at least 2n (since there are n vertices). It follows that at least two of the vertices must be of degree one. Note. In biological applications of trees, we are particularly interested in vertices of degree one (these will often represent the biological units of interest). 12

HYDROCARBON MOLECULES Recall. A hydrocarbon molecule consists of only carbon and hydrogen atoms. A carbon atom has four chemical bonds and a hydrogen atom has one chemical bond. A hydrocarbon is said to be saturated if it has the maximum number of hydrogen atoms for the given number of carbon atoms (hence there are no carbon double bonds and no cycles of carbon atoms). These hydrocarbons are called alkanes. So a saturated hydrocarbon can be represented by a graph in which the carbon atoms are degree four vertices, the hydrogen atoms are degree one vertices, and the edges represent chemical bonds. Several of these ideas can be found in Bogart [2]. Example. Here are three examples of saturated hydrocarbons: 13

Example. In a saturated hydrocarbon with n carbon atoms, there are 2n + 2 hydrogen atoms. Proof. The hydrocarbon can be represented by a tree. Let h be the number of hydrogen atoms. Then the tree has n + h vertices and by Theorem 6.1.1, it has n + h 1 edges. We know the degree of each vertex which represents a carbon atom is four and the degree of each vertex which represents a hydrogen atom is one. So the total sum of the degrees is 4n + h. Also, by Theorem 6.1.5 the sum of the degrees is twice the number of edges. Therefore 4n + h =2(n + h 1). From this equation, we get h =2n +2. 14

Note. Two different saturated hydrocarbons can have the same chemical formula, but different physical structures (that is, the trees representing them can be different). Such molecules are called isomers of each other. For example, C 4 H 10 yields: Exercise 6.1.4. Draw the trees which represent all possible saturated hydrocarbons C 5 H 12. Note. Since two carbon atoms can form a double bond, it is of interest to consider graphs which have repeated edges. Such a graph is called a multigraph (and graphs without repeated edges are often called simple graphs). 15

Example. Ethylene is C 2 H 4 and can be represented as: Exercise 6.1.5. A hydrocarbon with at least one carbon double bond and no cycle is called an alkene. Show that an alkene with exactly one double carbon bond and n carbon atoms has 2n hydrogen atoms. Exercise 6.1.6. A hydrocarbon with at least one carbon triple bond and no cycle is called an alkyne. Show that an alkyne with exactly one triple carbon bond and n carbon atoms has 2n 2 hydrogen atoms. Exercise 6.1.7. A hydrocarbon containing one or more cycles of carbon atoms (called a carbon ring) is called a cycloalkane. Show that a saturated cycloalkane with one carbon ring and n carbon atoms has 2n hydrogen atoms. Exercise 6.1.8. Draw a graphical representation of a benzene molecule, C 6 H 6. Is this a saturated hydrocarbon? 16

COUNTING TREES Note. We now consider trees which have biological applications. We are interested in representing evolutionary trees graphically. Here is an evolutionary tree and the corresponding graph: Notice that each of the vertices of degree one play a special role. Five of them represent species and one of them (labeled R) is the root of the tree. Each of the other vertices represents an episode of speciation and is hence a vertex of degree three (think of it as one species in, two species out). Since we expect speciation to occur in this bifurcating way, we restrict our study to trees with these properties. 17

Definition. A tree which consists of only degree one and degree three vertices is a bifurcating tree. If each of the degree one vertices is given a distinct label, then the tree is labeled. If one of the degree one vertices is declared the root, then the tree is rooted. Note. For evolutionary trees, we are primarily interested in rooted, bifurcating, labeled trees. With a nod to the botanical sciences, the degree one vertices (except for the root) are called leaves. A rooted bifurcating tree with n leaves is called an n-species tree. We wish to count the number of different n-species trees. 18

(2n 3)! Theorem 6.1.7. There are rooted, bifurcating, labeled n-species trees where n 2. Each such tree has 2 n 2 (n 2)! 2n 1 edges. Note. First, for n = 2 there is exactly one such tree: and the theorem holds. Now suppose we wanted to add a third species S 3. Since the root and the species vertices must remain degree one and the internal vertices must remain degree three, the only way to add a new species is to subdivide an existing edge by introducing a new vertex and then adding a new edge incident to the new vertex with the new species as the other end of the new edge. Since the 2-species tree has three edges, this can be done in three ways: 19

(2(3) 3)! So there are three 3-species trees and =3. 2 (3) 2 ((3) 2)! Similarly, to add a fourth species, there are five ways to do it (since each of the 3-species trees has 5 edges) for each of the 3-species trees. Hence there are (1)(3)(5) = 15 4-species trees (2(4) 3)! and 2 (4) 2 ((4) 2)! = 15. Exercise 6.1.9. Show algebraically that 1 3 5 7 (2n 3) = (2n 3)! 2 n 2 (n 2)!. 20

Lemma 6.1.8. Let T be an n-species tree. Let T 1 and T 2 be (n + 1)-species trees, each created from T by subdividing an edge and adding a new edge and a labeled vertex (say the new vertex is labeled S n+1 in both T 1 and T 2 ). Then if the edge of T which is subdivided to create T 1 is different from the edge of T used to create T 2, then T 1 and T 2 are different. Proof. Let e 1 =(u 1,v 1 ) be the edge of T which is subdivided to create T 1 and let e 2 =(u 2,v 2 ) be the edge of T which is subdivided to create T 2. Consider the path P u1 from u 1 to the root and the path P v1 from v 1 to the root. One of these paths is shorter than the other by exactly one edge (either P u1 = P v1 e 1 u 1 or P v1 = P u1 e 1 v 1 ). Without loss of generality suppose P u1 is shorter. (This illustrates the fact that a rooted tree with labeled leaves has a natural orientation: toward the root and away from the root. ) In a similar way, suppose a path from u 2 to the root is shorter than a path from v 2 to the root. Since T is a tree, there is a unique path from v 1 to v 2 and the path must include edges e 1 and e 2 (since there is a unique path from u 1 to u 2 ). Now we show that trees T 1 and T 2 are different by finding the lengths of the path between two vertices of T 1 which is different from the length of the path of the same two vertices in T 2. In T 1, a path from S n+1 to 21

v 1 consists only of the vertices S n+1, v, and v 1 where v is the new vertex which subdivides edge e 1. But in T 2, a path from S n+1 to v 1 must include vertices S n+1, v, u 2, and v 1 (it must also include u 1 but we do not know that u 1 and u 2 are different). So in T 2 the path from S n+1 to v 1 is longer than the path in T 1 from S n+1 to v 1. Therefore T 1 and T 2 are different. Note. The lemma shows that whenever we create an (n+1)- species tree from an n-species tree, we get different trees when we subdivide different edges. We will use induction to count the number of n-species trees. But how do we know that starting with an n-species tree and creating two (n+k)-species trees by adding k new species in different places produces different (n + k)-species trees (the lemma only guarantees a difference when k = 1)? This is where the labeling plays an 22

important role. If we assume the two (n + k)-species trees are the same, then we can derive a contradiction. Lemma 6.1.9. Suppose T 1 and T 2 are two different (n + 1)- species trees generated from n-species tree T, as described in Lemma 6.1.8. Then any (n + k)-species tree created from T 1 is different from any (n + k)-species tree created from T 2. Proof. Suppose not. Suppose T 1 is an (n + k)-species tree created from T 1 and T 2 is an (n + k)-species tree created from T 2. Then consider the (n + 1)-species trees induced by the first (n+1) labeled vertices and the root in T 1 and T 2. (These induced trees can be formed by taking the union of all of the paths from the labeled vertices to the root.) These (n + 1)- species trees are exactly T 1 and T 2. However, T 1 and T 2 are different by Lemma 6.1.8. Since these subtrees of T 1 and T 2 are different, then T 1 and T 2 are different. 23

Note. This concept of same and different trees is better dealt with mathematically by introducing an isomorphism ( same shape ). An isomorphism between two graphs is a one-to-one and onto mapping between the vertex sets of the graphs which preserves the edge sets of the graphs. For example, the following two graphs are isomorphic: The isomorphism π is the function such that π(0) = a, π(1) = b, π(2) = c, π(3) = d, and π(4) = e. Of course, both graphs are a 5-cycle. Two isomorphic graphs have the same properties. If two trees are isomorphic, say, then the lengths of paths between corresponding vertices must be the same. (The correspondence is created by the mapping π.) 24

Lemma 6.1.10. Any n-species tree can be generated from a 2-species tree by a unique sequence of (n 2) steps of adding new labeled vertices (species) by subdividing edges. Proof. We consider what happens by reversing the process of adding labeled vertices. Consider an n-species tree (n >2) with species as labeled vertices S 1,S 2,...,S n. Remove vertex S n and the edge containing it (since S n is degree one, there is only one edge incident to it), say edge (v,s n ). Since v is of degree three in the n-species tree, then there are two other vertices adjacent to v,sayu and v. Replace edges (u, v ) and (v,v) with edge (u, v). (This process is called pruning a tree! It has resulted in the removal of a leaf.) We then have created an (n 1)-species tree with species S 1,S 2,...,S n 1. Continue this process of pruning species until only species S 1 and S 2 are left. The reversal of the steps in this process then creates the original n-species tree from the 2-species tree. Also, by Lemma 6.1.9, there is only one sequence of steps which produces the original n-species tree from the 2-species tree. 25

Note. With the help of Lemmas 6.1.8, 6.1.9, and 6.1.10, we are now ready to count the number of n-species trees. (2n 3)! Theorem 6.1.7. There are rooted, bifurcating, labeled n-species trees where n 2. Each such tree has 2 n 2 (n 2)! 2n 1 edges. Proof. We have already seen that these formulae hold for N = 2 and N = 3. Now suppose they hold for N = n: An (2n 3)! n-species tree has 2n 1 edges and there are 2 n 2 (n 2)! n-species trees. Consider N = n + 1. We create all (n 1)- species trees by adding a new species to an n-species tree. Since there are 2n 1 edges in an n-species tree, there are 2n 1 different ways to produce an (n + 1)-species tree from a given n-species tree. By the induction hypothesis, there are 1 3 5 (2n 3) n-species trees, and so we can create 1 3 5 (2n 3) (2n 1) = (2n 1)! 2 n 1 (n 1)! (2(n 1) 3)! (2N 3)! = = 2 (n+1) 2 ((n +1) 2)! 2 N 2 (N 2)! (n+1)-species trees. By Lemma 6.1.9 the trees are all different and by Lemma 6.1.10 we have all such trees. The n-species tree has 2n 1 edges and we have subdivided some edge into two edges and added a new edge (for a net gain of two edges). 26

So the (n + 1)-species tree has (2n 1)+2 = 2n 1= 2(n + 1) 1 = 2N 1 edges. Therefore by the Principle of Mathematical Induction, the theorem follows. Note. Some of the first interest in graphs concerned counting trees and was addressed by Arthur Cayley in the mid-1800s. A common approach to counting involves generating functions in which the number of objects of a given size is computed in terms of the number of objects of a given size is computed in terms of the number of objects of smaller sizes. The Fibonacci sequence f n is an elementary example of such an idea: f 1 = 1, f 2 = 1, and f n = f n 1 + f n 2 for n 3. The technique of proof presented above is due to Cavilla-Sforza and Edwards [3]. Note. The presence of the factorial in the number of n- species trees means that there is a tremendous number of such trees, even when n is fairly small. Consider the following table (from Felsenstein, page 24 [4]). 27

Table 6.1.1. The number of rooted n species trees for various n. From Felsenstein [1]. # of Species n # of Trees 1 1 2 1 3 3 4 15 5 105 6 945 7 10,395 8 135,135 9 2,027,025 10 34,459,425 20 8.20 10 21 30 4.95 10 38 40 1.01 10 57 50 2.75 10 76 We see from this table that if we want to construct a phylogenetic tree for just a few species (say 10) then it is impractical to try to search through all possible trees (for 10 species there are over 34 million trees) to find the one which best fits given data. Hence some type of algorithm is needed to help simplify the problem. This will be explored shortly. 28

OTHER TREES Note. Most methods of inferring phylogenies infer unrooted trees (Felsenstein page 24 [3]). That is, it is desired to find a tree which best describes evolutionary relationships, but without an idea towards a common ancestor. With much of the phylogenetic work, interest lies in extant species and the revelation of relationships based on molecular data not so much on family trees which relate extant and extinct species. Definition. An unrooted, bifurcating, labeled tree is a tree in which every vertex is of either degree three or degree one. The degree one vertices are given distinct labels. If there are n degree one vertices, the tree is called an unrooted n-species tree. Note. There is no significant difference between an unrooted n-species tree and a rooted (n 1)-species tree. We can simply think of the root of an (n 1)-species tree as the n th species. Conversely, any unrooted n-species tree can be related to an 29

(n 1)-species tree by declaring one of the species as the root. Since there are (2(n 1) 1)! 1 3 5 (2(n 1) 3) = 2 (n 1) 2 ((n 1) 2)! or (2n 5)! 1 3 5 (2n 5) = 2 n 3 (n 3)! (n 1)-species trees, then this is the number of unrooted n-species trees. Exercise 6.1.10. Give a direct proof based on mathematical induction for the number of unrooted n-species trees. You may assume that Lemma 6.1.8, 6.1.9, and 6.1.10 hold for unrooted trees. Note. There is one subtle conceptual difference between rooted and unrooted trees. We mentioned above the idea in a rooted tree of directions. Namely, we can think of a path that goes from a vertex towards the root or a path that goes from a vertex away from the root (and hence towards a leaf). Since unrooted trees do not have a root, these ideas are meaningless in the unrooted tree setting. 30

Note. An additional (and very difficult) question concerns the number of tree shapes. That is, we are interested in the number of different (nonisomorphic) unlabeled trees. The unlabeled rooted 2-species tree and 3-species trees are unique. There are two unlabeled rooted 4-trees: Exercise 6.1.11. Give a path length argument as to why the above two graphs are different. Exercise 6.1.12. There are three unlabeled rooted 5 species trees. What are they? There are six such 6-species trees. What are they? Note. It is not currently known how many unlabeled n- species trees there are (see Felsenstein, page 30 [3]). Several 31

values have been calculated and are presented in the following table. Table 6.1.2. The number of different unlabeled rooted n species trees for various n. From Felsenstein [1]. # of Species n # of Trees 1 1 2 1 3 1 4 2 5 3 6 6 7 11 8 23 9 46 10 98 20 293,547 30 1.41 10 9 40 8.10 10 12 50 5.15 10 16 32

Note. We can also consider unlabeled unrooted n-species trees. There are unique examples for n {2, 3, 4, 5}: Note. Some of the numbers of unlabeled unrooted n-species trees are given in the following table. 33

Table 6.1.3. The number of different unlabeled unrooted n species trees for various n. From Felsenstein [1]. # of Species n # of Trees 1 1 2 1 3 1 4 1 5 1 6 2 7 2 8 4 9 6 10 11 20 12,444 30 3.61 10 7 40 1.49 10 11 34

Exercise 6.1.13. Find all labeled unrooted n-species trees for n {6, 7, 8}. Note. As a final comment on counting trees, we observe that the number of distinct trees on n vertices (making no assumptions of degrees or labelings) is unknown. However, the number of labeled trees on n vertices (that is, all of the vertices are labeled) is n n 2 [1]. 35

REFERENCES 1. J. A. Bondy and U. S. R. Murty, Graph Theory with Applications, New York: North-Holland, 1979. 2. K. P. Bogart, Introductory Combinatorics, Boston: Pitman, 1983. 3. L. L. Cavilla-Sforza and A. W. F. Edwards, Analysis of Human Evolution, in Genetics Today, Proceedings of the XI International Congress of Genetics, The Hague, The Netherlands, September 1963, Vol. 3, ed. S. J. Geerts, Oxford: Pergamon, 1965. 4. J. Felsenstein, Inferring Phylogenies, Sunderland, MA: Sinauer Associates, 2004. 36