Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

Size: px

Start display at page:

Download "Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau"

Neil Long
5 years ago
Views:

1 Phylogenetic Trees Lecture 12 Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau.

2 Maximum Parsimony. Last week we presented Fitch algorithm for (unweighted) Maximum Parsimony: Input: A rooted binary tree with characters at the leaves Output: Assignment of characters to internal vertices which minimizes the number of mutations. Some mutations may be more probable than others. Hence, a natural generalization of the Maximum Parsimony problem is the Weighted Parsimony, defined next. 2

3 Weighted Parsimony (Sankoff s algorithm) Weighted Parsimony score: Input: Tree with characters at the leaves, and a weight function on the mutations: c(a,b) is the weight of the mutation a b. Output: assignment of characters to internal vertices which minimizes the total weight of the mutations The weighted parsimony score reduces to the parsimony score when c(a,a)=0 and c(a,b)=1 for all b other than a. 3

4 Weighted Parsimony on a Given Tree Each position is independent and computed by itself. Use Dynamic programming. if i is a node with children j and k, then S(i,a) = min b (S(j,b)+c(a,b)) + min b (S(k,b )+c(a,b )) S(j,b) j i S(i,a) k S(k,b ) S(j,b) the optimal score of a subtree rooted at j when j has the character b. 4

5 Evaluating Parsimony Scores Dynamic programming on a given tree Initialization: For each leaf i set S(i,a) = 0 if i is labeled by a, otherwise S(i,a) = Iteration: For each node with children j and k: S(i,a) = min x (S(j,x)+c(a,x)) + min y (S(k,y)+c(a,y)) Termination: cost of tree is min x S(r,x) where r is the root Comment: To reconstruct an optimal assignment, we need to keep in each node i and for each character a two characters x, y that minimize the cost when i has character a. 5

6 Cost of Evaluating Parsimony for binary trees For a tree with n nodes and a single character with k values, the complexity is O(nk 2 ). When there are m such characters, it is O(nmk 2 ). 6

7 1 st problem with the Maximum Parsimony Approach: Inconsistency Maximum Parsimony/Perfect phylogeny are reasonable assumptions for evolution of significant characters (ie characters whose states are unlikely to be created twice during the evolution process) They are less reasonable when the characters are DNA (or Protein) residues, as depicted next. 7

8 A possible DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2 mil yrs Source: Tandy Warnow AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT -1 mil yrs today 8

9 Reversal and convergence may happen AAGGCCT AAGACTT TGGACTT -3 mil yrs -2 mil yrs Source: Tandy Warnow AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT -1 mil yrs today 9

10 The reconstruction task U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT Source: Tandy Warnow U X Y V W 10

11 2 nd problem with Maximum Parsimony (and other Character Based Algorithms): Efficiency There are no efficient algorithms for solving the big problem for maximum parsimony/perfect phylogeny (both are known to be NP hard). Mainly for this reason, the most used approaches for solving the big problem are distance based methods. 11

12 Distance-based Methods for Constructing Phylogenies This approach attempts to overcome the two weaknesses of maximum parsimony: 1. The distances are derived from a well defined statistical model of evolution (which allows reversals/convergences) 2. It provides efficient algorithms for the big problem. Basic idea: The differences between species (usually represented by sequences of characters) are transformed to numerical distances, and a tree realizing these distances is constructed. 12

13 Distance-Based Reconstruction Compute distances between all taxon-pairs Find a tree (edge-weighted) best-describing the distances D =

14 Data Distances Trees 1. Modeling question: given the data (eg DNA sequences of the taxa), how do we define distances between taxa? 2. Algorithmic question: Decide if the distances define a tree (ultrametric or additive to be defined later), and if so, construct that tree. 3. In reality, the computed distances are noisy. So we need the algorithm to return a tree which approximates the distances of the input data. In the following we shall study items 2 and 1, and briefly discuss item 3. 15

15 Ultrametric and Tree Metric A distance metric on a set M of L objects is a function d: M M R + (represented by a symmetric matrix) satisfying: d(i,i)=0, and for i j, d(i,j)>0 d(i,j)=d(j,i). For all i,j,k it holds that d(i,k) d(i,j)+d(j,k). A metric is ultrametric if it corresponds to distances between leaves of a tree which admits molecular clock. It is a tree metric, or additive, if it corresponds to distances between nodes in a weighted tree. 16

16 1 st model: Molecular Clock Ultrametric Trees molecular clock assumes a constant rate of evolution. Namely, the distance from a speciation event to the formation of current species is proportional to time, hence is identical for all paths (wrong assumption in reality). A directed tree satisfying this property is called ultrametric. 17

17 Ultrametric trees Definition: An ultrametric tree is a rooted weighted tree all of whose leaves are at the same depth. Basic property: Define the height of the leaves to be 0. Then edge weights can be represented by the heights of internal vertices. Edge weights: 5 Internal-vertices heights: 3 0: A E D B C

18 Least Common Ancestor and distances in Ultrametric Tree Let LCA(i,j) denote the least common ancestor of leaves i and j. Let height(lca(i, j)) be its distance from the leaves, and dist(i,j) be the distance from i to j. Observation: For any pair of leaves i, j in an ultrametric tree: height(lca(i,j)) = 0.5 dist(i,j). A B C D E 8 A B C D 0 5 A E D B C E 0 19

19 Ultrametric Matrices Definition: A distances matrix* U of dimension L L is ultrametric iff for each 3 indices i, j, k : U(i,j) max {U(i,k),U(j,k)}. j k Theorem: The following conditions are equivalent for an L L distance matrix U: 1. U is an ultrametric matrix. i j There is an ultrametric tree with L leaves such that for each pair of leaves i,j: U(i,j) = height(lca(i,j)) = ½ dist(i,j). * Recall: distance matrix is a symmetric matrix with positive non-diagonal entries, 0 diagonal entries, which satisfies the triangle inequality. 20

20 Ultrametric tree Ultrametric matrix There is an ultrametric tree s.t. U(i,j)=½dist(i,j). U is an ultrametric matrix: By properties of Least Common Ancestors in trees U(k,i) = U(j,i) U(k,j) k j i 21

21 Ultrametric matrix Ultrametric tree: We start with two observations: Definition: Let U be an L L matrix, and let S {1,...,L}. U[S] is the submatrix of U consisting of the rows and columns with indices from S. Observation 1: U is ultrametric iff for every S {1,...,L}, U[S] is ultrametric. Observation 2: If U is ultrametric and max i,j U(i,j)=M,, then M appears in every row of U. j k i?? One of the? Must be M j M 22

22 Ultrametric matrix Ultrametric tree: Proof by induction U is an ultrametric matrix U has an ultrametric tree : By induction on L, the size of U. i Basis: L= 1: T is a leaf i 0 i L= 2: T is a tree with two leaves i j i j 0 i j 23

23 Induction step Induction step: L>2. Use the 1 st row to split the set {1,,L} to two subsets: S 1 ={i: U(1,i) =M}, S 2 ={1,..,L}-S (note: 0< S i <L) S 1 ={2,4}, S 2 ={1,3,5} 24

24 Induction step By Observation 1, U[S 1 ] and U[S 2 ] are ultrametric. By induction, tree T 1 for S 1, with a root labeled M 1 M, and a tree T 2 for S 2 with root labeled M 2 < M (M 2 is the 2 nd largest element in row 1; if M 2 =0 then T 2 is a leaf). Join T 1 and T 2 to T with a root labeled M. M - M 2 M=M 1 [The construction when M 1 = M] M 2 < M T 2 T 1 25

25 Proof (end) Need to prove: T is an ultrametric tree for U ie, U(i,j) is the label of the LCA of i and j in T. If i and j are in the same subtree, this holds by induction. Else LCA(i,j) = M (since they are in different subtrees). Also, [U(1,i)= M and U(1,j) M] U(i,j) = M. i M j l M=M 2 M 1 T 1 T 2 i M 26

26 Efficient Algorithms for Constructing Ultrametric Trees Input: A distance matrix over a set S. Output: an ultrametric tree on the objects in S. (Note: we want our algorithm to be defined for all input metrics). Requirements: Consistency: If the input matrix is ultrametric, then the algorithm should return the corresponding tree (there is only one). Robustness: if the input matrix is not ultrametric, the algorithm should return an ultrametric close to it. In this course we ll concentrate on the 1 st requirement. 27

27 Reconstructing Ultrametrics: UPGMA Clustering Unweighted Pair Group Method using Averages Input: distance matrix over a set of species S. Output: an ultrametric phylogenetic tree on S. Outline: Initialization: Each object is a cluster. Place all clusters at height zero. At each iteration combine two closest clusters to get a new one, update distances to the new cluster and continue. This clustering algorithm is used in many other applications, such as data mining. 28

28 UPGMA While(#clusters > 1) do: Choose cluster pair i,j as neighbors, s.t. D(i,j) = min i j { D(i,j ) } Connect i,j to new cluster v Replace in D the pair i,j by v, and reduce D: For k v, D(v,k) = αd(i,k) + (1-α)D(j,k) α = C i Ci + C j Note: this reduction formula guarantees that the distance between clusters C i and C j is the average of distances between the elements in each cluster: 1 dc ( i, Cj) = d( pq, ) Ci Cj p Ci q Cj 29

29 Reduction Formulas in Closest Pair Clustering Algorithms The reduction formula (computing distances from new clusters) of UPGMA has several variants, for instance: For k v, D(v,k) = ½( D(i,k) + D(j,k) ) WPGMA or D(v,k) = min{d(i,k),d(j,k)} Single linkage Both these reduction keep the consistency of the algorithm. It is known (by simulations) that the chosen reduction formula may have a significant effect on the robustness of the algorithms to noise. 31

30 Example UPGMA construction on five objects. The length of an edge = its (vertical) height. d(i,j) is the distance between the leaves of C i and C j A B C D E A B C D d( H, G) d( F, G) + d( D, G) 3 3 F H I G E B C D E A 32

31 Consistency of UPGMA Proposition: If the input distances are ultrametric, then UPGMA will reconstruct the corresponding ultrametric tree T. Proof sketch: By induction on the number of iterations, show that the distance between two clusters is twice the height of the LCA of the corresponding subtrees. 33

32 Complexity of UPGMA Naïve implementation: n iterations, O(n 2 ) time for each iteration (to find a closest pair) O(n 3 ) total. Constructing heaps for each row and updating them each iteration O(n 2 log n) total Optimal implementation: O(n 2 ) total time. One such implementation, using mutually nearest neighbors is presented next. 34

33 The Nearest Neighbor Algorithm Let D be a distance metric. j is a nearest neighbor (NN) of i if [ j i] &[ d( i, j) = min{ d( i, k): k i}] (i, j) are mutual nearest neighbors if: i is NN of j and j is NN of i. In other words, if: di (, j) min{ dik (, ), d( jk, ): k i, j} 35

34 Ultrametric Reconstruction by Nearest Neighbor Chains algorithm n-1 neighbor-joining iterations While(#clusters > 1) do: Choose cluster pair i,j which are mutual nearest neighbors Connect i,j to new cluster v Replace i,j with the cluster v, and reduce the distance matrice D: For k v, D(v,k) = αd(i,k) + (1-α)D(j,k) Ci α = in UPGMA, but the algorithm is consistent for any 0 α 1. C + C i j (i.e., if the reduction is convex) 36

35 θ(n 2 ) implementation of NN chains Finding mutual nearest neighbors in O(n 2 ) total time: D: i 0 i 1 i 1 i Complete NN chain:,i r+1 is a Nearest Neighbour of i r Final pair (i l-1,i l ) are mutual nearest neighbors. Find minimal entry (i r,i r+1 ) in row i r. i r+1 is a nearest neighbour of i r. Stop if (i r,i r+1 ) is also minimal in row i r+1 (i.e., (i r,i r+1 ) are mutual nearest neighbours) Otherwise, continue. i 0 i i 2 Mutual NN 37

36 θ(n 2 ) implementation of NN chains (cont.) An θ(n 2 ) implementation using Nearest Neighbors Chains: - Extend a chain until it is complete. - Select final pair for joining, and remove them from chain. Note: If the reduction formula is convex, then for each v, if before the reduction NN(v) {i,j}, then after the reduction NN(v) k. Hence the remaining chain is still NN chain - Mutual NN 38

37 O(n 2 ) implementation of NN chains (cont.) Complexity Analysis: Count number of row-minimum calculations (each taking O(n) time) : - n-1 terminations throughout the execution - 2(n-1) Edge deletions 2(n-1) extensions - Total for NN chains operations: O(n 2 ). - Updates: O(n) each iteration, total O(n 2 ). - Altogether O(n 2 ). 39

38 Consistency of NN Chains Proposition: If the input distances are ultrametric, then NN chains will reconstruct the corresponding ultrametric tree T. Proof sketch: similar to that of UPGMA Note: It can be shown that on all inputs, Nearest Neighbor will produce the same output as UPGMA. 40

39 Ultrametric vs. general trees NN chain (and UPGMA) construct ultrametric trees even if the distances are defined by non-ultrametric trees. 2 3 NN chain

40 Tree Metric (aka Additive Distances) A distance metric on a set M of L objects is a function d: M M R + (represented by a symmetric matrix) satisfying: d(i,i)=0, and for i j, d(i,j)>0 d(i,j)=d(j,i). For all i,j,k it holds that d(i,k) d(i,j)+d(j,k). If there is a weighted tree which realizes these distances, then the distance form a tree-metric. 42

41 Additive Distances (cont) Definition: A distance metric on a set M with L objects is additive if there is a tree T, L of its nodes correspond to the L objects, with positive weights on the edges, such that for all i,j, d(i,j) = d T (i,j), the length of the path from i to j in T. Note: Sometimes the tree is required to be binary, and then the edge weights are required to be non-negative. 43

42 Distances on three objects are additive: For L=3: There is always a (unique) tree with one internal node. i j k k i 0 a+b a+c j 0 b+c c k 0 b a m j di (, j) = a+ b i dik (, ) = a+ c d( j, k) = b+ c For instance 1 c = d( k, m) = [ d( i, k) + d( j, k) d( i, j)]

43 How about four objects? L=4: Not all distance metrics on 4 objects are additive: eg, there is no tree which realizes the below distances. i j k l i j k 0 3 l 0 45

44 The Four Points Condition A necessary condition for distances on four objects to be additive: its objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) d(i,j) + d(k,l) i k {{i,j},{k,l}} is a split of {i,j,k,l}. j Proof: By the figure... l 46

45 The Four Points Condition Definition: A distance metric satisfies the four points condition iff any subset of four objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) d(i,j) + d(k,l) i k j l 47

46 The Four Points Condition Theorem: The following 3 conditions are equivalent for a distance matrix D on a set M of L objects 1. D is additive 2. D satisfies the four points condition for all quartets in M. 3. There is an object r in M, s.t. D satisfies the 4 points condition for all quartets that include r. i k j l 48

47 The Four Points Condition Proof: we ll show that Additivity 4P Condition satisfied by al quartets: By the figure... k i j l 2 3: trivial 49

48 Proof that 3 1 4PC on all quartets which include r additivity Induction on the number of objects, L. For L 3 the condition is trivially true and a tree exists. For L=4: Consider 4 points which satisfy d(i,k) +d(j,l) = d(i,l) +d(j,k) d(i,j) + d(k,l) k c f l We will construct a tree T with 4 leaves, s.t. d T (,x,y) = d(x,y) for each pair x,y in {i,j,k,l}, a n m y b i j 50

49 Tree construction for L=4 Assume split {{i,j},{k,l}}: d (i,j)+d (k,l) d(j,k)+d (i,l) 1. Construct a tree for {i, j,k}, with internal vertex m 2. Construct a tree for {i,k,l}, by adding the vertex n and the edge (n,l). k l n m j The construction guarantees that d T (,x,y)=d(x,y)for all (x,y) except (j,l). i 51

50 Tree construction for L=4 d T (,x,y)=d(x,y)for all (x,y) except (j,l). Thus, since d T (i,j) + d T (k,l) d T (j,k) + d T (i,l), {{i,j},{k,l}} is a split of the tree T. k l By the proof that 1 2, we have for the tree T: d(j,l) = d(i,l)+ d(j,k)- d(i,k)= d T (i,l)+ d T (j,k)- d T (i,k)= d T (j,l) And hence d T (x,y)=d(x,y) for all x,y. n m j i 52

51 Corollary from the construction Corollary F: If d(i,k) +d(j,l) = d(i,l) +d(j,k) d(i,j) + d(k,l), then there is a unique tree which realizes all the distances except d(j,l), and this tree realizes also the distance d(j,l).* k l j i *(j,l) can be replaced by any pair in {i,j} {k,l}. 53

52 Induction step for L>4: For each pair of labeled nodes (i,j) in T, let c ij be defined by the following figure: r c ij 1 cij = [ d ( i, r ) + d ( j, r ) d ( i, j )] 2 m ij j i Pick i and j that maximize c ij. 54

53 Induction step: Construct (by induction) T on M \{i}. Add i (and possibly m ij ) to T, as in the figure. Then d(i,r) = d T (i,l) and d(j,r) = d T (j,r) Remains to prove: For each k {r,j} it holds that : d(i,k) = d T (i,k). r c ij m ij T j i 55

54 Induction step (cont.) Let k i,r be an arbitrary node in T. The maximality of c ij means that {{r,k},{i,j}} is a split of {i,j,k,r}. Thus, by Corollary F, since d(x,y)=d T (x,y) for each x,y in {i,j,k,r}, except d(k,i), we have also that d(k,i)=d T (k,i) too. r k c ij m ij T j i 56

Algorithms for Bioinformatics

Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and