Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

Size: px

Start display at page:

Download "Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)"

Ambrose Jennings
5 years ago
Views:

1 Distance based tree reconstruction Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

2 All organisms have evolved from a common ancestor. Infer the evolutionary tree (tree topology and edge lengths) from molecular data. A gene tree inferred from a homologous gene from different species might be different from the species tree.

3 Distance based reconstruction Goal: Reconstruct an evolutionary tree from a distance matrix Input: n x n distance matrix d Output: weighted tree T with n leaves fitting d If d is additive, this problem has an efficient solution (which we will come back to), otherwise we should aim at finding a tree which is as good as possible

4 Distance based reconstruction The ideal case The tree reflects the distance matrix, i.e. the distances induced by the tree are equal to the distances in the matrix If this is possible, then the distance matrix is called additive

5 A reconstruction method Least square methods Given a tree topology T, select edge lengths such that n n Q T = w ij D ij d ij 2 i=1 j=1 is minimized, where dij is given by the matrix, Dij is the distance induced by the tree and the selected edge lengths, and wij is the confidence we have in dij, e.g. wij=1. Optimal edge lengths can be found by standard least square methods in time O(n3). The large version of the problem when the tree is unknown is NP-complete...

6 Counting rooted binary trees

7 Counting rooted binary trees

8 Counting rooted binary trees

9 Hierarchical clustering and UPGMA

10 Hierarchical clustering Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix.

11 Example

12 Example Group the closet two data points.

13 Example In each iteration, we join the closet two cluster.

14 Example

15 Example

16 Hierarchical clustering Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters are computed. Step 4: Repeat Step 2 and 3 until only one cluster remains.

17 Hierarchical clustering Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters are computed. Step 4: Repeat Step 2 and 3 until only one cluster remains. How to define and compute the distances between clusters?

18 Distance measures between clusters

19 Distance measures between clusters

20 Distance measures between clusters

21 Distance measures between clusters

22 Hierarchical clustering (UPGMA) Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time? Space consumption?

23 Hierarchical clustering (UPGMA) Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n3). Space consumption O(n2)

24 An implementation of UPGMA in Python

25 Proporties of an UPGMA tree

26 Trees reflecting a molecular clock A clocklike tree is a rooted tree, in which the total edge length from the root to any leaf is equal, i.e there is a molecular clock that ticks in a constant pace (the mutation rate is identical for all species), and all the observed species are at an equal number of ticks from the root... A clocklike tree induces an ultrametric distance matrix...

27 Ultrametric matrices and trees Formally: A matrix dij is ultrametric iff it is metric and any three species x, y, z satisfy that d(x,y) max{d(x,z), d(y,z)}, i.e there is a tie for the maximum of d(x,y), d(x,z), and d(y,z) Theorem (Gusfield): A symmetric matrix D has an ultra-metric tree iff D is an ultrametrix matrix Theorem (Gusfield): If D is ultrametric then the unique ultra-metric tree T for D can be constructed in time O(n2)

28 Two examples Ultrametric doesn t imply molecular clock a b 3 a a b c b c c 0 Molecular clock implies ultrametric 1 1 a 1 b 2 c a b c a b c

29 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n3). Space consumption O(n2)

30 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n3). Space consumption O(n2)

31 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster 2 correspond to a leaf in a tree T. The distance D(CTakes,C ) between cluster and Cj is d(i,j). O(n ) time. CanCbe improved to O(n). i j i Yields a total running time of O(n2). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n3). Space consumption O(n2)

32 Storing matrix D in a quad tree Building the quad tree for a nxn matrix takes time O(n2), when built we can pick the minimum in constant time O(1). Not surprising.

33 Storing matrix D in a quad tree Building the quad tree for a nxn matrix takes time O(n2), when built we can pick the minimum in constant time O(1). Not surprising. Question: How fast can we rebuilt the quad-tree to reflect the modified distance matrix we built in Step 3?

34 Updating the quad-tree Question: How fast can we rebuilt the quad-tree to reflect the modified distance matrix we built in Step 3? In Step 3, we only change 2 row/columns in D.

35 Updating the quad-tree Question: How fast can we rebuilt the quad-tree to reflect the modified distance matrix we built in Step 3? In Step 3, we only change 2 row/columns in D.

36 Updating the quad-tree Question: How fast can be rebuilt the quad-tree to reflect the modified distance matrix we built in Step 3? In Step 3, we only change 2 row/columns in D.

37 An implementation of a Quad Tree in Python

38 An implementation of a Quad Tree in Python

39 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n2) using a quad-tree to store D. Space consumption O(n2)

40 Improving the running time of UPGMA Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: A rooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm: O(n) Step 1: Let the n species 1,2,...,n be n clusters C1,..,Cn of size 1. Let each cluster correspond to a leaf in a tree T. The distance D(Ci,Cj) between cluster Ci and Cj is d(i,j). O(1) using quad-tree, O(n2) without quad-tree Step 2: Pick the pair of cluster Ci and Cj where D(Ci,Cj) is minimized and form a new cluster Ck by merging Ci and Cj, i.e. Ck = Ci U Cj. Make a new internal node Ck in tree T with children corresponding to Ci and Cj. Place internal node Ck at height D(Ci,Cj) / 2. O(n) for computing new distance and rebuilding quad-tree Step 3: Update distance matrix D such that distance between the new cluster Ck and all remaining clusters Cl are computed as: Step 4: Repeat Step 2 and 3 until only one cluster remains. Running time O(n2) using a quad-tree to store D. Space consumption O(n2)

41 Does it matter in practice? Tbasic(n) / TQuad(n) TBasic(n) TQuad(n) Tquad(n) / n2 Tbasic(n) / n3

42 Neighbor Joining

Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,.

43 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Popular distance based phylogenetc inference method with good accuracy. Introduced by Saitou and Nei in Reformulated by Studier and Keppler in Critcized by many but stll widely used. Often used as a benchmark for other methods.

44 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Let us assume that the distance matrix is additive, i.e. corresponds to tree distances from some tree with unknown topology and edge length, then this is the tree we want to infer If we from the distance matrix can infer which leaves (and sub trees) are neighbors in the unknown (true) tree, we can reconstruct the tree iteratively...

45 Neighbor Joining

46 Neighbor Joining Not easy to infer neighboring leaves!!

47 Neighbor Joining Here D(i,j) = D(k,l) = D(j,l) =13, and D(j,k) =12, so the leaves with minimum distance in the distance matrix are not necessarily neighbors in the tree, i.e. UPGMA will not reconstruct the tree correctly. Not easy to infer neighboring leaves!!

48 Neighbor Joining

49 Neighbor Joining

50 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm:

51 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm:

52 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm:

53 Neighbor Joining Input: A set of n species (1,2,...,n) and a distance matrix d giving the pairwise distances. Output: An unrooted binary tree T with edge lengths and leaves labelled 1, 2,..., n such that similar species (according to the distance matrix) are grouped in the same subtree, and the tree distances correspond to the distance matrix. Algorithm:

54 Running time and space consumption

55 Running time and space consumption Can we improve the running time of Step 1 by using a quad-tree? Why? Why not?

56 Properties of Neighbor Joining

57 How 'deterministic' is NJ? An experiment To see how much this matters in practice, I made a small experiment. I took 37 distance matrices based on Pfam sequences, with between 100 and 200 taxa. [ ] For each matrix I constructed ten equivalent distance matrices by randomly permuting the order of taxa. Then I used these ten matrices to construct ten neighbourjoining trees and fnally computed the Robinson-Foulds between all pairs of these trees. [...] As you can see, you can get quite diferent trees out of equivalent distance matrices and on realistic data too. I didn t check if the diferences between the trees are correlated with how much support the splits have in the data. Looking at bootstrap values for the trees might tell us something about that. Anyway, that is not the take home message I want to give here. That is that it doesn t necessarily make sense to talk the neighbourjoining tree for a distance matrix. Neighbour-joining is less deterministic than you might think!

58 How 'deterministic' is NJ?

59 RapidNJ RapidNJ is an implementation of NJ which uses heuristics to avoid searching large parts of the D matrix. No guarantees are given on how much of the matrix is avoided. RapidNJ was compared with three fast Neighbour-Joining implementatons: QuickTree A fast implementaton of the canonical Neighbour-Joining method. QuickJoin Uses an advanced data structure to reduce the search space. Clearcut Relaxed Neighbour-Joining. R8s was used to simulate phylogenetc trees from which distance matrices were constructed. Alignments from Pfam were used to infer distance matrices using QuickTree. Also work on how to use external memory to allow construction oflarge trees (> leaves).

60 Results on Simulated Data

61 Results on small Pfam Data

62 Results on large Pfam Data

63 Running times on Pfam data

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand