Recent Research Results. Evolutionary Trees Distance Methods

Recent Research Results Evolutionary Trees Distance Methods

Indo-European Languages After Tandy Warnow

What is the purpose? Understand evolutionary history (relationship between species). Uderstand how various functions evolved. Understand forces and constraints on evolution. To do multiple alignment.

Multiple Alignment Evolutionary Tree Cat: GCGAGTAGAGCGTA Dog: CCGGTCGACGAA Frog: CAAGTCTCACGGAT Wolf: CCTGCCGACGA Bird: AGTCGCACGGTT Cat: GCGAGTAG-AGCGTA- Dog: -CCGGTCG-A-CGAA- Frog: -CAAGTCTCA-CGGAT Wolf: -CCTGCCG-A-CG-A- Bird: ---AGTCGCA-CGGTT Align Compute distances Construct tree

Evolutionary Tree Multiple Alignment Align along edges of the tree using pairwise alignment

Solution Methods

Some Complications Convergence or parallel evolution e.g. presence of wings in birds and bats. Assume that there is no convergence. Can often be achieved by excluding characters causing problems. Reversals, e.g. snakes loss of legs. Assume that there is no reversals. Can often be achieved by excluding characters causing problems.

General Methods Distance-based methods. Maximum parsimony method. Maximum likelihood methods. Consensus methods.

Distance-Based Methods Distances between all pairs of species are determined. If species are represented by DNA or protein sequences, the distances may depend on the number of substitutions, insertions and delitions needed to get one sequence from another. Tree is computed from the resulting distance matrix such that the distances in the tree fit the distances in the matrix as good as possible.

Maximum Parsimony Methods Principle of Parsimony: When given the choice between two explanations, one simple and one complex, choose the simple one. Character-based methods, f. ex. "number of legs", eukaryote vs. prokaryote organisms, bases in aligned DNA or amino acids in protein sequences. The problem is to find a tree that requires minimum number of changes.

Maximum Likelihood Methods A C G T A 1 u ap c bp g cp t uap c ubp g ucp t C udp a 1 u dp a ep g fp t uep g ufp t G ugp a uhp c 1 u gp a hp c ip t uip t T ujp a ukp c ulp g 1 u jp a kp c lp g base frequencies: p a, p c, p g, p t mutation rate: u frequencies of change of any base to any other: a, b, c, d, e, f, g, h, i, j, k, l.

Maximum Likelihood Methods n species: consider a topology with n leaves. Consider each base position for a given topology. Consider each possible assignment of bases to inner nodes. Compute the product of transition rates along all edges. Add up the products of all possible assignments. Add up for all base positions. This is the likelihood of particular topology. Repeat for every topology and select the one with maximum likelihood.

Consensus Methods A set of leaf-labeled trees (possibly weighted) is used to generate a tree. In some situations the trees have the same leaves but different topologies. It for example frequently happens that choosing different DNA- or protein sequences for the same family of species results in different evolutionary trees. In other situations, each tree spans only (small) subsets of species. The objective is then to find a tree spanning all species and agreeing in some way with the small trees.

Introduction to Distance Methods Distance methods reconstruct trees (rooted or unrooted) from a set of pairwise distances between the sequences. Introduced by Cavalli-Sforza and Edwards [1967] Fitch and Margoliash [1967] Influenced by clustering algorithms of Sokal and Sneath [1963]

Metric Spaces M is a set. d: M M R is a function. d is a distance function iff d(u,v) > 0 for all u,v M, u v, d(u,u) = 0 for all u M, d(u,v) = d(v,u) for all u,v M, d(u,v) d(u,w)+d(w,v) for all u,v,w M. A set M with a distance function d is called a metric space.

Tree Metrics T is a tree with edge weights and with elements of the set M as its leaves. Define d T (u,v) for all u,v in M as the length of the unique path from u to v in T. It can be shown that (M,d T ) is a metric space provided that edge weights are strictly positive.

Additive Distance Functions Given a distance function d on a set M. Does there exist a tree T with elements of M as leaves realizing d? If it is the case, d is said to be additive.

Four Points Condition (M,d) is a metric space. d is additive iff for every set of four different elements i, j, k,l M, two of the sums d ij +d kl, d ik +d jl, d il +d jk are the same and greater than or equal to the third sum. This condition is called four points condition. Necessity: Sufficiency: Constructive proof.

Sufficiency Given a metric space (M,d). Is there a weighted tree T with elements of M as its leaves such that d T (u,v) = d(u,v) for all u, v in M? Obvious for M =2. How about M =3?

Sufficiency for m=3 x + y = d 12, x+ z = d 13, y + z = d 23 2x + y + z = d 12 + d 13 ==> x = [d 12 + d 13 d 23 ] / 2 x + 2y + z = d 12 + d 23 ==> y = [d 12 + d 23 - d 13 ] / 2 x + y + 2z = d 13 + d 23 ==> z = [d 13 + d 23 d 12 ] / 2 Note that x, y, z 0. Can be 0! Solution is unique.

Sufficiency for m = 4 Solve for #1, #2, #4. Steiner point s 4 is somewhere on the path from #1 to #2. If s 4!= s, add an edge from s 4 to #4. If it overlaps, solve for s, #3 and #4. Assume not unique. So there are two trees with s 4 placed in different places, This implies that there are 2 different trees for #1, #3 and #4, a contradiction.

Sufficiency for m > 4 Assume that we have a unique tree for k species, k 4. The process is similar to what was done for 4 species.

Uniqueness Is the solution unique? Topology is unique. Assume that there are 2 distinct topologies. There must exist 3 leaves x, y, z such that the partitions induced by them in these two topologies are different. Assume that x and a fourth leaf w are in the same partition subset in one topology while there are in different partition subsets in the second topology. This implies that there are 2 different trees realizing the distances for four species x, y, z, w. A contradiction.

Uniqueness Is the solution unique? Edge weights are unique. Assume that there are two solutions with the same topology where the edge incident with a leaf v has different lengths. Let s be the Steiner point incident to this edge. This s defines a 3-set partition (one set consisting of v alone). Taking any x and y from the other two sets gives two distinct solutions, a contradiction. An interior edge in a fixed topology defines a 4-set partition. Assume that there are two solutions with the same topology but with different lengths of the selected interior edge. Taking any set of 4 leaves one from each partition set must give a unique solution, a contradiction.

Ultrametric Distance Functions A tree T in a metric space (M,d) where d is ultrametric has the following property: there is a way to place a root on T so that for all nodes in M, their distance to the root is the same. Such T is referred to as a molecular clock tree. d is ultrametric ==> d additive (M,d) is ultrametric iff every set of three elements i,j,k M, two of the distances coincide and are greater than or equal to the third one. (M,d) is ultrametric iff in the corresponding complete weighted graph G, the largest-weight edge in any cycle is not unique.

Sandwich Problem Given two distance functions d l and d u on a set M, d l (a,b) d u (a,b) for all species a, b. Does there exist an ultrametric tree T with elements of M as leaves such that d l (a,b) d T (a,b) d u (a,b) for all species a, b? Lower and upper bounds can be given by two weighted graphs G l and G u. Edges not present in G l have weight 0. Edges not present in G u have weight. If such an ultrametric T exists, it can be found in polynomial time. M. Farach, S. Kannan and T. Warnow, A robust model for finding evolutionary trees, Algorithmica 13 (1995) 155-179.

Unweighted Pair Group Method Using Averages - UPGMA Find species i and j with the smallest distance M(i, j). Create a new node (i, j ) and connect it to i and j by branches of length M(i, j) / 2. Compute the distance between the new ij group and all other groups (except i and j ) by using M ij, k = n i M i, k n j M j, k n ij n ij Delete the columns and rows of the data matrix that correspond to groups i and j, and add a column and row for group ij. If there is only one item in the data matrix, stop. Otherwise repeat.

UPGMA - Example

UPGMA Another Example

Neighbor Joining 8 3 14 10 12 9 10 6 8 15 11 13 10 8 8 r i = d ik M 2 r 1 =11 3 4, r 2 =10 1 4, r 3 =12 3 4, r 4 =13 3 4,r 5 =11 1 4, r 6 =12 1 4 D ij =d ij r j r i -14-21.5-12 -13-12 -14-14.5-15.5-14.5-12 -13-12 -15.5-18.5-15.5

Neighbor Joining Species 1 and 3 are replaced by a new species 7 at distance: d 17 =½(d 13 + r 1 - r 3 ) = 1 d 37 =½(d 13 + r 3 - r 1 ) = 2 d 27 =½(d 12 + d 23 - d 13 ) = 7 d 47 =½(d 14 + d 34 - d 13 ) = 13 d 57 =½(d 15 + d 35 - d 13 ) = 9 d 67 =½(d 16 + d 36 - d 13 ) = 11

Neighbor Joining

Multidimensional Scaling (MDS) Method of representing a given collection of dissimilarities between pairs of objects as distances between points in a multidimensional metric space. Obvious application: visual representation of objects in 2- or 3-dimensional Euclidean space such that distances between points in the space match the original dissimilarites between objects as close as possible. Our application: species represented as points in higher dimensional Euclidean space so that dissimilarities are captured as closely as possible by distances. Low-cost Steiner trees could be good evolutionary trees.

Classical MDS - Overview Δ: matrix of dissimilarities which actually are distances in some higher-dimensional Euclidean space. Δ2 : matrix of squared dissimilarities. H = I n n -1 11 T : centering matrix. B = ½HΔ2 H: inner product matrix. Spectral decomposion B=VΛV T where Λ is a diagonal matrix of eigenvalues λ 1 λ 2... λ n 0, V is corresponding matrix of normalized eigenvectors. d: number of non-zero eigenvalues. Λ d : diagonal matrix of the first d eigenvalues. Let V d denote the first d columns of V. Then X = V d Λ d ½ is the coordinate matrix for objects.

Δ: metric matrix of dissimilarities (not necessarily distances in some higher-dimensional Euclidean space). Classical MDS can be used to compute X (discarding negative eigenvalues). If negative eigenvalues are small (close to zero) then X is fairly accurate. In order to reduce the dimension d, small positive eigenvalues can be discarded.

Steiner Tree Problem Given n points in d-dimensional space, find a shortest network spanning them.

Heuristic Construct MST Local improvements: pair of edges meeting at angles less than 120 degrees. Shortcutting: pick a pair of nonadjacent vertices. Consider the longest edge on the path between them in the current tree. If the distance between the nonadjacent vertices is less than the length of the longest edge, shortcut and apply local improvements. Postprocessing: terminals of degree greater than 1 and Steiner points of degree greater than 3 have their degrees reduced by introducing appropriate Steiner points.

MSD-Steiner Heuristic

Validation Methods Reconstruction of known phylogenies rare, small and/or easy. Computational simulation: Given a tree and intial root sequence, use appropriate probabilistic model of evolution to generate leaf sequences. Apply any of the distance methods and compare with the tree used in the simulation. J. Stoye, D. Evers and F. Meyer, ROSE: Generating sequence families, Bioinformatcs 17 (1998) 157-163.

Comparison of Trees Comparison of trees partition metric: removing an edge in simulation tree or solution tree gives a cut. Number of cuts induced by one tree but not by the other tree. This can be at most 2n-6. Can be determined in linear time. Sensitive when applied to very similar trees.

Computational Experience How well does MDS retains the information needed to obtain good phylogenetic trees? Are short Euclidean Steiner trees good phylogenetic trees? How do phylogentic trees obtained by the Steiner method compare to other methods?

Quality of MDS (see Fig. 1a) Apply Neighbor Joining to the original distance matrices and to the MDS matrices (for increasing number of dimensions. For sufficient number of dimensions, the same phylogenetic trees are obtained. Number of dimension to obtain good phylogenetic trees seems to be bounded, e.g. 20, dimensions are typically needed for problem instances of size 50. ROSE's relatedness parameter seems not to affect the number of dimensions. Conclusion: The information needed to obtain good phylogenetic trees is preserved when using MDS.

Steiner Trees as Phylogenetic Trees (Fig. 2a) As the lengths of Steiner trees decrease to 85% of the lengths of MSTs, their partition metric distances to the correct trees start to decrease. This behavior is observed for all dimensions higher than ½ of the maximal dimension.

Direct Comparison of NJ and Steiner (Fig. 1b) These two methods seem comparable. It is possible that the way instances are generated by ROSE favors NJ.

Homework Critical assessment of the paper Method MDS Steiner tree heuristic Validation, use of ROSE Evaluation Problem instances Interpretation of results