EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Similar documents
Evolutionary tree reconstruction (Chapter 10)

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Algorithms for Bioinformatics

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

11/17/2009 Comp 590/Comp Fall

Recent Research Results. Evolutionary Trees Distance Methods

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Outline of the talk. Local search meta-heuristics for combinatorial problems. Constraint Satisfaction Problems. The n-queens problem

On the Optimality of the Neighbor Joining Algorithm

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course)

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Terminology. A phylogeny is the evolutionary history of an organism

CLUSTERING IN BIOINFORMATICS

CSE 549: Computational Biology

Stat 547 Assignment 3

human chimp mouse rat

Lecture 20: Clustering and Evolution

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Lecture 20: Clustering and Evolution

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

4/4/16 Comp 555 Spring

Clustering of Proteins

Distance-based Phylogenetic Methods Near a Polytomy

Parsimony methods. Chapter 1

Lecture: Bioinformatics

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

The worst case complexity of Maximum Parsimony

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Introduction to Trees

DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION

A Clustering Approach to the Bounded Diameter Minimum Spanning Tree Problem Using Ants. Outline. Tyler Derr. Thesis Adviser: Dr. Thang N.

Confidence Regions and Averaging for Trees Examples

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

CS 581. Tandy Warnow

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets

Main Reference. Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taking a Random Walk through Tree Space Genetics 2005

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

ACO and other (meta)heuristics for CO

Confidence Regions and Averaging for Trees

Scaling species tree estimation methods to large datasets using NJMerge

Conservation genetics

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Chapter 6: Cluster Analysis

Evolutionary Trees. Fredrik Ronquist. August 29, 2005

Evolution Module. 6.1 Phylogenetic Trees. Bob Gardner and Lev Yampolski. Integrated Biology and Discrete Math (IBMS 1300)

Alignment ABC. Most slides are modified from Serafim s lectures

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Unsupervised Learning. Unsupervised Learning. What is Clustering? Unsupervised Learning I Clustering 9/7/2017. Clustering

Consider the character matrix shown in table 1. We assume that the investigator has conducted primary homology analysis such that:

Artificial Intelligence

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

Generation of distancebased phylogenetic trees

46 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 16, 2011

1. Evolutionary Tree Reconstruction 2. Two Hypotheses for Human Evolution 3. Did we evolve from Neanderthals? 4. Distance-Based Phylogeny 5.

Rapid Neighbour-Joining

Fast Hierarchical Clustering via Dynamic Closest Pairs

3. Cluster analysis Overview

Pre-requisite Material for Course Heuristics and Approximation Algorithms

BMI/CS 576 Fall 2015 Midterm Exam

"PRINCIPLES OF PHYLOGENETICS" Spring Lab 1: Introduction to PHYLIP

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Phylogenetic Trees (I)

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

A space where trees lie?

Lesson 13 Molecular Evolution

3. Cluster analysis Overview

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Machine Learning (BSMC-GA 4439) Wenke Liu

Hardware-Software Codesign

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

A Co-Clustering approach for Sum-Product Network Structure Learning

Evolution of Tandemly Repeated Sequences

Construction of a distance tree using clustering with the Unweighted Pair Group Method with Arithmatic Mean (UPGMA).

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Machine Learning for Software Engineering

Phylogenetic Trees - Parsimony Tutorial #11

Artificial Intelligence

Machine Learning. Unsupervised Learning. Manfred Huber

Machine learning - HT Clustering

Gene expression & Clustering (Chapter 10)

An Edge-Swap Heuristic for Finding Dense Spanning Trees

Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Seeing the wood for the trees: Analysing multiple alternative phylogenies

Computer Vision 2 Lecture 8

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS

CLC Phylogeny Module User manual

Prior Distributions on Phylogenetic Trees

Introduction to Optimization Using Metaheuristics. The Lecturer: Thomas Stidsen. Outline. Name: Thomas Stidsen: Nationality: Danish.

FastJoin, an improved neighbor-joining algorithm

Semi-Supervised Learning with Trees

Missing Data Estimation in Microarrays Using Multi-Organism Approach

Quiz section 10. June 1, 2018

Combinatorial Optimization - Lecture 14 - TSP EPFL

Transcription:

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 28 th November 2007

OUTLINE 1 INFERRING PHYLOGENIES 2 OPTIMIZATION PROBLEMS ON TREES 3 LEAST SQUARE METHODS 4 CLUSTERING METHODS

OUTLINE 1 INFERRING PHYLOGENIES 2 OPTIMIZATION PROBLEMS ON TREES 3 LEAST SQUARE METHODS 4 CLUSTERING METHODS

RECONSTRUCTING HISTORY OF LIFE WHAT MEANS PHYLOGENETIC INFERENCE? All species on Earth come from a common ancestor. If we have data from a pool of species, we wish to reconstruct the history of speciation events that lead to their emergence: We want to find the phylogenetic tree giving this information! This is an hard task, because data is often incomplete (we lack information about most of the ancestor species) and noisy.

METHODS TO INFER PHYLOGENY APPROACHES TO PHYLOGENY Distance-based methods Parsimony methods Likelihood methods Bayesian inference methods DISTANCE-BASED METHODS Given a matrix of pairwise distances, find the tree that explains it better. Several algorithms: UPGMA (clustering methods) Neighbor Joining Fitch-Margolias (sum of squares methods)

AN EXAMPLE: PRIMATES DNA FROM PRIMATES Tarsius Lemur Homo Sapiens Chimp Gorilla Pongo Hylobates Macaco Fuscata AAGTTTCATTGGAGCCACCACTCTTATAATTGCCCATGGCCTCACCTCCT... AAGCTTCATAGGAGCAACCATTCTAATAATCGCACATGGCCTTACATCAT... AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCT... AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCACGGACTTACATCCT... AAGCTTCACCGGCGCAGTTGTTCTTATAATTGCCCACGGACTTACATCAT... AAGCTTCACCGGCGCAACCACCCTCATGATTGCCCATGGACTCACATCCT... AAGCTTTACAGGTGCAACCGTCCTCATAATCGCCCACGGACTAACCTCTT... AAGCTTTTCCGGCGCAACCATCCTTATGATCGCTCACGGACTCACCTCTT... DISTANCE MATRIX 0.00 0.29 0.40 0.39 0.38 0.34 0.38 0.37 0.29 0.00 0.37 0.38 0.35 0.33 0.36 0.34 0.40 0.37 0.00 0.10 0.11 0.15 0.21 0.24 0.39 0.38 0.10 0.00 0.12 0.17 0.21 0.24 0.38 0.35 0.11 0.12 0.00 0.16 0.21 0.26 0.34 0.33 0.15 0.17 0.16 0.00 0.22 0.24 0.38 0.36 0.21 0.21 0.21 0.22 0.00 0.26 0.37 0.34 0.24 0.24 0.26 0.24 0.26 0.00 Tarsius Lemur Homo Sapiens Chimp Gorilla Pongo Hylobates Macaco Fuscata

OUTLINE 1 INFERRING PHYLOGENIES 2 OPTIMIZATION PROBLEMS ON TREES 3 LEAST SQUARE METHODS 4 CLUSTERING METHODS

WHAT IS AN OPTIMIZATION PROBLEM? TWO INGREDIENTS 1 A search space S (possibly constrained) 2 A function f : S R to optimize: find x such that f ( x) = max x S f (x). If S is discrete (integers, graphs, trees), then we talk of combinatorial optimization. If S is continuous (R), then we talk of continuous optimization. BAD NEWS... (OR GOOD ONES?) Interesting combinatorial optimization problems are usually N P-hard.

THE SPACE OF TREES The search space is the space of trees, usually with branch lengths. Branch lengths can be optimized for a given tree topology in an easy way. The bottleneck is the identification of the correct tree topology! TREE TOPOLOGIES Rooted vz Unrooted Labeled vz Unlabeled (leaves) Bifurcating vz Multifurcating

HOW MANY TOPOLOGIES ARE THERE? We count the rooted labeled bifurcating trees. KEY PROPERTY Each tree with n tips (leaves) can be obtained in a unique way from a tree with k leaves adding in sequence the remaining n k leaves. In fact, reversing the sequence and removing the n k leaves, one obtains a unique tree with k leaves.

HOW MANY TOPOLOGIES ARE THERE? We count the rooted labeled bifurcating trees. TREES WITH n LEAVES NT r,l,b (n) = 3 5 (2n 3) = (2n 3)! 2 n 2 (n 2)!

HOW MANY TOPOLOGIES ARE THERE? We count the rooted labeled bifurcating trees. Species Number of Trees 3 3 4 15 5 105 6 945 7 10.395 8 135.135 9 2.027.025 10 34.459.425 15 213.458.046.676.875 20 8.200.794.532.637.891.559.375 30 4, 9518 10 38 40 1, 00985 10 57 50 2.75292 10 76 TREES WITH n LEAVES NT r,l,b (n) = (2n 3)! 2 n 2 (n 2)!

SEARCHING THE TREE SPACE: HEURISTIC METHODS The N P-hardness of optimization problems on the tree space calls for heuristic solutions. Heuristic optimization algorithms usually search the space by following a trajectory that hopefully will hit good solutions, even the optimal one (trajectory-based heuristics). The next point of the trajectory is chosen in the neighbor of the current one. NEIGHBORHOODS FOR THE TREE SPACE Nearest-neighbor interchanges Subtree pruning and regrafting

NEIGHBORHOODS FOR THE TREE SPACE Subtree pruning and regrafting Nearest-neighbor interchanges

NEIGHBORHOODS FOR THE TREE SPACE Nearest-neighbor interchanges

HEURISTIC ALGORITHMS TRAJECTORY METHODS Local Optimization Greedy Search Simulated Annealing GRASP Taboo Search

HEURISTIC ALGORITHMS TRAJECTORY METHODS Local Optimization Greedy Search Simulated Annealing GRASP Taboo Search POPULATION METHODS Genetic Algorithms Ant Colony Optimization Particle swarm optimization

BRANCH AND BOUND SEARCH Branch and Bound is a technique that explores a portion of the search space that is guaranteed to contain the optimal solution. It is an intelligent exhaustive search. Trees are constructed incrementally, adding one species at time. If the current tree is known to lead to a worse solution to the best one found so far, the algorithm backtracks and reconsider previous choices. The algorithm needs to estimate upper and lower bounds on the function to be optimized from partial solutions. It worst complexity is exponential, but it works well in practice.

BRANCH AND BOUND - SEARCH TREE Branch and Bound can be seen as a visit of a search tree, pruning some subtrees (according to bounds) and backtracking before reaching leaves.

OUTLINE 1 INFERRING PHYLOGENIES 2 OPTIMIZATION PROBLEMS ON TREES 3 LEAST SQUARE METHODS 4 CLUSTERING METHODS

LEAST SQUARE METHOD We have our observed distance matrix D ij and a tree T with branch lengths predicting an additive distance matrix d ij. TARGET Find the tree T minimizing the error between d ij and D ij, i.e. the tree minimizing the weighted least square sum S(T ) = i,j w ij (D ij d ij ) 2 Given a tree topology, the best branch lengths for S can be computed by solving a linear system. A least square algorithms needs to search the tree space for the best tree T : this is an N P-hard problem. The search for the best tree can use branch and bound methods or heuristic state space explorations. This method gives the best explanation of the data

FITCH-MARGOLIAS ALGORITHM Letting w ij = 1 in S(T ) = Dij 2 i,j w ij(d ij d ij ) 2, we obtain the method of Fitch-Margoliash. 1 The choice of has statistical reasons: it takes into Dij 2 account the variance from the expected additive distances. Lemur M fuscata Hylobates Pongo Gorilla Pan Homo sap Tarsius M fuscata Hylobates Lemur Pongo Gorilla Pan Homo sap Tarsius

OUTLINE 1 INFERRING PHYLOGENIES 2 OPTIMIZATION PROBLEMS ON TREES 3 LEAST SQUARE METHODS 4 CLUSTERING METHODS

HEURISTIC METHODS: UPGMA HIERARCHICAL CLUSTERING Hierarchical clustering works by iteratively merging the two closest clusters (sets of elements) in the current collection of clusters. It requires a matrix of distances among singletons. Different ways of computing intercluster distances give rise to different HC-algorithms. UPGMA UPGMA (Unweighted Pair Group Method with Arithmetic mean) computes the distance between two clusters as d(a, B) = 1 A B i A,j B d ij. When two clusters A and B are merged, their union is represented by their ancestor node in the tree. The distance between A and B is evenly split between the two branches entering in A and B

UPGMA - AN EXAMPLE A B C D E B 2 C 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 8 d([a, B], C) = 1 (d(a, C) + d(b, C)) = 4 2 A, B C D E C 4 D 6 6 E 6 6 4 F 8 8 8 8

UPGMA - AN EXAMPLE A, B C D, E C 4 D, E 6 6 F 8 8 8 A, B, C D, E D, E 6 F 8 8

UPGMA - AN EXAMPLE

UPGMA - CONSIDERATIONS Tarsius Lemur Homo sap HYPOTHESIS UPGMA reconstructs correctly the tree if the input distance is an ultrametric (molecular clock). Pan Gorilla Pongo Hylobates M fuscata

HEURISTIC METHODS: NEIGHBOR-JOINING NEIGHBOR-JOINING Neighbor-Joining works similarly to UPGMA, but it merges together the two clusters minimizing D ij = d ij r i r j, where r i = 1 C 2 k d ik is the average distance of i from all other nodes. When i and j are merged, their new ancestor x has distances from another node k equal to d xk = 1 2 (d ik + d jk d ij ) The branch lengths are d ix = 1 2 (d ij + r i r j ) and d jx = 1 2 (d ij + r j r i ). NJ reconstructs the correct tree if the input distance is additive. Lemur M fuscata Hylobates Hylobates Pongo Gorilla Homo sap Pongo M fuscata Pan Gorilla Homo sap Pan Tarsius Lemur Tarsius

NOT ONLY DNA EVOLVE...

THE END Thanks for the attention!