ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

Similar documents
CS 581. Tandy Warnow

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course)

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

GARLI manual (version 0.95)

The Lattice BOINC Project Public Computing for the Tree of Life

New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0

Tracy Heath Workshop on Molecular Evolution, Woods Hole USA

CSE 549: Computational Biology

Evolutionary tree reconstruction (Chapter 10)

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Escaping Local Optima: Genetic Algorithm

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

Dynamic Programming for Phylogenetic Estimation

A Statistical Test for Clades in Phylogenies

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

Non-deterministic Search techniques. Emma Hart

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Artificial Intelligence

MetaPIGA v2.1. Maximum likelihood large phylogeny estimation using the metapopulation genetic algorithm (MetaGA) and other stochastic heuristics

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

"PRINCIPLES OF PHYLOGENETICS" Spring 2008

Protein phylogenetics

GPU computing and the tree of life

Lab 06: Support Measures Bootstrap, Jackknife, and Bremer

LARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR. A Thesis HYUN JUNG PARK

Introduction to MrBayes

BIOINFORMATICS. Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data

Local Search and Optimization Chapter 4. Mausam (Based on slides of Padhraic Smyth, Stuart Russell, Rao Kambhampati, Raj Rao, Dan Weld )

Local Search and Optimization Chapter 4. Mausam (Based on slides of Padhraic Smyth, Stuart Russell, Rao Kambhampati, Raj Rao, Dan Weld )

Workshop Practical on concatenation and model testing

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

The RAxML-VI-HPC Version 2.0 Manual

Local Search and Optimization Chapter 4. Mausam (Based on slides of Padhraic Smyth, Stuart Russell, Rao Kambhampati, Raj Rao, Dan Weld )

Main Reference. Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taking a Random Walk through Tree Space Genetics 2005

Local Search (Ch )

Systematics - Bio 615

Conservation genetics

CS:4420 Artificial Intelligence

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Fall 09, Homework 5

CS 331: Artificial Intelligence Local Search 1. Tough real-world problems

Phylogenetic Trees and Their Analysis

Artificial Intelligence

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Scaling species tree estimation methods to large datasets using NJMerge

Reconsidering the Performance of Cooperative Rec-I-DCM3

The History Bound and ILP

Lecture: Bioinformatics

Algorithms & Complexity

x n+1 = x n f(x n) f (x n ), (1)

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing

REAL-CODED GENETIC ALGORITHMS CONSTRAINED OPTIMIZATION. Nedim TUTKUN

Bayesian phylogenetic inference MrBayes (Practice)

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Review: Final Exam CPSC Artificial Intelligence Michael M. Richter

CLC Phylogeny Module User manual

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

Algorithm Design (4) Metaheuristics

Software Vulnerability

The Phylogenetic Likelihood Library

Hardware-Software Codesign

Cooperative Rec-I-DCM3: A Population-Based Approach for Reconstructing Phylogenies

3.6.2 Generating admissible heuristics from relaxed problems

MultiPhyl: a high-throughput phylogenomics webserver using distributed computing

Predicting Diabetes using Neural Networks and Randomized Optimization

Outline. Best-first search. Greedy best-first search A* search Heuristics Local search algorithms

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Heuristic Optimization Introduction and Simple Heuristics

Parsimony methods. Chapter 1

Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Lecture 20: Clustering and Evolution

4/4/16 Comp 555 Spring

A Comparative Study of Linear Encoding in Genetic Programming

Lecture 20: Clustering and Evolution

Which is more useful?

Estimating rates and dates from time-stamped sequences A hands-on practical

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

Geometric data structures:

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Phylogenetic Trees (I)

Evolutionary Algorithms: Perfecting the Art of Good Enough. Liz Sander

Efficiently Inferring Pairwise Subtree Prune-and-Regraft Adjacencies between Phylogenetic Trees

Program of Study. Artificial Intelligence 1. Shane Torbert TJHSST

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Understanding the content of HyPhy s JSON output files

Heterotachy models in BayesPhylogenies

ARTIFICIAL INTELLIGENCE (CSCU9YE ) LECTURE 5: EVOLUTIONARY ALGORITHMS

PHYLIP. Joe Felsenstein. Depts. of Genome Sciences and of Biology, University of Washington. PHYLIP p.1/13

March 19, Heuristics for Optimization. Outline. Problem formulation. Genetic algorithms

Generation of distancebased phylogenetic trees

Advanced Search Genetic algorithm

Random Search Report An objective look at random search performance for 4 problem sets

Transcription:

ML phylogenetic inference and GARLI Derrick Zwickl University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

Outline Heuristics and tree searches ML phylogeny inference and GARLI Using GARLI in practice Computer exercises

Heuristics Aim to optimize a function or solve some problem Finding a good solution is not guaranteed Deterministic and/or stochastic Many types (greedy hill climbing, GA, simulated annealing, etc.)

Likelihood A likelihood surface Parameter values

global optimum A likelihood surface (from above) valleys peaks local optima parameter values

Where does it start? Heuristic features

Heuristics: starting point????

Heuristic features Where does it start? How are new values proposed?

Heuristics: proposing new values?

Heuristic features Where does it start? How are new values proposed? How does it move between proposed values?

Heuristics: choosing a new value better worse

Heuristics Few restrictions on how a heuristic can work Best choice likely problem-specific

ML phylogenetic heuristics Goal: find tree with highest likelihood Difficulties: Enormous number of trees to consider Significant computation needed to score each tree (parameter optimization) Branch-length parameters aren t equivalent on different trees Optimal parameter values are strongly correlated

Phylogenetic searches Think about moving through an abstract treespace Nearby points in this treespace are connected by NNI (nearest neighbor interchange) branch swaps

Moving through treespace: NNI branch swaps Reassemble Break internal branch

Schoenberg graph edges connect NNI neighbors A C D B E A D B E C B C E D A A E B C D B A C D E B D C A E A B C D E A B C E D A B D C E A D B C E A C B D E C A B D E A E B D C A E C B D D A B E C (figure courtesy of Joe Felsenstein)

NNI NNI Treespace

(SPR- TBR slide) Subtree Pruning Regrafting (SPR) and Tree Bisection Reconnection (TBR) C I D E F G H A B C I D E F G H A B C I D E F G H A B C I D E F H G A B C I D E F G H A B C I D E H G F A B C I D E A B F G H SPR maintains subtree rooting TBR tries all possible rootings (figure courtesy of Paul Lewis)

SPR/TBR moves in NNI treespace X NNI SPR TBR

GARLI Genetic Algorithm for Rapid Likelihood Inference Descendent of GAML (Lewis, 1998) Development goal: make ML phylogenetic searches feasible for large datasets

GARLI Stochastic, genetic algorithm-like approach instead of deterministic hill climbing Gradually optimizes tree topology, branch lengths and model parameters Accurate ML tree inference on large datasets (hundreds of sequences) in hours

The Genetic Algorithm Computational analogue of evolution by natural selection A few simple requirements: Measure of fitness Method of selection Mutation operators Recombination operators

GA terminology Individual (topology+model parameter values+branch length values) Population Fitness (log-likelihood) Selection function (fitness proportional) Generation

One generation Create initial population of individuals T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4,

One generation Create initial population of individuals T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, Apply stochastic mutations to individuals and/or recombine

One generation Create initial population of individuals T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, lnl lnl lnl lnl Apply stochastic mutations to individuals and/or recombine Partially optimize and score mutated individuals

One generation Repeat many, many times T, p 1, p 2, p 3, p 4, lnl T, p 1, p 2, p 3, p 4, Create initial population of individuals T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, lnl lnl T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, T, p 1, p 2, p 3, p 4, lnl T, p 1, p 2, p 3, p 4, Apply stochastic mutations to individuals and/or recombine Partially optimize and score mutated individuals Use selection function to choose parents for the next generation

Maximized likelihood: pros By definition, our goal is to find the topology with the highest maximized likelihood If we calculate it for every tree we examine, we ll always know which tree is the best

Maximized likelihood: cons Fully optimizing a single branch length can require significant computation When one parameter changes, optimal values of all others also change

Maximized likelihood: cons As optimization is applied (and applied, and applied, ), maximized likelihood only approached asymptotically This is the majority of the computation required in ML inference, and it grows quickly with the number of sequences

Heuristic runtimes Inference time = # of topologies to evaluate x time to evaluate each Starting topology Evaluate new topology Modify topology Compare topology scores Both are strongly a function of the # of sequences when calculating maximized likelihood Select topology

Avoiding the maximized likelihood We want to accurately judge the merits of topologies, as if we had the maximized likelihood but without actually calculating it We ll explore the idea of an approximate likelihood score for topologies

How accurate does a tree likelihood estimate need to be? Maximized likelihood L A B C

How accurate does a tree likelihood estimate need to be? L Acceptable range of estimate A B C

How accurate does a tree likelihood estimate need to be? (when tree scores are more similar) L Acceptable range of estimate A B C

How important are branch- length values? (three example branches in a speciric 64- taxon tree) Branch length 4x greater than optimum = entire topology 50 lnl worse!

Branch length importance If even one branch length is far from optimal, the estimated likelihood will not be useful How can we get around optimizing every branch length on every tree?

Using topological similarity Successive trees are created by slightly modifying an existing tree We can capitalize on this when dealing with branch-length parameters

Searching with approximate likelihoods Branch lengths are optimized on a starting topology

Altering the tree: subtree pruning- regrafting (SPR) 1 2

Altering the tree: subtree pruning- regrafting (SPR) 1 2

Altering the tree: subtree pruning- regrafting (SPR) 1 2

Scoring and optimizing the new topology 1 2 Branch split Branches fused

Scoring and optimizing the new topology??? Other changes in optimal branch lengths? 1 2??

Where do optimal branch lengths change?

Optimal branch lengths only change here

This radius is not strongly dependent on tree size!

GARLI s post- swap optimization Optimization rules: 1. Optimize the 3 proximal branches until near their optimal values 2. Propagate optimization outward to other branches 3. If a branch length is far from optimum, continue to propagate outward 4. After propagation, return to changed branches for another optimization pass

Topology evaluation times (normalized with respect to # of site patterns)

Topology evaluation times (normalized with respect to # of site patterns)

Conclusions GARLI s localized method makes branch-length optimization largely independent of the number of sequences Several other fast ML heuristics also owe much of their speed to localized optimization (PHYML, RAxML)

Using GARLI in practice Performance comparisons (brief) Allowed models Search strategies

Performance comparisons against other software More subjective than one would like: What constitutes comparable analyses? What criteria should be used to compare methods? Models and likelihood values often not exactly comparable Most software can be tuned to perform better on any particular dataset Simulated datasets are far too easy to analyze

Performance comparison: 228 taxon x 4811 nucleotide dataset -122025-122030 log-likelihood -122035-122040 -122045-122050 -122055-122060 Several GARLI runs 100 s worse GARLI-GAMMA RAxML-GAMMA RAxML-CAT 0 1 2 3 4 runtime (hours)

ML tree inference software Some of the most used (alphabetically): GARLI, PAUP*, PHYML, RAxML For small datasets (< 50 taxa), all of the ML tree inference programs perform well For large datasets (hundreds of sequences): PAUP* is very rigorous, but slowest RAxML is generally the fastest GARLI often has a slight edge over RAxML in optimality (although often more variability) RAxML is very efficient for huge datasets (1000+ sequences)

ML tree inference software NOTE: There can be substantial differences in which program performs best depending on the specific dataset!

Search strategies in GARLI Multiple search replicates must ALWAYS be done If variable results across search replicates seen: Make changes to improve the search and/or do more search replicates

Search repeatability and multiple replicates

Search repeatability and multiple replicates

Search difriculty On average: More sequences = worse More characters (signal) = better # parsimony informative sites better indicator of signal than total # of sites

Tuning search intensity Tradeoff between search intensity and runtimes Not always a direct relationship between search intensity and solution optimality Given a certain amount of time, how can we best use it?

Balancing search intensity and runtimes H hours Per run, more likely to find global optimum May be more likely to find global optimum within H hours 3 thorough searches 6 fast searches

Practical search recommendations Search repeatability is an indicator of how analyses are going (much like convergence of independent MCMC runs) Saturating the search space (lots of searches) may be better than very long searches On some large datasets, unlikely to find the same tree twice

How else can I speed up/improve searches? Eliminate identical sequences! Constrained tree searches won t help (in GARLI) Starting tree Providing a decent (potentially unresolved) starting tree can help on large datasets

What about bootstrapping? GARLI can run multiple search replicates per bootstrap reweighting, with the best scoring tree saved More intense searches add up quickly when bootstrapping Find fastest settings that give consistent results on full data, use those for bootstrap searches

Evolutionary models GARLI is geared toward model flexibility and rigorous parameter estimation Model types Any GTR submodel for nucleotides Various common amino acid models Simple codon models Non-sequence data (Mk and Mk v ) Partitioned models Indel models (unreleased)

How/when to partition PartitionFinder may prove to be a great approach to partitioned model selection Smaller subsets increase sampling error, lead to parameter estimation difficulties and model breakdown Over-partitioning may have serious consequences in ML inference, less in Bayesian

Non- bifurcating trees GARLI returns trees with polytomies when branches have an optimal length of zero, but some programs do not This can become very important in low divergence phylogenomic studies

Assorted GARLI features Single data file may be analyzed at the nucleotide, amino acid and codon levels without making changes to it Multithreaded version for multiple CPU cores MPI version simplifies bootstrapping on clusters Full checkpointing Topological constraints (positive, negative, backbone)

Other assorted GARLI features Specification and fixation of model parameter values Site-likelihood output for all models including partitioned, for input into CONSEL, etc. Ancestral state reconstruction for all models Eventually: Beagle GPU version

Computer exercises