Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Similar documents
Scaling species tree estimation methods to large datasets using NJMerge

Dynamic Programming for Phylogenetic Estimation

FastJoin, an improved neighbor-joining algorithm

Distance-based Phylogenetic Methods Near a Polytomy

Supplementary Material, corresponding to the manuscript Accumulated Coalescence Rank and Excess Gene count for Species Tree Inference

CS 581. Tandy Warnow

Introduction to Triangulated Graphs. Tandy Warnow

Evolutionary tree reconstruction (Chapter 10)

Supplementary Online Material PASTA: ultra-large multiple sequence alignment

PASTA: Ultra-Large Multiple Sequence Alignment

Efficient Quartet Representations of Trees and Applications to Supertree and Summary Methods

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

PLOS Currents Tree of Life. Performance on the six smallest datasets

Algorithms for Bioinformatics

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

Lecture 20: Clustering and Evolution

Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Lecture 20: Clustering and Evolution

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Parallelizing SuperFine

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 forbiggerdatasets

A fast neighbor joining method

4/4/16 Comp 555 Spring

Recent Research Results. Evolutionary Trees Distance Methods

11/17/2009 Comp 590/Comp Fall

Rapid Neighbour-Joining

A Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem

Rapid Neighbour-Joining

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation

BIR pipeline steps and subsequent output files description STEP 1: BLAST search

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

On the Optimality of the Neighbor Joining Algorithm

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees

MultiPhyl: a high-throughput phylogenomics webserver using distributed computing

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

Basics of Multiple Sequence Alignment

Seeing the wood for the trees: Analysing multiple alternative phylogenies

A Statistical Test for Clades in Phylogenies

INFERRING OPTIMAL SPECIES TREES UNDER GENE DUPLICATION AND LOSS

Computational Genomics and Molecular Biology, Fall

CSE 549: Computational Biology

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

Introduction to Computational Phylogenetics

A practical O(n log 2 n) time algorithm for computing the triplet distance on binary trees

Special course in Computer Science: Advanced Text Algorithms

Comparison of commonly used methods for combining multiple phylogenetic data sets

The Performance of Phylogenetic Methods on Trees of Bounded Diameter

Reconstruction of Large Phylogenetic Trees: A Parallel Approach

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency?

8/19/13. Computational problems. Introduction to Algorithm

CLC Phylogeny Module User manual

in interleaved format. The same data set in sequential format:

BIOINFORMATICS. Time and memory efficient likelihood-based tree searches on phylogenomic alignments with missing data

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets

Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees

Chordal Graphs and Evolutionary Trees. Tandy Warnow

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

Lecture 5: Multiple sequence alignment

Approaches to Efficient Multiple Sequence Alignment and Protein Search

Fast Algorithms for Large-Scale Phylogenetic Reconstruction

Constrained Minimum Spanning Tree Algorithms

The CIPRES Science Gateway: Enabling High-Impact Science for Phylogenetics Researchers with Limited Resources

CLUSTERING IN BIOINFORMATICS

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

A high-throughput bioinformatics distributed computing platform

Alignment of Trees and Directed Acyclic Graphs

Multiple sequence alignment based on combining genetic algorithm with chaotic sequences

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Computational Molecular Biology

Fast Parallel Maximum Likelihood-based Protein Phylogeny

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Neighbor Joining Plus - algorithm for phylogenetic tree reconstruction with proper nodes assignment.

Generalized Neighbor-Joining: More Reliable Phylogenetic Tree Reconstruction

Tracy Heath Workshop on Molecular Evolution, Woods Hole USA

A New Approach For Tree Alignment Based on Local Re-Optimization

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

Workshop Practical on concatenation and model testing

Parallel Multiple Sequence Alignment with Local Phylogeny Search by Simulated Annealing

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

A multiple alignment tool in 3D

World Academy of Science, Engineering and Technology International Journal of Bioengineering and Life Sciences Vol:11, No:6, 2017

INFERENCE OF PARSIMONIOUS SPECIES TREES FROM MULTI-LOCUS DATA BY MINIMIZING DEEP COALESCENCES CUONG THAN AND LUAY NAKHLEH

DISTANCE BASED METHODS IN PHYLOGENTIC TREE CONSTRUCTION

Prospects for inferring very large phylogenies by using the neighbor-joining method. Methods

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

NSGA-II for Biological Graph Compression

Transcription:

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation (PI: Bill Gropp) NCSA Blue Waters Symposium June 5, 2018 Molloy (PI: Warnow; PI: Gropp) TERADACTAL 1/56

Phylogeny Molloy (PI: Warnow; PI: Gropp) TERADACTAL 2/56

Tree of Life and Downstream Applications Nothing in biology makes sense except in the light of evolution Dobhzansky (1973) Protein structure and function prediction Population genetics Human migrations Metagenomics Infectious Disease Biodiversity Molloy (PI: Warnow; PI: Gropp) TERADACTAL 3/56

Outline Phylogeny estimation pipeline Some standard approaches and their limitations Our new approach to ultra-large phylogeny estimation on Blue Waters Comparison of our approach to existing methods Future work Molloy (PI: Warnow; PI: Gropp) TERADACTAL 4/56

DNA Sequence Evolution Molloy (PI: Warnow; PI: Gropp) TERADACTAL 5/56

Phylogeny Estimation Pipelines Molloy (PI: Warnow; PI: Gropp) TERADACTAL 6/56

Phylogeny Estimation Pipelines Molloy (PI: Warnow; PI: Gropp) TERADACTAL 7/56

Phylogeny Estimation Pipelines Molloy (PI: Warnow; PI: Gropp) TERADACTAL 8/56

Phylogeny Estimation Pipelines Molloy (PI: Warnow; PI: Gropp) TERADACTAL 9/56

Phylogeny Estimation Pipelines Molloy (PI: Warnow; PI: Gropp) TERADACTAL 10/56

Phylogeny Estimation Pipelines Molloy (PI: Warnow; PI: Gropp) TERADACTAL 11/56

Distance Methods Input: A matrix D such that D[i, j] indicates the distance between sequence i and sequence j Use multiple sequence alignment to compute pairwise distances Use alignment-free method to compute pairwise distances in an embarrassingly parallel fashion Output: A tree with branch lengths Distance methods use polynomial time. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 12/56

Maximum Likelihood (ML) Tree Estimation Input: A multiple sequence alignment Output: A model tree (topology and other numerical parameters) that maximizes likelihood, that is, the probability of observing the multiple sequence alignment given a model tree The ML tree estimation problem is NP-hard. [Felsenstein,1981; Roch, 2006] ML Heuristic are typically more accurate than distance methods, especially under some challenging model conditions. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 13/56

Maximum Likelihood (ML) Tree Estimation Input: A multiple sequence alignment Output: A model tree (topology and other numerical parameters) that maximizes likelihood or probability of observing the multiple sequence alignment given a model tree The ML tree estimation problem is NP-hard. [Felsenstein,1981; Roch, 2006] ML Heuristic are typically more accurate than distance methods, especially under some challenging model conditions. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 14/56

ML Tree Estimation: N versus L A multiple sequence alignment is an N L matrix. Number of sites or alignment length, L Thousands of sites for single gene analysis Millions of sites for multi-gene analysis Many analyses can be parallelized across sites Because likelihood is computed on each site independently Number of species or sequences, N Many datasets have thousands of species The Tree of Life will have millions Number of tree topologies on N leaves is (2N 5)!! Parallelism is more complicated Molloy (PI: Warnow; PI: Gropp) TERADACTAL 15/56

ML Tree Estimation: N versus L [Stamatakis, 2006; Price et al., 2010; Kozlov et al., 2015] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 16/56

ML Tree Estimation: N versus L [Stamatakis, 2006; Price et al., 2010; Kozlov et al., 2015] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 17/56

ML Tree Estimation: N versus L [Stamatakis, 2006; Price et al., 2010; Kozlov et al., 2015] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 18/56

ML Tree Estimation: N versus L ExaML [Kozlov et al., 2015] can be used for analyzing datasets with 10-20 genes and up to 55,000 taxa, but scalability will be limited to at most 100 cores. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 19/56

ML Tree Estimation: N versus L Molloy (PI: Warnow; PI: Gropp) TERADACTAL 20/56

ML Tree Estimation: N versus L Molloy (PI: Warnow; PI: Gropp) TERADACTAL 21/56

Divide-and-Conquer with DACTAL [Nelesen et al., 2012] [Steel, 1992; Jiang et al., 2001; Bansal et al., 2010] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 22/56

Divide-and-Conquer with DACTAL [Nelesen et al., 2012] [Steel, 1992; Jiang et al., 2001; Bansal et al., 2010] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 23/56

Divide-and-Conquer with DACTAL [Nelesen et al., 2012] [Steel, 1992; Jiang et al., 2001; Bansal et al., 2010] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 24/56

Divide-and-Conquer with DACTAL [Nelesen et al., 2012] [Steel, 1992; Jiang et al., 2001; Bansal et al., 2010] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 25/56

Divide-and-Conquer with DACTAL [Nelesen et al., 2012] [Steel, 1992; Jiang et al., 2001; Bansal et al., 2010] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 26/56

Divide-and-Conquer with DACTAL [Nelesen et al., 2012] [Steel, 1992; Jiang et al., 2001; Bansal et al., 2010] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 27/56

Our Design Goals If we want to build the Tree of Life (millions of sequences!) using Blue Waters, then we need to design an algorithm that can utilize a very large numbers of (not high memory) processors. Based on our observations, it would be good to avoid estimating 1 Multiple sequence alignment on the full set of sequences 2 ML tree on the full multiple sequence alignment 3 Supertree Molloy (PI: Warnow; PI: Gropp) TERADACTAL 28/56

Divide-and-Conquer with DACTAL [Nelesen et al., 2012] [Steel, 1992; Jiang et al., 2001; Bansal et al., 2010] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 29/56

TERADACTAL Algorithm Molloy (PI: Warnow; PI: Gropp) TERADACTAL 30/56

TreeMerge: Step 1 Create a minimum spanning tree on the disjoint subsets. [Kruskal, 1956] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 31/56

TreeMerge: Step 1 Create a minimum spanning tree on the disjoint subsets. [Kruskal, 1956] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 32/56

TreeMerge: Step 2 Merge trees T i and T j into tree T ij such that T ij Li = T i and T ij Lj = T j (multiple solutions). 1 Merge two alignments A i and A j into A ij using an existing technique (e.g., OPAL [Wheeler and Kececioglu, 2007]). 2 Compute distance matrix D ij from merged alignment A ij. 3 Run NJMerge our variant of Neighbor-Joining [Saitou and Nei, 1987] that takes both a distance matrix and a set of constraint trees as input. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 33/56

TreeMerge: Step 2 Merge trees T i and T j into tree T ij such that T ij Li = T i and T ij Lj = T j (multiple solutions). 1 Merge two alignments A i and A j into A ij using an existing technique (e.g., OPAL [Wheeler and Kececioglu, 2007]). 2 Compute distance matrix D ij from merged alignment A ij. 3 Run NJMerge our variant of Neighbor-Joining [Saitou and Nei, 1987] that takes both a distance matrix and a set of constraint trees as input. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 34/56

NJMerge: Constrained Neighbor-Joining Molloy (PI: Warnow; PI: Gropp) TERADACTAL 35/56

NJMerge: Constrained Neighbor-Joining Molloy (PI: Warnow; PI: Gropp) TERADACTAL 36/56

TreeMerge: Step 3 Combine pairs of merged trees using branch lengths, e.g., trees T ij and T jk are combined through T j (blue). NOTE: Unlabeled branches have length one for simplicity. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 37/56

TreeMerge: Step 3 Combine pairs of merged trees using branch lengths, e.g., trees T ij and T jk are combined through T j (blue). NOTE: Unlabeled branches have length one for simplicity. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 38/56

Advantages TERADACTAL No multiple sequence alignment estimation on the full dataset No Maximum Likelihood tree estimation on the full dataset No supertree estimation TreeMerge Polynomial Time Parallel (Step 2) Pairs of alignments and trees can be merged in an embarrassingly parallel fashion. (Step 3) Merged tree pairs can be combined in parallel, as long as they do not share edges in the minimum spanning tree. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 39/56

Simulation Studies Molloy (PI: Warnow; PI: Gropp) TERADACTAL 40/56

Quantifying Tree Error Molloy (PI: Warnow; PI: Gropp) TERADACTAL 41/56

Simulation Study Compare TERADACTAL to 2 alignment-free methods 3 multiple sequence alignment methods (2 shown) 2 distance methods (1 shown) 2 Maximum Likelihood methods (1 Shown) on simulated datasets from Mirarab et al., 2015. 10,000 sequences 4 model conditions each with 10 replicate datasets (1 shown) Molloy (PI: Warnow; PI: Gropp) TERADACTAL 42/56

TERADACTAL Iterations 0.8.71 INDELible M2 0.6 Error Rate 0.4 0.2.14.13.12.12.11 0.0 Iteration 0 1 2 3 4 5 Molloy (PI: Warnow; PI: Gropp) TERADACTAL 43/56

TERADACTAL versus Alignment-Free Methods 0.6.61 INDELible M2 Error Rate 0.4 0.2.41.11 0.0 Method RapidNJ+JD2Stat RapidNJ+KMACS TERADACTAL [Simonsen et al., 2008; Chan et al., 2014; Leimeister and Morgenstern, 2014] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 44/56

TERADACTAL versus Two-Phase Methods (NJ) 0.8.78 INDELible M2 Error Rate 0.6 0.4 0.2.14.11 0.0 Method RapidNJ+MAFFT RapidNJ+PASTA TERADACTAL [Simonsen et al., 2008; Katoh and Standley, 2013; Mirarab et al., 2015] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 45/56

TERADACTAL versus Two-Phase Methods (ML).65 INDELible M2 0.6 Error Rate 0.4 0.2.08.11 0.0 Method RAxML+MAFFT RAxML+PASTA TERADACTAL [Stamatakis, 2006; Katoh and Standley, 2013; Mirarab et al., 2015] Molloy (PI: Warnow; PI: Gropp) TERADACTAL 46/56

Conclusions We designed, prototyped, and tested a method that achieves similar error rates to the leading two-phase phylogeny estimation methods but is highly parallel and avoids 1 Multiple sequence alignment estimation on the full dataset 2 Maximum likelihood tree estimation on the full dataset 3 Supertree estimation Molloy (PI: Warnow; PI: Gropp) TERADACTAL 47/56

Future Work Scale out to one million sequences. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 48/56

Why Blue Waters Blue Waters was used to Demonstrate that codes (e.g., FastTree-2, PASTA, RAxML) could not run on Blue Waters on datasets with 1 million sequences (Run out of memory!) Simulation study completed in < 1 month but would have required > 1 year using our 4 campus cluster nodes Molloy (PI: Warnow; PI: Gropp) TERADACTAL 49/56

Research Products Paper under review at the 17th European Conference on Computational Biology (ECCB 2018). Github: github.com/ekmolloy/teradactal-prototype Molloy (PI: Warnow; PI: Gropp) TERADACTAL 50/56

Other Research Products from General Allocation Nute, Saleh & Warnow Benchmarking statistical multiple sequence alignment. Under review at Systematic Biology. Nute, Chou, Molloy & Warnow (2018). The Performance of Coalescent-Based Species Tree Estimation Methods under Models of Missing Data. BMC Genomics 19(Suppl 5):286. Vachaspati & Warnow (2018). SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol Phyl Evol 124:122-136. Github: github.com/pranjalv123/svdquest Vachaspati & Warnow (2018). SIESTA: Enhancing searches for optimal supertrees and species trees. BMC Genomics 19(Suppl 5):252. Github: github.com/pranjalv123/siesta Molloy (PI: Warnow; PI: Gropp) TERADACTAL 51/56

Acknowledgements This work was supported by the National Science Foundation Blue Waters Sustained-Petascale Computing Project (OCI-0725070 and ACI-1238993) General Allocation (PI: Warnow) Exploratory Allocation (PI: Gropp) Graduate Research Fellowship Program (DGE-1144245) Graph-Theoretic Algorithms to Improve Phylogenomic Analyses (CCF-1535977) and the Ira & Debra Cohen Graduate Fellowship in Computer Science. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 52/56

References Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17(6):368 376. Roch, S. (2006). A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Transactions on Computational Biology and Bioinformatics 3(1):92 94. Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22(21):2688 2690. Price, M.N., P. S. Dehal, and A. P. Arkin. (2010). FastTree 2 Approximately Maximum Likelihood Trees for Large Alignments. PLoS ONE 5(3):1 10. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 53/56

References Kozlov, A.M., A.J. Aberer, and A. Stamatakis. (2015). ExaML version 3: a tool for phylogenomic analyses on supercomputers. Bioinformatics 31. Nelesen, S., et al., (2012). DACTAL: Divide-And-Conquer Trees (almost) without Alignments. Bioinformatics 28(12):i274 i282. Steel, M. (1992). The complexity of reconstructing trees from qualitative characters and subtrees. Journal of Classification 9:91-116. Jiang, T., P. Kearney, and M. Li. (2001). A polynomial time approximation scheme for inferring evolutationay trees from quartet topologies and its application. SIAM Journal on Computing 30(6):1942 1961. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 54/56

References Bansal, M.S., et al., (2010). Robinson-Foulds Supertrees. Algorithms for Molecular Biology 5(1). Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society 7:48 50. Saitou, N. and M. Nei. (1987). The neighbor-joining method: a new method for reconstruction of phylogenetic trees. Molecular Biology and Evolution 4: 406 425. Mirarab, S., et al., (2015). PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. Journal of Computational Biology 22(5):377 386. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 55/56

References Wheeler, T.J. and J. D. Kececioglu. (2007). Multiple alignment by aligning alignments. Bioinformatics 23(13):i559. Chan, C.X. et al., (2014). Inferring phylogenies of evolving sequences without multiple sequence alignment. Scientific Reports 4:6504. Leimeister, C.-A. and B. Morgenstern. (2014). kmacs: the k-mismatch Average Common Substring Approach to alignment-free sequence comparison. Bioinformatics 30(14):2000 2008. Simonsen, M., T. Mailund, and C.N.S. Pedersen. (2008) Rapid Neighbour-Joining. Algorithms in Bioinformatics 5251:113 122. Katoh, K. and D.M. Standley. (2013). MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution 30(4):772. Molloy (PI: Warnow; PI: Gropp) TERADACTAL 56/56