Dynamic Programming for Phylogenetic Estimation

Similar documents
Scaling species tree estimation methods to large datasets using NJMerge

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Introduction to Trees

Introduction to Computational Phylogenetics

Parallelizing SuperFine

CS 581. Tandy Warnow

Supplementary Material, corresponding to the manuscript Accumulated Coalescence Rank and Excess Gene count for Species Tree Inference

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3

Efficient Quartet Representations of Trees and Applications to Supertree and Summary Methods

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

INFERENCE OF PARSIMONIOUS SPECIES TREES FROM MULTI-LOCUS DATA BY MINIMIZING DEEP COALESCENCES CUONG THAN AND LUAY NAKHLEH

Introduction to Triangulated Graphs. Tandy Warnow

Comparison of commonly used methods for combining multiple phylogenetic data sets

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

Lab 06: Support Measures Bootstrap, Jackknife, and Bremer

Lecture: Bioinformatics

A Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency?

Study of a Simple Pruning Strategy with Days Algorithm

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Reconstructing Reticulate Evolution in Species Theory and Practice

Seeing the wood for the trees: Analysing multiple alternative phylogenies

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

Quartet Inference from SNP Data Under the Coalescent Model

The Performance of Phylogenetic Methods on Trees of Bounded Diameter

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Bad Clade Deletion Supertrees: A Fast and Accurate Supertree Algorithm

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course)

Improved parameterized complexity of the Maximum Agreement Subtree and Maximum Compatible Tree problems LIRMM, Tech.Rep. num 04026

Fast Hashing Algorithms to Summarize Large. Collections of Evolutionary Trees

Evolutionary tree reconstruction (Chapter 10)

LARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR. A Thesis HYUN JUNG PARK

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES

INFERRING OPTIMAL SPECIES TREES UNDER GENE DUPLICATION AND LOSS

The History Bound and ILP

Alignment of Trees and Directed Acyclic Graphs

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

Algorithms for constructing more accurate and inclusive phylogenetic trees

Least Common Ancestor Based Method for Efficiently Constructing Rooted Supertrees

The worst case complexity of Maximum Parsimony

Algorithms for MDC-Based Multi-locus Phylogeny Inference

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms

CSE 549: Computational Biology

Recent Research Results. Evolutionary Trees Distance Methods

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Parsimonious Reconciliation of Non-binary Trees. Louxin Zhang National University of Singapore

Reconciliation Problems for Duplication, Loss and Horizontal Gene Transfer Pawel Górecki. Presented by Connor Magill November 20, 2008

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS

Basic Tree Building With PAUP

10601 Machine Learning. Model and feature selection

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Terminology. A phylogeny is the evolutionary history of an organism

Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

Computing the Quartet Distance Between Trees of Arbitrary Degrees

Chordal Graphs and Evolutionary Trees. Tandy Warnow

Lab 12: The What the Hell Do I Do with All These Trees Lab

MODERN phylogenetic analyses are rapidly increasing

CSE 417 Network Flows (pt 4) Min Cost Flows

Lecture 20: Clustering and Evolution

On the Optimality of the Neighbor Joining Algorithm

Loopy Belief Propagation

Markovian Models of Genetic Inheritance

Supplementary Online Material PASTA: ultra-large multiple sequence alignment

ACO and other (meta)heuristics for CO

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.

Balanced Search Trees

AUTOMATED PLAUSIBILITY ANALYSIS OF LARGE PHYLOGENIES

Basics of Multiple Sequence Alignment

How do we get from what we know (score of data given a tree) to what we want (score of the tree, given the data)?

Prior Distributions on Phylogenetic Trees

PRec-I-DCM3: A Parallel Framework for Fast and Accurate Large Scale Phylogeny Reconstruction

CS Machine Learning

Fixed parameter algorithms for compatible and agreement supertree problems

MONOPHYLER: A program for calculating monophyly probabilities of sampled lineages given a species tree Rohan Mehta December 6, 2017

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

Lecture 20: Clustering and Evolution

Fast Local Search for Unrooted Robinson-Foulds Supertrees

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Phylogenetic Trees (I)

Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis

Sistemática Teórica. Hernán Dopazo. Biomedical Genomics and Evolution Lab. Lesson 03 Statistical Model Selection

SEEING THE TREES AND THEIR BRANCHES IN THE NETWORK IS HARD

An Edge-Swap Heuristic for Finding Dense Spanning Trees

Evolution of Tandemly Repeated Sequences

Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees

1 High-Performance Phylogeny Reconstruction Under Maximum Parsimony. David A. Bader, Bernard M.E. Moret, Tiffani L. Williams and Mi Yan

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

Lab 13: The What the Hell Do I Do with All These Trees Lab

Introduction to Graph Theory

6.856 Randomized Algorithms

4/4/16 Comp 555 Spring

Transcription:

1 / 45 Dynamic Programming for Phylogenetic Estimation CS598AGB Pranjal Vachaspati University of Illinois at Urbana-Champaign

2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for species p2 p1

2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for species, which may differ from evolutionary trees of individual genes due to the multi-species coalescent p2 p1

Coalescent-based Species Tree Estimation Find evolutionary tree for species, which may differ from evolutionary trees of individual genes due to the multi-species coalescent p2 p1 Two separate lineages coalesce in a branch with probability p 2 / 45

Coalescent-based Species Tree Estimation Find evolutionary tree for species, which may differ from evolutionary trees of individual genes due to the multi-species coalescent p2=0.99 p1=0.99 What if evolution is very slow (or populations very small)? 2 / 45

Coalescent-based Species Tree Estimation Find evolutionary tree for species, which may differ from evolutionary trees of individual genes due to the multi-species coalescent p2=0.01 p1=0.01 What if evolution is very quick (or populations very large)? 2 / 45

How to compute an unrooted species tree from unrooted gene trees? (Pretend these are unrooted trees) Theorem (Allman et al., 2011, and others): For every four leaves {a,b,c,d}, the most probable unrooted quartet tree on {a,b,c,d} is the true species tree. Use the All Quartets Method to construct the species tree, based on the most frequent gene trees for each set of four species. Es3mate species tree for every 4 species Combine unrooted 4-taxon trees (Pretend this is unrooted!) 3 / 45

4 / 45 How do we go from gene trees to quartets? 1. Take most common quartet - fine if lots of accurate trees

4 / 45 How do we go from gene trees to quartets? 1. Take most common quartet - fine if lots of accurate trees 2. Weight quartets by number of times they appear in gene trees

4 / 45 How do we go from gene trees to quartets? 1. Take most common quartet - fine if lots of accurate trees 2. Weight quartets by number of times they appear in gene trees 3. SVDQuartets (Chifman and Kubatko, 2014): estimate best quartets from sequence data, avoiding effects of gene tree estimation error!

Unfortunately, SVDQuartets is frequently less accurate than ASTRAL-2 Figure: 100 taxa, low and medium levels of ILS PAUP* is a software package containing a reference implementation of SVDQuartets 5 / 45

6 / 45 All Quartets Method doesn t really work Really slow Fails if quartets are not all compatible Doesn t work for anything other than picking the most common quartet

6 / 45 All Quartets Method doesn t really work Really slow Fails if quartets are not all compatible Doesn t work for anything other than picking the most common quartet Finding tree that maximizes support over quartets is NP-hard in general

6 / 45 All Quartets Method doesn t really work Really slow Fails if quartets are not all compatible Doesn t work for anything other than picking the most common quartet Finding tree that maximizes support over quartets is NP-hard in general So we need some tricks.

7 / 45 Supertree Estimation - another important phylogenetic problem Combine sparse scaffold + dense trees on subsets

7 / 45 Supertree Estimation - another important phylogenetic problem Combine sparse scaffold + dense trees on subsets Like coalescent-based species tree estimation, many methods - some more accurate than others

7 / 45 Supertree Estimation - another important phylogenetic problem Combine sparse scaffold + dense trees on subsets Like coalescent-based species tree estimation, many methods - some more accurate than others MRL: most accurate method in literature, finds maximum likelihood tree under a symmetric two-state model, using the MRP character matrix MulRF: less accurate method that finds a tree that minimizes the RF distance to the input trees

MulRF is frequently less accurate than MRL 8 / 45

9 / 45 What makes a method accurate? Optimization Criterion: we want a tree that is as close to reality as possible, but since we don t know the true tree, we have to optimize some other function Optimization algorithm: Optimizing these criteria is usually NP-hard, so we can t solve them exactly

9 / 45 What makes a method accurate? Optimization Criterion: we want a tree that is as close to reality as possible, but since we don t know the true tree, we have to optimize some other function Optimization algorithm: Optimizing these criteria is usually NP-hard, so we can t solve them exactly We can t tell the difference between a poor criterion and a good criterion with a bad optimization algorithm!

9 / 45 What makes a method accurate? Optimization Criterion: we want a tree that is as close to reality as possible, but since we don t know the true tree, we have to optimize some other function Optimization algorithm: Optimizing these criteria is usually NP-hard, so we can t solve them exactly We can t tell the difference between a poor criterion and a good criterion with a bad optimization algorithm! We have developed a general-purpose optimization algorithm that does an exact constrained search for a variety of optimization problems

10 / 45 Advantages of our approach 1. Often finds better solutions than existing heuristic algorithms 2. Allows for an unbiased comparison between optimization criteria 3. Can be easily adapted to new optimization criteria

11 / 45 Exact Constrained Search with Dynamic Programming Theory introduced by Bryant and Steel (2001) for weighted quartet support Input: quartet weights, constraint set X Output: species tree constrained by X maximizing induced quartet weight

11 / 45 Exact Constrained Search with Dynamic Programming Theory introduced by Bryant and Steel (2001) for weighted quartet support Used for minimizing deep coalescences (Than and Nakleh, 2009) Input: Cluster compatibility graph from gene trees, constraint set X Output: species tree constrained by X minimizing deep coalescences over gene trees

11 / 45 Exact Constrained Search with Dynamic Programming Theory introduced by Bryant and Steel (2001) for weighted quartet support Used for minimizing deep coalescences (Than and Nakleh, 2009) Used by ASTRAL (Mirarab et al. 2014) and ASTRAL-II (Mirarab and Warnow, 2015) for quartet support Input: gene trees, constraint set X Output: species tree constrained by X maximizing support over gene trees Currently most accurate coalescent-based species tree estimation method

11 / 45 Exact Constrained Search with Dynamic Programming Theory introduced by Bryant and Steel (2001) for weighted quartet support Used for minimizing deep coalescences (Than and Nakleh, 2009) Used by ASTRAL (Mirarab et al. 2014) and ASTRAL-II (Mirarab and Warnow, 2015) for quartet support Our algorithm is a generic version of the above techniques that can solve a wide variety of problems

12 / 45 Weighted Quartet Support Input: A mapping of quartets (four-leaf trees) to real-valued weights Output: A tree with optimum weight (the sum of the weights of its quartets)

12 / 45 Weighted Quartet Support Input: A mapping of quartets (four-leaf trees) to real-valued weights Output: A tree with optimum weight (the sum of the weights of its quartets) Observation: In a rooted tree that induces quartet ((a, b), (c, d)), there is a unique LCA for at least three of a, b, c, d a b c d a b c d

12 / 45 Weighted Quartet Support Input: A mapping of quartets (four-leaf trees) to real-valued weights Output: A tree with optimum weight (the sum of the weights of its quartets) Observation: In a rooted tree that induces quartet ((a, b), (c, d)), there is a unique LCA for at least three of a, b, c, d a b c d a b c d Definition: Weight of node is sum of weights of corresponding quartets

13 / 45 Exact optimization scheme: Weighted Quartet Support Observation: The weight of the tree is the sum of the weights of each node!

13 / 45 Exact optimization scheme: Weighted Quartet Support Observation: The weight of the tree is the sum of the weights of each node! Calculate the weight of a tripartition corresponding to a node without knowing the structure of the rest of the tree H I J A B C D F G E

13 / 45 Exact optimization scheme: Weighted Quartet Support Observation: The weight of the tree is the sum of the weights of each node! Calculate the weight of a tripartition corresponding to a node without knowing the structure of the rest of the tree H I J A B C D F G E

13 / 45 Exact optimization scheme: Weighted Quartet Support Observation: The weight of the tree is the sum of the weights of each node! Calculate the weight of a tripartition corresponding to a node without knowing the structure of the rest of the tree H I J A B C D F G E The weight of the best subtree on ABCDEFG that has the ABCD vs EFG split at the root is the sum of

13 / 45 Exact optimization scheme: Weighted Quartet Support Observation: The weight of the tree is the sum of the weights of each node! Calculate the weight of a tripartition corresponding to a node without knowing the structure of the rest of the tree H I J A B C D F G E The weight of the best subtree on ABCDEFG that has the ABCD vs EFG split at the root is the sum of 1. The weight of the tripartition ABCD, EFG, HIJ

13 / 45 Exact optimization scheme: Weighted Quartet Support Observation: The weight of the tree is the sum of the weights of each node! Calculate the weight of a tripartition corresponding to a node without knowing the structure of the rest of the tree H I J A B C D F G E The weight of the best subtree on ABCDEFG that has the ABCD vs EFG split at the root is the sum of 1. The weight of the tripartition ABCD, EFG, HIJ 2. The weight of the best subtree on ABCD

13 / 45 Exact optimization scheme: Weighted Quartet Support Observation: The weight of the tree is the sum of the weights of each node! Calculate the weight of a tripartition corresponding to a node without knowing the structure of the rest of the tree H I J A B C D F G E The weight of the best subtree on ABCDEFG that has the ABCD vs EFG split at the root is the sum of 1. The weight of the tripartition ABCD, EFG, HIJ 2. The weight of the best subtree on ABCD 3. The weight of the best subtree on EFG

13 / 45 Exact optimization scheme: Weighted Quartet Support Observation: The weight of the tree is the sum of the weights of each node! Calculate the weight of a tripartition corresponding to a node without knowing the structure of the rest of the tree H I J A B C D F G E The weight of the best subtree on ABCDEFG that has the ABCD vs EFG split at the root is the sum of 1. The weight of the tripartition ABCD, EFG, HIJ 2. The weight of the best subtree on ABCD 3. The weight of the best subtree on EFG If we have the best subtrees on all subsets, we can try every combination to get the best subtree on ABCDEFG

14 / 45 Constrain the search space to avoid exponential time 2 n ways to divide a set of size n into two subsets

14 / 45 Constrain the search space to avoid exponential time 2 n ways to divide a set of size n into two subsets Instead of exploring entire search space, only allow trees to contain bipartitions from an input set X

14 / 45 Constrain the search space to avoid exponential time 2 n ways to divide a set of size n into two subsets Instead of exploring entire search space, only allow trees to contain bipartitions from an input set X In practice, we use ASTRAL-2 s set X, which is based on the set of bipartitions in the input gene trees. Can run into problems when input trees missing taxa, poorly resolved

The Constrained DP Algorithm for quartet support Input: Constraint set X of bipartitions, list of quartet weights, over taxon set S Output: Score of species tree constrained by X maximizing weight of induced quartets For i {1... n} (number of taxa): For each cluster c (half of a bipartition) of size i in X: Set score(c) = 0 For each cluster c of size less than i: If c contains c, and c c is in X: score(c) = max(score(c), score(c ) + score(c c ) + weight((c, c c, S c))) O( X 2 ) pairs of clusters O(n 4 ) time to count the quartets matching a tripartition Running time: O( X 2 n 4 ), can be reduced to O( X 2 n 2 + n 4 ) with some precomputation (Bryant and Steel, 2001) 15 / 45

16 / 45 RF Supertrees Combine sparse scaffold + dense trees on subsets The Robinson-Foulds (RF) distance between a binary supertree T and a binary source tree t on a taxon subset s is RF (T, t) = bipartitions(t s ) \ bipartitions(t) A E A F E B C D B D C

RF Supertrees Combine sparse scaffold + dense trees on subsets The Robinson-Foulds (RF) distance between a binary supertree T and a binary source tree t on a taxon subset s is RF (T, t) = bipartitions(t s ) \ bipartitions(t) A E A F E B C D The RF supertree is the tree that minimizes the total RF distance to a set of source trees B D C 16 / 45

RF Supertrees Earlier, we mapped quartets to tripartitions. If we map bipartitions to tripartitions, we can use the same approach as our quartet support algorithm! 17 / 45

17 / 45 RF Supertrees Earlier, we mapped quartets to tripartitions. If we map bipartitions to tripartitions, we can use the same approach as our quartet support algorithm! We map a bipartition {A B} on a taxon subset to a tripartition if 1. it has every member of A (and no members of B) below one child 2. it has at least one member of B below the other child L B b K A J

17 / 45 RF Supertrees Earlier, we mapped quartets to tripartitions. If we map bipartitions to tripartitions, we can use the same approach as our quartet support algorithm! We map a bipartition {A B} on a taxon subset to a tripartition if 1. it has every member of A (and no members of B) below one child 2. it has at least one member of B below the other child L B b K k input trees over n taxa: O(kn) input bipartitions Checking if tripartition matches bipartition takes O(n) time O( X 2 ) pairs of clusters Running time:o( X 2 n 2 k) A J

RF Supertrees Earlier, we mapped quartets to tripartitions. If we map bipartitions to tripartitions, we can use the same approach as our quartet support algorithm! We map a bipartition {A B} on a taxon subset to a tripartition if 1. it has every member of A (and no members of B) below one child 2. it has at least one member of B below the other child L B b K k input trees over n taxa: O(kn) input bipartitions Checking if tripartition matches bipartition takes O(n) time O( X 2 ) pairs of clusters Running time:o( X 2 n 2 k) A similar approach is possible for several other problems 17 / 45 A J

18 / 45 Experiment: SVDQuartets We can use our weighted quartet support algorithm for the SVDQuartets problem Questions: 1. Can the DP algorithm find better solutions to the SVDQuartets problem than the reference implementation in PAUP*? 2. Can the SVDQuartets criterion with the DP algorithm find topologically more accurate trees than ASTRAL? 3. What is the effect of expanding the constraint space?

Unfortunately, SVDQuartets is frequently less accurate than ASTRAL-2 Figure: 100 taxa, low and medium levels of ILS PAUP* is a software package containing a reference implementation of SVDQuartets 19 / 45

20 / 45 Datasets: SVDQuartets ASTRAL-2 simulated data (based on data from Mirarab and Warnow, 2015) Datasets with 10, 50, 100 taxa 10, 100, 1000 genes Sequence length random from 300-1500 bp Also restricted to 25, 100 base pairs per gene Three ILS model conditions for 100-taxon dataset: 21%, 33%, 69% average distance (AD) between true gene trees + species tree 10- and 50-taxon datasets have 34% AD 10 replicates per model condition

21 / 45 10-taxon data: all methods are capable of finding good SVDQuartets criterion scores Genes 10 100 Seq. Len. 25 bp 100 bp Full 25 bp 100 bp Full DP-constr. 0.300 0.294 0.124 0.224 0.127 0.049 PAUP* 0.302 0.295 0.124 0.224 0.127 0.051 DP-exact 0.300 0.294 0.124 0.224 0.127 0.049 Genes 1000 Seq. Len. 25 bp 100 bp Full DP-constr. 0.087 0.049 0.020 PAUP* 0.087 0.049 0.020 DP-exact 0.087 0.049 0.020 Smaller criterion scores are better

22 / 45 50-taxon data: DP consistently outperforms PAUP* in SVDQuartets criterion score Genes 10 100 Seq. Len. 25 bp 100 bp Full 25 bp 100 bp Full DP 518.5 426.0 210.3 318.8 174.6 86.0 PAUP* 517.3 427.0 210.5 320.6 176.5 88.3 Genes 1000 Seq. Len. 25 bp 100 bp Full DP 117.1 65.3 33.0 PAUP* 118.5 67.7 35.1 Smaller criterion scores are better

23 / 45 100-taxon data, low/mid ILS: DP almost always finds better SVDQuartets criterion scores than PAUP* 25 bp 100 bp Full 10 genes Low ILS DP 12245.5 8230.0 3485.5 PAUP* 12268.3 8253.4 3510.0 Mid ILS DP 2992.9 5502.5 2768.1 PAUP* 2937.0 5528.8 2776.4 100 genes Low ILS DP 6055.3 3279.0 1352.5 PAUP* 6082.2 3357.0 1462.4 Mid ILS DP 4062.9 2269.9 1063.8 PAUP* 4065.1 2292.3 1091.3 1000 genes Low ILS DP 2182.8 1180.2 496.1 PAUP* 2200.7 1233.0 544.8 Mid ILS DP 1553.5 873.3 414.9 PAUP* 1566.4 882.9 439.5 Smaller criterion scores are better

24 / 45 100-taxon data, high ILS: PAUP* finds better scores with short sequences, few genes 25 bp 100 bp Full 10 genes High ILS DP 34.4 241.2 1317.9 PAUP* 23.5 216.7 1305.2 100 genes High ILS DP 1531.6 1436.2 864.4 PAUP* 1418.0 1413.0 867.7 1000 genes High ILS DP 1149.4 670.6 351.7 PAUP* 1039.6 677.5 355.2

25 / 45 Observations Under low and medium ILS conditions, the DP algorithm finds trees with scores as good as or better than PAUP*

25 / 45 Observations Under low and medium ILS conditions, the DP algorithm finds trees with scores as good as or better than PAUP* Under high ILS conditions, especially with short sequences and few genes, PAUP* gets better scores than the DP algorithm

25 / 45 Observations Under low and medium ILS conditions, the DP algorithm finds trees with scores as good as or better than PAUP* Under high ILS conditions, especially with short sequences and few genes, PAUP* gets better scores than the DP algorithm This is due to the constraint set X not containing good enough bipartitions

26 / 45 100-taxon data, low ILS: Improving criterion scores leads to reduced topological error Lower topological error is better

27 / 45 100-taxon data, low ILS: Improving criterion scores with DP make SVDQuartets competitive with ASTRAL Lower topological error is better

28 / 45 100-taxon data, medium ILS: Improving criterion scores leads to reduced topological error Lower topological error is better

100-taxon data, medium ILS: Improving criterion scores with DP make SVDQuartets competitive with ASTRAL Lower topological error is better 29 / 45

30 / 45 100-taxon data, high ILS: PAUP* is better than DP on short sequences Lower topological error is better

100-taxon data, high ILS: PAUP* is better than DP and ASTRAL on short sequences, but DP can help on longer sequences Lower topological error is better 31 / 45

32 / 45 SVDQuartets: Lessons learned Under low and medium ILS conditions, DP finds better criterion scores and more accurate trees than PAUP*, and is competitive with ASTRAL with respect to tree accuracy. Under high ILS conditions, especially when a limited amount of data is available, PAUP* finds more accurate trees with better criterion scores than the DP algorithm, and more accurate trees than ASTRAL.

32 / 45 SVDQuartets: Lessons learned Under low and medium ILS conditions, DP finds better criterion scores and more accurate trees than PAUP*, and is competitive with ASTRAL with respect to tree accuracy. Under high ILS conditions, especially when a limited amount of data is available, PAUP* finds more accurate trees with better criterion scores than the DP algorithm, and more accurate trees than ASTRAL. High ILS conditions make gene trees very different from species tree (69% difference between true gene trees and species tree) Hard to find a good set X

Supertree Estimation - another important phylogenetic problem 33 / 45

Supertree Estimation - another important phylogenetic problem 33 / 45

34 / 45 Experiment: RF Supertrees Questions: 1. Can the DP algorithm find better solutions to the RF supertrees problem than a reference implementation (MulRF)? 2. Can the RF criterion with the DP algorithm find more accurate trees than other leading supertree methods (MRL)?

35 / 45 Datasets: RF Supertrees SMIDgen datasets (Swenson et al., 2010) 100, 500, 1000 species (taxa) Scaffold tree with 20, 50, 75, or 100% of the species 5, 15, or 25 trees with clade-based subsets of species Biological datasets Dataset # source trees # taxa Largest tree THPL 19 558 140 (25%) CPL 39 2228 1648 (73%) Seabirds 7 121 90 (74%) Marsupials 158 267 267 (100%) Placental Mammals 726 116 116 (100%)

36 / 45 The DP algorithm almost always finds a better RF criterion score than MulRF Lower criterion scores are better

37 / 45 The DP algorithm always finds a better RF criterion score than MulRF Lower criterion scores are better

38 / 45 MulRF is not able to run on the 1000-taxon data set Lower criterion scores are better

39 / 45 This translates to much better supertrees Lower topological error is better

39 / 45 This translates to much better supertrees, competitive with MRL Lower topological error is better

39 / 45 This translates to much better supertrees Lower topological error is better

39 / 45 This translates to much better supertrees, competitive with MRL Lower topological error is better

The DP algorithm can run with good accuracy on 1000-taxon datasets 40 / 45

41 / 45 Much faster than competing methods Taxa Scaffold MulRF DP MRL 100 20 1028.3 6.6 30.8 100 50 1144.2 6.7 35.4 100 75 1033.0 6.8 38.1 100 100 1151.7 7.2 40.2 500 20 20951 61 1837.3 500 50 25460 71 2158.5 500 75 28472 75 2049.0 500 100 32527 73-1000 20-222.9-1000 50-259.5-1000 75-289.9-1000 100-284.8 - Table: Running times (seconds) for RF Supertree methods.

42 / 45 Good performance on most biological datasets Dataset MulRF DP CPL 2294 THPL 617 475 Marsupials 1533 1425 Placental Mammals 5387 5449 Seabirds 99 79 Table: RF Criterion scores Dataset MulRF DP THPL 486 3.3 CPL 32 Seabirds 5 0.005 Marsupials 672 1.2 Placental Mammals 691 8.8 Table: Running times (min)

43 / 45 Discussion of biological datasets Biological datasets are somewhat different from the simulated datasets More source trees More poorly resolved nodes (nodes with high degree)

43 / 45 Discussion of biological datasets Biological datasets are somewhat different from the simulated datasets More source trees More poorly resolved nodes (nodes with high degree) DP finds more accurate trees than MulRF for all datasets except placental mammals The dataset has a 100% scaffold, which is a good condition for MulRF The dataset has several very poorly resolved nodes, which makes it challenging to find a good set of bipartitions This results in increased runtime and a worse criterion score

43 / 45 Discussion of biological datasets Biological datasets are somewhat different from the simulated datasets More source trees More poorly resolved nodes (nodes with high degree) DP finds more accurate trees than MulRF for all datasets except placental mammals The dataset has a 100% scaffold, which is a good condition for MulRF The dataset has several very poorly resolved nodes, which makes it challenging to find a good set of bipartitions This results in increased runtime and a worse criterion score Next step: Add trees from MulRF search to DP constraint set X

44 / 45 Ongoing work Choice of bipartition set Add bipartitions from best tree to DP bipartition set, especially for biological analyses New criteria MulRF can optimize over multi-copy gene trees MRP is another supertree method that uses parsimony Better understanding of conditions where the DP algorithm performs well or poorly Missing taxa for SVDQuartets Poorly resolved source trees Finish running time analysis for SVDQuartets, taking into account the size of the constraint set X In progress for ECCB - due March 29

45 / 45 The DP algorithm is fast, accurate and versatile The DP algorithm optimizes any criterion that can be expressed as a sum of tripartition scores Finds exact solutions over a constrained search space Often faster, more accurate than heuristic searches Need to be careful with choice of constraint space DP algorithm performs poorly when very little data are available Adding bipartitions from other methods substantially improves this Enables several exciting new research projects, including Bayesian quartet weighting (with Mike Nute), gene tree correction (with Ashu Gupta), divide-and-conquer species tree estimation with DACTAL (with Tandy Warnow), and quintet-based species tree estimation and rooting (with Ruth Davidson)