Analyzing Evolutionary Trees

Size: px
Start display at page:

Download "Analyzing Evolutionary Trees"

Transcription

1 Analyzing Evolutionary Trees Katherine St. John Lehman College and the Graduate Center City University of New York Katherine St. John City University of New York 1

2 Overview Talk Outline

3 Talk Outline Overview Constructing Trees

4 Talk Outline Overview Constructing Trees Comparing Reconstruction Methods Case Study: Quartet-Based Methods

5 Talk Outline Overview Constructing Trees Comparing Reconstruction Methods Case Study: Quartet-Based Methods Evaluating the Results Comparing, Clustering, and Visualizing Katherine St. John City University of New York 2

6 Talk Outline Overview Constructing Trees Comparing Reconstruction Methods Case Study: Quartet-Based Methods Evaluating the Results Comparing, Clustering, and Visualizing Katherine St. John City University of New York 3

7 Goal: Reconstruct the Evolutionary History (

8 Goal: Reconstruct the Evolutionary History ( The evolutionary process not only determines relationships among taxa, but allows prediction of structural, physiological, and biochemical properties. Katherine St. John City University of New York 4

9 Process for Reconstruction: Input Data Start with information about the taxa. For example: Morphological Characters

10 Process for Reconstruction: Input Data Start with information about the taxa. For example: Morphological Characters Biomolecular Sequences A GTTAGAAGGCGGCCAGCGAC... B CATTTGTCCTAACTTGACGG... C CAAGAGGCCACTGCAGAATC... D CCGACTTCCAACCTCATGCG... E ATGGGGCACGATGGATATCG... F TACAAATACGCGCAAGTTCG... Katherine St. John City University of New York 5

11 Process for Reconstruction

12 Process for Reconstruction Input Data A GTTAGAAGGC... B CATTTGTCCT... C CAAGAGGCCA... D CCGACTTCCA... E ATGGGGCACG... F TACAAATACG...

13 Process for Reconstruction Input Data A GTTAGAAGGC... B CATTTGTCCT... C CAAGAGGCCA... D CCGACTTCCA... E ATGGGGCACG... F TACAAATACG... Reconstruction Algorithms Maximum Parsimony Maximum Likelihood Distance Methods: NJ, Quartet-Based, Fast Convering,.

14 Process for Reconstruction Input Data Reconstruction Algorithms Output Tree A GTTAGAAGGC... B CATTTGTCCT... C CAAGAGGCCA... D CCGACTTCCA... E ATGGGGCACG... F TACAAATACG... Maximum Parsimony Maximum Likelihood Distance Methods: NJ, Quartet-Based, Fast Convering,. Katherine St. John City University of New York 6

15 Applications In addition to finding the evolutionary history of species, phylogeny is also used for:

16 Applications In addition to finding the evolutionary history of species, phylogeny is also used for: drug discovery: used to determine structural and biochemical properties of potential drugs

17 Applications In addition to finding the evolutionary history of species, phylogeny is also used for: drug discovery: used to determine structural and biochemical properties of potential drugs determining origins of HIV infection

18 Applications In addition to finding the evolutionary history of species, phylogeny is also used for: drug discovery: used to determine structural and biochemical properties of potential drugs determining origins of HIV infection origin of other virus and bacteria strains Katherine St. John City University of New York 7

19 Talk Outline Overview Constructing Trees Comparing Reconstruction Methods Case Study: Quartet-Based Methods Evaluating the Results Comparing, Clustering, and Visualizing Katherine St. John City University of New York 8

20 Process for Reconstruction Input Data A GTTAGAAGGC... B CATTTGTCCT... C CAAGAGGCCA... D CCGACTTCCA... E ATGGGGCACG... F TACAAATACG... Reconstruction Algorithms Maximum Parsimony Maximum Likelihood Distance Methods: NJ, Quartet-Based, Fast Convering,. Output Tree Katherine St. John City University of New York 9

21 Algorithms for Reconstruction Most optimization criteria are hard:

22 Algorithms for Reconstruction Most optimization criteria are hard: Maximum Parsimony: (NP-hard: Foulds & Graham 82) find the tree that can explain the observed sequences with a minimal number of substitutions.

23 Algorithms for Reconstruction Most optimization criteria are hard: Maximum Parsimony: (NP-hard: Foulds & Graham 82) find the tree that can explain the observed sequences with a minimal number of substitutions. Maximum Likelihood Estimation: (NP-hard: Roch 06) find the tree with the maximum likelihood: P(data tree).

24 Algorithms for Reconstruction Most optimization criteria are hard: Maximum Parsimony: (NP-hard: Foulds & Graham 82) find the tree that can explain the observed sequences with a minimal number of substitutions. Maximum Likelihood Estimation: (NP-hard: Roch 06) find the tree with the maximum likelihood: P(data tree). For both, the best known algorithms require an exponential number of trees to be checked.

25 Algorithms for Reconstruction Most optimization criteria are hard: Maximum Parsimony: (NP-hard: Foulds & Graham 82) find the tree that can explain the observed sequences with a minimal number of substitutions. Maximum Likelihood Estimation: (NP-hard: Roch 06) find the tree with the maximum likelihood: P(data tree). For both, the best known algorithms require an exponential number of trees to be checked. This is not feasible for more than 20 taxa. Katherine St. John City University of New York 10

26 Approximating Trees Exact answers are often wanted, but hard to find.

27 Approximating Trees Exact answers are often wanted, but hard to find. But approximate is often good enough:

28 Approximating Trees Exact answers are often wanted, but hard to find. But approximate is often good enough: drug design: predicting function via similarity

29 Approximating Trees Exact answers are often wanted, but hard to find. But approximate is often good enough: drug design: predicting function via similarity sequence alignment: guide trees for alignment

30 Approximating Trees Exact answers are often wanted, but hard to find. But approximate is often good enough: drug design: predicting function via similarity sequence alignment: guide trees for alignment use as priors or starting points for expensive searches Katherine St. John City University of New York 11

31 Approximation Algorithms Since calculating the exact answer is hard, algorithms that estimate the answer have been developed.

32 Approximation Algorithms Since calculating the exact answer is hard, algorithms that estimate the answer have been developed. Heuristics for maximum parsimony and maximum likelihood estimation (use clever ways to limit the number of trees checked, while still sampling much of tree-space )

33 Approximation Algorithms Since calculating the exact answer is hard, algorithms that estimate the answer have been developed. Heuristics for maximum parsimony and maximum likelihood estimation (use clever ways to limit the number of trees checked, while still sampling much of tree-space ) Polynomial-time methods, based on the distance between taxa Katherine St. John City University of New York 12

34 Distance-Based Methods These methods calculate the distance between taxa: B D A C F E B D A C F E and then determine the tree using the distance matrix.

35 Distance-Based Methods These methods calculate the distance between taxa: B D A C F E B D A C F E and then determine the tree using the distance matrix. One way to calculate distance is to take differences divided by the length (the normalized Hamming distance). Katherine St. John City University of New York 13

36 Distance-Based Methods Popular distance based methods include

37 Distance-Based Methods Popular distance based methods include Neighbor Joining (Saitou & Nei 87) which repeatedly joins the nearest neighbors to build a tree, and

38 Distance-Based Methods Popular distance based methods include Neighbor Joining (Saitou & Nei 87) which repeatedly joins the nearest neighbors to build a tree, and Quartet-based methods that decide the topology for every 4 taxa and then assemble them to form a tree (Berry et al. 1999, 2000, 2001).

39 Distance-Based Methods Popular distance based methods include Neighbor Joining (Saitou & Nei 87) which repeatedly joins the nearest neighbors to build a tree, and Quartet-based methods that decide the topology for every 4 taxa and then assemble them to form a tree (Berry et al. 1999, 2000, 2001). Many of these methods have good performance empirically, and some can be proven to have nice accuracy properties. Katherine St. John City University of New York 14

40 Testing Methods Empirically How accurate are the methods at reconstructing trees?

41 Testing Methods Empirically How accurate are the methods at reconstructing trees? In biological applications, the true, historical tree is almost never known, which makes assessing the quality of phylogenetic reconstruction methods problematic.

42 Testing Methods Empirically How accurate are the methods at reconstructing trees? In biological applications, the true, historical tree is almost never known, which makes assessing the quality of phylogenetic reconstruction methods problematic. (an exception: Hillis 92 created an evolutionary tree in the laboratory)

43 Testing Methods Empirically How accurate are the methods at reconstructing trees? In biological applications, the true, historical tree is almost never known, which makes assessing the quality of phylogenetic reconstruction methods problematic. (an exception: Hillis 92 created an evolutionary tree in the laboratory) Simulation is used instead to evaluate methods, given a model of evolution. Katherine St. John City University of New York 15

44 Simulation Studies 1. Construct a model tree.

45 Simulation Studies 1. Construct a model tree. 2. Evolve sequences down the tree. A GTTAGAAGGCGGCCA... B CATTTGTCCTAACTT... C CAAGAGGCCACTGCA... D CCGACTTCCAACCTC... E ATGGGGCACGATGGA... F TACAAATACGCGCAA...

46 Simulation Studies 1. Construct a model tree. 2. Evolve sequences down the tree. 3. Reconstruct the tree using method. A GTTAGAAGGCGGCCA... B CATTTGTCCTAACTT... C CAAGAGGCCACTGCA... D CCGACTTCCAACCTC... E ATGGGGCACGATGGA... F TACAAATACGCGCAA...

47 Simulation Studies 1. Construct a model tree. 2. Evolve sequences down the tree. 3. Reconstruct the tree using method. A GTTAGAAGGCGGCCA... B CATTTGTCCTAACTT... C CAAGAGGCCACTGCA... D CCGACTTCCAACCTC... E ATGGGGCACGATGGA... F TACAAATACGCGCAA Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 16

48 Simulation Studies 1. Construct a model tree. 2. Evolve sequences down the tree. 3. Reconstruct the tree using method. A GTTAGAAGGCGGCCA... B CATTTGTCCTAACTT... C CAAGAGGCCACTGCA... D CCGACTTCCAACCTC... E ATGGGGCACGATGGA... F TACAAATACGCGCAA Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 17

49 Simulating Data: Choosing Trees Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees)

50 Simulating Data: Choosing Trees Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) Can view this as two different random processes:

51 Simulating Data: Choosing Trees Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) Can view this as two different random processes: generate the tree shape, and then

52 Simulating Data: Choosing Trees Usually chosen from a random distribution on trees: Uniform, or Yule-Harding (birth-death trees) Can view this as two different random processes: generate the tree shape, and then assign weights or branch lengths to the shape. Katherine St. John City University of New York 18

53 Simulation Studies 1. Construct a model tree. 2. Evolve sequences down the tree. 3. Reconstruct the tree using method. A GTTAGAAGGCGGCCA... B CATTTGTCCTAACTT... C CAAGAGGCCACTGCA... D CCGACTTCCAACCTC... E ATGGGGCACGATGGA... F TACAAATACGCGCAA Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 19

54 Simulating Data: Evolving Sequences The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution.

55 Simulating Data: Evolving Sequences The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. A DNA sequence (a string over {A, C, T, G}) at the root evolves down a rooted binary tree T. AACGT Katherine St. John City University of New York 20

56 Simulating Data: Evolving Sequences The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. A DNA sequence (a string over {A, C, T, G}) at the root evolves down a rooted binary tree T. AACGT 0 1 AACGA AACGT Katherine St. John City University of New York 21

57 Simulating Data: Evolving Sequences The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. A DNA sequence (a string over {A, C, T, G}) at the root evolves down a rooted binary tree T. AACGT 0 1 AACGA AACGT ACCCT GACGT AACGA GGCGT Katherine St. John City University of New York 22

58 Simulating Data: Evolving Sequences The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. A DNA sequence (a string over {A, C, T, G}) at the root evolves down a rooted binary tree T. AACGT 0 1 AACGA AACGT ACCCT GACGT AACGA GGCGT GACGT AACGT GACGT GGCGA Katherine St. John City University of New York 23

59 Simulating Data: Evolving Sequences The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. A DNA sequence (a string over {A, C, T, G}) at the root evolves down a rooted binary tree T. AACGT 0 1 AACGA AACGT ACCCT GACGT AACGA GGCGT GACGT AACGT GACGT GGCGA Katherine St. John City University of New York 24

60 Simulating Data: Evolving Sequences The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. A DNA sequence (a string over {A, C, T, G}) at the root evolves down a rooted binary tree T. {ACCCT, GACGT, AACGT, GACGT, GGCGA} Katherine St. John City University of New York 25

61 Simulating Data: Evolving Sequences The Jukes-Cantor (JC) model is the simplest Markov model of biomolecular sequence evolution. A DNA sequence (a string over {A, C, T, G}) at the root evolves down a rooted binary tree T. The assumptions of the model are: 1. the sites (i.e., the positions within the sequences) evolve independently and identically 2. if a site changes state it changes with equal probability to each of the remaining states, and 3. the number of changes of each site on an edge e is a Poisson random variable with expectation λ(e) (this is also called the length of the edge e). Katherine St. John City University of New York 26

62 Simulation Studies 1. Construct a model tree. 2. Evolve sequences down the tree. 3. Reconstruct the tree using method. A GTTAGAAGGCGGCCA... B CATTTGTCCTAACTT... C CAAGAGGCCACTGCA... D CCGACTTCCAACCTC... E ATGGGGCACGATGGA... F TACAAATACGCGCAA Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 27

63 Simulation Studies 1. Construct a model tree. 2. Evolve sequences down the tree. 3. Reconstruct the tree using method. A GTTAGAAGGCGGCCA... B CATTTGTCCTAACTT... C CAAGAGGCCACTGCA... D CCGACTTCCAACCTC... E ATGGGGCACGATGGA... F TACAAATACGCGCAA Evaluate the accuracy of the constructed tree. Katherine St. John City University of New York 28

64 Evaluating Accuracy To compare reconstructed tree to model tree, the Robinson-Foulds Score is often used: False Positives + False Negatives total edges b a c d e f... b c d a f e

65 Evaluating Accuracy To compare reconstructed tree to model tree, the Robinson-Foulds Score is often used: False Positives + False Negatives total edges b a c d e f... b c d a f e If there are many possible answers, choose the one with the best parsimony score: the sum of the number of site changes acrosss the edges in the tree. Katherine St. John City University of New York 29

66 Talk Outline Overview Constructing Trees Comparing Reconstruction Methods Case Study: Quartet-Based Methods Evaluating the Results Comparing, Clustering, and Visualizing Katherine St. John City University of New York 30

67 Case Study: Quartet Methods A quartet is an unrooted binary tree on four taxa: a b c d {ab cd} a c b d {ac bd} a d b c {ad bc}

68 Case Study: Quartet Methods A quartet is an unrooted binary tree on four taxa: b a d c c a d b d a c b {ab cd} {ac bd} {ad bc} Let Q(T ) = all quartets that agree with T. [Erdős et al. 1997]: T can be reconstructed from Q(T ) in polynomial time. Katherine St. John City University of New York 31

69 Case Study: Quartet Methods Quartet-based methods operate in two phases:

70 Case Study: Quartet Methods Quartet-based methods operate in two phases: Construct quartets on all four taxa sets.

71 Case Study: Quartet Methods Quartet-based methods operate in two phases: Construct quartets on all four taxa sets. Combine these quartets into a tree.

72 Case Study: Quartet Methods Quartet-based methods operate in two phases: Construct quartets on all four taxa sets. Combine these quartets into a tree. Running time: For most optimizations, determining a quartet is fast.

73 Case Study: Quartet Methods Quartet-based methods operate in two phases: Construct quartets on all four taxa sets. Combine these quartets into a tree. Running time: For most optimizations, determining a quartet is fast. There are Θ(n 4 ) quartets, giving Ω(n 4 ) running time.

74 Case Study: Quartet Methods Quartet-based methods operate in two phases: Construct quartets on all four taxa sets. Combine these quartets into a tree. Running time: For most optimizations, determining a quartet is fast. There are Θ(n 4 ) quartets, giving Ω(n 4 ) running time. In practice, the input quality is insufficient to ensure that all quartets are accurately inferred.

75 Case Study: Quartet Methods Quartet-based methods operate in two phases: Construct quartets on all four taxa sets. Combine these quartets into a tree. Running time: For most optimizations, determining a quartet is fast. There are Θ(n 4 ) quartets, giving Ω(n 4 ) running time. In practice, the input quality is insufficient to ensure that all quartets are accurately inferred. Quartet methods have to handle incorrect quartets. Katherine St. John City University of New York 32

76 Popular Quartet Methods Q or Buneman Method [Berry & Gascuel 97, Buneman 71]: Only add edges that agree with all input quartets. Doesn t tolerate errors outputs conservative, but unresolved tree.

77 Popular Quartet Methods Q or Buneman Method [Berry & Gascuel 97, Buneman 71]: Only add edges that agree with all input quartets. Doesn t tolerate errors outputs conservative, but unresolved tree. Quartet Cleaning (QC) [Berry et al. 1999]: Add edges with a small number of errors proportional to q e. Many variants: all handle a small number of errors.

78 Popular Quartet Methods Q or Buneman Method [Berry & Gascuel 97, Buneman 71]: Only add edges that agree with all input quartets. Doesn t tolerate errors outputs conservative, but unresolved tree. Quartet Cleaning (QC) [Berry et al. 1999]: Add edges with a small number of errors proportional to q e. Many variants: all handle a small number of errors. Quartet Puzzling [Strimmer & von Haeseler 1996]: Order taxa randomly, greedily add edges, repeat 1000 times. Output majority tree. Most popular with biologists. Katherine St. John City University of New York 33

79 Standard Method: Neighbor Joining (NJ) [Saitou & Nei 1987]: very popular and fast: O(n 3 ). Based on the distance between nodes, join neighboring leaves, replace them by their parent, calculate distances to this node, and repeat. This process eventually returns a binary (fully resolved) tree. Joining the leaves with the minimal distance does not suffice, so subtract the averaged distances to compensate for long edges. Experimental work shows that NJ trees are reasonably accurate, given a rate of evolution is neither too low nor too high. Katherine St. John City University of New York 34

80 Our Study [St. John, Moret, Warnow, & Vawter: SODA01; J Algorithms 03] A detailed, large-scale experimental study of quartet methods and NJ under the Jukes-Cantor model of evolution:

81 Our Study [St. John, Moret, Warnow, & Vawter: SODA01; J Algorithms 03] A detailed, large-scale experimental study of quartet methods and NJ under the Jukes-Cantor model of evolution: Our results indicate that NJ always outperforms the quartet-based methods we examined, in terms of both accuracy and speed.

82 Our Study [St. John, Moret, Warnow, & Vawter: SODA01; J Algorithms 03] A detailed, large-scale experimental study of quartet methods and NJ under the Jukes-Cantor model of evolution: Our results indicate that NJ always outperforms the quartet-based methods we examined, in terms of both accuracy and speed. Give new theory about convergence rates of quartet-based methods which helps explain our observations. Katherine St. John City University of New York 35

83 Experimental Design: Parameter Space In all, our study used 16,000 datasets and required many months of computation on the two clusters. Taxa: 5, 10, 20, expected evolutionary rates: from to per tree edge. For each, we generated 100 tree shapes, grouped into 10 runs of 10 trials. Sequence lengths: 500, 2,000, 8,000, and 32,000. Katherine St. John City University of New York 36

84 Measuring Accuracy: Quartets and Edges Topological accuracy is a more demanding criterion than quartet accuracy. % tree edges * % quartets (a) seq. length 500 % tree edges ** % quartets (b) seq. length 2000 λ e * (Percent of true tree edges recovered by Quartet Puzzling for 40 taxa and two sequence lengths) Katherine St. John City University of New York 37

85 Sensitivity to Input Quality Methods that estimate quartets and then combine them into a single tree can be greatly affected by the quality of the input quartets. % edges % quartets Q NJ % edges % quartets QCNJ Katherine St. John City University of New York 38

86 Running Times NJ was clearly the fastest method tested.

87 Running Times NJ was clearly the fastest method tested. QCML and QP were by far the slowest of the methods tested, slow enough that running them on more than a fifty taxa appears infeasible at present.

88 Running Times NJ was clearly the fastest method tested. QCML and QP were by far the slowest of the methods tested, slow enough that running them on more than a fifty taxa appears infeasible at present. With default settings, QP takes more than 200 days of computation to analyze ten runs of ten trials each for a single set of parameters on 80 taxa with a sequence length of 500. (30 minutes for NJ). Katherine St. John City University of New York 39

89 Quartet Study Results Quartet quality impacts performance significantly. Maximizing the number of quartets correct doesn t yield better trees.

90 Quartet Study Results Quartet quality impacts performance significantly. Maximizing the number of quartets correct doesn t yield better trees. Clear linear order of accuracy for the methods (except under very low rates of evolution): NJ, QP, QC and Q.

91 Quartet Study Results Quartet quality impacts performance significantly. Maximizing the number of quartets correct doesn t yield better trees. Clear linear order of accuracy for the methods (except under very low rates of evolution): NJ, QP, QC and Q. Study led to the development of new methods that weight quartets and have better accuracy on shorter sequences. Katherine St. John City University of New York 40

92 Talk Outline Overview Constructing Trees Comparing Reconstruction Methods Case Study: Quartet-Based Methods Evaluating the Results Comparing, Clustering, and Visualizing Katherine St. John City University of New York 41

93 Analyzing & Visualizing Sets of Trees Amenta & Klingner, InfoVis 02 Hillis, Heath, & St. John, Sys. Biol. 05 Katherine St. John City University of New York 42

94 Evaluating the Results Often, a search will result in many (often thousands) of trees with the same score.

95 Evaluating the Results Often, a search will result in many (often thousands) of trees with the same score. Input Data A GTTAGAAGGC... B CATTTGTCCT... C CAAGAGGCCA... D CCGACTTCCA... E ATGGGGCACG... F TACAAATACG... Reconstruction Algorithms Maximum Parsimony Maximum Likelihood Distance Methods: NJ, Quartet-Based, Fast Convering,. Output Tree Katherine St. John City University of New York 43

96 Evaluating the Results Often, a search, will result in many (often thousands) of trees with the same score. Input Data A GTTAGAAGGC... B CATTTGTCCT... C CAAGAGGCCA... D CCGACTTCCA... E ATGGGGCACG... F TACAAATACG... Reconstruction Algorithms Maximum Parsimony Maximum Likelihood Output Trees Katherine St. John City University of New York 44

97 Summarizing Trees Input Trees Consensus Method Output Trees Strict Consensus Majority-rule Katherine St. John City University of New York 45

98 Visualizing Sets of Trees Efficiency is important for real-time visualization. Katherine St. John City University of New York 46

99 Strict Consensus Tree Input trees Strict Consensus s 0 s 1 s 2 s 3 s 4 s 0 s 1 s 2 s 3 s 4 s 0 s 1 s 2 s 4 s 3 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 0 s 3 s 4 s 2 s 3 s 0 s 1 s 4 s 2 s 4 s 0 s 1 s 3 s 1 s 2 s 3 s 0 s 4 s 1 s 2 s 3 s 0 s 4 s 2 s 3 s 4 s 0 s 1 Katherine St. John City University of New York 47

100 Majority-rule Tree Input trees Majority-rule Tree s 0 s 1 s 2 s 3 s 4 s 0 s 1 s 2 s 3 s 4 s 0 s 1 s 2 s 4 s 3 s 0 s 1 s 2 s 3 s 4 Includes splits found in a majority of trees Can be 2/3 majority, etc. Katherine St. John City University of New York 48

101 Running Time Majority-rule Consensus: O((n/w)(nt + lg x)): Margush & McMorris, (x = total number of splits, nt) Implemented in PHYLIP, Felsenstein et al. O(n 2 + nt 2 ): Wareham, O(nt): Amenta, Clarke, & S, WABI 03. Strict Consensus: O(nt): Day, 1985 (assuming w = O(lg n)) Katherine St. John City University of New York 49

102 Our Algorithm Randomized algorithm for majority-rule consensus tree with expected running time O(nt) (n = number of leaves, t = number of trees). Optimal, since reading t trees on n taxa requires Ω(nt) time. Implementation in tree-set visualization software module for Mesquite (W&D Maddison). Katherine St. John City University of New York 50

103 Basic Idea Traverse all input trees in post-order, counting occurrences of splits in a hash table. Build majority-rule tree on splits which appear more than t 2 times. Key issue: Representation of splits Katherine St. John City University of New York 51

104 Distances Between Sets of Trees Katherine St. John City University of New York 52

105 Distances Between Trees Robinson-Foulds distance: # of edges that occur in only one tree. Calculate in O(n) time using Day s Algorithm (1985). Extends naturally to weighted trees. Katherine St. John City University of New York 53

106 Other Natural Metrics Tree-bisection-reconnect (TBR): A C D B E G F A C D B E G F A B A B C D E G F C D G E F TBR is NP-hard. (Allen & Steel 01) Many attempts, but no approximations with provable bounds. Katherine St. John City University of New York 54

107 Other Natural Metrics Subtree-prune-regraft (SPR): A B A B C C G D F D E E G F C D A E B G F NP-hard for rooted trees (Bordewich & Semple 05). 5-approximation for rooted trees (Bonet, Amenta, Mahindru, & S. 06). Katherine St. John City University of New York 55

108 Our Contribution Nicotiana Campanula Scaevola Stokesia Dimorphotheca Senecio Gerbera Gazania Echinops Felicia Tagetes Chromolaena Blennosperma Coreopsis Vernonia Cacosmia Cichorium Achillea Carthamnus Flaveria Piptocarpa Helianthus Tragopogon Chrysanthemum Eupatorium Lactuca Barnadesia Dasyphyllum Nicotiana Campanula Scaevola Stokesia Dimorphotheca Senecio Gazania Gerbera Echinops Felicia Tagetes Chromolaena Blennosperma Coreopsis Vernonia Cacosmia Cichorium Achillea Carthamnus Flaveria Piptocarpa Helianthus Tragopogon Chrysanthemum Eupatorium Lactuca Barnadesia Dasyphyllum Tree-Bisection-Reconnection (TBR): counter-example to past approximation algorithms

109 Our Contribution Nicotiana Campanula Scaevola Stokesia Dimorphotheca Senecio Gerbera Gazania Echinops Felicia Tagetes Chromolaena Blennosperma Coreopsis Vernonia Cacosmia Cichorium Achillea Carthamnus Flaveria Piptocarpa Helianthus Tragopogon Chrysanthemum Eupatorium Lactuca Barnadesia Dasyphyllum Nicotiana Campanula Scaevola Stokesia Dimorphotheca Senecio Gazania Gerbera Echinops Felicia Tagetes Chromolaena Blennosperma Coreopsis Vernonia Cacosmia Cichorium Achillea Carthamnus Flaveria Piptocarpa Helianthus Tragopogon Chrysanthemum Eupatorium Lactuca Barnadesia Dasyphyllum Tree-Bisection-Reconnection (TBR): counter-example to past approximation algorithms Subtree-Prune-Regraft (SPR): First approximation algorithm for rooted SPR Katherine St. John City University of New York 56

110 Approximation Algorithm Outline For each sibling pair (a,b) in the first tree, find in the second and do: Case 1: a b 1 subtree between a & b in T 2 : Cut the subtrees in T 2. N := N + 1 Case 2: a b 2 subtrees between a & b in T 2 : Cut a & b in both. N := N + 2. Case 3: a & b are in separate components for T 2 : Cut a & b in both. N := N + 2. a b Case 4: b is a singleton in T 2 : Cut b in T 1. Note that N doesn t change. a b Case 5: a b The sibling pair occurs in both: Replace by a new node in both. Note that N doesn t change. (Hein et al. version of the algorithm; Rodrigues et al. varies slightly in Cases 1 & 2.) Katherine St. John City University of New York 57

111 Accounting: how good is your estimate? (All based on having a maximum agreement forest for the distance metric.)

112 Accounting: how good is your estimate? (All based on having a maximum agreement forest for the distance metric.) Look at each edge that should be cut (called a link) and see how many extra edges were cut for it. Charge that link the number of edges cut for it.

113 Accounting: how good is your estimate? (All based on having a maximum agreement forest for the distance metric.) Look at each edge that should be cut (called a link) and see how many extra edges were cut for it. Charge that link the number of edges cut for it. Tricky to count! Past attempts for TBR make a subtle mistake that charges non-links. Katherine St. John City University of New York 58

114 Counterexample to Rodrigues et al. 3-Approximation T 1 T a b c d e f g h i j Enlargement of the subtrees between 9 &10: a b c d e f g h i j k l m n o Best Case: Cut the red subtrees; gives TBR distance of 27. Algorithm: Cuts each sibling pair, then subtrees; gives distance of Katherine St. John City University of New York 59

115 5-Approximation With the work of Bordewich & Semple 05, it s now possible to do analysis for rooted SPR.

116 5-Approximation With the work of Bordewich & Semple 05, it s now possible to do analysis for rooted SPR. Slightly modifying the algorithms for TBR, we have a 5-approximation for rooted SPR (very detailed proof).

117 5-Approximation With the work of Bordewich & Semple 05, it s now possible to do analysis for rooted SPR. Slightly modifying the algorithms for TBR, we have a 5-approximation for rooted SPR (very detailed proof). Have an example to show that it can t be better than a 3-approximation, but likely much better in practice.

118 5-Approximation With the work of Bordewich & Semple 05, it s now possible to do analysis for rooted SPR. Slightly modifying the algorithms for TBR, we have a 5-approximation for rooted SPR (very detailed proof). Have an example to show that it can t be better than a 3-approximation, but likely much better in practice. Algorithm runs in linear time. Katherine St. John City University of New York 60

119 Small vs. Large Sets of Trees TreeJuxtaposer compares visually small sets of trees, and Treecomp focuses on comparing, analyzing, and visualizing large sets of phylogenetic trees. Katherine St. John City University of New York 61

120 Comparing Two Trees Will Fisher, UT Katherine St. John City University of New York TreeJuxtaposer, Munzner et al. 62

121 TreeJuxtaposer (Munzner, Guimbretiere, Tasiran, Zhang, & Zhou, SIGGRAPH 2003) Katherine St. John City University of New York 63

122 TreeJuxtaposer Can compare trees of several hundred thousand nodes. Guaranteed visibility: highlighted areas remain visually apparent at all times. Focus+Context navigation technique ( rubber sheet ). Katherine St. John City University of New York 64

123 SequenceJuxtaposer (Slack, Munzner, Hildebrand, & S, 2004) Katherine St. John City University of New York 65

124 SequenceJuxtaposer Accordion drawing guarantees context, visibility, and frame rate. Interactively view a million basepairs. Regular expression searches. Annotation files displayed and navigatable. Katherine St. John City University of New York 66

125 Tree Set Visualization Katherine St. John City University of New York 67

126 Visualizing chains of trees: Color indicates maximum likelihood score Animate feature Applications (Rana dataset from Hillis Lab 500 trees from MrBayes search) Katherine St. John City University of New York 68

127 Summary Constructing Trees: Efficient approximation algorithms for large datasets Comparing Reconstruction Methods: Case Study: Quartet-Based Methods Evaluating the Results: Comparing, Clustering, and Visualizing Katherine St. John City University of New York 69

128 Future Work Efficient heuristics for biological tree differences. Analysis of tree distributions. Dimensionality reduction techniques applied to sequences. Linking TreeJuxtaposer to SequenceJuxtaposer Visualizing & Analyzing Morphological Data Katherine St. John City University of New York 70

129 Acknowledgments National Science Foundation David Hillis Laboratory, U. Texas Wildebeest & Research Clusters at CUNY Mesquite by David and Wayne Maddison treecomp module, many of our students: Left to Right: Back Row: Isaac Boamah (CUNY/UMd), Jeff Klingner (UT/Stanford), Katherine St. John (CUNY), Tamara Munzner (UBC), Silvio Neris (CUNY) Front Row: Denise Edwards (CUNY), Ruchi Kalra (CUNY), Nina Amenta (UT/UCD), Ingrid Montealegre (CUNY), Fred Clarke (CUNY/UCD) Katherine St. John City University of New York 71

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1

DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens. Katherine St. John City University of New York 1 DIMACS Tutorial on Phylogenetic Trees and Rapidly Evolving Pathogens Katherine St. John City University of New York 1 Thanks to the DIMACS Staff Linda Casals Walter Morris Nicole Clark Katherine St. John

More information

Approximating Subtree Distances Between Phylogenies. MARIA LUISA BONET, 1 KATHERINE ST. JOHN, 2,3 RUCHI MAHINDRU, 2,4 and NINA AMENTA 5 ABSTRACT

Approximating Subtree Distances Between Phylogenies. MARIA LUISA BONET, 1 KATHERINE ST. JOHN, 2,3 RUCHI MAHINDRU, 2,4 and NINA AMENTA 5 ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 13, Number 8, 2006 Mary Ann Liebert, Inc. Pp. 1419 1434 Approximating Subtree Distances Between Phylogenies AU1 AU2 MARIA LUISA BONET, 1 KATHERINE ST. JOHN, 2,3

More information

CS 581. Tandy Warnow

CS 581. Tandy Warnow CS 581 Tandy Warnow This week Maximum parsimony: solving it on small datasets Maximum Likelihood optimization problem Felsenstein s pruning algorithm Bayesian MCMC methods Research opportunities Maximum

More information

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin Sequence length requirements Tandy Warnow Department of Computer Science The University of Texas at Austin Part 1: Absolute Fast Convergence DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2

More information

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst Phylogenetics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Phylogenetics phylum = tree phylogenetics: reconstruction of evolutionary

More information

The Performance of Phylogenetic Methods on Trees of Bounded Diameter

The Performance of Phylogenetic Methods on Trees of Bounded Diameter The Performance of Phylogenetic Methods on Trees of Bounded Diameter Luay Nakhleh 1, Usman Roshan 1, Katherine St. John 1 2, Jerry Sun 1, and Tandy Warnow 1 3 1 Department of Computer Sciences, University

More information

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand

More information

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University College Station, TX 77843-3 {sulsj,tlw}@cs.tamu.edu

More information

Phylogenetic Trees and Their Analysis

Phylogenetic Trees and Their Analysis City University of New York (CUNY) CUNY Academic Works Dissertations, Theses, and Capstone Projects Graduate Center 2-2014 Phylogenetic Trees and Their Analysis Eric Ford Graduate Center, City University

More information

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:

More information

Introduction to Triangulated Graphs. Tandy Warnow

Introduction to Triangulated Graphs. Tandy Warnow Introduction to Triangulated Graphs Tandy Warnow Topics for today Triangulated graphs: theorems and algorithms (Chapters 11.3 and 11.9) Examples of triangulated graphs in phylogeny estimation (Chapters

More information

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES SEUNG-JIN SUL AND TIFFANI L. WILLIAMS Department of Computer Science Texas A&M University College Station, TX 77843-3112 USA E-mail: {sulsj,tlw}@cs.tamu.edu

More information

Rotation Distance is Fixed-Parameter Tractable

Rotation Distance is Fixed-Parameter Tractable Rotation Distance is Fixed-Parameter Tractable Sean Cleary Katherine St. John September 25, 2018 arxiv:0903.0197v1 [cs.ds] 2 Mar 2009 Abstract Rotation distance between trees measures the number of simple

More information

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees A Randomized Algorithm for Comparing Sets of Phylogenetic Trees Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University E-mail: {sulsj,tlw}@cs.tamu.edu Technical Report

More information

Scaling species tree estimation methods to large datasets using NJMerge

Scaling species tree estimation methods to large datasets using NJMerge Scaling species tree estimation methods to large datasets using NJMerge Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana Champaign 2018 Phylogenomics Software

More information

Dynamic Programming for Phylogenetic Estimation

Dynamic Programming for Phylogenetic Estimation 1 / 45 Dynamic Programming for Phylogenetic Estimation CS598AGB Pranjal Vachaspati University of Illinois at Urbana-Champaign 2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for

More information

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 ML phylogenetic inference and GARLI Derrick Zwickl University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 Outline Heuristics and tree searches ML phylogeny inference and

More information

of the Balanced Minimum Evolution Polytope Ruriko Yoshida

of the Balanced Minimum Evolution Polytope Ruriko Yoshida Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida Figure 19.1 Genomes 3 ( Garland Science 2007) Origins of Species Tree (or web) of life eukarya

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course)

Parsimony Least squares Minimum evolution Balanced minimum evolution Maximum likelihood (later in the course) Tree Searching We ve discussed how we rank trees Parsimony Least squares Minimum evolution alanced minimum evolution Maximum likelihood (later in the course) So we have ways of deciding what a good tree

More information

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation

More information

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie.

Olivier Gascuel Arbres formels et Arbre de la Vie Conférence ENS Cachan, septembre Arbres formels et Arbre de la Vie. Arbres formels et Arbre de la Vie Olivier Gascuel Centre National de la Recherche Scientifique LIRMM, Montpellier, France www.lirmm.fr/gascuel 10 permanent researchers 2 technical staff 3 postdocs, 10

More information

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet)

Codon models. In reality we use codon model Amino acid substitution rates meet nucleotide models Codon(nucleotide triplet) Phylogeny Codon models Last lecture: poor man s way of calculating dn/ds (Ka/Ks) Tabulate synonymous/non- synonymous substitutions Normalize by the possibilities Transform to genetic distance K JC or K

More information

On the Optimality of the Neighbor Joining Algorithm

On the Optimality of the Neighbor Joining Algorithm On the Optimality of the Neighbor Joining Algorithm Ruriko Yoshida Dept. of Statistics University of Kentucky Joint work with K. Eickmeyer, P. Huggins, and L. Pachter www.ms.uky.edu/ ruriko Louisville

More information

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS

SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS 1 SPR-BASED TREE RECONCILIATION: NON-BINARY TREES AND MULTIPLE SOLUTIONS C. THAN and L. NAKHLEH Department of Computer Science Rice University 6100 Main Street, MS 132 Houston, TX 77005, USA Email: {cvthan,nakhleh}@cs.rice.edu

More information

Introduction to Computational Phylogenetics

Introduction to Computational Phylogenetics Introduction to Computational Phylogenetics Tandy Warnow The University of Texas at Austin No Institute Given This textbook is a draft, and should not be distributed. Much of what is in this textbook appeared

More information

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees

What is a phylogenetic tree? Algorithms for Computational Biology. Phylogenetics Summary. Di erent types of phylogenetic trees What is a phylogenetic tree? Algorithms for Computational Biology Zsuzsanna Lipták speciation events Masters in Molecular and Medical Biotechnology a.a. 25/6, fall term Phylogenetics Summary wolf cat lion

More information

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3 The XIII International Conference Applied Stochastic Models and Data Analysis (ASMDA-2009) June 30-July 3, 2009, Vilnius, LITHUANIA ISBN 978-9955-28-463-5 L. Sakalauskas, C. Skiadas and E. K. Zavadskas

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

Distance Methods. PRINCIPLES OF PHYLOGENETICS Spring 2006 Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2006 Distance Methods Due at the end of class: - Distance matrices and trees for two different distance

More information

Lecture: Bioinformatics

Lecture: Bioinformatics Lecture: Bioinformatics ENS Sacley, 2018 Some slides graciously provided by Daniel Huson & Celine Scornavacca Phylogenetic Trees - Motivation 2 / 31 2 / 31 Phylogenetic Trees - Motivation Motivation -

More information

Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees

Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees Journal of Bioinformatics and Computational Biology c Imperial College Press Computing the All-Pairs Quartet Distance on a set of Evolutionary Trees M. Stissing, T. Mailund, C. N. S. Pedersen and G. S.

More information

Distances Between Phylogenetic Trees: A Survey

Distances Between Phylogenetic Trees: A Survey TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll07/11llpp490-499 Volume 18, Number 5, October 2013 Distances Between Phylogenetic Trees: A Survey Feng Shi, Qilong Feng, Jianer Chen, Lusheng Wang, and

More information

CSE 549: Computational Biology

CSE 549: Computational Biology CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford Tree of Life 2 * H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 * H5N1 Influenza Strains The 2007 outbreak

More information

The worst case complexity of Maximum Parsimony

The worst case complexity of Maximum Parsimony he worst case complexity of Maximum Parsimony mir armel Noa Musa-Lempel Dekel sur Michal Ziv-Ukelson Ben-urion University June 2, 20 / 2 What s a phylogeny Phylogenies: raph-like structures whose topology

More information

Fast Hashing Algorithms to Summarize Large. Collections of Evolutionary Trees

Fast Hashing Algorithms to Summarize Large. Collections of Evolutionary Trees Texas A&M CS Technical Report 2008-6- June 27, 2008 Fast Hashing Algorithms to Summarize Large Collections of Evolutionary Trees by Seung-Jin Sul and Tiffani L. Williams Department of Computer Science

More information

Outline. Guaranteed Visibility. Accordion Drawing. Guaranteed Visibility Challenges. Guaranteed Visibility Challenges

Outline. Guaranteed Visibility. Accordion Drawing. Guaranteed Visibility Challenges. Guaranteed Visibility Challenges Outline Scalable Visual Comparison of Biological Trees and Sequences Tamara Munzner University of British Columbia Department of Computer Science ccordion Drawing information visualization technique TreeJuxtaposer

More information

1 High-Performance Phylogeny Reconstruction Under Maximum Parsimony. David A. Bader, Bernard M.E. Moret, Tiffani L. Williams and Mi Yan

1 High-Performance Phylogeny Reconstruction Under Maximum Parsimony. David A. Bader, Bernard M.E. Moret, Tiffani L. Williams and Mi Yan Contents 1 High-Performance Phylogeny Reconstruction Under Maximum Parsimony 1 David A. Bader, Bernard M.E. Moret, Tiffani L. Williams and Mi Yan 1.1 Introduction 1 1.2 Maximum Parsimony 7 1.3 Exact MP:

More information

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency?

Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency? Answer Set Programming or Hypercleaning: Where does the Magic Lie in Solving Maximum Quartet Consistency? Fathiyeh Faghih and Daniel G. Brown David R. Cheriton School of Computer Science, University of

More information

Distance-based Phylogenetic Methods Near a Polytomy

Distance-based Phylogenetic Methods Near a Polytomy Distance-based Phylogenetic Methods Near a Polytomy Ruth Davidson and Seth Sullivant NCSU UIUC May 21, 2014 2 Phylogenetic trees model the common evolutionary history of a group of species Leaves = extant

More information

Recent Research Results. Evolutionary Trees Distance Methods

Recent Research Results. Evolutionary Trees Distance Methods Recent Research Results Evolutionary Trees Distance Methods Indo-European Languages After Tandy Warnow What is the purpose? Understand evolutionary history (relationship between species). Uderstand how

More information

A Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem

A Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem A Lookahead Branch-and-Bound Algorithm for the Maximum Quartet Consistency Problem Gang Wu Jia-Huai You Guohui Lin January 17, 2005 Abstract A lookahead branch-and-bound algorithm is proposed for solving

More information

TreeCmp 2.0: comparison of trees in polynomial time manual

TreeCmp 2.0: comparison of trees in polynomial time manual TreeCmp 2.0: comparison of trees in polynomial time manual 1. Introduction A phylogenetic tree represents historical evolutionary relationship between different species or organisms. There are various

More information

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES

EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES EVOLUTIONARY DISTANCES INFERRING PHYLOGENIES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 28 th November 2007 OUTLINE 1 INFERRING

More information

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ)

Distance based tree reconstruction. Hierarchical clustering (UPGMA) Neighbor-Joining (NJ) Distance based tree reconstruction Hierarchical clustering (UPGMA) Neighbor-Joining (NJ) All organisms have evolved from a common ancestor. Infer the evolutionary tree (tree topology and edge lengths)

More information

Introduction to Trees

Introduction to Trees Introduction to Trees Tandy Warnow December 28, 2016 Introduction to Trees Tandy Warnow Clades of a rooted tree Every node v in a leaf-labelled rooted tree defines a subset of the leafset that is below

More information

Computing the Quartet Distance Between Trees of Arbitrary Degrees

Computing the Quartet Distance Between Trees of Arbitrary Degrees January 22, 2006 University of Aarhus Department of Computer Science Computing the Quartet Distance Between Trees of Arbitrary Degrees Chris Christiansen & Martin Randers Thesis supervisor: Christian Nørgaard

More information

Seeing the wood for the trees: Analysing multiple alternative phylogenies

Seeing the wood for the trees: Analysing multiple alternative phylogenies Seeing the wood for the trees: Analysing multiple alternative phylogenies Tom M. W. Nye, Newcastle University tom.nye@ncl.ac.uk Isaac Newton Institute, 17 December 2007 Multiple alternative phylogenies

More information

Reconstructing Reticulate Evolution in Species Theory and Practice

Reconstructing Reticulate Evolution in Species Theory and Practice Reconstructing Reticulate Evolution in Species Theory and Practice Luay Nakhleh Department of Computer Science Rice University Houston, Texas 77005 nakhleh@cs.rice.edu Tandy Warnow Department of Computer

More information

Comparison of commonly used methods for combining multiple phylogenetic data sets

Comparison of commonly used methods for combining multiple phylogenetic data sets Comparison of commonly used methods for combining multiple phylogenetic data sets Anne Kupczok, Heiko A. Schmidt and Arndt von Haeseler Center for Integrative Bioinformatics Vienna Max F. Perutz Laboratories

More information

Chordal Graphs and Evolutionary Trees. Tandy Warnow

Chordal Graphs and Evolutionary Trees. Tandy Warnow Chordal Graphs and Evolutionary Trees Tandy Warnow Possible Indo-European tree (Ringe, Warnow and Taylor 2000) Anatolian Vedic Iranian Greek Italic Celtic Tocharian Armenian Germanic Baltic Slavic Albanian

More information

Efficiently Calculating Evolutionary Tree Measures Using SAT

Efficiently Calculating Evolutionary Tree Measures Using SAT Efficiently Calculating Evolutionary Tree Measures Using SAT Maria Luisa Bonet 1 and Katherine St. John 2 1 Lenguajes y Sistemas Informáticos, Universidad Politécnica de Cataluña, Spain 2 Math & Computer

More information

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Wayne Pfeiffer (SDSC/UCSD) & Alexandros Stamatakis (TUM) February 25, 2010 What was done? Why is it important? Who cares? Hybrid MPI/OpenMP

More information

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such)

Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences Phylogeny methods, part 1 (Parsimony and such) Genetics/MBT 541 Spring, 2002 Lecture 1 Joe Felsenstein Department of Genome Sciences joe@gs Phylogeny methods, part 1 (Parsimony and such) Methods of reconstructing phylogenies (evolutionary trees) Parsimony

More information

Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis

Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis Parallel Implementation of a Quartet-Based Algorithm for Phylogenetic Analysis B. B. Zhou 1, D. Chu 1, M. Tarawneh 1, P. Wang 1, C. Wang 1, A. Y. Zomaya 1, and R. P. Brent 2 1 School of Information Technologies

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/11/2014 Comp 555 Bioalgorithms (Fall 2014) 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets

Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets Improvement of Distance-Based Phylogenetic Methods by a Local Maximum Likelihood Approach Using Triplets Vincent Ranwez and Olivier Gascuel Département Informatique Fondamentale et Applications, LIRMM,

More information

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea Descent w/modification Descent w/modification Descent w/modification Descent w/modification CPU Descent w/modification Descent w/modification Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

More information

Main Reference. Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taking a Random Walk through Tree Space Genetics 2005

Main Reference. Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taking a Random Walk through Tree Space Genetics 2005 Stochastic Models for Horizontal Gene Transfer Dajiang Liu Department of Statistics Main Reference Marc A. Suchard: Stochastic Models for Horizontal Gene Transfer: Taing a Random Wal through Tree Space

More information

Lecture 20: Clustering and Evolution

Lecture 20: Clustering and Evolution Lecture 20: Clustering and Evolution Study Chapter 10.4 10.8 11/12/2013 Comp 465 Fall 2013 1 Clique Graphs A clique is a graph where every vertex is connected via an edge to every other vertex A clique

More information

The SNPR neighbourhood of tree-child networks

The SNPR neighbourhood of tree-child networks Journal of Graph Algorithms and Applications http://jgaa.info/ vol. 22, no. 2, pp. 29 55 (2018) DOI: 10.7155/jgaa.00472 The SNPR neighbourhood of tree-child networks Jonathan Klawitter Department of Computer

More information

Generation of distancebased phylogenetic trees

Generation of distancebased phylogenetic trees primer for practical phylogenetic data gathering. Uconn EEB3899-007. Spring 2015 Session 12 Generation of distancebased phylogenetic trees Rafael Medina (rafael.medina.bry@gmail.com) Yang Liu (yang.liu@uconn.edu)

More information

LARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR. A Thesis HYUN JUNG PARK

LARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR. A Thesis HYUN JUNG PARK LARGE-SCALE ANALYSIS OF PHYLOGENETIC SEARCH BEHAVIOR A Thesis by HYUN JUNG PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements for the degree

More information

in interleaved format. The same data set in sequential format:

in interleaved format. The same data set in sequential format: PHYML user's guide Introduction PHYML is a software implementing a new method for building phylogenies from sequences using maximum likelihood. The executables can be downloaded at: http://www.lirmm.fr/~guindon/phyml.html.

More information

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony Jean-Michel Richer 1 and Adrien Goëffon 2 and Jin-Kao Hao 1 1 University of Angers, LERIA, 2 Bd Lavoisier, 49045 Anger Cedex 01,

More information

4/4/16 Comp 555 Spring

4/4/16 Comp 555 Spring 4/4/16 Comp 555 Spring 2016 1 A clique is a graph where every vertex is connected via an edge to every other vertex A clique graph is a graph where each connected component is a clique The concept of clustering

More information

Fast Parallel Maximum Likelihood-based Protein Phylogeny

Fast Parallel Maximum Likelihood-based Protein Phylogeny Fast Parallel Maximum Likelihood-based Protein Phylogeny C. Blouin 1,2,3, D. Butt 1, G. Hickey 1, A. Rau-Chaplin 1. 1 Faculty of Computer Science, Dalhousie University, Halifax, Canada, B3H 5W1 2 Dept.

More information

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Khaddouja Boujenfa, Nadia Essoussi, and Mohamed Limam International Science Index, Computer and Information Engineering waset.org/publication/482

More information

A Statistical Test for Clades in Phylogenies

A Statistical Test for Clades in Phylogenies A STATISTICAL TEST FOR CLADES A Statistical Test for Clades in Phylogenies Thurston H. Y. Dang 1, and Elchanan Mossel 2 1 Department of Electrical Engineering and Computer Sciences, University of California,

More information

Topology Inference from Co-Occurrence Observations. Laura Balzano and Rob Nowak with Michael Rabbat and Matthew Roughan

Topology Inference from Co-Occurrence Observations. Laura Balzano and Rob Nowak with Michael Rabbat and Matthew Roughan Topology Inference from Co-Occurrence Observations Laura Balzano and Rob Nowak with Michael Rabbat and Matthew Roughan Co-Occurrence observations Co-Occurrence observations Network Inference for Co- Occurrences

More information

MOLECULAR phylogenetic methods reconstruct evolutionary

MOLECULAR phylogenetic methods reconstruct evolutionary Calculating the Unrooted Subtree Prune-and-Regraft Distance Chris Whidden and Frederick A. Matsen IV arxiv:.09v [cs.ds] Nov 0 Abstract The subtree prune-and-regraft (SPR) distance metric is a fundamental

More information

Reconsidering the Performance of Cooperative Rec-I-DCM3

Reconsidering the Performance of Cooperative Rec-I-DCM3 Reconsidering the Performance of Cooperative Rec-I-DCM3 Tiffani L. Williams Department of Computer Science Texas A&M University College Station, TX 77843 Email: tlw@cs.tamu.edu Marc L. Smith Computer Science

More information

Fast Algorithms for Large-Scale Phylogenetic Reconstruction

Fast Algorithms for Large-Scale Phylogenetic Reconstruction Fast Algorithms for Large-Scale Phylogenetic Reconstruction by Jakub Truszkowski A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy

More information

Heterotachy models in BayesPhylogenies

Heterotachy models in BayesPhylogenies Heterotachy models in is a general software package for inferring phylogenetic trees using Bayesian Markov Chain Monte Carlo (MCMC) methods. The program allows a range of models of gene sequence evolution,

More information

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES

Lab 07: Maximum Likelihood Model Selection and RAxML Using CIPRES Integrative Biology 200, Spring 2014 Principles of Phylogenetics: Systematics University of California, Berkeley Updated by Traci L. Grzymala Lab 07: Maximum Likelihood Model Selection and RAxML Using

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Alignment of Trees and Directed Acyclic Graphs

Alignment of Trees and Directed Acyclic Graphs Alignment of Trees and Directed Acyclic Graphs Gabriel Valiente Algorithms, Bioinformatics, Complexity and Formal Methods Research Group Technical University of Catalonia Computational Biology and Bioinformatics

More information

Evolution of Tandemly Repeated Sequences

Evolution of Tandemly Repeated Sequences University of Canterbury Department of Mathematics and Statistics Evolution of Tandemly Repeated Sequences A thesis submitted in partial fulfilment of the requirements of the Degree for Master of Science

More information

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees Applied Mathematics Letters 24 (2011) 719 723 Contents lists available at ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml Graph triangulations and the compatibility

More information

Efficiently Inferring Pairwise Subtree Prune-and-Regraft Adjacencies between Phylogenetic Trees

Efficiently Inferring Pairwise Subtree Prune-and-Regraft Adjacencies between Phylogenetic Trees Efficiently Inferring Pairwise Subtree Prune-and-Regraft Adjacencies between Phylogenetic Trees Downloaded 01/19/18 to 140.107.151.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

More information

Study of a Simple Pruning Strategy with Days Algorithm

Study of a Simple Pruning Strategy with Days Algorithm Study of a Simple Pruning Strategy with ays Algorithm Thomas G. Kristensen Abstract We wish to calculate all pairwise Robinson Foulds distances in a set of trees. Traditional algorithms for doing this

More information

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Multiple Sequence Alignment. Mark Whitsitt - NCSA Multiple Sequence Alignment Mark Whitsitt - NCSA What is a Multiple Sequence Alignment (MA)? GMHGTVYANYAVDSSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKQPHV GMHGTVYANYAVEHSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKTPHV

More information

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony

A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony A Memetic Algorithm for Phylogenetic Reconstruction with Maximum Parsimony Jean-Michel Richer 1,AdrienGoëffon 2, and Jin-Kao Hao 1 1 University of Angers, LERIA, 2 Bd Lavoisier, 49045 Anger Cedex 01, France

More information

MODERN phylogenetic analyses are rapidly increasing

MODERN phylogenetic analyses are rapidly increasing IEEE CONFERENCE ON BIOINFORMATICS & BIOMEDICINE 1 Accurate Simulation of Large Collections of Phylogenetic Trees Suzanne J. Matthews, Member, IEEE, ACM Abstract Phylogenetic analyses are growing at a rapid

More information

The History Bound and ILP

The History Bound and ILP The History Bound and ILP Julia Matsieva and Dan Gusfield UC Davis March 15, 2017 Bad News for Tree Huggers More Bad News Far more convincingly even than the (also highly convincing) fossil evidence, the

More information

Systematics - Bio 615

Systematics - Bio 615 The 10 Willis (The commandments as given by Willi Hennig after coming down from the Mountain) 1. Thou shalt not paraphyle. 2. Thou shalt not weight. 3. Thou shalt not publish unresolved nodes. 4. Thou

More information

Markovian Models of Genetic Inheritance

Markovian Models of Genetic Inheritance Markovian Models of Genetic Inheritance Elchanan Mossel, U.C. Berkeley mossel@stat.berkeley.edu, http://www.cs.berkeley.edu/~mossel/ 6/18/12 1 General plan Define a number of Markovian Inheritance Models

More information

human chimp mouse rat

human chimp mouse rat Michael rudno These notes are based on earlier notes by Tomas abak Phylogenetic Trees Phylogenetic Trees demonstrate the amoun of evolution, and order of divergence for several genomes. Phylogenetic trees

More information

Consistent and Efficient Reconstruction of Latent Tree Models

Consistent and Efficient Reconstruction of Latent Tree Models Stochastic Systems Group Consistent and Efficient Reconstruction of Latent Tree Models Myung Jin Choi Joint work with Vincent Tan, Anima Anandkumar, and Alan S. Willsky Laboratory for Information and Decision

More information

Terminology. A phylogeny is the evolutionary history of an organism

Terminology. A phylogeny is the evolutionary history of an organism Phylogeny Terminology A phylogeny is the evolutionary history of an organism A taxon (plural: taxa) is a group of (one or more) organisms, which a taxonomist adjudges to be a unit. A definition? from Wikipedia

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

Outline. Phylogenetic/Evolutionary Tree. Common Dataset Size Today. Paper Comparison: Multiple Trees

Outline. Phylogenetic/Evolutionary Tree. Common Dataset Size Today. Paper Comparison: Multiple Trees Outline TreeJuxtaposer Scalable Visual Comparison of Biological Trees and Sequences tree comparison Accordion Drawing information visualization technique SequenceJuxtaposer Tamara Munzner University of

More information

A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees

A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees A New Algorithm for the Reconstruction of Near-Perfect Binary Phylogenetic Trees Kedar Dhamdhere, Srinath Sridhar, Guy E. Blelloch, Eran Halperin R. Ravi and Russell Schwartz March 17, 2005 CMU-CS-05-119

More information

Notes 4 : Approximating Maximum Parsimony

Notes 4 : Approximating Maximum Parsimony Notes 4 : Approximating Maximum Parsimony MATH 833 - Fall 2012 Lecturer: Sebastien Roch References: [SS03, Chapters 2, 5], [DPV06, Chapters 5, 9] 1 Coping with NP-completeness Local search heuristics.

More information

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau

Phylogenetic Trees Lecture 12. Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau Phylogenetic Trees Lecture 12 Section 7.4, in Durbin et al., 6.5 in Setubal et al. Shlomo Moran, Ilan Gronau. Maximum Parsimony. Last week we presented Fitch algorithm for (unweighted) Maximum Parsimony:

More information

High-Performance Algorithm Engineering for Computational Phylogenetics

High-Performance Algorithm Engineering for Computational Phylogenetics High-Performance Algorithm Engineering for Computational Phylogenetics Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 High-Performance

More information

Cooperative Rec-I-DCM3: A Population-Based Approach for Reconstructing Phylogenies

Cooperative Rec-I-DCM3: A Population-Based Approach for Reconstructing Phylogenies Cooperative Rec-I-DCM3: A Population-Based Approach for Reconstructing Phylogenies Tiffani L. Williams Department of Computer Science Texas A&M University tlw@cs.tamu.edu Marc L. Smith Department of Computer

More information

Algorithms for Computing Cluster Dissimilarity between Rooted Phylogenetic

Algorithms for Computing Cluster Dissimilarity between Rooted Phylogenetic Send Orders for Reprints to reprints@benthamscience.ae 8 The Open Cybernetics & Systemics Journal, 05, 9, 8-3 Open Access Algorithms for Computing Cluster Dissimilarity between Rooted Phylogenetic Trees

More information

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS Ivan Vogel Doctoral Degree Programme (1), FIT BUT E-mail: xvogel01@stud.fit.vutbr.cz Supervised by: Jaroslav Zendulka E-mail: zendulka@fit.vutbr.cz

More information