Supplementary Online Material PASTA: ultra-large multiple sequence alignment

Size: px
Start display at page:

Download "Supplementary Online Material PASTA: ultra-large multiple sequence alignment"

Transcription

1 Supplementary Online Material PASTA: ultra-large multiple sequence alignment Siavash Mirarab, Nam Nguyen, and Tandy Warnow University of Texas at Austin - Department of Computer Science {smirarab,bayzid,tandy}@cs.utexas.edu Abstract We introduce PASTA, a new method for multiple sequence alignment of datasets with up to 200,000 sequences in [3]. Here we provide supplementary information not provided in the main paper. We give exact commands used for running the experiments, we provide extra results that did not fit in the main paper, and we provide some supplementary discussion of the results. 1

2 Contents A Appendix - Method version numbers and commands 3 B Appendix - Proofs 6 C Appendix - Further Results 10 C.1 Performance on 1000-taxon datasets C.2 Performance on biological datasets C.3 Impact of Starting Tree C.4 Choice of Opal vs. Muscle for merging alignments C.5 Impact of alignment subset size C.6 Muscle running time D Appendix - Additional Discussion 18 D.1 Alignment Accuracy Measures D.2 Comparison between SATé and PASTA

3 A Appendix - Method version numbers and commands ClustalW version 2.1: clustalw2 -quicktree -align -infile=[input sequences] -outfile=[output alignment] -output=fasta Muscle version : muscle -in [input sequences] -out [output alignment] HMMBUILD version 3.0: hmmbuild symfrac 0.0 dna [output profile] [backbone alignment] HMMALIGN version 3.0: hmmalign dna [output profile] [query file] > [output alignment] Mafft-profile version 6.956b: mafft add [query file] [backbone alignment] > [output alignment] FastTree version SSE3: fasttree -nt -gtr [input fasta] > [output tree] SATé version 2.2.7: python run sate.py config.sate2.txt (The config.sate2.txt file is defined as follows:) [commandline] two_phase = False datatype = <dna or rna> untrusted = False multilocus = False input = <input_fasta> treefile = <starting_tree> aligned = False raxml_search_after = False auto = False [fasttree] model = -gtr args = options = -nosupport -fastest 3

4 [sate] time_limit = -1.0 iter_without_imp_limit = -1 time_without_imp_limit = -1.0 break_strategy = centroid start_tree_search_from_current = True blind_after_iter_without_imp = -1 max_mem_mb = 4024 blind_mode_is_final = True blind_after_time_without_imp = -1.0 max_subproblem_size = 200 merger = muscle num_cpus = 12 after_blind_time_without_imp_limit = -1.0 max_subproblem_frac = 0.0 blind_after_total_time = -1.0 after_blind_time_term_limit = -1.0 aligner = mafft iter_limit = 3 blind_after_total_iter = -1 tree_estimator = fasttree after_blind_iter_term_limit = -1 return_final_tree_and_alignment = False move_to_blind_on_worse_score = True after_blind_iter_without_imp_limit = -1 PASTA version 1.1.0: python run pasta.py config.txt (The config.txt file is defined as follows:) [commandline] two_phase = False datatype = <dna or rna> untrusted = False multilocus = False input = <input_fasta> treefile = <starting_tre> aligned = False raxml_search_after = False 4

5 [sate] time_limit = -1.0 iter_without_imp_limit = -1 time_without_imp_limit = -1.0 break_strategy = centroid start_tree_search_from_current = True blind_after_iter_without_imp = -1 max_mem_mb = 1024 blind_mode_is_final = True blind_after_time_without_imp = -1.0 max_subproblem_size = 200 merger = opal num_cpus = 12 after_blind_time_without_imp_limit = -1.0 max_subproblem_frac = 0.0 blind_after_total_time = -1.0 after_blind_time_term_limit = -1.0 aligner = mafft iter_limit = 3 blind_after_total_iter = -1 tree_estimator = fasttree after_blind_iter_term_limit = -1 return_final_tree_and_alignment = True move_to_blind_on_worse_score = True after_blind_iter_without_imp_limit = -1 mask_gappy_sites = <0.1% of dataset size] 5

6 B Appendix - Proofs Lemma 1: Let X, Y, and Z be disjoint sequence datasets, and alignments A and A be alignments on X Z and Y Z, respectively, that induce identical alignments on Z. Let K be the length of the longest sequence in X, Y, and Z, and L be the total number of sites in A and A. Then we can merge alignments A and A using transitivity in O(L + ( X + Y + Z )K). Proof: To represent an alignment, we use a data-structure with two elements: 1) the unaligned sequence and 2) a list of integers giving the position of each letter in the aligned sequences. Assume A has k columns, and A has k columns. We start by finding the sequences that belong to Z. For each shared sequence in Z, we find the columns that are non-gap in at least one shared sequence in A, and do the same thing for A (we call these shared columns). Calculating shared columns can be done in O(K Z ), because our data-structure for representing alignments has the list of column positions for each character of each sequence of Z. Let k s denote the number of shared columns. After computing shared columns we know that the final alignment will have k + k k s columns. We simultaneously iterate through the k columns in A and k columns in A, and map these numbers to position numbers in the output alignment. We start at the leftmost position of both alignments, and keep a position in A (denoted by p), a position in A (dented by p ), and a position in the output alignment (denoted by q). If both p and p correspond to a shared column, we map both to q and increment all three. Otherwise, w.l.o.g. assume p is not a shared column; we map p to q and increment only p and q. At the end of this process, we have a mapping from columns of both input alignments to the columns of the output alignment, and this procedure takes O(k + k k s ) = O(L) time. Finally, we build the output alignment by adding sequences from the original alignments, and by mapping their column indices using the mapping computed above. This step takes O(k( X + Y + Z )). Thus, the final running time is O(L + k( X + Y + Z )). Note that for any single gene alignment, L << k( X + Y + Z ), and therefore can be omitted from the analysis. Theorem 1: Let {A 1, A 2,..., A k } be a set of Type 1 sub-alignments, and let {X 1, X 2,..., X l } be a set of Type 2 sub-alignments defined by any spanning tree on this set. Then, (a) any two orderings of pairwise transitivity mergers, using the procedure described above, produce the same final multiple sequence alignment, and (b) if no A i has any false positives homologies (with respect to the true alignment) then the final multiple sequence alignment will also not have any false positive homologies. Proof: An alignment is entirely defined by the set of homologies it contains. 6

7 Each set of homologies in each column of an alignment creates an equivalence class, and thus the alignment constitutes an equivalence relation. Thus, the transitive closure of a set A of alignments, denoted tc(a), is defined to be the equivalence relation defined by the transitive closure of the equivalence relations defining each alignment in A. It is easy to see that this is equivalent to saying that letters x and y are in the same column within tc(a) if and only if there is a sequence of alignments A 1, A 2,..., A k with each A i A, and letters x 1, x 2,..., x k, such that x and x 1 are in the same column in A 1, x i and x i+1 are in the same column within A i+1 (i = 1, 2,..., k 1), and x k and y are in the same column in A k. However, by the constraints on the Type 2 sub-alignments (they cannot modify the Type 1 sub-alignments), the sequence of alignments that implies that x and y are in the same column cannot repeat any sub-alignment. Hence, when we apply the transitivity merge of two sub-alignments and infer a new homology between letters x and y, it is because there is some z such that x and z are in the same column in one sub-alignment, and z and y are in the same column in the other subalignment. This will be helpful to us in proving that the simple algorithm we have suffices to define the transitivity merge. We prove that the transitivity merge algorithm computes the transitive closure tc(a) and so is independent of the particular ordering on the edge contractions. Note that at any point in the transitive merge algorithm, some number of Type 3 sub-alignments may have been computed. We can therefore trace the creation of any Type 3 sub-alignment from the beginning set of Type 1 and Type 2 sub-alignments, and note the sequence and number of the pairwise transitivity mergers involved. We can also note the initial set of Type 2 sub-alignments involved in creating the Type 3 sub-alignment. Note that the number n of such additional operations is at least 1 for every Type 3 sub-alignment, but never more than k 1, where k is the number of Type 1 sub-alignments. We will prove by induction on n that every Type 3 sub-alignment that is created by n pairwise mergers applied to a set A of Type 2 sub-alignments is identical to tc(a). The base case is n = 1. Let A and A be Type 2 sub-alignments on X Z and Y Z, respectively, with X Y =. The definition of the pairwise merge of these two Type 2 sub-alignments matches the definition of tc({a, A }), and so the base case is proven. Now assume by induction that there is some n 1 such that any Type 3 sub-alignment formed by at most n pairwise transitivity mergers applied to a set A of Type 2 sub-alignments is identical to tc(a). Let A 1 be some Type 3 sub-alignment created by n + 1 pairwise mergers, and let A 2 and A 3 be the pair of sub-alignments (each is either Type 2 or Type 3) that are merged 7

8 together through the transitivity merger to form A 1. Hence, A 2 is on X Y, A 3 is on Y Z, and X Z =. Note that A 2 and A 3 were each created using fewer than n pairwise transitivity mergers, and so are identical to the transitive closure applied to a set of Type 2 sub-alignments. The result of merging these two sub-alignments therefore adds in remaining homologies that can only be inferred through transitivity. Hence, the result of applying the transitive closure algorithm is an alignment that has only those pairwise homologies that can be inferred through transitivity. As a result, if each Type 2 sub-alignment has no false positives, then neither can the result of the transitive closure algorithm. The only thing we now need to show is that all pairwise homologies that can be inferred through transitivity are in the final alignment. So suppose letter x in X and letter z in Z should be in the same column in the true transitive closure of all Type 2 sub-alignments in A. Then the path of alignments linking x to z must go directly from x to some y Y via some Type 2 sub-alignment and then from y directly to z via some Type 2 sub-alignment. However, the Type 2 sub-alignments in A are obtained using the edges of the spanning tree, and hence the path linking x to z must be obtained using sub-alignments A and A. Hence the pairwise transitivity merge of A and A would correctly detect that x and z are in the same equivalence class, and the result is proved. Theorem 2: Given m Type 1 alignments and m 1 Type 2 alignments, the algorithm to compute the transitivity merge of these alignments uses O(Km log m + ml) time, where K is the maximum length of any sequence (not counting gaps) in any Type 1 alignment and L is length of output alignment. Proof: Let our dataset consist of N sequences, with each sequence of length at most K, and for the sake of simplicity, assume that our decomposition produces m subsets, all with equal sizes (note that centroid decomposition produces balanced subsets, so this assumption is justified). As described before, in Step 5, we chose an edge e = (v, w) from the spanning tree, contract that edge, and perform two transitivity merges: one between S(v) and Label(e), and another between the result of the first merger and S(w). Based on Lemma 1, the first transitivity merge will have a running time of O(K( S(v) +2) +L), and the second merge will have a cost of O(K( S(v) + S(w) ) + L), and thus the cost of each edge contraction is O(K(2 S(v) + S(w) ) + L). Now, imagine the case where the spanning tree is a path. If we start merging from one end to the other end, we get the total running time of O(K( m)) = O(Km 2 ); however, we can improve on that. The important observation is that the spanning tree should be traversed such that transitivity mergers are between alignments with balanced number of sequences on each side. 8

9 The order in which edges are processed in PASTA is obtained by a recursive approach. Given the spanning tree, we divide it into two halves on the centroid edge, and thus obtain two roughly equal size subtrees. We process each half recursively using the same strategy, and thus get two single leaves at the endpoints of the centroid edge. Each leaf would represent the merger of all alignments in each half, and by construction they would have roughly equal size. We then contract the centroid edge, merge the two sides, and obtain the full alignment. If each half has roughly x sequences, the cost of the final edge contraction is O(K(2x + x) + L) = O(3Kx + L) (as shown before). If f(x) denotes the cost of applying our transitivity merger on a spanning tree with x nodes, we have f(2x) = 2f(x) + 3kx + L which has a O(x log(x) + xl) solution. Therefore, our particular order of traversing the spanning tree results in a total cost of O(Km log(m) + ml). 9

10 C Appendix - Further Results This section includes: Results on 1000-taxon simulated datasets Tree error using reference trees based on different bootstrap support thresholds for the Gutell datasets Comparison of PASTA results based on different starting trees Comparison of PASTA results based on using Opal or Muscle to produce Type 2 sub-alignments Comparison of PASTA results for different alignment subset sizes Running time of Muscle as a function of dataset size 10

11 C.1 Performance on 1000-taxon datasets 40% Missing Branch Rate (FN) 30% 20% 10% 1000M3 1000S3 1000M2 1000L2 1000S2 1000L3 1000S1 1000L1 1000M1 True Alignment Mafft linsi OPAL MUSCLE SATé II PASTA Figure 1: Tree errors on 1000-taxon datasets. We report the missing branch rate of trees estimated by FastTree-2 on the true alignment, and on alignments estimated using PASTA, SATé-II, and other alignment methods, on challenging 1000-taxon datasets from [1]. In Figure 1, we show FN error rates of maximum likelihood trees estimated using FastTree-2 on different alignments, using datasets studied in [1, 2]. The model conditions are labelled by the gap length distribution (M for medium length, S for short, and L for long), and increase in difficulty (higher rates of indels and substitutions) from left to right. Note that the most accurate results are obtained by ML on the true alignment, but that PASTA and SATé-II have almost indistinguishable accuracy on these data. 11

12 C.2 Performance on biological datasets S.3 16S.B.ALL 16S.T Tree Error (FN rate) Bootstrap Support Threshold Muscle SATe2 PASTA Figure 2: Tree error (FN) rates on biological datasets as a function of bootstrap threshold chosen to define the reference tree. Biological datasets present a challenge for benchmarking because the true phylogeny cannot be known with certainty. We used the reference alignment (estimated based on secondary structures) produced by the Gutell laboratory, and then estimated a maximum likelihood tree on each dataset using RAxML with bootstrapping. By contracting all branches with low support we obtain a reference tree. In the text we reported results based on a bootstrap support threshold of 75%; here we present results based on other thresholds. Note that for high thresholds, more branches will be collapsed, whereas for low thresholds, fewer branches will be collapsed. Figure 2 reports results only for Muscle, SATé-II, and PASTA, the three methods with the best performance on these biological datasets, but exploring the impact of changing the threshold for contracting branches. Note that the difference in performance is small in most cases, and that relative performance generally does not change much as a function of threshold. Typically PASTA has slightly better tree accuracy than both SATé and Muscle, but there are a few cases where the relative performance changes. However, there are a few thresholds and datasets where the relative ordering between SATé, PASTA, and Muscle changes, so that Muscle is tied for best, or PASTA is less accurate than SATé. With respect to the remaining methods, the PASTA starting tree had the least accurate results, with especially poor accuracy on the 16S.T dataset, where the starting alignment did not include one of the taxa, and so that taxon was added randomly 12

13 into the tree computed without its sequence. MAFFT-profile and Clustalquicktree had generally good performance, but clearly less accurate than Muscle, PASTA, and SATé. 13

14 C.3 Impact of Starting Tree We also looked at the impact of using different starting trees. We tested starting trees obtained by using FastTree-2 on three different alignments: the Mafft-Profile alignment, the Mafft-PartTree alignment, and our new method for computing the starting alignment (HMMER-based). Table 1 shows the PASTA alignment and tree accuracy starting from these starting trees, after running PASTA for three iterations. There does not seem to be any impact of the starting tree on the TC score or the final tree accuracy. However, alignment SP-score and Modeler score decrease with the increase in error for the starting tree, but the differences are very small. Thus, PASTA is highly robust to starting tree, when run for three iterations. Alignment Accuracy Tree Accuracy SP-score Modeler score TC 1-FN Starting Tree (FN = 12.4% ) 88.5% 89.1% % Mafft-Profile (FN = 19.1%) 87.2% 87.8% % Mafft-PartTree (FN = 28.7%) 83.2% 84.6% % Table 1: Effects of the starting tree in PASTA algorithm, based on three iterations of PASTA on one replicate of the 10k RNASim dataset. We show PASTA using three different starting trees: our default HMMER-based technique, the tree computed using FastTree-2 on the Mafft-Profile alignment, and the tree based on FastTree-2 on Mafft-PartTree alignment. Boldface indicates the best performance on this data. 14

15 C.4 Choice of Opal vs. Muscle for merging alignments We also explored the difference between PASTA using Opal (the default merger) or Muscle for merging Type 1 alignments into Type 2 alignments; see Table 2. This comparison showed that OPAL can result in better final alignments and trees compared to Muscle. For example, on the 10,000 RNASim dataset, PASTA with OPAL and with MUSCLE had tree errors of 10.7% and 11.2%, a slight improvement, but that alignment accuracy changed substantially, especially when considering the average of the SP and modeler scores. Parameters Tree Error SP-Score Modeler Score TC score Running Time (sec) Opal Type 2 merger 10.7% 88.5% 89.1% ,478 Muscle Type 2 merger 11.2% 67.2% 80.6% ,884 Table 2: Impact of using Muscle versus Opal as the alignment merger technique. We report results on one replicate of the 10K RNASim dataset, using three iteration of PASTA using all default settings for other algorithmic parameters. We report the missing branch rate for the tree error, and two accuracy measures for alignments: the Total Column (TC) score, and the average of the SP-score and modeler score. Boldface indicates the best performance on this data. 15

16 C.5 Impact of alignment subset size We also compared results for PASTA where we changed the alignment subset size from 200 (default) to smaller values; see Table 3. We show results for one replicate of the 10K RNASim dataset, and also for the 16S.T dataset. Note that the tree error changes only slightly, with slight differences between the two datasets. As the alignment subset size is reduced from 200 to 100, there is a decrease in error for the 16S.T dataset, but the tree error for the 10K dataset first decreases and then returns to the original value. The sumof-pairs alignment accuracy measures do not seem to be consistently affected by the change in alignment subset size between the two datasets either, (initial decrease in accuracy then increase for 10K, and consistent decrease in accuracy for 16S.T), but here also the changes are small. On the other hand, there is a consistent change in the TC score, where decreasing the alignment subset size improves the TC score for both datasets, and substantially for the 10K dataset. Finally, the running time went down for both datasets as a result of reducing the subset size from 200 to 100. Thus, although changes in alignment subset size do seem to impact the alignment and tree accuracy, the differences are generally small and may not be statistically significant. Furthermore, the impact may depend on the dataset. However, the impact on the running time is substantial, so that reducing from datasets of size 200 to datasets of size 50 reduced the running time by 55% for the 10K dataset and by 37% for the 16S.T dataset. Dataset Parameters Tree Alignment TC Running Time Error Accuracy score (secs) 10K Subset size % ,478 10K Subset size % ,235 10K Subset size % ,015 16S.T Subset size % ,120 16S.T Subset size % ,086 16S.T Subset size % ,780 Table 3: Impact of alignment subset size. We report results on one replicate of the 10K RNASim dataset and also on the 16S.T dataset, using three iteration of PASTA in which we explore the impact of changing the subset size from 200 (the default) to 100 and 50; all other algorithmic parameters use default values. We report the missing branch rate for the tree error, two accuracy measures for alignments (the Total Column (TC) score, and the average of the SP-score and modeler score), and the running time in seconds. Boldface indicates the best performance on this data. 16

17 C.6 Muscle running time Figure 3 shows that the running time of MUSCLE mergers scale linearly with the alignment length, but grows more rapidly with the number of sequences Running Time (seconds) Running Time (minutes) Alignment Length Number of Sequences (a) (a) (b) (b) Figure 3: Running time for Muscle mergers as a function of (a) the alignment length with fixed number of sequences (10,000) and (b) the number of sequences sequences with fixed sequence length (2000bp). 17

18 D Appendix - Additional Discussion D.1 Alignment Accuracy Measures One of the interesting observations in this study is that the average of SPscore and modeler score is not always predictive of tree accuracy. For example, on the Gutell datasets, MAFFT-profile has better sum-of-pairs alignment accuracy than Muscle but produces less accurate trees. Similarly, the starting alignments used by PASTA (computed using the HMM-based technique) had close to the best sum-of-pairs alignment accuracy but produced less accurate trees than PASTA, and SATé had much lower alignment accuracy scores than MAFFT-profile on the Gutell and the 10K RNASim datasets, but produced more accurate trees. This is generally not what has been expected, and perhaps runs counter to earlier studies that have explored the impact of alignment estimation on tree estimation. However, this inconsistency between relative performance in terms of tree accuracy and standard alignment accuracy criteria has been observed before on large phylogenetic datasets [2], where Opal was observed to produce alignments with very good sum-of-pairs scores but where trees on Opal alignments were not particularly accurate. Thus, the disconnect between standard alignment criteria and phylogenetic accuracy may be a general issue for large datasets. The explanation may be that not all homologies have the same evolutionary signal, and that failing to recover some homologies may not have much impact on tree estimation, while other homologies may be essential for phylogenetic accuracy. Furthermore, alignment methods that aim to recover the conserved regions may be able to have high alignment accuracy scores but fail to produce good trees - because conserved regions may not be as useful for phylogeny estimation as regions that change. Thus, the sites and even specific homologies that are most informative of the phylogenetic branching process may not be the homologies that many alignment methods are trained to recover. More generally, then, this disconnect suggests a real challenge in using alignment metrics to predict the utility of an alignment, especially if the purpose of the alignment is phylogeny estimation. D.2 Comparison between SATé and PASTA The technique to re-align sequences on a guide tree used within PASTA was designed to address the running time challenges in using Opal or Muscle to repeatedly merge alignments until the entire alignment is created. As we 18

19 saw, the last pairwise merge is the most expensive, and becomes prohibitively expensive on large datasets. The new design, which computes overlapping compatible Type 2 sub-alignments and then merges these using transitivity, addresses that computational challenge. However, as we saw, it also provides accuracy improvements on large datasets. Here we make a direct comparison between the two methods, to clarify where PASTA has an advantage taxon datasets from [1]: PASTA and SATé have very similar tree accuracy, and improve relative to the other methods tested. The running time differences are small on these datasets, because they are not that big. Gutell datasets, 16S.3, 16S.T, and 16S.B.ALL. These datasets have between 6K and 28K sequences. PASTA is slightly more accurate than SATé with respect to tree accuracy, but has much higher alignment accuracy. In addition, PASTA uses less time. The running time advantage depends on the number of sequences, however; for example, on 16S.3, the smallest of these three datasets, PASTA uses about 80% of the time used by SATé, but on 16S.B.ALL, the largest of these datasets, PASTA finishes in 6 hours while SATé finishes in 24 hours. RNASim datasts, ranging from 10K to 200K sequences. When restricted to the 24 hour limit, SATé can only complete analyses for the smallest of these datasets. The comparison between SATé and PASTA on this model condition (10 replicates) show that PASTA is much faster, finishing in less than 4 hours while SATé uses more than 6 hours. PASTA and SATé are close in terms of tree error, with a small advantage to PASTA, but PASTA has much better alignment accuracy. Running SATé on a machine where runs are not limited to 24 hours produces a tree after roughly 70 hours of running time per iteration for the RNASim dataset with 50K sequences. Thus SATé is much slower than PASTA, which finishes each iteration in about 5 hours on this dataset. Moreover, the final tree and alignment of SATé are not nearly as accurate as PASTA; the tree error is 50% higher (12.6% for SATé versus 8.2% for PASTA), and the alignment is much less accurate (e.g. SATé recovers only 30 columns entirely correctly, but PASTA recovers 311 columns). In general, therefore, PASTA and SATé have very similar accuracy on small datasets, but PASTA has a large advantage over SATé on large datasets with respect to alignment accuracy measures, and a smaller advantage with 19

20 respect to tree accuracy. However, PASTA is much faster than SATé, and the running time advantage increases with the size of the dataset. Thus, PASTA dominates SATé for large-scale phylogeny and alignment estimation. Acknowledgments This research was supported in part by NSF grant DBI to TW, by an International Predoctoral Fellowship to SM from the HHMI, and by a subgrant from the University of Alberta to TW, made possible through a donation from Musea Ventures, which is held by Professor Gane Ka-Shu Wong. The authors wish to thank the anonymous referees for their helpful comments. References [1] K. Liu, S. Raghavan, S. Nelesen, C. R. Linder, and T. Warnow. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324(5934): , [2] K. Liu, T.J. Warnow, M.T. Holder, S. Nelesen, J. Yu, A. Stamatakis, and C.R. Linder. SATé-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol, 61(1):90 106, [3] S. Mirarab, N. Nguyen, and T. Warnow. PASTA: ultra-large multiple sequence alignment. In Proc. ISMB 2014,

PASTA: Ultra-Large Multiple Sequence Alignment

PASTA: Ultra-Large Multiple Sequence Alignment PASTA: Ultra-Large Multiple Sequence Alignment Siavash Mirarab, Nam Nguyen, and Tandy Warnow University of Texas at Austin - Department of Computer Science 2317 Speedway, Stop D9500 Austin, TX 78712 {smirarab,namphuon,tandy}@cs.utexas.edu

More information

Basics of Multiple Sequence Alignment

Basics of Multiple Sequence Alignment Basics of Multiple Sequence Alignment Tandy Warnow February 10, 2018 Basics of Multiple Sequence Alignment Tandy Warnow Basic issues What is a multiple sequence alignment? Evolutionary processes operating

More information

PLOS Currents Tree of Life. Performance on the six smallest datasets

PLOS Currents Tree of Life. Performance on the six smallest datasets Multiple sequence alignment: a major challenge to large-scale phylogenetics November 18, 2010 Tree of Life Kevin Liu, C. Randal Linder, Tandy Warnow Liu K, Linder CR, Warnow T. Multiple sequence alignment:

More information

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at Austin Phylogeny (evolutionary tree) Orangutan Gorilla

More information

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters

Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin

Sequence length requirements. Tandy Warnow Department of Computer Science The University of Texas at Austin Sequence length requirements Tandy Warnow Department of Computer Science The University of Texas at Austin Part 1: Absolute Fast Convergence DNA Sequence Evolution AAGGCCT AAGACTT TGGACTT -3 mil yrs -2

More information

Large-scale Multiple Sequence Alignment and Phylogenetic Estimation. Tandy Warnow Department of Computer Science The University of Texas at Austin

Large-scale Multiple Sequence Alignment and Phylogenetic Estimation. Tandy Warnow Department of Computer Science The University of Texas at Austin Large-scale Multiple Sequence Alignment and Phylogenetic Estimation Tandy Warnow Department of Computer Science The University of Texas at Austin Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee

More information

Scaling species tree estimation methods to large datasets using NJMerge

Scaling species tree estimation methods to large datasets using NJMerge Scaling species tree estimation methods to large datasets using NJMerge Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana Champaign 2018 Phylogenomics Software

More information

DO NOT RE-DISTRIBUTE THIS SOLUTION FILE

DO NOT RE-DISTRIBUTE THIS SOLUTION FILE Professor Kindred Math 104, Graph Theory Homework 3 Solutions February 14, 2013 Introduction to Graph Theory, West Section 2.1: 37, 62 Section 2.2: 6, 7, 15 Section 2.3: 7, 10, 14 DO NOT RE-DISTRIBUTE

More information

Multiple sequence alignment. November 20, 2018

Multiple sequence alignment. November 20, 2018 Multiple sequence alignment November 20, 2018 Why do multiple alignment? Gain insight into evolutionary history Can assess time of divergence by looking at the number of mutations needed to change one

More information

Richard Feynman, Lectures on Computation

Richard Feynman, Lectures on Computation Chapter 8 Sorting and Sequencing If you keep proving stuff that others have done, getting confidence, increasing the complexities of your solutions for the fun of it then one day you ll turn around and

More information

Dynamic Programming for Phylogenetic Estimation

Dynamic Programming for Phylogenetic Estimation 1 / 45 Dynamic Programming for Phylogenetic Estimation CS598AGB Pranjal Vachaspati University of Illinois at Urbana-Champaign 2 / 45 Coalescent-based Species Tree Estimation Find evolutionary tree for

More information

Lecture 5: Multiple sequence alignment

Lecture 5: Multiple sequence alignment Lecture 5: Multiple sequence alignment Introduction to Computational Biology Teresa Przytycka, PhD (with some additions by Martin Vingron) Why do we need multiple sequence alignment Pairwise sequence alignment

More information

CS 581. Tandy Warnow

CS 581. Tandy Warnow CS 581 Tandy Warnow This week Maximum parsimony: solving it on small datasets Maximum Likelihood optimization problem Felsenstein s pruning algorithm Bayesian MCMC methods Research opportunities Maximum

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Introduction to Computational Phylogenetics

Introduction to Computational Phylogenetics Introduction to Computational Phylogenetics Tandy Warnow The University of Texas at Austin No Institute Given This textbook is a draft, and should not be distributed. Much of what is in this textbook appeared

More information

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes

Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Hybrid Parallelization of the MrBayes & RAxML Phylogenetics Codes Wayne Pfeiffer (SDSC/UCSD) & Alexandros Stamatakis (TUM) February 25, 2010 What was done? Why is it important? Who cares? Hybrid MPI/OpenMP

More information

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015

ML phylogenetic inference and GARLI. Derrick Zwickl. University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 ML phylogenetic inference and GARLI Derrick Zwickl University of Arizona (and University of Kansas) Workshop on Molecular Evolution 2015 Outline Heuristics and tree searches ML phylogeny inference and

More information

Multiple Alignment. From [4] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997.

Multiple Alignment. From [4] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997. Multiple Alignment From [4] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, New York, 1997. 1 Multiple Sequence Alignment Introduction Very active research area Very

More information

Problem Set 2 Solutions

Problem Set 2 Solutions Problem Set 2 Solutions Graph Theory 2016 EPFL Frank de Zeeuw & Claudiu Valculescu 1. Prove that the following statements about a graph G are equivalent. - G is a tree; - G is minimally connected (it is

More information

1 (15 points) LexicoSort

1 (15 points) LexicoSort CS161 Homework 2 Due: 22 April 2016, 12 noon Submit on Gradescope Handed out: 15 April 2016 Instructions: Please answer the following questions to the best of your ability. If you are asked to show your

More information

3 Fractional Ramsey Numbers

3 Fractional Ramsey Numbers 27 3 Fractional Ramsey Numbers Since the definition of Ramsey numbers makes use of the clique number of graphs, we may define fractional Ramsey numbers simply by substituting fractional clique number into

More information

Computer Science 236 Fall Nov. 11, 2010

Computer Science 236 Fall Nov. 11, 2010 Computer Science 26 Fall Nov 11, 2010 St George Campus University of Toronto Assignment Due Date: 2nd December, 2010 1 (10 marks) Assume that you are given a file of arbitrary length that contains student

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

Supplementary Material, corresponding to the manuscript Accumulated Coalescence Rank and Excess Gene count for Species Tree Inference

Supplementary Material, corresponding to the manuscript Accumulated Coalescence Rank and Excess Gene count for Species Tree Inference Supplementary Material, corresponding to the manuscript Accumulated Coalescence Rank and Excess Gene count for Species Tree Inference Sourya Bhattacharyya and Jayanta Mukherjee Department of Computer Science

More information

V10 Metabolic networks - Graph connectivity

V10 Metabolic networks - Graph connectivity V10 Metabolic networks - Graph connectivity Graph connectivity is related to analyzing biological networks for - finding cliques - edge betweenness - modular decomposition that have been or will be covered

More information

Lab 4: Multiple Sequence Alignment (MSA)

Lab 4: Multiple Sequence Alignment (MSA) Lab 4: Multiple Sequence Alignment (MSA) The objective of this lab is to become familiar with the features of several multiple alignment and visualization tools, including the data input and output, basic

More information

Definition For vertices u, v V (G), the distance from u to v, denoted d(u, v), in G is the length of a shortest u, v-path. 1

Definition For vertices u, v V (G), the distance from u to v, denoted d(u, v), in G is the length of a shortest u, v-path. 1 Graph fundamentals Bipartite graph characterization Lemma. If a graph contains an odd closed walk, then it contains an odd cycle. Proof strategy: Consider a shortest closed odd walk W. If W is not a cycle,

More information

INFERRING OPTIMAL SPECIES TREES UNDER GENE DUPLICATION AND LOSS

INFERRING OPTIMAL SPECIES TREES UNDER GENE DUPLICATION AND LOSS INFERRING OPTIMAL SPECIES TREES UNDER GENE DUPLICATION AND LOSS M. S. BAYZID, S. MIRARAB and T. WARNOW Department of Computer Science, The University of Texas at Austin, Austin, Texas 78712, USA E-mail:

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

In this lecture, we ll look at applications of duality to three problems:

In this lecture, we ll look at applications of duality to three problems: Lecture 7 Duality Applications (Part II) In this lecture, we ll look at applications of duality to three problems: 1. Finding maximum spanning trees (MST). We know that Kruskal s algorithm finds this,

More information

MC 302 GRAPH THEORY 10/1/13 Solutions to HW #2 50 points + 6 XC points

MC 302 GRAPH THEORY 10/1/13 Solutions to HW #2 50 points + 6 XC points MC 0 GRAPH THEORY 0// Solutions to HW # 0 points + XC points ) [CH] p.,..7. This problem introduces an important class of graphs called the hypercubes or k-cubes, Q, Q, Q, etc. I suggest that before you

More information

Evolutionary tree reconstruction (Chapter 10)

Evolutionary tree reconstruction (Chapter 10) Evolutionary tree reconstruction (Chapter 10) Early Evolutionary Studies Anatomical features were the dominant criteria used to derive evolutionary relationships between species since Darwin till early

More information

DO NOT RE-DISTRIBUTE THIS SOLUTION FILE

DO NOT RE-DISTRIBUTE THIS SOLUTION FILE Professor Kindred Math 104, Graph Theory Homework 2 Solutions February 7, 2013 Introduction to Graph Theory, West Section 1.2: 26, 38, 42 Section 1.3: 14, 18 Section 2.1: 26, 29, 30 DO NOT RE-DISTRIBUTE

More information

arxiv: v5 [cs.dm] 9 May 2016

arxiv: v5 [cs.dm] 9 May 2016 Tree spanners of bounded degree graphs Ioannis Papoutsakis Kastelli Pediados, Heraklion, Crete, reece, 700 06 October 21, 2018 arxiv:1503.06822v5 [cs.dm] 9 May 2016 Abstract A tree t-spanner of a graph

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

arxiv:cs/ v1 [cs.ds] 20 Feb 2003

arxiv:cs/ v1 [cs.ds] 20 Feb 2003 The Traveling Salesman Problem for Cubic Graphs David Eppstein School of Information & Computer Science University of California, Irvine Irvine, CA 92697-3425, USA eppstein@ics.uci.edu arxiv:cs/0302030v1

More information

6.001 Notes: Section 4.1

6.001 Notes: Section 4.1 6.001 Notes: Section 4.1 Slide 4.1.1 In this lecture, we are going to take a careful look at the kinds of procedures we can build. We will first go back to look very carefully at the substitution model,

More information

The Structure of Bull-Free Perfect Graphs

The Structure of Bull-Free Perfect Graphs The Structure of Bull-Free Perfect Graphs Maria Chudnovsky and Irena Penev Columbia University, New York, NY 10027 USA May 18, 2012 Abstract The bull is a graph consisting of a triangle and two vertex-disjoint

More information

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo

Genome 559: Introduction to Statistical and Computational Genomics. Lecture15a Multiple Sequence Alignment Larry Ruzzo Genome 559: Introduction to Statistical and Computational Genomics Lecture15a Multiple Sequence Alignment Larry Ruzzo 1 Multiple Alignment: Motivations Common structure, function, or origin may be only

More information

3.4 Multiple sequence alignment

3.4 Multiple sequence alignment 3.4 Multiple sequence alignment Why produce a multiple sequence alignment? Using more than two sequences results in a more convincing alignment by revealing conserved regions in ALL of the sequences Aligned

More information

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees A Randomized Algorithm for Comparing Sets of Phylogenetic Trees Seung-Jin Sul and Tiffani L. Williams Department of Computer Science Texas A&M University E-mail: {sulsj,tlw}@cs.tamu.edu Technical Report

More information

On total domination and support vertices of a tree

On total domination and support vertices of a tree On total domination and support vertices of a tree Ermelinda DeLaViña, Craig E. Larson, Ryan Pepper and Bill Waller University of Houston-Downtown, Houston, Texas 7700 delavinae@uhd.edu, pepperr@uhd.edu,

More information

Paths, Flowers and Vertex Cover

Paths, Flowers and Vertex Cover Paths, Flowers and Vertex Cover Venkatesh Raman, M.S. Ramanujan, and Saket Saurabh Presenting: Hen Sender 1 Introduction 2 Abstract. It is well known that in a bipartite (and more generally in a Konig)

More information

Solution for Homework set 3

Solution for Homework set 3 TTIC 300 and CMSC 37000 Algorithms Winter 07 Solution for Homework set 3 Question (0 points) We are given a directed graph G = (V, E), with two special vertices s and t, and non-negative integral capacities

More information

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst

Phylogenetics. Introduction to Bioinformatics Dortmund, Lectures: Sven Rahmann. Exercises: Udo Feldkamp, Michael Wurst Phylogenetics Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Phylogenetics phylum = tree phylogenetics: reconstruction of evolutionary

More information

A Reduction of Conway s Thrackle Conjecture

A Reduction of Conway s Thrackle Conjecture A Reduction of Conway s Thrackle Conjecture Wei Li, Karen Daniels, and Konstantin Rybnikov Department of Computer Science and Department of Mathematical Sciences University of Massachusetts, Lowell 01854

More information

Multiple sequence alignment. November 2, 2017

Multiple sequence alignment. November 2, 2017 Multiple sequence alignment November 2, 2017 Why do multiple alignment? Gain insight into evolutionary history Can assess time of divergence by looking at the number of mutations needed to change one sequence

More information

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 www.biostat.wisc.edu/bmi576.html Mark Craven craven@biostat.wisc.edu Fall 0 Phylogenetic tree approaches! three general types! distance:

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Chordal deletion is fixed-parameter tractable

Chordal deletion is fixed-parameter tractable Chordal deletion is fixed-parameter tractable Dániel Marx Institut für Informatik, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany. dmarx@informatik.hu-berlin.de Abstract. It

More information

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Exact Algorithms Lecture 7: FPT Hardness and the ETH Exact Algorithms Lecture 7: FPT Hardness and the ETH February 12, 2016 Lecturer: Michael Lampis 1 Reminder: FPT algorithms Definition 1. A parameterized problem is a function from (χ, k) {0, 1} N to {0,

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

These are not polished as solutions, but ought to give a correct idea of solutions that work. Note that most problems have multiple good solutions.

These are not polished as solutions, but ought to give a correct idea of solutions that work. Note that most problems have multiple good solutions. CSE 591 HW Sketch Sample Solutions These are not polished as solutions, but ought to give a correct idea of solutions that work. Note that most problems have multiple good solutions. Problem 1 (a) Any

More information

Greedy Algorithms. Previous Examples: Huffman coding, Minimum Spanning Tree Algorithms

Greedy Algorithms. Previous Examples: Huffman coding, Minimum Spanning Tree Algorithms Greedy Algorithms A greedy algorithm is one where you take the step that seems the best at the time while executing the algorithm. Previous Examples: Huffman coding, Minimum Spanning Tree Algorithms Coin

More information

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES SEUNG-JIN SUL AND TIFFANI L. WILLIAMS Department of Computer Science Texas A&M University College Station, TX 77843-3112 USA E-mail: {sulsj,tlw}@cs.tamu.edu

More information

Figure 4.1: The evolution of a rooted tree.

Figure 4.1: The evolution of a rooted tree. 106 CHAPTER 4. INDUCTION, RECURSION AND RECURRENCES 4.6 Rooted Trees 4.6.1 The idea of a rooted tree We talked about how a tree diagram helps us visualize merge sort or other divide and conquer algorithms.

More information

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees

Applied Mathematics Letters. Graph triangulations and the compatibility of unrooted phylogenetic trees Applied Mathematics Letters 24 (2011) 719 723 Contents lists available at ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml Graph triangulations and the compatibility

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

CSE 549: Computational Biology

CSE 549: Computational Biology CSE 549: Computational Biology Phylogenomics 1 slides marked with * by Carl Kingsford Tree of Life 2 * H5N1 Influenza Strains Salzberg, Kingsford, et al., 2007 3 * H5N1 Influenza Strains The 2007 outbreak

More information

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Khaddouja Boujenfa, Nadia Essoussi, and Mohamed Limam International Science Index, Computer and Information Engineering waset.org/publication/482

More information

Introduction to Trees

Introduction to Trees Introduction to Trees Tandy Warnow December 28, 2016 Introduction to Trees Tandy Warnow Clades of a rooted tree Every node v in a leaf-labelled rooted tree defines a subset of the leafset that is below

More information

Discrete mathematics , Fall Instructor: prof. János Pach

Discrete mathematics , Fall Instructor: prof. János Pach Discrete mathematics 2016-2017, Fall Instructor: prof. János Pach - covered material - Lecture 1. Counting problems To read: [Lov]: 1.2. Sets, 1.3. Number of subsets, 1.5. Sequences, 1.6. Permutations,

More information

Multi-Cluster Interleaving on Paths and Cycles

Multi-Cluster Interleaving on Paths and Cycles Multi-Cluster Interleaving on Paths and Cycles Anxiao (Andrew) Jiang, Member, IEEE, Jehoshua Bruck, Fellow, IEEE Abstract Interleaving codewords is an important method not only for combatting burst-errors,

More information

Clustering Using Graph Connectivity

Clustering Using Graph Connectivity Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the

More information

Distance Methods. "PRINCIPLES OF PHYLOGENETICS" Spring 2006

Distance Methods. PRINCIPLES OF PHYLOGENETICS Spring 2006 Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2006 Distance Methods Due at the end of class: - Distance matrices and trees for two different distance

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Graph Connectivity G G G

Graph Connectivity G G G Graph Connectivity 1 Introduction We have seen that trees are minimally connected graphs, i.e., deleting any edge of the tree gives us a disconnected graph. What makes trees so susceptible to edge deletions?

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 22.1 Introduction We spent the last two lectures proving that for certain problems, we can

More information

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version Lecture Notes: External Interval Tree Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk This lecture discusses the stabbing problem. Let I be

More information

Optimization I : Brute force and Greedy strategy

Optimization I : Brute force and Greedy strategy Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean

More information

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3

ABOUT THE LARGEST SUBTREE COMMON TO SEVERAL PHYLOGENETIC TREES Alain Guénoche 1, Henri Garreta 2 and Laurent Tichit 3 The XIII International Conference Applied Stochastic Models and Data Analysis (ASMDA-2009) June 30-July 3, 2009, Vilnius, LITHUANIA ISBN 978-9955-28-463-5 L. Sakalauskas, C. Skiadas and E. K. Zavadskas

More information

HORIZONTAL GENE TRANSFER DETECTION

HORIZONTAL GENE TRANSFER DETECTION HORIZONTAL GENE TRANSFER DETECTION Sequenzanalyse und Genomik (Modul 10-202-2207) Alejandro Nabor Lozada-Chávez Before start, the user must create a new folder or directory (WORKING DIRECTORY) for all

More information

11.9 Connectivity Connected Components. mcs 2015/5/18 1:43 page 419 #427

11.9 Connectivity Connected Components. mcs 2015/5/18 1:43 page 419 #427 mcs 2015/5/18 1:43 page 419 #427 11.9 Connectivity Definition 11.9.1. Two vertices are connected in a graph when there is a path that begins at one and ends at the other. By convention, every vertex is

More information

4 Fractional Dimension of Posets from Trees

4 Fractional Dimension of Posets from Trees 57 4 Fractional Dimension of Posets from Trees In this last chapter, we switch gears a little bit, and fractionalize the dimension of posets We start with a few simple definitions to develop the language

More information

A Vizing-like theorem for union vertex-distinguishing edge coloring

A Vizing-like theorem for union vertex-distinguishing edge coloring A Vizing-like theorem for union vertex-distinguishing edge coloring Nicolas Bousquet, Antoine Dailly, Éric Duchêne, Hamamache Kheddouci, Aline Parreau Abstract We introduce a variant of the vertex-distinguishing

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees. Chapter 7 SUPERTREE ALGORITHMS FOR NESTED TAXA Philip Daniel and Charles Semple Abstract: Keywords: Most supertree algorithms combine collections of rooted phylogenetic trees with overlapping leaf sets

More information

We show that the composite function h, h(x) = g(f(x)) is a reduction h: A m C.

We show that the composite function h, h(x) = g(f(x)) is a reduction h: A m C. 219 Lemma J For all languages A, B, C the following hold i. A m A, (reflexive) ii. if A m B and B m C, then A m C, (transitive) iii. if A m B and B is Turing-recognizable, then so is A, and iv. if A m

More information

COMP 182: Algorithmic Thinking Prim and Dijkstra: Efficiency and Correctness

COMP 182: Algorithmic Thinking Prim and Dijkstra: Efficiency and Correctness Prim and Dijkstra: Efficiency and Correctness Luay Nakhleh 1 Prim s Algorithm In class we saw Prim s algorithm for computing a minimum spanning tree (MST) of a weighted, undirected graph g. The pseudo-code

More information

Name: Lirong TAN 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G.

Name: Lirong TAN 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G. 1. (15 pts) (a) Define what is a shortest s-t path in a weighted, connected graph G. A shortest s-t path is a path from vertex to vertex, whose sum of edge weights is minimized. (b) Give the pseudocode

More information

PCP and Hardness of Approximation

PCP and Hardness of Approximation PCP and Hardness of Approximation January 30, 2009 Our goal herein is to define and prove basic concepts regarding hardness of approximation. We will state but obviously not prove a PCP theorem as a starting

More information

AUTOMATED PLAUSIBILITY ANALYSIS OF LARGE PHYLOGENIES

AUTOMATED PLAUSIBILITY ANALYSIS OF LARGE PHYLOGENIES CHAPTER 1 AUTOMATED PLAUSIBILITY ANALYSIS OF LARGE PHYLOGENIES David Dao 1, Tomáš Flouri 2, Alexandros Stamatakis 1,2 1 KarlsruheInstituteofTechnology,InstituteforTheoreticalInformatics,Postfach 6980,

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

arxiv: v2 [math.co] 13 Aug 2013

arxiv: v2 [math.co] 13 Aug 2013 Orthogonality and minimality in the homology of locally finite graphs Reinhard Diestel Julian Pott arxiv:1307.0728v2 [math.co] 13 Aug 2013 August 14, 2013 Abstract Given a finite set E, a subset D E (viewed

More information

Reconstructing Reticulate Evolution in Species Theory and Practice

Reconstructing Reticulate Evolution in Species Theory and Practice Reconstructing Reticulate Evolution in Species Theory and Practice Luay Nakhleh Department of Computer Science Rice University Houston, Texas 77005 nakhleh@cs.rice.edu Tandy Warnow Department of Computer

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Introduction to Analysis of Algorithms

Introduction to Analysis of Algorithms Introduction to Analysis of Algorithms Analysis of Algorithms To determine how efficient an algorithm is we compute the amount of time that the algorithm needs to solve a problem. Given two algorithms

More information

Math 485, Graph Theory: Homework #3

Math 485, Graph Theory: Homework #3 Math 485, Graph Theory: Homework #3 Stephen G Simpson Due Monday, October 26, 2009 The assignment consists of Exercises 2129, 2135, 2137, 2218, 238, 2310, 2313, 2314, 2315 in the West textbook, plus the

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

The strong chromatic number of a graph

The strong chromatic number of a graph The strong chromatic number of a graph Noga Alon Abstract It is shown that there is an absolute constant c with the following property: For any two graphs G 1 = (V, E 1 ) and G 2 = (V, E 2 ) on the same

More information

Fundamental Properties of Graphs

Fundamental Properties of Graphs Chapter three In many real-life situations we need to know how robust a graph that represents a certain network is, how edges or vertices can be removed without completely destroying the overall connectivity,

More information

THE LEAFAGE OF A CHORDAL GRAPH

THE LEAFAGE OF A CHORDAL GRAPH Discussiones Mathematicae Graph Theory 18 (1998 ) 23 48 THE LEAFAGE OF A CHORDAL GRAPH In-Jen Lin National Ocean University, Taipei, Taiwan Terry A. McKee 1 Wright State University, Dayton, OH 45435-0001,

More information

Connected Components of Underlying Graphs of Halving Lines

Connected Components of Underlying Graphs of Halving Lines arxiv:1304.5658v1 [math.co] 20 Apr 2013 Connected Components of Underlying Graphs of Halving Lines Tanya Khovanova MIT November 5, 2018 Abstract Dai Yang MIT In this paper we discuss the connected components

More information

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony

Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Molecular Evolution & Phylogenetics Complexity of the search space, distance matrix methods, maximum parsimony Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 Learning Objectives understand

More information

Faster parameterized algorithms for Minimum Fill-In

Faster parameterized algorithms for Minimum Fill-In Faster parameterized algorithms for Minimum Fill-In Hans L. Bodlaender Pinar Heggernes Yngve Villanger Technical Report UU-CS-2008-042 December 2008 Department of Information and Computing Sciences Utrecht

More information