Supplementary Online Material PASTA: ultra-large multiple sequence alignment

Size: px

Start display at page:

Download "Supplementary Online Material PASTA: ultra-large multiple sequence alignment"

Mae Griffin
5 years ago
Views:

1 Supplementary Online Material PASTA: ultra-large multiple sequence alignment Siavash Mirarab, Nam Nguyen, and Tandy Warnow University of Texas at Austin - Department of Computer Science {smirarab,bayzid,tandy}@cs.utexas.edu Abstract We introduce PASTA, a new method for multiple sequence alignment of datasets with up to 200,000 sequences in [3]. Here we provide supplementary information not provided in the main paper. We give exact commands used for running the experiments, we provide extra results that did not fit in the main paper, and we provide some supplementary discussion of the results. 1

2 Contents A Appendix - Method version numbers and commands 3 B Appendix - Proofs 6 C Appendix - Further Results 10 C.1 Performance on 1000-taxon datasets C.2 Performance on biological datasets C.3 Impact of Starting Tree C.4 Choice of Opal vs. Muscle for merging alignments C.5 Impact of alignment subset size C.6 Muscle running time D Appendix - Additional Discussion 18 D.1 Alignment Accuracy Measures D.2 Comparison between SATé and PASTA

3 A Appendix - Method version numbers and commands ClustalW version 2.1: clustalw2 -quicktree -align -infile=[input sequences] -outfile=[output alignment] -output=fasta Muscle version : muscle -in [input sequences] -out [output alignment] HMMBUILD version 3.0: hmmbuild symfrac 0.0 dna [output profile] [backbone alignment] HMMALIGN version 3.0: hmmalign dna [output profile] [query file] > [output alignment] Mafft-profile version 6.956b: mafft add [query file] [backbone alignment] > [output alignment] FastTree version SSE3: fasttree -nt -gtr [input fasta] > [output tree] SATé version 2.2.7: python run sate.py config.sate2.txt (The config.sate2.txt file is defined as follows:) [commandline] two_phase = False datatype = <dna or rna> untrusted = False multilocus = False input = <input_fasta> treefile = <starting_tree> aligned = False raxml_search_after = False auto = False [fasttree] model = -gtr args = options = -nosupport -fastest 3

4 [sate] time_limit = -1.0 iter_without_imp_limit = -1 time_without_imp_limit = -1.0 break_strategy = centroid start_tree_search_from_current = True blind_after_iter_without_imp = -1 max_mem_mb = 4024 blind_mode_is_final = True blind_after_time_without_imp = -1.0 max_subproblem_size = 200 merger = muscle num_cpus = 12 after_blind_time_without_imp_limit = -1.0 max_subproblem_frac = 0.0 blind_after_total_time = -1.0 after_blind_time_term_limit = -1.0 aligner = mafft iter_limit = 3 blind_after_total_iter = -1 tree_estimator = fasttree after_blind_iter_term_limit = -1 return_final_tree_and_alignment = False move_to_blind_on_worse_score = True after_blind_iter_without_imp_limit = -1 PASTA version 1.1.0: python run pasta.py config.txt (The config.txt file is defined as follows:) [commandline] two_phase = False datatype = <dna or rna> untrusted = False multilocus = False input = <input_fasta> treefile = <starting_tre> aligned = False raxml_search_after = False 4

5 [sate] time_limit = -1.0 iter_without_imp_limit = -1 time_without_imp_limit = -1.0 break_strategy = centroid start_tree_search_from_current = True blind_after_iter_without_imp = -1 max_mem_mb = 1024 blind_mode_is_final = True blind_after_time_without_imp = -1.0 max_subproblem_size = 200 merger = opal num_cpus = 12 after_blind_time_without_imp_limit = -1.0 max_subproblem_frac = 0.0 blind_after_total_time = -1.0 after_blind_time_term_limit = -1.0 aligner = mafft iter_limit = 3 blind_after_total_iter = -1 tree_estimator = fasttree after_blind_iter_term_limit = -1 return_final_tree_and_alignment = True move_to_blind_on_worse_score = True after_blind_iter_without_imp_limit = -1 mask_gappy_sites = <0.1% of dataset size] 5

6 B Appendix - Proofs Lemma 1: Let X, Y, and Z be disjoint sequence datasets, and alignments A and A be alignments on X Z and Y Z, respectively, that induce identical alignments on Z. Let K be the length of the longest sequence in X, Y, and Z, and L be the total number of sites in A and A. Then we can merge alignments A and A using transitivity in O(L + ( X + Y + Z )K). Proof: To represent an alignment, we use a data-structure with two elements: 1) the unaligned sequence and 2) a list of integers giving the position of each letter in the aligned sequences. Assume A has k columns, and A has k columns. We start by finding the sequences that belong to Z. For each shared sequence in Z, we find the columns that are non-gap in at least one shared sequence in A, and do the same thing for A (we call these shared columns). Calculating shared columns can be done in O(K Z ), because our data-structure for representing alignments has the list of column positions for each character of each sequence of Z. Let k s denote the number of shared columns. After computing shared columns we know that the final alignment will have k + k k s columns. We simultaneously iterate through the k columns in A and k columns in A, and map these numbers to position numbers in the output alignment. We start at the leftmost position of both alignments, and keep a position in A (denoted by p), a position in A (dented by p ), and a position in the output alignment (denoted by q). If both p and p correspond to a shared column, we map both to q and increment all three. Otherwise, w.l.o.g. assume p is not a shared column; we map p to q and increment only p and q. At the end of this process, we have a mapping from columns of both input alignments to the columns of the output alignment, and this procedure takes O(k + k k s ) = O(L) time. Finally, we build the output alignment by adding sequences from the original alignments, and by mapping their column indices using the mapping computed above. This step takes O(k( X + Y + Z )). Thus, the final running time is O(L + k( X + Y + Z )). Note that for any single gene alignment, L << k( X + Y + Z ), and therefore can be omitted from the analysis. Theorem 1: Let {A 1, A 2,..., A k } be a set of Type 1 sub-alignments, and let {X 1, X 2,..., X l } be a set of Type 2 sub-alignments defined by any spanning tree on this set. Then, (a) any two orderings of pairwise transitivity mergers, using the procedure described above, produce the same final multiple sequence alignment, and (b) if no A i has any false positives homologies (with respect to the true alignment) then the final multiple sequence alignment will also not have any false positive homologies. Proof: An alignment is entirely defined by the set of homologies it contains. 6

7 Each set of homologies in each column of an alignment creates an equivalence class, and thus the alignment constitutes an equivalence relation. Thus, the transitive closure of a set A of alignments, denoted tc(a), is defined to be the equivalence relation defined by the transitive closure of the equivalence relations defining each alignment in A. It is easy to see that this is equivalent to saying that letters x and y are in the same column within tc(a) if and only if there is a sequence of alignments A 1, A 2,..., A k with each A i A, and letters x 1, x 2,..., x k, such that x and x 1 are in the same column in A 1, x i and x i+1 are in the same column within A i+1 (i = 1, 2,..., k 1), and x k and y are in the same column in A k. However, by the constraints on the Type 2 sub-alignments (they cannot modify the Type 1 sub-alignments), the sequence of alignments that implies that x and y are in the same column cannot repeat any sub-alignment. Hence, when we apply the transitivity merge of two sub-alignments and infer a new homology between letters x and y, it is because there is some z such that x and z are in the same column in one sub-alignment, and z and y are in the same column in the other subalignment. This will be helpful to us in proving that the simple algorithm we have suffices to define the transitivity merge. We prove that the transitivity merge algorithm computes the transitive closure tc(a) and so is independent of the particular ordering on the edge contractions. Note that at any point in the transitive merge algorithm, some number of Type 3 sub-alignments may have been computed. We can therefore trace the creation of any Type 3 sub-alignment from the beginning set of Type 1 and Type 2 sub-alignments, and note the sequence and number of the pairwise transitivity mergers involved. We can also note the initial set of Type 2 sub-alignments involved in creating the Type 3 sub-alignment. Note that the number n of such additional operations is at least 1 for every Type 3 sub-alignment, but never more than k 1, where k is the number of Type 1 sub-alignments. We will prove by induction on n that every Type 3 sub-alignment that is created by n pairwise mergers applied to a set A of Type 2 sub-alignments is identical to tc(a). The base case is n = 1. Let A and A be Type 2 sub-alignments on X Z and Y Z, respectively, with X Y =. The definition of the pairwise merge of these two Type 2 sub-alignments matches the definition of tc({a, A }), and so the base case is proven. Now assume by induction that there is some n 1 such that any Type 3 sub-alignment formed by at most n pairwise transitivity mergers applied to a set A of Type 2 sub-alignments is identical to tc(a). Let A 1 be some Type 3 sub-alignment created by n + 1 pairwise mergers, and let A 2 and A 3 be the pair of sub-alignments (each is either Type 2 or Type 3) that are merged 7

8 together through the transitivity merger to form A 1. Hence, A 2 is on X Y, A 3 is on Y Z, and X Z =. Note that A 2 and A 3 were each created using fewer than n pairwise transitivity mergers, and so are identical to the transitive closure applied to a set of Type 2 sub-alignments. The result of merging these two sub-alignments therefore adds in remaining homologies that can only be inferred through transitivity. Hence, the result of applying the transitive closure algorithm is an alignment that has only those pairwise homologies that can be inferred through transitivity. As a result, if each Type 2 sub-alignment has no false positives, then neither can the result of the transitive closure algorithm. The only thing we now need to show is that all pairwise homologies that can be inferred through transitivity are in the final alignment. So suppose letter x in X and letter z in Z should be in the same column in the true transitive closure of all Type 2 sub-alignments in A. Then the path of alignments linking x to z must go directly from x to some y Y via some Type 2 sub-alignment and then from y directly to z via some Type 2 sub-alignment. However, the Type 2 sub-alignments in A are obtained using the edges of the spanning tree, and hence the path linking x to z must be obtained using sub-alignments A and A. Hence the pairwise transitivity merge of A and A would correctly detect that x and z are in the same equivalence class, and the result is proved. Theorem 2: Given m Type 1 alignments and m 1 Type 2 alignments, the algorithm to compute the transitivity merge of these alignments uses O(Km log m + ml) time, where K is the maximum length of any sequence (not counting gaps) in any Type 1 alignment and L is length of output alignment. Proof: Let our dataset consist of N sequences, with each sequence of length at most K, and for the sake of simplicity, assume that our decomposition produces m subsets, all with equal sizes (note that centroid decomposition produces balanced subsets, so this assumption is justified). As described before, in Step 5, we chose an edge e = (v, w) from the spanning tree, contract that edge, and perform two transitivity merges: one between S(v) and Label(e), and another between the result of the first merger and S(w). Based on Lemma 1, the first transitivity merge will have a running time of O(K( S(v) +2) +L), and the second merge will have a cost of O(K( S(v) + S(w) ) + L), and thus the cost of each edge contraction is O(K(2 S(v) + S(w) ) + L). Now, imagine the case where the spanning tree is a path. If we start merging from one end to the other end, we get the total running time of O(K( m)) = O(Km 2 ); however, we can improve on that. The important observation is that the spanning tree should be traversed such that transitivity mergers are between alignments with balanced number of sequences on each side. 8

9 The order in which edges are processed in PASTA is obtained by a recursive approach. Given the spanning tree, we divide it into two halves on the centroid edge, and thus obtain two roughly equal size subtrees. We process each half recursively using the same strategy, and thus get two single leaves at the endpoints of the centroid edge. Each leaf would represent the merger of all alignments in each half, and by construction they would have roughly equal size. We then contract the centroid edge, merge the two sides, and obtain the full alignment. If each half has roughly x sequences, the cost of the final edge contraction is O(K(2x + x) + L) = O(3Kx + L) (as shown before). If f(x) denotes the cost of applying our transitivity merger on a spanning tree with x nodes, we have f(2x) = 2f(x) + 3kx + L which has a O(x log(x) + xl) solution. Therefore, our particular order of traversing the spanning tree results in a total cost of O(Km log(m) + ml). 9

10 C Appendix - Further Results This section includes: Results on 1000-taxon simulated datasets Tree error using reference trees based on different bootstrap support thresholds for the Gutell datasets Comparison of PASTA results based on different starting trees Comparison of PASTA results based on using Opal or Muscle to produce Type 2 sub-alignments Comparison of PASTA results for different alignment subset sizes Running time of Muscle as a function of dataset size 10

11 C.1 Performance on 1000-taxon datasets 40% Missing Branch Rate (FN) 30% 20% 10% 1000M3 1000S3 1000M2 1000L2 1000S2 1000L3 1000S1 1000L1 1000M1 True Alignment Mafft linsi OPAL MUSCLE SATé II PASTA Figure 1: Tree errors on 1000-taxon datasets. We report the missing branch rate of trees estimated by FastTree-2 on the true alignment, and on alignments estimated using PASTA, SATé-II, and other alignment methods, on challenging 1000-taxon datasets from [1]. In Figure 1, we show FN error rates of maximum likelihood trees estimated using FastTree-2 on different alignments, using datasets studied in [1, 2]. The model conditions are labelled by the gap length distribution (M for medium length, S for short, and L for long), and increase in difficulty (higher rates of indels and substitutions) from left to right. Note that the most accurate results are obtained by ML on the true alignment, but that PASTA and SATé-II have almost indistinguishable accuracy on these data. 11

12 C.2 Performance on biological datasets S.3 16S.B.ALL 16S.T Tree Error (FN rate) Bootstrap Support Threshold Muscle SATe2 PASTA Figure 2: Tree error (FN) rates on biological datasets as a function of bootstrap threshold chosen to define the reference tree. Biological datasets present a challenge for benchmarking because the true phylogeny cannot be known with certainty. We used the reference alignment (estimated based on secondary structures) produced by the Gutell laboratory, and then estimated a maximum likelihood tree on each dataset using RAxML with bootstrapping. By contracting all branches with low support we obtain a reference tree. In the text we reported results based on a bootstrap support threshold of 75%; here we present results based on other thresholds. Note that for high thresholds, more branches will be collapsed, whereas for low thresholds, fewer branches will be collapsed. Figure 2 reports results only for Muscle, SATé-II, and PASTA, the three methods with the best performance on these biological datasets, but exploring the impact of changing the threshold for contracting branches. Note that the difference in performance is small in most cases, and that relative performance generally does not change much as a function of threshold. Typically PASTA has slightly better tree accuracy than both SATé and Muscle, but there are a few cases where the relative performance changes. However, there are a few thresholds and datasets where the relative ordering between SATé, PASTA, and Muscle changes, so that Muscle is tied for best, or PASTA is less accurate than SATé. With respect to the remaining methods, the PASTA starting tree had the least accurate results, with especially poor accuracy on the 16S.T dataset, where the starting alignment did not include one of the taxa, and so that taxon was added randomly 12

13 into the tree computed without its sequence. MAFFT-profile and Clustalquicktree had generally good performance, but clearly less accurate than Muscle, PASTA, and SATé. 13

14 C.3 Impact of Starting Tree We also looked at the impact of using different starting trees. We tested starting trees obtained by using FastTree-2 on three different alignments: the Mafft-Profile alignment, the Mafft-PartTree alignment, and our new method for computing the starting alignment (HMMER-based). Table 1 shows the PASTA alignment and tree accuracy starting from these starting trees, after running PASTA for three iterations. There does not seem to be any impact of the starting tree on the TC score or the final tree accuracy. However, alignment SP-score and Modeler score decrease with the increase in error for the starting tree, but the differences are very small. Thus, PASTA is highly robust to starting tree, when run for three iterations. Alignment Accuracy Tree Accuracy SP-score Modeler score TC 1-FN Starting Tree (FN = 12.4% ) 88.5% 89.1% % Mafft-Profile (FN = 19.1%) 87.2% 87.8% % Mafft-PartTree (FN = 28.7%) 83.2% 84.6% % Table 1: Effects of the starting tree in PASTA algorithm, based on three iterations of PASTA on one replicate of the 10k RNASim dataset. We show PASTA using three different starting trees: our default HMMER-based technique, the tree computed using FastTree-2 on the Mafft-Profile alignment, and the tree based on FastTree-2 on Mafft-PartTree alignment. Boldface indicates the best performance on this data. 14

15 C.4 Choice of Opal vs. Muscle for merging alignments We also explored the difference between PASTA using Opal (the default merger) or Muscle for merging Type 1 alignments into Type 2 alignments; see Table 2. This comparison showed that OPAL can result in better final alignments and trees compared to Muscle. For example, on the 10,000 RNASim dataset, PASTA with OPAL and with MUSCLE had tree errors of 10.7% and 11.2%, a slight improvement, but that alignment accuracy changed substantially, especially when considering the average of the SP and modeler scores. Parameters Tree Error SP-Score Modeler Score TC score Running Time (sec) Opal Type 2 merger 10.7% 88.5% 89.1% ,478 Muscle Type 2 merger 11.2% 67.2% 80.6% ,884 Table 2: Impact of using Muscle versus Opal as the alignment merger technique. We report results on one replicate of the 10K RNASim dataset, using three iteration of PASTA using all default settings for other algorithmic parameters. We report the missing branch rate for the tree error, and two accuracy measures for alignments: the Total Column (TC) score, and the average of the SP-score and modeler score. Boldface indicates the best performance on this data. 15

16 C.5 Impact of alignment subset size We also compared results for PASTA where we changed the alignment subset size from 200 (default) to smaller values; see Table 3. We show results for one replicate of the 10K RNASim dataset, and also for the 16S.T dataset. Note that the tree error changes only slightly, with slight differences between the two datasets. As the alignment subset size is reduced from 200 to 100, there is a decrease in error for the 16S.T dataset, but the tree error for the 10K dataset first decreases and then returns to the original value. The sumof-pairs alignment accuracy measures do not seem to be consistently affected by the change in alignment subset size between the two datasets either, (initial decrease in accuracy then increase for 10K, and consistent decrease in accuracy for 16S.T), but here also the changes are small. On the other hand, there is a consistent change in the TC score, where decreasing the alignment subset size improves the TC score for both datasets, and substantially for the 10K dataset. Finally, the running time went down for both datasets as a result of reducing the subset size from 200 to 100. Thus, although changes in alignment subset size do seem to impact the alignment and tree accuracy, the differences are generally small and may not be statistically significant. Furthermore, the impact may depend on the dataset. However, the impact on the running time is substantial, so that reducing from datasets of size 200 to datasets of size 50 reduced the running time by 55% for the 10K dataset and by 37% for the 16S.T dataset. Dataset Parameters Tree Alignment TC Running Time Error Accuracy score (secs) 10K Subset size % ,478 10K Subset size % ,235 10K Subset size % ,015 16S.T Subset size % ,120 16S.T Subset size % ,086 16S.T Subset size % ,780 Table 3: Impact of alignment subset size. We report results on one replicate of the 10K RNASim dataset and also on the 16S.T dataset, using three iteration of PASTA in which we explore the impact of changing the subset size from 200 (the default) to 100 and 50; all other algorithmic parameters use default values. We report the missing branch rate for the tree error, two accuracy measures for alignments (the Total Column (TC) score, and the average of the SP-score and modeler score), and the running time in seconds. Boldface indicates the best performance on this data. 16

17 C.6 Muscle running time Figure 3 shows that the running time of MUSCLE mergers scale linearly with the alignment length, but grows more rapidly with the number of sequences Running Time (seconds) Running Time (minutes) Alignment Length Number of Sequences (a) (a) (b) (b) Figure 3: Running time for Muscle mergers as a function of (a) the alignment length with fixed number of sequences (10,000) and (b) the number of sequences sequences with fixed sequence length (2000bp). 17

18 D Appendix - Additional Discussion D.1 Alignment Accuracy Measures One of the interesting observations in this study is that the average of SPscore and modeler score is not always predictive of tree accuracy. For example, on the Gutell datasets, MAFFT-profile has better sum-of-pairs alignment accuracy than Muscle but produces less accurate trees. Similarly, the starting alignments used by PASTA (computed using the HMM-based technique) had close to the best sum-of-pairs alignment accuracy but produced less accurate trees than PASTA, and SATé had much lower alignment accuracy scores than MAFFT-profile on the Gutell and the 10K RNASim datasets, but produced more accurate trees. This is generally not what has been expected, and perhaps runs counter to earlier studies that have explored the impact of alignment estimation on tree estimation. However, this inconsistency between relative performance in terms of tree accuracy and standard alignment accuracy criteria has been observed before on large phylogenetic datasets [2], where Opal was observed to produce alignments with very good sum-of-pairs scores but where trees on Opal alignments were not particularly accurate. Thus, the disconnect between standard alignment criteria and phylogenetic accuracy may be a general issue for large datasets. The explanation may be that not all homologies have the same evolutionary signal, and that failing to recover some homologies may not have much impact on tree estimation, while other homologies may be essential for phylogenetic accuracy. Furthermore, alignment methods that aim to recover the conserved regions may be able to have high alignment accuracy scores but fail to produce good trees - because conserved regions may not be as useful for phylogeny estimation as regions that change. Thus, the sites and even specific homologies that are most informative of the phylogenetic branching process may not be the homologies that many alignment methods are trained to recover. More generally, then, this disconnect suggests a real challenge in using alignment metrics to predict the utility of an alignment, especially if the purpose of the alignment is phylogeny estimation. D.2 Comparison between SATé and PASTA The technique to re-align sequences on a guide tree used within PASTA was designed to address the running time challenges in using Opal or Muscle to repeatedly merge alignments until the entire alignment is created. As we 18

19 saw, the last pairwise merge is the most expensive, and becomes prohibitively expensive on large datasets. The new design, which computes overlapping compatible Type 2 sub-alignments and then merges these using transitivity, addresses that computational challenge. However, as we saw, it also provides accuracy improvements on large datasets. Here we make a direct comparison between the two methods, to clarify where PASTA has an advantage taxon datasets from [1]: PASTA and SATé have very similar tree accuracy, and improve relative to the other methods tested. The running time differences are small on these datasets, because they are not that big. Gutell datasets, 16S.3, 16S.T, and 16S.B.ALL. These datasets have between 6K and 28K sequences. PASTA is slightly more accurate than SATé with respect to tree accuracy, but has much higher alignment accuracy. In addition, PASTA uses less time. The running time advantage depends on the number of sequences, however; for example, on 16S.3, the smallest of these three datasets, PASTA uses about 80% of the time used by SATé, but on 16S.B.ALL, the largest of these datasets, PASTA finishes in 6 hours while SATé finishes in 24 hours. RNASim datasts, ranging from 10K to 200K sequences. When restricted to the 24 hour limit, SATé can only complete analyses for the smallest of these datasets. The comparison between SATé and PASTA on this model condition (10 replicates) show that PASTA is much faster, finishing in less than 4 hours while SATé uses more than 6 hours. PASTA and SATé are close in terms of tree error, with a small advantage to PASTA, but PASTA has much better alignment accuracy. Running SATé on a machine where runs are not limited to 24 hours produces a tree after roughly 70 hours of running time per iteration for the RNASim dataset with 50K sequences. Thus SATé is much slower than PASTA, which finishes each iteration in about 5 hours on this dataset. Moreover, the final tree and alignment of SATé are not nearly as accurate as PASTA; the tree error is 50% higher (12.6% for SATé versus 8.2% for PASTA), and the alignment is much less accurate (e.g. SATé recovers only 30 columns entirely correctly, but PASTA recovers 311 columns). In general, therefore, PASTA and SATé have very similar accuracy on small datasets, but PASTA has a large advantage over SATé on large datasets with respect to alignment accuracy measures, and a smaller advantage with 19

20 respect to tree accuracy. However, PASTA is much faster than SATé, and the running time advantage increases with the size of the dataset. Thus, PASTA dominates SATé for large-scale phylogeny and alignment estimation. Acknowledgments This research was supported in part by NSF grant DBI to TW, by an International Predoctoral Fellowship to SM from the HHMI, and by a subgrant from the University of Alberta to TW, made possible through a donation from Musea Ventures, which is held by Professor Gane Ka-Shu Wong. The authors wish to thank the anonymous referees for their helpful comments. References [1] K. Liu, S. Raghavan, S. Nelesen, C. R. Linder, and T. Warnow. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324(5934): , [2] K. Liu, T.J. Warnow, M.T. Holder, S. Nelesen, J. Yu, A. Stamatakis, and C.R. Linder. SATé-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst Biol, 61(1):90 106, [3] S. Mirarab, N. Nguyen, and T. Warnow. PASTA: ultra-large multiple sequence alignment. In Proc. ISMB 2014,

PASTA: Ultra-Large Multiple Sequence Alignment

PASTA: Ultra-Large Multiple Sequence Alignment Siavash Mirarab, Nam Nguyen, and Tandy Warnow University of Texas at Austin - Department of Computer Science 2317 Speedway, Stop D9500 Austin, TX 78712 {smirarab,namphuon,tandy}@cs.utexas.edu