USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

Size: px

Start display at page:

Download "USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT"

Daniel Harrison
6 years ago
Views:

1 IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah N. Arslan Assistant Professor Computer Science Department, University of Vermont, Burlington, VT 05405, USA Xindong Wu Professor, Chair Computer Science Department, University of Vermont, Burlington, VT 05405, USA ABSTRACT An important problem in computational biology is the alignment of a given query sequence and sequences in a database to find similar (locally or globally) sequences from the database to the query. Many heuristic algorithms for this problem are based on the idea of locating a fixed-length matching pair of substrings (called a seed) to start an alignment, and then extending this alignment using dynamic programming. We generalize this idea and take it one step further in a tool we develop, namely Sequence Comparison Tool (SCT). SCT preprocesses the database to create a special generalized suffix tree from the sequences in the database. This tree extends the definition of a generalized suffix tree by containing additional information at the nodes for the length and frequency (number of occurrences) of the corresponding substrings (patterns). A pattern is regarded as significant if it is sufficiently long and it appears many times in the database. A significant pattern shared by two sequences is an indication that the sequences are locally similar. SCT ranks the sequences with respect to the number of significant patterns they share with the query sequence. SCT reduces the database by selecting a given number of sequences with the topmost ranks. It proceeds with invoking an ordinary local alignment algorithm on this reduced database. We conducted experiments on real biological sequences, and compared SCT's performance with a popular alignment tool BLAST. In these tests we used the 6-fold cross validation technique of data mining. The tests show that SCT effectively reduces the database and obtains very similar results compared to those of BLAST in approximately half the time taken by BLAST. KEYWORDS Sequence alignment, suffix tree, 6-fold cross-validation. 1. INTRODUCTION Sequence alignment is an important problem for identifying and presenting the biologically important, yet hidden or widely dispersed common characteristics from a set of sequences. These commonalities can reveal evolutionary histories, critical conserved motifs or molecular structures that give clues about the common biological functions. Such commonalities are also used to characterize families or super-families of proteins. These characterizations are then used in database searches to identify other potential members of a family. The sequence similarity query can be formally defined as the following problem: given a sequence Q and a database of sequences, determine one or more sequences from the database which are the closest to sequence Q. Our objective is to improve the computation time of the search compared to the existing methods while preserving the accuracy. 655

2 ISBN: IADIS Local pairwise sequence alignment seeks similar segments in a given pair of sequences. A classical algorithm for this problem is the Smith-Waterman algorithm [9] which uses dynamic programming. This quadratic-time algorithm is too slow to be practical for long sequences. For the pairwise sequence alignment problem there are several heuristic algorithms such as FASTA [6] and BLAST [1, 2]. BLAST is approximately times faster than the Smith-Waterman algorithm. One important feature of BLAST is its ability to compare a query with a database of sequences. Considering the rapid growth of database sizes, this problem demands ever-growing computation resources, and remains as a computational challenge. Main idea in heuristic algorithms is to first locate a fixed-length matching-pair (called a seed) to start an alignment, and use dynamic programming to extend the alignment in both directions. In this paper, we generalize this idea, and take it one step further. Our research has been inspired by the use of frequency and the application of a very efficient data structure, the suffix tree [4, 5]. The suffix tree is a powerful tool to determine common patterns in sequences. It represents the internal structure of a string in a comprehensive manner. A suffix tree can be constructed in linear time [4,5,7,8,10]. A suffix tree T for an m-character-long string S is a rooted directed tree such that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix S[i..m] of S that starts at position i. Figure 2 Part (a) includes an example suffix tree, the suffix tree for S 1 =TATAA. A generalized suffix tree (GST) [5] is a suffix tree that combines the suffixes of a set of strings {S 1, S 2,,S n }. In a GST a node may be shared by suffixes of more than one strings. This is indicated by including string identifiers in leafs. In the GST in Figure 2 Part (b), each leaf of the tree represents a suffix either from one of the two strings or from both. Delcher et al. [3] showed that suffix trees are efficient in whole genome pairwise alignments. Their tool MUMer find maximal matches (MUMs) using a suffix tree and combine these matches by using indels (inserts/deletes) into larger alignments. In this paper we concentrate not on pairwise sequence alignment but on sequence similarity query in a database. We aim to accelerate answering sequence similarity query first by filtering the database to select most promising sequences. The query sequence can then be compared with these sequences using a suitable sequence alignment tool such as BLAST, or MUMer. We preprocess the database to create a generalized suffix tree from the sequences in the database. We extend the generalized suffix tree by including additional information at the nodes. We compute, and store at each node the length and frequency (the number of occurrences in the database) of the corresponding pattern. For each substring (pattern) p in the database there exists on this tree a node i such that we obtain p when we spell out (concatenate) the labels of the edges on the path from the root to node i. We consider several functions that, at each node, take into account the frequency and the length of the corresponding pattern, and assign a score to each node. Our hypothesis is that high frequency indicates that the pattern points to a conserved region, and the length parameter measures the information value of the pattern. A pattern is regarded as significant if it is sufficiently long and it appears many times in the database. We determine significant patterns by checking if their scores are larger than or equal to a given threshold. We expect that a significant pattern is contained in biologically-related sequences. For each query we temporarily add the query sequence to the tree to determine sequences that share high-scoring significant patterns with the query. We identify the sequences in the database which share these patterns. We then rank the sequences with respect to the number of significant patterns they share with the query sequence. Next, we reduce the database by selecting only a given number of sequences with the topmost ranks. We expect that these are the closest sequences to the query. In the end, we apply a local alignment algorithm on the reduced database. We implemented this method into a new tool SCT (Sequence Comparison Tool). We conducted experiments on real biological sequences. In these experiments we used the 6-fold cross validation technique of data mining. We compared SCT's performance with that of a popular alignment tool BLAST. The tests show that our method works faster without losing accuracy. We organize this paper as follows: We describe our method for answering sequence similarity queries in Section 2. We summarize the experimental results in Section 3. We provide concluding remarks and pointers for future work in Section DESIGN AND METHODOLOGY We generate a generalized suffix tree GST from a given database of sequences, and then process this tree to add information that we use later to select sequences that are potentially similar to a given query sequence. 656

3 IADIS International Conference Applied Computing 2006 We invoke BLAST to perform pairwise sequence alignment between the query sequence and the chosen sequences in the database. Central to our approach is the processing of the tree and selection of the sequences. We designed these steps based on the following observations: The substrings of a string S in GST can be used as patterns to identify similarity between homologous sequences. This is because similar sequences contain conserved regions (or common substrings). We can quantify the significance of a pattern by computing a score. For a given pattern, a scoringfunction must take into account the length of the pattern, the number of occurrences (frequency) of the pattern in the database, and possibly the size of the database subjected to search to normalize the frequency over the database size. Ideally, a given pattern with a high score carries an important feature (information) that belongs to a family of sequences with a high likelihood. For a pattern: the length is an important parameter in measuring the biological information it carries, the frequency is important because a pattern conserved in many biologically-related sequences has a high frequency. In an ideal case, the higher the frequency the higher the chances that it points to a family of similar sequences. We classify a given pattern p as significant if it satisfies the following two constraints: the length of p a given length-threshold: a significant pattern must be sufficiently long to carry important biological information, and the score of p a given score-threshold: a significant pattern must have a sufficiently high score. Sequences similar to the query sequence in the database are expected to share many significant patterns with the query sequence. We implement our method, and create a tool SCT. The flowchart in Figure 1 shows the steps of SCT: Figure 1. Flowchart of SCT 1. Read all the sequences in the database into memory. 2. Construct a generalized suffix tree from the input sequences {S 1, S 2,, S n } in the database. We use Ukkonen's algorithm [8]. 3. While constructing the suffix tree, at each node i for the corresponding pattern p i, store the length l(p i ) and the frequency f(p i ) (i.e. the number of occurrences of p i in the database): we increment the frequency by one at each visited node during the addition of a new suffix. 4. For a given node i we use a function W to assign a score to pattern p i corresponding to node i. We define W ( pi ) = f ( pi ) * l( pi ) / DB where DB denotes the size of the database. We compute W(p i ) when we construct the GST. We use W(p i ) to measure the significance of the pattern p i. As we explained earlier our hypothesis is that biologically significant patterns are shared by many substrings of sequences, and as the length of the pattern increases so does the information value. Therefore we incorporate both factors in W. We experimented with several different functions. The reason we also include DB is to normalize the frequency over databases with different sizes. This is important only if we want to determine and use a scorethreshold independent of the underlying database. 5. Prompt the user to obtain the number of sequences n to select from the database. 6. Read in the query sequence. 7. Temporarily add the suffixes of the query sequence Q onto the generalized suffix tree. This enables us to determine which suffixes of the query are shared by the sequences in the database. 657

4 ISBN: IADIS The query sequence is only temporarily added to the tree so that SCT is not affected for future sequence searches. Initially the colors of all the nodes in the GST are 0. When we add the suffixes of the query sequence Q we change the color of the nodes visited in the GST to 1. This expedites the search for common patterns within the GST because we only examine those paths in the tree for patterns that contain substrings of the query sequence. Consider a query sequence Q=S 3 =AATGT and two sequences in the database, S 1 =TATAA and S 2 = AACGA. Figure 2 Part (c) shows the coloring of the nodes after temporarily adding the query sequence to the GST. Figure 2. For S 1 =TATAA and S 2 =AACGA: (a) The suffix tree of S 1, (b) The generalized suffix tree of S 1 and S 2, and (c) The suffix tree after the query string Q=S 3 =AATGT is added to the GST of S 1 and S 2 in part (b). The vertices with color 1 are shown by filled circles. The color of other vertices is 0 8. Post-process the generated tree to extract significant patterns shared by the query sequence: in this step we do the actual traversal of the generalized suffix tree to extract significant patterns. Starting at the root and in a depth-first manner we visit all nodes whose colors are 1. If the current node has no child whose color is 1 then we backtrack to its parent node. During this traversal we collect all significant patterns into a set G. The sequences may have other common patterns that are not significant. An optimal alignment between these two sequences in an ideal case contains all significant patterns. 9. Delete the query sequence from the GST once the significant patterns have been collected. 10. Pick the top 10 significant patterns from G and store them in a set P. 11. Extract into a set R the sequences that contain the significant patterns in P. 12. Do a reverse check to compute a weight for each sequence in R. We define the weight of a sequence as the number of patterns that it contains from P. Our hypothesis is that the higher is this number (weight), the greater will be the similarity of the corresponding sequence to the query. 13. Rank the sequences according to these weights. 14. Pick top n (a user specified number) sequences from R and write to a new database. 15. Apply BLAST to the query and the new database of sequences. 16. Output the results of BLAST. We use the BLAST2 implementation we obtained from NCBI's website. BLAST2 allows for insertion of gaps in alignments. For very large databases the sequences in the database can be loaded into memory in parts, the suffix-tree can be created in parts, and all significant patterns can be collected in G by repeating Steps 7-8 for each part. 3. EXPERIMENTS AND RESULTS We have used real DNA sequence databases from three species in our tests. Table 1 lists these data sets: (1) Escherichia coli; (2) Bacillus anthracis; (3) Plasmodium falciparum. The source for these databases is the website of National Center of Biotechnology Information, established in Table 1. Data sets we obtained from NCBI's website and used in our experiments Database Sequence-Length Number of Sequences Ecoli.nt 800 1,000 Bacteria.dna 700 1,000 Plasmodium.dna 127 1,

IADIS International Conference Applied Computing 2006 In designing our tests, we used the 6-fold cross validation approach which is based on the idea ``train on 5 folds, test on 1 fold'' as

5 IADIS International Conference Applied Computing 2006 In designing our tests, we used the 6-fold cross validation approach which is based on the idea ``train on 5 folds, test on 1 fold'' as illustrated in Figure 3. The argument in support of the combined approaches is that with a limited amount of training data, the individual classifier may not represent the true hypotheses. On the other hand, a combined classifier may produce a good approximation for the true hypotheses. This data mining technique compares two learned models. For the implementation we randomly divided training data into two disjoint sets: a training set and a validation set. The two data sets consist of DNA sequences from NCBI's website. There are three steps in implementing the 6-fold cross validation: (a) Use 5 folds for training and 1 fold for testing; (b) Run until every fold is used for training; (3) Calculate the average of the results from 6 runs. The goal was to verify that the results of these runs are consistent. So the data set was split and iteratively 5/6 of the data were used for training and the remaining 1/6 were used for testing. The average of the six runs was computed for analysis. This established the consistency in the timing for database search using SCT. Figure 3. 6-fold cross validation approach Table 2 summarizes the results that we obtained after conducting the experiments with SCT. In the table we also include in bold the results obtained by applying BLAST alone for the same queries. For each query the first row is for the results of SCT, and the second row is those obtained by applying BLAST alone. Experiments are performed on each of the three data sets individually. Six searches are executed on each database to implement 6-fold cross-validation using different queries. The idea is to test whether the results are consistent for all the queries on a particular database in terms of computation time. The results show that in our tests we obtain consistent results and our tool performs sequence comparison with a good accuracy and a practical time improvement is achieved over BLAST. Table 2. Average query times for Ecoli.nt 971 (1,699), for Bacteria_dna 918 (1,714), for Plasmodium_dna 1,009 ( 1,828). The results obtained by BLAST alone are shown within parentheses here, and in boldface in the table. Numbers in the column named Chosen Sequences are the indices of chosen sequences in the database (it does not include the query) Data Set Query Chosen Sequences Time (ms) Data Set Query Chosen Sequences Time (ms) Data Set Query Chosen Sequences Time (ms) Ecoli 1 2, 14, Bacteria. 1 25, 6, Pasmod. 1 24, 2, 40 1,078 2, 14, 872 1,693 25, 6, 706 1,713 24, 2, 120 1, , 20, , 30, , 12, , 20, 205 1,726 6, 30, 512 1,726 8, 12, 99 1, , , , , 28 1,596 32, 11 1,711 5, 106 1, , 41, , 51, , 84, , 41, 519 1,742 8, 51, 262 1,712 63, 84, 179 1, , , , 72, , 112 1,749 2, 95 1,712 11, 72, 155 1, , 95, , , 25, 28 1,006 31, 95, 154 1,688 5, 78 1,713 14, 25, 310 1,849 We have conducted a set of controlled experiments to test the effect of each parameter alone. We compare the output of SCT with the results obtained when BLAST is used directly on the original database Scoring patterns: A scoring function W changes the significance of patterns. We tested with W ( pi ) = f ( pi ) * l( pi ), and then with W ( pi ) = f ( pi )* l( pi ) for a pattern p i corresponding to node i. The former performed better in picking closest sequences. This suggests that in measuring the significance of a pattern the length carries more weight than the frequency. Length-threshold: We set the threshold to 3, 4, and 5 separately. Even though the closest sequence remained the same in each case, other similar sequences were affected. We observed that the accuracy was the best when the threshold was set to 3. Number of patterns: We experimented with setting the number of patterns to be used to 7, 8, and 10 separately. In each case, the closest sequence obtained was the same. Other sequences obtained were affected. The accuracy was the highest when the number of patterns was set to

6 ISBN: IADIS Score-threshold: We did a set of tests with different values for the score-threshold. The best results were obtained when we set it to It is easy to think of a worst-case and a best-case scenario for SCT. If a large number of substrings (patterns) are common in almost all sequences in the database then based on these patterns SCT will not be able to distinguish sequences close to the query sequence. In the best case the database contains a family of sequences that share very long patterns with Q, or a sequence P which is almost identical to the query sequence Q, and SCT will be able to identify the common patterns, and return the closest sequence(s) in the answer to the query very quickly. 4. CONCLUDING REMARKS AND FUTURE WORK In this paper we have presented SCT (Sequence Comparison Tool) for answering alignment-based similarity queries against a database. SCT preprocesses the database to create a generalized suffix tree that we extend by adding frequency and length information for the patterns. The tree is resident in the memory, and it is used for answering future queries. The tree can be created, fetched into memory, and used in parts in answering the similarity query by repeating certain steps for each part. SCT distinguishes patterns by computing significance-scores. A pattern is regarded as significant if it is long enough, and it appears frequently enough in the database. The scoring function takes into account a pattern's length and frequency, the given threshold values, and determines if a pattern is significant. Using these, for a given query sequence SCT reduces the database to only a few sequences that share the most significant patterns with the query. This reduction in database size speeds-up the local alignment of the query sequence against the database. Experimental results have shown that SCT provides a speed-up over BLAST, which is currently the dominant search engine for database-searches. It is able to curtail the time of a database search to nearly half the time originally taken by BLAST. Results from BLAST in our tests have shown that this method is experimentally effective, as we obtain accurate sequence alignment results from SCT. Combined with the extended suffix tree, SCT has the advantage of using BLAST to do the local sequence alignment. We applied the 6-fold cross validation technique of data mining to attain a greater accuracy in our results. The 6 runs of the cross validation help us establish that SCT performs consistently well for all the queries for a particular database included in our tests. With this effort we have obtained very promising results. We selected a small domain of sequences from the universe for which our experiments showed that our method works well. Our method can be further enhanced by covering databases of protein sequences to determine domain-specific parameters such as score and length thresholds, and scoring functions based on natural frequency of amino acid patterns. We can allow for approximate matches in significant patterns. Suffix arrays can be used to improve the performance. REFERENCES [1] Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. Lipman, Basic local alignment search tool. Journal of Molecular Biology, 215: [2] Altschul, S. F., T. L. Madden, A. A. Schaffer, J Zhang, Z Zhang, W. Miller, and D. J. Lipman, September Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17): [3] Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, L, Alignment of whole genomes. Nucleic Acid Research}, 27(11): [4] Giegerich, R. and Kurtz, S, From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19: [5] Gusfield, Dan, Algorithms on Strings, Trees, and Sequences. Cambridge University Press. [6] Lipman, D. J. and Pearson, W. R., Rapid and sensitive protein similarity searches. Science, 227, [7] McCreight, E. M., A space-economical suffix tree construction algorithm. J. of the ACM, 23(2), [8] Ukkonen, E., On-line construction of suffix-trees. Algorithmica, 14: [9] Waterman, M. S., Introduction to Computational Biology, Chapman & Hall. [10] Weiner, P., Linear pattern matching algorithms. Proceedings of the 14th IEEE Symposium on Switching and Automata Theory,

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid