New Implementation for the Multi-sequence All-Against-All Substring Matching Problem

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of Victoria Department of Computer Science September 19, 2007 Abstract The threshold all-against-all problem in approximate string matching has important applications in bioinformatics. For two input sequences, an algorithm has been developed which depends linearly on the sizes of each of the sequences. However, this algorithm doesn t extend efficiently to more than two sequences in a straightforward manner. After modifying the problem somewhat, we have developed a fast algorithm which reports patterns unusually conserved between any number of input sequences, and also returns all of their approximate matches.

Table of Contents 1 Introduction 2 2 Previous work and possible improvements 3 2.1 Algorithm for the threshold all-against-all problem 3 2.1.1 Description 3 2.1.2 Possible improvements 4 2.2 The multiple-sequence version of the all-against-all problem 5 2.2.1 Description 5 2.2.2 Possible improvements 6 3 A different implementation for multiple-sequence all-against-all matching 7 3.1 The problem redefined 7 3.2 Using pattern against text matching as a subroutine 8 3.3 Returning candidates and their matches in multiple-sequence comparison 10 3.4 Left to complete 12 4 Additional features desired in the program 12 5 Conclusion 13 References 14 1 Introduction Marina Barsky, a Ph.D candidate at the University of Victoria, developed an algorithm to solve the threshold all-against-all problem [1]. The problem is, given two strings of length M and N, find all maximal pairs of substrings of at least size S, with at most K differences. She solves it using a method referred as All Paths Below Threshold, or APBT. Her algorithm scales well with the sizes of the sequences and the number of differences. 2

This paper will present her implementation of solving the multiple-sequence version of the problem, also using APBT. Our main aim is to find a different approach to solving this problem, and provide an implementation for it. 2 Previous work and possible improvements 2.1 Algorithm for the threshold all-against-all problem 2.1.1 Description Marina Barsky, a Ph.D candidate at the University of Victoria, developed an algorithm to solve the threshold all-against-all problem. The problem is, given two strings of length M and N, find all maximal pairs of substrings of at least size S, with at most K differences. Previous approaches to solving this problem exactly were either based on dynamic programming or used suffix trees to avoid recomputing matches. The most efficient of the attempts is the one published by Baeza-Yates and Gonnet, in [4]. Barsky s algorithm, developed in conjunction with the University of Victoria Computer Science professors Ulrike Stege and Alex Thomo, as well as Pr. Chris Upton, Biochemistry, aimed to improve on their worst-case running time of O(M 2 N 2 ) [1]. Her algorithm runs in O(MNK 3 ). The problem solved by M. Barsky, which she calls all error-bounded approximate matches is: given two strings s and t, of lengths M and N, find all pairs of substrings (s[i,j], t[k,l]), such that the length of both substrings is greater than S, and the edit distance between them is at most K. s[i,j], t[k,l] are referred to K-approximate matches. The matches reported are maximal, meaning that if (s[i,j], t[k,l]) is a solution, then there is no valid solution (s, t ) with s being a superstring of s[i,j] and t = t[k,l], or vice versa. 3

Barsky considers a matching matrix m which has s as one dimension, t as the second dimension, and m[i,j] = 1 iff s[i] = t[j]. The matrix m is used to induce digraph G m, with each vertex v ij corresponding to 1 entry in the matrix (m[i,j] = 1). G m contains all edges (v ij,v kl ) such that i < k and j < l. The graph is searched for paths, call them P(v ij v kl ) such that have a match length greater than S, and an error number of at most K. The match length of P is related to the difference between the coordinates of v ij and v kl, and is equal to min(k-i+1, l-j+1). The error number of P is the sum of the costs of all the edges it contains. The cost of an edge from v ab to v cd is max(c-a, d-b) -1), corresponding to the minimum number of edit operations required to transform one of the substrings, say s[a,c], into the other, say t[b,d]. Here, an edit operation can be an insertion or deletion of one character, or a substitution between two characters. Hence, the error number of the path corresponds to the cost of an edit transcript between s[i,k] and t[j,l]). Barsky shows that the error number of the shortest path between any two vertices in Gm corresponds to the actual edit distance between the two corresponding substrings. This means that the problem of finding matches between strings s and t can be reduced to finding all maximal paths of match length at least S, with error number of at most K. 2.1.2 Possible improvements To solve the all paths below the threshold problem, Barsky doesn t build the entire matrix m or the graph G m at once, but computes values from them as needed. Also, information already computed about paths is used to determine whether to consider new paths. For more details on this, and for a proof that the algorithm runs in O(MNK 3 ), see section 4 of Barsky s paper [1]. Barsky suggested to me that the implementation could be sped up if it was run in parallel on several processors. This can be done because sets of adjacent rows in the matrix can be processed independently. Hence, given s 1, s 2 and a chunk size of e.g. 1000 columns, a central processor could have auxiliary processors run APBT on s 2 and a chunk of s 1, and then combine the results from different parts of the matrix into the complete solution for s 1 and s 2. According to Barsky, for highly similar 4

sequences, APBT can take as much as 20 hours to return a result, and parallelizing the algorithm could reduce the running time by a significant factor [2]. 2.2 The multiple-sequence version of the all-against-all problem 2.2.1 Description Chris Upton s virology lab is interested in predicting unusually conserved short regions of DNA or protein, given large input sequences. However, because of the nature of their research, they are more interested in regions that are unusually similar between a number of organisms, not just two; therefore, it is very important to provide them with an efficient program that takes several sequences as input [3]. The multiple-sequence version of the all-against-all problem can be defined as: given strings s 1, s 2,, s m, find all matches (s 1, s 2,., s m ), such the substrings are all of length S and the edit distance between any two is always at most K. M. Barsky tried to use the existing implementation of APBT to get a method that would solve the multiple-sequence version of the problem exactly. It involves a filtering step in which the all-paths-below-threshold (APBT) algorithm is essentially run on all pairs of the given strings to get a set of potential start positions of matches in each of the strings. For each of the start positions from a certain chosen sequence, Barsky builds all the matches which include this start position as branches in a tree rooted at the original start position [2]. The algorithm breaks down to: run APBT on s 1, s 2 o mark all possible match start positions in s 1 and s 2 starting only at the marked start positions o run APBT on s 1, s 3 eliminate the candidate start positions from s 1 with no matches o run APBT on s 2, s 3 - eliminate the candidate start positions from s 2 with no matches o continue doing this until all pairs of sequences have been compared; 5

o then repeat the three previous steps on all pairs of sequences, eliminating more candidate start positions, until a run of APBT yields a less than 10- fold reduction in the number of candidates recover the candidate patterns o perform APBT only on the marked start positions in s 1, s 2 o for each match reported by APBT from s 1, run APBT on it and s 2, then add the resulting matches as children of the match in a tree run APBT on the match and s 3, and for each match, add it as a child of one of the matches added from s 2, iff their edit distance is less than K repeat this for each sequence up to s m. o for every tree, report all the distinct paths from root to leaf as solution tuples: (s 1, s 2,., s m ) 2.2.2 Possible improvements Marina encountered a problem when running her algorithm on 4 or more biologically related sequences. Namely, the memory required to build all of the trees is such that the intermediate information, even for one tree, can be larger than the amount of memory available in the average lab. The program is heavily slowed down by writing intermediate information on disk, then retrieving it [2]. Another issue is that the filter is very time consuming, since it applies APBT an order of n 2 times. Also APBT itself can be quite time-consuming for long, relatively similar sequences. My project focuses mostly on finding a different approach to solving the multiple-sequence version of threshold allagainst-all problem. We aim to reduce the memory requirements, since they slow down the program. Another goal is to reduce the number of times APBT is run, since it can be very slow for similar sequences. APBT can also likely be sped up by using faster ways of comparing candidates against sequences, as opposed to using APBT as a subroutine. Much of the information stored in the trees described above is redundant: if for instance s 31 is present in both subtrees rooted at s 21 and s 22, and s 31 differs from both s 21 and s 22 at the same position, then the entire subtree rooted at s 31 could appear twice. Also, many of the candidates added to the tree at the sequence 3 level could be eliminated at a later point 6

in the algorithm (this happens if, e.g., s 31 has no match in sequence m). Hence, to save memory we could store potential matches and their edges in a graph until all the edges are computed, and then the graph can be used in recovering the actual solution tuples. To build this new graph G, proceed as in the pseudocode for the all-against-all problem, only adding a substring r ij at the i th level in the graph if and only if it has edges to every single level from 1 to i-1; this is essentially building the trees above as a graph, but without storing the paths separately; then, the graph can be processed to obtain all the valid solution tuples as induced complete subgraphs in G, i.e.: sets of m vertices, one from each level of the graph, such that every vertex in the set has edges to every single other vertex in the set. 3 A different implementation for multiple-sequence all-againstall matching 3.1 The problem redefined C. Upton told us that he would like results from the multiple-sequence version in useful time, even if the results reported are a superset of the actual solution to the multiple allagainst-all problem. What he specifically wants is a set of patterns, say taken from s 1, which are unusually conserved in all of the sequences. He doesn t necessarily need for any two substrings in the set of matches to each pattern to be within K differences from each other. Hence, as long as the number of the patterns is low enough, he can use the positions of their matches in all of the other sequences to figure which are significant. He especially insisted that the current implementation is too slow to be usable by his lab, so he would like a fast algorithm, as long as no potential solutions are left out [3]. The idea behind the implementation done in the current project is to use APBT to get a set of candidate patterns. Then, we aim to quickly determine the candidates which have matches in all of the sequences and what these matches are. We examined how APBT performs with inputs that would be common to Upton s lab, so that we know the number 7

of candidate patterns that will have to be matched against the other sequences. The test sequences were Mycobacteriophage D29 and Mycobacteriophage Bxb1 taken from Barsky s paper [1], related viral genomes of around 50000 base pairs; several values of S and K were attempted, with K around 10% of S, as suggested by Upton. Table 1: Tests of APBT on two viral sequences S 25 25 25 30 30 30 40 40 40 K 1 2 3 1 2 3 2 3 4 Number of 430 1609 4591 230 842 2338 210 638 1572 solutions Processing time (min.) 11 17 25 4 8 15 16 25 25 The strategy for solving the multiple-sequence version is to use APBT on two sequences to get candidate patterns and then compare the candidates to the rest of the sequences using fast pattern-text matching. This can be used to efficiently report a set of matches for each candidate, with at least one match from each sequence. Since all the matches in any one set have an edit distance of K to original pattern in the set, they are guaranteed to have at most 2K differences between each other. After this implementation, if the users require it, we can further process each set to produce sets of tuples, with one substring from each sequence. This will guarantee that any two substrings have an edit distance of at most K. 3.2 Using pattern against text matching as a subroutine We based the pattern-text matching as first described by Baeza-Yates and Gonnet in A New Approach to Text Searching [5]. This algorithm takes advantage of the small size of the pattern to use computer words, bit shifts and logical bitwise operations to determine where matches to the pattern end in the text. For the sizes we re interested in, the pattern can be encoded as several bit masks: there would be one bit mask for each 8

letter in the pattern, and a 1 bit would signify that the letter appears at that position. Note that we used Wu and Manber s extension of this algorithm [6] to also match the pattern to the text with up to k single-letter insertions, deletions, or substitutions. The algorithm is given a pattern P, a text T, and values for parameters k and S and then determines all the end positions in T of matches to P with up to k errors. First, we compute a set of bit masks U, one for each character in the text. Then, we compute a set of matrices R 1, R k. Every column in R represents a different position in the text; a bit R d (i, j) is 1 iff a there is a match to the prefix P[1..i] ending at T(j), with d allowed differences. Column j of R d depends on: the column j-1 in R d, the bit mask for character T(j), and some columns in R d-1. The recurrence relation to compute the j+1 th column of R d is: R d (j) = 111..10000 (d 1s) if j = 0 R d (j) = [ Bit-Shift(R d (j-1)) AND U(T(j)) ] if j >= 1 OR Bit-Shift( R d-1 (j-1) ) OR Bit-Shift( R d-1 (j) ) OR R d-1 (j-1) Note that the final bit in any column or R k (j) indicates whether or not there is a match to P ending at T(j) with up to k differences. Since computing a column in R d (j) involves looking up R d (j-1), R d-1 (j), R d-1 (j) and j could be of the order of tens of thousands, we don t store any entire R d table. We consider instead R d (j) values to be cells in a table, with d as one dimension and j as the other. We compute values in the table diagonal by diagonal, storing a position j as a solution when appropriate. In this way, we only need to store the two previous diagonals when computing a new one, which are each at most d values. We also represent the actual values of R d (j) as 64-bit longs; this limits the size of a pattern to search for up to 64 characters, long enough for C. Upton s purposes. The maximum length can easily be changed to 128 characters in the future. For a more complete specification of the algorithm, see the program code and its comments. We tested the implementation of this algorithm for speed. We chose to test patterns of different lengths S and with different K values against the Human coronavirus 229E sequence, which has a size of about 27000 base pairs. The worst running time found in 9

the trials, using S as 22 and an exaggerated value for K of 9 allowed differences was under 0.2 seconds. These results were promising because a program for all-against-all comparison could use the shift-and as a subroutine to quickly filter out candidate patterns which don t have matches in all the sequences. The pattern-against-text matching could also speed up the substring-against-substring comparisons done in the existing APBT multi-sequence implementation. Table 2: Performance of pattern-against-text matching pattern length allowed differences number of processing time (ms) (chars) solutions 22 9 7243 172 30 15 1902 328 30 5 0 125 15 5 2266 94 15 2 18 46 15 1 5 46 3.3 Returning candidates and their matches in multiple-sequence comparison The strategy for finding candidates and their matches is as follows: Use APBT to find the candidate matches between 2 preferred sequences (s 1, s 2 ); For each of these candidates as taken from the first sequence s 1 : o o for each of the sequences s 2 to s m, get the end positions of the matches to the candidate; store these in a map, indexed by the candidate; if in any of the sequences, the candidate has 0 matches, remove the candidate from consideration; Report each candidate along with its set of matches from each sequence, after post-processing to find the start positions as well as the end positions of the matches. Pattern-text matching reports only the end positions of the matches. However, the exact coordinates of every match have to be reported in order for the users to visualize the results. After any set of end positions is returned by pattern matching, we can compute the starting positions of the matches and then store the matches as pairs of start and end 10

positions. Since the start points have to be between (end position S k) and (end position S + k), we can just test every possible start point. There are then up to k possibilities for the starting point, and each could define a separate match, so the number of matches could grow by a factor of k when doing this. We decided to only store the farthest start position for any given end position, so that the users know that the region between this start point and the end point contains at least one match for the given pattern. Here is an example to illustrate what solutions are reported after this processing (note the substrings are given in this example, not their position): given k = 1, S = 4, and sequences t01 CATC, t02 TTTTCATCTT, t03 CGTCGGCATC. The solutions reported are: CATC from t01 TTTTCATCTT and TTTTCATCTT from t02; CGTCGGCATC and CGTCGGCATC from t03 Note that GCATC is reported from t03 and not CATC, because they have the same end point and GCATC is longer. Also note that if S = 3, then CAT would also be reported as a match to CATC, because even if it is contained completely in the reported GCATC, it has a different end point. In testing the program, we comparing running times with and without also determining finding the start points, to see whether that step reduces the benefits of using the fast pattern-text matching. We found that both the number of matches and the running time remain of the same order of magnitude as before finding the start points. First, we used the same sequences as the ones used by Barsky [1] to test the 2-sequence APBT. These are five viral DNA sequences, D1 to D5, and three animal protein sequences, P1 to P3; for a description of the sequences, refer to [1]. The running time of the algorithm was split into the first step, using APBT on the first two sequences input to get candidate patterns, then the second step, processing the patterns to get their matches from all the sequences. Table 3: Performance of the program 11

Test sequences S K APBT time (s) Processing time after APBT (s) APBT # candidates # results remaining D1 D4 D2 D3 D5 17 2 214 1.4 49 0 P2 P3 P1 23 1 17 8.9 326 310 P2 P3 P1 25 2 24 33.4 804 791 3.4 Left to complete The example above indicates that overlapping solutions can be reported by the program. One useful addition that would make the results more usable by Upton s lab would be to combine any two matches to one candidate from in the same sequence with any overlap between them. Further, we have to store the matches in a fashion that is useful for the scientists using the program. According to C. Upton s team, they would like the output as one file per candidate pattern and its set of matches. This file should be in Fasta format and include for each of the matches: a description line with the identifier of sequence it is from, and the start and end positions; and a data line, with the actual sequence of this match. 4 Additional features desired in the program Upton s lab has used both the APBT program and the M. Barsky s multiple-sequence version. They noticed that when running the APBT program on very long sequences, such as entire viral genome with reasonable value values of S, e.g. 20, the number of matches reported can be very large. This is unavoidable given that both the lengths and similarities of the input sequences can be varied by the user. What the virologists would like is to input several viral genomes, of length around 200 000 bp, and have the program predict values of S, K, such that the number of hits is less than 1000. We decided that a sampling strategy could be used to try to pinpoint desired values of the parameters. This could be built on top of the multiple-sequence implementation of the program. 12

5 Conclusion The greatest advantage of the current implementation is the speed with which it reports the patterns which satisfy the desired criteria. The implementation leaves open many possibilities for post-processing to enforce additional criteria on the output. It also can be extended to solve the multiple-sequence threshold all-against-all substring matching problem, by further processing of the sets of matches reported. 13

References 1. BARSKY M., STEGE U., THOMO A. and UPTON C. A graph approach to the threshold all-against-all substring matching problem, 2007. Submitted to The Journal of Experimental Algorithmics. 2. BARSKY M. Personal communication, 2007. University of Victoria. 3. UPTON C. Personal communication, 2007. University of Victoria. 4. BAEZA-YATES R. A. and GONNET G. H. All-against-all sequence matching, 1990. 5. Report of the Department of Computer Science, Univ. de Chile. 6. BAEZA-YATES R. A. and GONNET G. H. A New Approach to Text Searching, 1992. Communications of the ACM, 35:74--82. 7. MANBER U. and WU S. Fast text searching with errors, 1991. Department of Computer Science, U. of Arizona. 14