An improved algorithm for the regular expression constrained multiple sequence alignment problem

Size: px

Start display at page:

Download "An improved algorithm for the regular expression constrained multiple sequence alignment problem"

Albert Park
5 years ago
Views:

1 An improved algorithm for the regular expression constrained multiple sequence alignment problem Abdullah N. Arslan and Dan He Department of Computer Science University of Vermont Burlington, VT 05405, USA Abstract Constrained sequence alignment has been proposed as a way for incorporating biologists knowledge about common structures or functions into the alignment process. For alignment of protein sequences, several studies have suggested taking into account the motifs (a restricted regular expression) from the PROSITE database to guide alignments. The regular expression constrained sequence alignment has been introduced for this purpose. An alignment satisfies the constraint if part of it matches a given regular expression in each dimension (i.e. in each sequence aligned). There is a method that rewards the alignments that include a region matching the given regular expression. This method does not always guarantee the satisfaction of the constraint. Another method constructs a weighted finite automaton from the given regular expression, and presents a dynamic programming solution that simulates copies of this automaton to find an alignment with maximum score satisfying the regular expression constraint. We propose a new algorithm for the regular expression constrained multiple sequence alignment problem. Our algorithm considers two layers each of which corresponds to part of the dynamic programming matrix for the alignment of the given sequences. We compute each layer differently using dynamic programming. We propose the following modification in the definition of the problem: the region satisfying the constraint does not contribute to the total score. This modification is not necessary for the correctness and the performance in certain cases such as the constraint involves only one motif, or motif-matching regions span short distance in each sequence but we believe that with this modification we achieve the same goal by doing less work in practice. Our algorithm is much more efficient than a previously proposed algorithm that uses weighted automata, and its performance in practice is comparable to (and under certain conditions even better than) that of the ordinary (unconstrained) multiple sequence alignment algorithm. Our experiments on real biological sequences, and regular expressions each composed of a sequence of motifs verify this. Keywords: Multiple sequence alignment, regular expression, PROSITE, constrained sequence alignment, dynamic programming. 1. Introduction Given two strings S 1, and S 2, the pairwise sequence alignment can be described as a writing scheme such that we use a two-row-matrix in which the first row is used for the symbols of S 1, and the second row is used for those of S 2, and each symbol of one string is aligned with (i.e. it appears on the same column with) a symbol of another string, or the blank symbol -. A matrix obtained this way is called an alignment matrix. In an alignment matrix, a given column with two identical symbols represents a match, similarly, a column including two different (non-blank) symbols indicates a mismatch, or when one of the symbols is a -, it corresponds to an indel (insert/delete) operation. Two blank symbols cannot be aligned together. Each of these operations has a weight. The score of an alignment is the total score of the columns in the corresponding alignment matrix. The Smith-Waterman algorithm [15] computes the maximum alignment score for a given pair of sequences. The definition of sequence alignment for a pair of sequences can be generalized to the multiple sequence alignment problem [7, 15]. Multiple sequence alignment of k sequences involves a k-row alignment matrix, and there are various scoring schemes (e.g. sum of pairwise distances) assigning a weight to each column. There are many algorithms for the multiple sequence alignment problem which is NP -hard in all meaningful definitions of similarity. Notredame [10] summarizes recent progress in multiple sequence alignment. Ordinarily an alignment algorithm simply computes the maximum alignment score without considering the compo-

2 sition of the alignment. However, it has been noted that biologists favor integrating their knowledge about common patterns, or structures into the alignment process to obtain biologically more meaningful similarities [11, 12, 6, 8]. For example, when computing the homology of two protein sequences it may be important to take into account a common specific or putative structure [12]. The constrained versions of the sequence alignment problems have been studied in the literature extensively [3, 4, 5, 6, 11, 12, 13]. The constraint for the alignments sought can be the inclusion of a given pattern string [11, 4, 13, 12, 5, 3]. In this case, alignments are required to contain a given pattern exactly, giving rise to the constrained multiple sequence alignment problem. Motivation for the problem comes from the alignment of RNase sequences. Such sequences are all known to contain three active residues His(H), Lyn(K), His(H) that are essential for RNA degrading. Therefore, it is natural to expect that in an alignment of RNA sequences, each of these residues should be aligned in the same column, i.e. alignments are constrained to contain sequence "HKH". Arslan and Eğecioğlu [3] propose a variation of the problem where the inclusion of the given pattern by the alignments sought is not exact but within a given tolerance. Considering common structures is especially important in aligning protein sequences. Family of similar protein sequences include a conserved region which can be described as a regular expression. Such conserved amino acid residues associated with a particular function is called a sequence motif. Typically, motifs span 10 to 30 amino acid residues. For many years, PROSITE ( mirrored in the US at has been a collection of sequence motifs, which were represented and stored in the PROSITE format ( that can be translated into simple regular expressions. For example, the motif in Figure 1 is the famous P-loop motif, first described in 1982 by John Walker and colleagues as Motif A and found later in many ATP- and GTP-binding proteins, corresponds to a flexible loop, sandwiched between a b-strand and an a-helix and interacting with b- and g-phosphates of ATP or GTP [14]. In PROSITE database it is represented as [GA]-X(4)-G-K-[ST] (ATP/GTPbinding site motif A (P-loop) (PS00017)) which means that the first position of the motif can be occupied by either Ala or Gly, the second, third, fourth, and fifth positions can be occupied by any amino acid residue, and the sixth and seventh positions have to be Gly and Lys, respectively, followed by either Ser or Thr. If the sequences aligned contain the same motif then it is biologically more meaningful to seek an alignment that satisfies the corresponding regular expression constraint. Figure 1 illustrates an example [2] in which two synthetic sequences S 1 = TGFPSVGKTKDDA, and S 2 = TFSVAKDDDGKSA are aligned in a way to maximize the number of matches (this is the longest common subsequence problem). An optimal alignment with 8 matches is shown in part (a). For the regular expression constrained sequence alignment problem with regular expression R = (G +A)ΣΣΣΣGK(S +T), where Σ denotes a fixed alphabet over which sequences are defined, the alignments sought change. The alignment in Part (a) does not satisfy the regular expression constraint. Part (b) shows an alignment with which the constraint is satisfied. The alignment includes a region (shown with a rectangle drawn in dashed lines in the figure) where the substring GFPSVGKT of S 1 is aligned with substring AKDDDGKS of S 2, and both substrings match R. In this case, optimal number of matches achievable with the constraint decreases to 4. T G F P S V G K T K D D A T - F - S V A - - K D D D G K S A (a) T G F P S V G K T T F S V A K D D D G K S A (b) K D D Figure 1: [2] For strings S 1 = TGFPSVGKTKDDA and S 2 = TFSVAKDDDGKSA: (a) An alignment with maximum number of matches, 8. (b) An alignment in which substring GFPSVGKT of S 1 is aligned with substring AKDDDGKS of S 2, and both match R = (G +A)ΣΣΣΣGK(S +T) where Σ is a fixed alphabet on which the sequences are defined. This alignment has 4 matches, and it satisfies the regular expression constraint. Comet and Henry [6] experimentally show that the Smith-Waterman algorithm does not always align common motifs in the two sequences aligned, proving that the Smith- Waterman algorithm may miss some important similarities. The main reasons for this are that the motifs are short, and sometimes they are not part of a large similar region with a high score. Comet and Henry [6] use a method that rewards alignments containing motifs. From the motif database their method first finds a known motif (or motifs) in each sequence separately to determine a common motif (or motifs). Next, it extends the dynamic programming sequence alignment solution by reconsidering each region where the motif appears in each sequence simultaneously. Arslan [2] formally introduces the regular expression constrained sequence alignment (RECSA) problem in which alignments are required to contain a given regular expression (motif). For the RECSA problem, Arslan [2] A

3 presents an algorithm whose time complexity is O(nmr) where n and m are the lengths of the sequences aligned, and r is the size of the finite automaton M created to accept alignments that satisfy the regular expression constraint, i.e. the alignments that contain the given regular expression. M is a weighted automaton such that the weights of the states in M correspond to optimum constrained alignment scores. The algorithm is based on a given dynamic programming formulation for sequence alignment. Instead of computing optimum scores, it uses the dynamic programming solution to compute weights for the states of automaton M. M has O(t 2 ) states if N has t states where N is an automaton that accepts the language described by the regular expression R. Arslan [1] generalizes this approach for multiple sequences, namely for the regular expression constrained multiple sequence alignment where the constraint is a sequence of regular expressions. Automaton created in this case has O(t n ) states. Simulating copies of this automaton increases the time requirement of multiple sequence alignment by a factor of O(t 2n ). We present a new dynamic programming algorithm for the regular expression constrained multiple sequence alignment problem. We consider two layers where each layer is part of the dynamic programming matrix of alignment of given sequences. We present a solution that uses dynamic programming on these two layers. Our solution avoids computations on unnecessary parts in ordinary dynamic programming matrix. We modify the definition of the problem such that the region satisfying the constraint does not contribute to the total score. For example, in our definition, the number of matches in Figure 1 in Part (b) is 2 instead of 4: the alignment sought is required to include a region that matches the regular expression but the score obtained in this region is not included in the total. Note that the alignment shown in Part (b) of the figure is optimal with respect to both definitions. The time and space requirements of our algorithm is asymptotically the same in practice as those of the ordinary multiple sequence alignment algorithm. The outline of this paper is as follows. In Section 2 we include the definition of the ordinary multiple sequence alignment problem, and a dynamic programming solution for it. In Section 3 we give the definition of the regular expression constrained multiple sequence alignment problem, describe our new algorithm for this problem, and discuss its performance. We summarize our test-results in Section 4. We include a summary of our contribution in this paper in Section Multiple Sequence Alignment Given n sequences S 1, S 2,...,S n with lengths respectively l 1, l 2,..., l n, the multiple sequence alignment is to find an alignment matrix with maximum score. A given function assigns a score to each column. The score of an alignment matrix (or simply the alignment) is the sum of the scores of all the columns. An alignment matrix contains in row i a sequence obtained from S i by inserting in it as many - s as necessary to maximize the total score such that that no column in the resulting matrix is entirely composed of - s. We denote by S[i..j] the substring of string S starting from position i and ending at position j. Let D(i 1, i 2,..., i n ) be the optimum multiple sequence alignment score of sequences S 1 [1..i 1 ],S 2 [1..i 2 ],...,S n [1..i n ]. Then this score can be computed by the following recurrence ([7]): D(i 1,...,i n ) = if i 1 = 0, or i 2 = 0, or..., or i n = 0. D({0} n ) = 0. For all i 1, i 2,..., i n, 0 < i 1 l 1, 0 < i 2 l 2,..., 0 < i n l n D(i 1, i 2,..., i n ) = max D(i 1 e 1, i 2 e 2,..., i n e n ) e {0,1} n + γ(e 1 S 1 [i 1 ],..., e n S n [i n ]) where e j = 0 or 1, e j S j [i j ] with e j = 0 represents the space character, and S j [i j ] when e j = 1, and γ(c 1, c 2,..., c n ) is a score for column (c 1, c 2,..., c n ) (note that we use a row vector to represent a column for convenience) that is not completely composed of - s. Computing the optimum multiple alignment score takes time O(2 n l 1 l 2... l k ) since each (i 1, i 2,..., i n ) has 2 n 1 neighbors. 3. New algorithm for the regular expression constrained multiple sequence alignment problem The regular expression constrained multiple sequence alignment [2] is a multiple sequence alignment problem such that alignments sought satisfy regular expression constraint for a given regular expression E. An alignment satisfies a regular expression constraint if there is an n-way regular expression match for E in part of the alignment, i.e. in part of the alignment substrings s 1, s 2,..., s n respectively of S 1, S 2,..., S n are aligned, and all of these substrings s 1, s 2,..., s n match the given regular expression E. We say that a string matches a regular expression E if it is described by E (i.e. if it is in the language described by E). As a main application of the regular expression constrained multiple sequence alignment problem, we consider the alignment of given multiple protein sequences under the guidance of a sequence of given motifs. In this case, let E 1, E 2,...,E r be respectively a sequence of equivalent regular expressions for motifs R 1, R 2,..., R r. We construct from the regular expressions E 1, E 2,..., E r, the regular expression E = E 1 Σ E 2 Σ... Σ E r. We find all matches of E in each sequence S i using a classical regular (1)

4 expression pattern matching (in text) algorithm (e.g. [9]). We collect all these matches in set X i for each sequence S i. Let X i,j be the set of start positions such that for each x X i,j, substring S i [(j x)..j] matches E. Let b i be the position where the first match for E starts, and let e i be the position where the last match for E ends. We compute the regular expression constrained sequence alignment using computations in two layers, Layer 0, and Layer 1, where each layer corresponds to a part of an n- dimensional dynamic programming matrix for multiple sequence alignment of sequences S 1, S 2,...,S n. Each layer is a hyper-rectangle that can be identified by its two extreme points. These points are (0, 0,...,0) and (e 1, e 2,...,e n ) for Layer 0, and (b 1, b 2,..., b n ), (l 1, l 2,...,l n ) for Layer 1. Figure 2 illustrates the case when there are two sequences (n = 2). Figure 2: Layers 0 and 1 for two sequences where b i is the start position of the first match for E, and e i is the last position of the last match for E. The region that belongs to Layer 0 is filled with horizontal lines while the region of Layer 1 is filled with vertical lines. The region shared by the two layers are filled with both types of lines. We compute two matrices D 0 () on Layer 0, and D 1 () on Layer 1. The optimum value of the regular expression constrained sequence alignment is D 1 (l 1, l 2,..., l n ). We only refer to the values in D 0 () when we compute new entries in D 0 () on Layer 0 shown in Figure 2, and it is computed the same way as D() in Equation (1) except that the indices run up to the extreme point (e 1, e 2,...,e n ). That is, 1 i 1 e 1, 1 i 2 e 2,..., 1 i n e n. We define D 1 (i 1, i 2,..., i n ) as the optimum multiple sequence alignment score of sequences S 1 [1..i 1 ],S 2 [1..i 2 ],...,S n [1..i n ] such that the alignment satisfies the regular expression constraint, i.e. the alignment has a region where each involved substring matches E. We compute D 1 () on Layer 1 shown in Figure 2. We use values in D 0 () and D 1 () in computing new values in D 1 (). D 1 (i 1,..., i n ) = if i 1 = b 1, or i 2 = b 2, or..., or i n = b n. D({0} n ) = 0, and for all i 1, i 2,...,i n, b 1 < i 1 l 1, b 2 < i 2 l 2,..., b n < i n l n D 1 (i 1, i 2,..., i n ) = max{0, max x 1 X 1,i1,...,x k X k,ik D 0 (i 1 x 1, i 2 x 2,..., i k x k )+ score(s 1 [(i 1 x 1 )..i 1 ],..., S n [(i n x n )..i n ]), β(i 1, i 2,...,i k ) ( D 1 (i 1 e 1, i 2 e 2,..., i k e n )+ γ(e 1 S 1 [i 1 ],...,e k S k [i n ]) ) } (2) where score(y 1, y 2,..., y n ) is the score of the multiple alignment of sequences y 1, y 2,..., y n, β(i 1, i 2,..., i k ) = 1 if D 1 (i 1 e 1, i 2 e 2,..., i k e k ) > 0; 0 otherwise, and e j = 0 or 1, e j S j [i j ] with e j = 0 represents the space character, and S j [i j ] when e j = 1 and γ(c 1, c 2,..., c k ) is a score for column (c 1, c 2,..., c k ). The first max-term in Equation (2) max x 1 X 1,i1,...,x n X n,in D 0 (i 1 x 1,..., i n x n )+ score(s 1 [(i 1 x 1 )..i 1 ],..., S k [(i k x k )..i k ]) is for computing the optimum score of an alignment over all alignments ending with an n-way match for E (a match in each of the n sequences). There exists such an alignment only if X 1,i1 X 2,i2... X n,in. If this cartesian product is not empty then every member (i 1, i 2,..., i n ) corresponds to an n-way match for E. The substrings involved in this n-way match are S 1 [(i 1 x 1 )..i 1 ], S 2 [(i 2 x 2 )..i 2 ],...,S k [(i k x k )..i k ]). An optimal alignment considered in this case satisfies the constraint the first time, therefore, we use the optimum score D 0 (i 1 x 1, i 2 x 2,..., i n x n ) obtained on Layer 0 at the start position (i 1 x 1, i 2 x 2,..., i n x n ) of the n-way match, and add to it the alignment score score(s 1 [(i 1 x 1 )..i 1 ], S 2 [(i 2 x 2 )..i 2 ],...,S k [(i k x k )..i k ]) of the substrings involved in this n-way match for E. If we want to compute the optimum constrained multiple sequence alignment as defined in [2], we need to calculate this part performing an ordinary multiple sequence alignment using the underlying scoring function. However, in this paper we propose a modification in the definition as follows: we consider score(s 1 [(i 1 x 1 )..i 1 ], S 2 [(i 2 x 2 )..i 2 ],..., S k [(i k x k )..i k ]) as zero, that is, the n-way match is required to satisfy the constraint, and it does not contribute to the total score. We believe that this modification achieves the same goal, and enables more efficient computation. The second max-term in Equation (2) β(i 1, i 2,..., i k ) ( D 1 (i 1 e 1, i 2 e 2,...,i k e k )+ γ(e 1 S 1 [i 1 ],..., e k S k [i k ]) ) }

5 is for computing the optimum score of alignments that already satisfied the regular expression constraint (i.e. that already had an n-way match for E in them). On Layer 1, only such alignments have positive scores, and they can be extended in the same way as is an unconstrained alignment but on Layer 1, i.e. from one of the 2 n 1 neighbors on Layer 1. The optimum score is computed accordingly. Next, we discuss the time complexity of our algorithm by considering the parameters in the domain of protein sequences and motifs. Define M = n li i=1 j=1 X i,j. That is, the total number M of all possible n-way matches for E in all n sequences is the product of total number of distinct matches of E in each sequence. Then we can see that the time spent on computing the first max-term in computing D 1 () is O(M) (we note that without our modification in the definition in this part we would need to perform O(M) alignments of matching substrings from sequences), and the total time requirement is O(2 n l 1 l 2...l n + M). We expect that M is small in practice. Let N(E) be the set of lengths of strings that match E, and let P k (E) be the probability that a given string of length k matches E. Since motif lengths run between 10 and 30, and the varying length gaps are small in number in any motif, N(E) is bounded from above by 20. When we consider the motifs in PROSITE database it is very reasonable to assume that P k 1/20 for all k, because motifs are defined over an alphabet of size 20. We can easily see that this inequality is true for all motifs in practice such as for a motif in which only one position has a fixed symbol, or for a motif in which there are two positions having 5 and 4 alternative symbols. With these parameters if we do simple expected value analysis, we see that the expected total number of matches for E in S i ending at position j is bounded from above by N(E) k=1 1/ /20 = 1. Then the expected total k=1 number of matches for E in S i is l i, and therefore the expected value for M is l 1 l 2...l n. This upper bound on M indicates that the running time of our algorithm in practice is asymptotically the same as that of computing an ordinary multiple sequence alignment, i.e. O(2 n l 1 l 2...l n ). Our analysis for M is very pessimistic: there assumed to be a match at every position in each sequence. However, motifs by design do not match many biological (sub)sequences. Therefore, we expect that M is much smaller in practice. 4. Experiments We have collected experimental evidence to verify our claims about the performance of our algorithm in the previous section. We compare the time requirement of our algorithm with that of the dynamic programming for ordinary (unconstrained) multiple sequence alignment algorithm. In our experiments, we choose three motifs from PROSITE MOTIF NAME PROSITE INDEX Regular expression CK2 PHOSPHO SITE PS00006 [ST]-x(2)-[DE] MYRISTYL PS00008 G-[EDRKHPFYM]-x(2)- [STAGCN]-p UPF0027 PS01288 Q-[LIVM]-x-N-x-A-x-[LIVM]-P-x- I-x(6)-[LIVM]-P-D-x- H-x-G-x-G-x(2)-[IV]-G Figure 3: Motifs we use in our tests. SEQUENCE NAME length RTCB ECOLI 408 Y2631 MYCTU 432 Y682 METJA 968 YT6J CAEEL 505 Figure 4: Sequences we use in our tests. database and align four sequences in Swissprot Release 35 with different constraints formed by the combination of the three motifs. These three motifs and four sequences are shown in figures 3, and 4 respectively. In our tests on these sequences and motifs, we note that to compare the time requirement of our algorithm with that of the ordinary multiple sequence alignment, we can simply compare the total number of points in the matrix that the algorithms perform computations, i.e. the total sizes of Layer 0, and Layer 1 versus l 1, l 2... l n. The reasons are: First, we observe that the total number of n-way matches M is insignificant compared to 2 n l 1 l 2... l n. Second, for each point in the matrix the ordinary multiple sequence alignment performs 2 n 1 additions and comparisons (examines 2 n 1 neighbors) in Formulation (1), and our algorithm in Formulation (2) performs the same number of additions and comparisons for the points on Layer 0, and on Layer 1. Our algorithm performs additional operations on Layer 1. The total time our algorithm spends on these operations is O(M) which is insignificant in practice. In Figure 5, we show the number of vertices at which our algorithm performs computations (i.e. the total of the sizes of Layer 0 and Layer 1). The size of the dynamic programming matrix for ordinary multiple sequence alignment is in all these cases. We also show in the figure the ratio of the size-sum of Layer 0 and Layer 1 to this number. These ratios indicate that the performance of our algorithm is comparable in practice with that of the unconstrained multiple sequence alignment algorithm. The ratios in the figure decrease as the length of the regular expression increases since with longer regular expressions there are fewer matches in the sequences, and as a result layers 0 and 1 shrink, and our algorithm performs faster, and uses lesser space compared to the ordinary multiple sequence

6 Pattern Sequence New New/MSA PS01288 PS00008 PS PS01288 PS00008 PS00006 PS00008 PS PS01288 PS00008 PS00008 PS00006 PS00008 PS00008 PS Figure 5: The numbers of vertices in the dynamic programming matrix for our new algorithm for given motifs, and their ratios to the size of the dynamic programming matrix of the ordinary multiple sequence alignment (which is in this case). alignment. Due to this effect of reduction in volume, the improvement in performance gain by our algorithm accelerates as the number n of sequences aligned increases. We remark that our modification in the definition in our formulation (2) is not necessary for improved performance if the motif-matching regions are short (e.g. a single motif). Without the modification, when the constraint involves many motifs and their matching regions are large the alignment of these regions for all n-way matches are costly. 5. Conclusion We present a new dynamic programming algorithm for the regular expression constrained multiple sequence alignment problem. An important application of this problem is alignment of protein sequences with the constraint that optimal alignment contains a given sequence of motifs. We propose a modification in the definition such that the region of the alignment that satisfies the constraint does not contribute to the total score of the alignment. Our algorithm performs computations in two layers each of which corresponds to a part of the dynamic programming matrix for ordinary multiple sequence alignment. The performance of our algorithm depends on the total size (volume) of these layers. Our algorithm is more efficient than previously existing solutions for the regular expression constrained multiple sequence alignment problem. By a proper choice of a sequence of motifs in the constraint, compared to the ordinary (unconstrained) multiple sequence alignment, our algorithm not only requires no additional time and space, but even outperforms it. We verified this result experimentally using real protein sequences, and motifs. References [1] A. N. Arslan. Multiple sequence alignment containing a sequence of regular expressions. Proc.: IEEE 5th Symposium on Intelligent Computations on Bioinformatics and Biotechnology (CIBCB), pp , San Diego, Nov , [2] A. N. Arslan. Regular expression constrained sequence alignment. Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science 3537, pp , Jeju Island, Korea, June 19-22, [3] A. N. Arslan and Ö. Eğecioğlu. Algorithms for the constrained common sequence problem. International Journal of Foundations of Computer Science, (16)6: , December [4] F. Y. L. Chin, N. L. Ho, T. W. Lam, P. W. H. Wong, M. Y. Chan. Efficient constrained multiple sequence alignment with performance guarantee. Proc. IEEE Computational Systems Bioinformatics (CSB 2003), pp , [5] F. Y.L. Chin, A. D. Santis, A. L. Ferrara, N. L. Ho, S. K. Kim. A simple algorithm for the constrained sequence problems. Information Processing Letters Vol. 90, pp , [6] J.-P. Comet and J. Henry. Pairwise sequence alignment using a PROSITE pattern-derived similarity score. Computers and Chemistry, 26, pp , [7] R. F. Doolittle. Similar amino acid sequences: chance or common ancestry. Science, 214: , [8] D. He and A. N. Arslan. A parallel algorithm for the constrained multiple sequence alignment problem. IEEE 5th Symposium on Bioinformatics and Bioengineering (BIBE), pp , Minneapolis, Minnesota, October 19-21, [9] R. Johnsonbaugh and M. Schaefer. Algorithms. Pearson Prentice Hall, [10] C. Notredame. Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics, 3(1), pp. 1-14, [11] C. Y. Tang, C. L. Lu, M. D.-T. Chang, Y.-T. Tsai, Y.-J. Sun, K.-M. Chao, J.-M. Chang, Y.-H. Chiou, C.-M. Wu, H.-T. Chang, and W.-I. Chou. Constrained multiple sequence alignment tool development and its applications to rnase family alignment. Proceeding of the 1st IEEE Computer Society Bioinformatics Conference (CSB 2002), pp , [12] Y.-T. Tsai. The constrained common sequence problem. Information Processing Letters, 88: , [13] Y.-T. Tsai, C. L. Lu, C. T. Yu, and Y. P. Huang. MuSiC: A tool for multiple sequence alignment with constraint. Bioinformatics, 20(14): , [14] J. E. Walker, M. Saraste, M. J. Runswick, N. J. Gay. Distantly related sequences in the alpha- and beta-subunits of ATP synthase, myosin, kinases and other ATP-requiring enzymes and a common nucleotide binding fold. EMBO J., 1: , [15] M. S. Waterman. Introduction to computational biology. Chapman & Hall, 1995.

Sequence Alignment Guided By Common Motifs Described By Context Free Grammars

Sequence Alignment Guided By Common Motifs Described By Context Free Grammars Abdullah N. Arslan Department of Computer Science University of Vermont Burlington, VT 05405, USA Email: aarslan@cs.uvm.edu