Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Size: px

Start display at page:

Download "Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by"

Anthony Wilkins
6 years ago
Views:

1 Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques A Thesis Presented by Divya R. Singh to The Faculty of the Graduate College of the University of Vermont In Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science February 2005

2 Abstract This thesis presents an application of a generalized suffix tree extended by the use of frequency of patterns, to perform accurate and faster biological sequence analysis as an improvement on the computation time of existing tools in this area. This application utilizes the knowledge of frequency of prefixes shared by two or more sequences in a generalized suffix tree, to identify with good accuracy, sequences in a database which are highly similar to a given query sequence. The speedup is achieved by reducing the size of the database to very few sequences which are found closest to the query sequence in question. This results in a faster computation. It can also be viewed as an extension of exact pattern matching, where cumulative results of matched patterns indicate the closest sequences. The specific strategy is to pick matched patterns of the query sequence and identify sequences in the database which share a large number of these matched patterns. Experiments conducted in this study demonstrate that this application outperforms BLAST by obtaining a better computation time, while preserving the accuracy of alignments.

3 Acknowledgements I would like to express my gratitude to my thesis advisors Dr. Abdullah Arslan and Dr. Xindong Wu for their guidance, support and encouragement. I am very thankful to Dr. Marc S. Greenblatt for his help to assist my final defense. I greatly appreciate the time devoted by Dr. Wu and Dr. Arslan in sharing their ideas, critic and experience for my research. I appreciate the feedback given by Dr. Arslan on the experimental results and performance considerations. This study was supported by a DOE EPSCoR grant for research in computational biology. This thesis would not have been possible without the encouragement and support of my family, especially that of my husband Rajesh and my parents. ii

4 Table of Contents Acknowledgements.. ii List of figures. vi List of tables.. vii 1 Introduction Motivation Problem Definition Why do we need string comparison Problem Statement The Proposed Technique Scope and Contributions Organization. 4 2 Background and Related Work Useful Terminology and Definitions String Substring Prefix Suffix S(i) Match Mismatch String Comparison Exact string matching Inexact string matching The Suffix Tree 13 iii

5 2.3.1 Definitions Longest common substring problem of two strings Applications Related Work Dynamic Programming Heuristics Design & Methodology Preliminaries Importance of Data Structure Choice of Algorithm Contributing factors Patterns Function Formula Constraint on Length Fold Cross-validation Ukkonen s Algorithm Methodology Steps Implementation Analysis Output Results and Discussion Experimental Setup Data Sets Query data Hardware and Software Platform Performance Metrics Experimental Results. 53 iv

6 4.3 Controlled Tests Varying the function formula Varying the length limit on patterns Varying the number of patterns for analysis Varying the threshold value Motivation Comparison with BLAST Ecoli.nt Database Bacteria_dna Database Plasmodium_dna Database Scenarios Conclusion Summary Future Work. 66 Bibliography v

7 List of Figures 1. Suffix Tree for string xabxac Generalized Suffix Tree for strings S 1 and S GST of strings S 1 and S 2 for the longest common substring problem Edit Graph for transforming X into Y Alignment of two sequences X and Y Computation of Global alignment score GA* Local alignment involving subsequences ATTGT and AGGACAT Computation of local alignment score LA* Extension of local alignment as long as the resulting score is positive Steps in BLAST Extension of a hit between query Q and string P of database DB Two hits requirement Local Alignment between two strings using BLAST Steps in FASTA Example sequence in FASTA format Runs in 6-fold cross validation Flowchart for SCT steps SCT showing generalized suffix tree for query sequence traversals Processing of the generalized suffix tree Local alignment between the query and a sequence in the database A Bacteria_dna sequence A DNA query sequence Varying the threshold Value SCT vs. BLAST (Ecoli.nt) SCT vs. BLAST (Bacteria_dna) SCT vs. BLAST (Plasmodium_dna) vi

8 List of Tables 1. Common substrings of n strings Dataset used Search results using SCT Search results using BLAST alone Varying the function formula Varying the length limit on patterns Varying the number of patterns for analysis.57 vii

9 CHAPTER 1: Introduction 1.1 Motivation String similarity has a wide application in different areas like biological sequence analysis, information retrieval, pattern recognition, image and signal processing, optical character recognition etc. Biological sequence comparison is extremely important to molecular biologists and other scientists. When sequences are similar they may have either descended from a common evolutionary ancestor, or might have evolved to perform a similar function. Scientists predict the genes for humans by studying the human DNA sequences or based on known and similar, un-annotated genomic sequences of other species like mouse. Unfortunately, the latter approach in its current state is not suitable for sequence comparison on a large genomic scale [3, 4, 6]. A comprehensive study has been conducted in the area of biological sequence comparison [1, 9, 32, 22, 28, 29]. Amongst the existing algorithms and heuristics that perform sequence comparison, the most popular are FASTA, BLAST, Smith- Waterman, and algorithms based on suffix-arrays. These algorithms utilize the existence of local similarity between the query sequence and sequences in a given database for their analysis. Even though these algorithms provide accurate or near optimal results, they have not proved efficient in finding local alignment between two or more sequences, achieving a good computation time. In sequence comparison, it is important to find locally similar segments. Two segments can be either highly conserved or poorly conserved fragments of two long genomic sequences. When sequence alignment is performed for a given sequence with a large database of sequences, finding the most similar sequence(s) from the pool is a challenge for the biologists. The existing algorithms provide local alignments, but face serious constraints in terms of time required for the comparison procedure. In this situation, they are likely to miss important 1

10 alignments or even generate unrelated fragments, leading to problems in comparative gene prediction and establishing sequence functions. Achieving accurate local-alignment between segments of two or more genomic sequences within a reasonable time, poses a big challenge to the biotech and pharmaceutical industry and is currently a serious limitation in the field of computational biology. Our study has been inspired by the use of frequency and the application of a very efficient data structure the suffix tree [3, 4, 6], which exposes the internal structure of a sequence in a deep way. This research focuses on how shared information between a given query sequence and the database of sequences can contribute to determining a strong relationship or similarity between sequences. The suffix tree proves to be a powerful tool to expose this relatedness in the form of shared patterns or substrings. We use the data-mining approach of exploiting frequency of a shared pattern [15, 38] to preserve accuracy and further strengthen our results. 1.2 Problem Definition Now we give a formal definition to the problem of string comparison. The aim of our research is to provide a means of finding frequent common substrings from a database of sequences that have the highest local similarity with a given query sequence followed by local alignment Why do we need string comparison? String comparison is an extremely important problem for identifying and presenting the biologically important, yet hidden or widely dispersed common characteristics from a set of strings. These commonalities can expose evolutionary histories, critical conserved motifs or conserved characters in DNA or proteins, common two and three-dimensional molecular structures, or clues about the common biological functions of the strings. Such commonalities are also used to 2

11 characterize families or superfamilies of proteins. These characterizations are then used in the database searches to identify other potential members of a family Problem Statement We define the sequence similarity query as the following problem: Given a string Q of length m and a database of candidate sequences S, determine one or more sequences from the database which are similar to string Q. Our purpose is to improve the computation time of the search compared to the existing methods, while preserving the accuracy of alignments between Q and the chosen sequences. We can represent the query string as Q[1.m], and the set of sequences in a database as S = {S 1, S 2, S 3,.., S k }. 1.3 The Proposed Technique The proposed study aims to develop a new comparison tool SCT (Sequence Comparison Tool) in computational biology that is fast and enables local alignment with a high degree of similarity between segments from a query and a database of genomic sequences. These segments are studied as patterns [30, 15] for search in this approach. Our purpose is to present a way to eliminate those sequences in the database which are non-similar to the query sequence. This would expedite the search mechanism while conserving all the related segments for alignment. In this way sequence comparison can be done faster, and without losing any important similarity information. This study uses sequential associations [39] to find related segments. Because of its excellent performance in exact pattern matching, we use the generalized suffix tree data structure for detecting the most similar sequences, while discarding the rest from the database. It is then combined with the efficiency of BLAST in determining local alignments between the query and candidate sequences to 3

12 obtain high similarity regions. The reduction in computation time is achieved since we preprocess the database to select a small number of potentially similar sequences to the query sequence and use the smaller database for sequence alignment. 1.4 Scope and Contributions The scope of the thesis is to develop a new space and time efficient technique for sequence similarity query. We make the following contributions in this thesis: Present a new technique for answering sequence similarity query. Ukkonen s suffix tree algorithm [37] is considered the most memory (space) efficient and fast pattern matching tool to generate a generalized suffix tree data structure. Therefore we use this algorithm in SCT. By taking advantage of the speed of comparison in the generalized suffix tree and the efficiency of BLAST, a faster computation of sequence similarity query is facilitated. Demonstrate the speedup achieved with this new technique over using BLAST alone, through a series of experiments. 1.5 Organization The remaining chapters are organized as follows. Chapter 2 provides some useful terminology and background on the concepts and methods of sequence comparison. Research efforts in this area of computational biology are also discussed in this chapter. Chapter 3 presents our approach and elaborates on the design. Chapter 4 describes the experiments conducted to compare the two techniques and explains the results. Chapter 5 summarizes the thesis and suggests future work. 4

13 CHAPTER 2: Background and Related Work We begin with providing some important terminology and definitions, followed by a high level view of the approaches, methods and data structures used in the area of string comparison. We also describe some tools and techniques that have been used so far, for the purpose of sequence comparison. 2.1 Useful Terminology and Definitions 1. String A string S is an ordered list of characters written contiguously from left to right. 2. Substring For a string S, S[i..j] is the contiguous substring of S that starts at position i and ends at position j. S[i..j] is an empty substring, if i > j. 3. Prefix A prefix of string S is a substring of S, S[1..i] that begins at position 1 and ends at position i, where i <= S, and S = number of characters in S or length of S. 4. Suffix A suffix of string S is a substring of S, S[i.. S ] that begins at position i and ends at the end of the string, where i <= S, and S = number of characters in S. 5. S(i) Given a string S, S(i) denotes the i th character of S. 5

14 6. Match In the comparison of two characters if the characters are equal, they are said to match. 7. Mismatch In the comparison of two characters if the characters are not equal, they are said to mismatch. 2.2 String Comparison String comparison can be categorized into two types, exact matching and inexact matching Exact String Matching The exact string matching, given a string P called the pattern and a longer string T called the text, is to find all occurrences, if any, of pattern P in text T. For example, let P = xyz and T = rstxyzuvxyz, then P occurs twice in T beginning at positions 4 and Approaches for Exact String Matching In this section we shortly illustrate on some techniques implemented for exact string matching. We categorize them into three groups as follows: Fundamental Preprocessing The algorithms following this approach ingeniously skip the comparisons between string characters by first spending modest time studying the internal structure of either the pattern P or the text T. This part of the entire algorithm is called the preprocessing stage. Preprocessing is followed by a search stage, in 6

15 which information gathered during the preprocessing stage is used to reduce the amount of work done while searching for occurrences of P in T. An example of this group of algorithms is the Z algorithm [13]. Classical Comparison Based Methods The algorithms following this approach first do a contiguous alignment of the pattern P with the text T and investigate whether P matches the opposing characters of T. On completion of this check, the pattern P is shifted right relative to T. If P has a length n and T has a length m, this shift of the pattern is implemented with several clever rules that direct to a method that examines fewer than m+n characters and runs in linear worst-case time. Some examples of this group of algorithms are the Boyer-Moore algorithm [16, 17, 33, 34, 36], the Knuth-Morris Pratt algorithm [19], and the Aho-Corasick algorithm [20]. Seminumerical String Matching The algorithms that follow this approach are based on bit operations or arithmetic instead of character comparisons. So they have an entirely different mechanism. But sometimes character comparisons can be found hidden at the internal levels of these methods. Some examples of this group of algorithms are the Shift-And method [7], the Fast Fourier Transform [12, 11], and the Karp-Rabin fingerprint methods [18] Inexact String Matching Inexact string matching is also addressed as approximate string matching. As the name suggests, inexact implies that certain errors of different nature are accepted 7

16 while matching [31]. Sequence alignment is the primary approach used in the sequence comparison of this type. Sequence alignment can be defined as a scheme of writing one sequence on top of another by lining up the characters of sequences, allowing for mismatches, gaps and also matches to occur together [23]. This would include permitting the characters of one sequence to be positioned opposite spaces made in the second sequence. These correspond to inserts and deletes in both sequences. Due to the presence of errors in the molecular data and because of active mutational processes that are modeled and revealed by the sequence comparison methods, sequence alignment has become a very significant area in computational molecular biology. Broadly there are two types of sequence alignments. The first is pairwise sequence alignment, that comprises of global and local alignment. The second is multiple sequence alignment. Global Sequence Alignment (GSA) Global sequence alignment (GSA) is used to find the best match of both sequences in their entirety [8]. In GSA, two given sequences S 1 and S 2 are aligned completely. First dashes are inserted anywhere in S 1 and/or S 2, and then the two resulting sequences are placed one above the other so that every character or a dash in either sequence is a unique character or a unique dash in the other sequence [20]. The alignment between S 1 and S 2 is shown below. S 1 = q a c d b d S 2 = q a u x b M M S I/D I/D M I/D 8

17 Here we have 3 matches (M), 1 substitution (S), and 3 insertions/deletions (I/D). A cost is associated to each GSA. This cost is a combination of the match cost (C m ), substitution cost (C s ), and insertions/deletions cost (C i ). Using these costs a score is computed as a measure of the similarity of these sequences. Score = f(m, S, I) = M. C m + S. C s + I. C i Thus we can summarize the uses of global sequence alignment as follows: It is used to deduce the evolutionary history by examining sequence differences and similarities for proteins of the same family. Residues in one position are deemed to have a common evolutionary origin and a position is conserved in evolution if the same letter occurs in both the sequences. We can sometimes infer the structure/function from sequence similarity. To determine similarities between sequences that are found in different species. For example sequences of human α-globin and mouse α-globin. To determine similarities between sequences of the same species that differ because of a gene duplication event. For example sequences of human α- globin and human β-globin. Local Sequence Alignment (LSA) Local sequence alignment finds the best approximate subsequence match [8]. LSA differs from GSA in a way that instead of trying to align the two given sequences completely, it searches and extracts a pair of regions, one from each of the two given strings, that exhibit high similarity. In LSA, given two strings S 1 and S 2 we find substrings α and β of S 1 and S 2 whose similarity (global alignment value) is maximum over all pairs of substrings from S 1 and S 2 respectively. Consider the following two strings S 1 and S 2 : S 1 = p q r a x a b c s t v q S 2 = x y a x b a c s l l 9

18 Similar to GSA we consider a score to be a measure of the similarity of these subsequences. Let us give each match (M) a value of 2, each substitution (S) a value of -2, and each space i.e. insertion/deletion (I/D) a value of -1. Then we have two substrings α and β of S 1 and S 2 whose optimal global alignment is: α = a x a b - c s β = a x - b a c s M M I/D M I/D M M Score = f(m, S, I) = [ (5 x 2) + (2 x -1) ] = 8 Over all the choices of pairs of substrings from the given two strings, the above two substrings have the maximum similarity [27]. So with this scoring scheme, the optimal local alignment of S 1 and S 2 has value 8 and is defined by the substrings α and β. LSA is considered more meaningful than GSA in some applications, especially while comparing long sequences of DNA. This is true, because in DNA sequences only some internal sections of those strings may be related. We can highlight the applications of local sequence alignment as follows: Useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere. Useful for comparing DNA sequences that share a similar motif but differ elsewhere. Useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence). More sensitive when comparing highly diverged sequences. 10

19 Multiple Sequence Alignment (MSA) Multiple sequence alignment can be defined as the alignment of a set of sequences {S 1, S 2,., S n }, where n >2. Dashes are inserted in each of the n strings, to enable the resulting strings to have equal length l. Next the strings are arrayed in n rows and l columns so that each character or dash in each string has a unique location [25]. We can summarize the uses of multiple sequence alignment as follows: MSA is considered most important for the purpose of revealing evolutionary history. It reveals the critically conserved motifs or the conserved characters in DNA or protein sequences. It identifies clues about the common biological function. These features can be very easily missed, if only a pair of sequences was being compared. Multiple sequence alignment is beyond the scope of our study, so here we have covered it only briefly Gap A gap is any maximal, consecutive run of spaces in a single string of a given alignment [14]. Gaps help to create alignments that better conform to underlying biological models and more closely fit patterns that one expects to find in meaningful alignments. Consider the following mutations: (i) ACGA AGGA (ii) ACGA ACCGA (ii) ACGA AGA [Substitution] [Insertion] [Deletion] 11

20 The mutations (ii) and (iii) will result in gaps in the alignments. That is, insertions and deletions form gaps Approaches for Inexact String Matching We now introduce some established approaches to perform inexact string matching using sequence alignment. We can either define a distance or a similarity function for an alignment. Dynamic Programming This approach breaks down the string matching problem into a series of incremental steps or sub-problems. It then provides a solution to the main matching problem by combining the solutions to the sub-problems. Dynamic programming [5] can be used when a problem can be divided into sub-problems and when these sub-problems are not independent, i.e. when sub-problems share problems within sub-problems. Dynamic programming is used to compute global sequence alignment so that it determines the best alignment of two sequences, by identifying the best alignment of all the suffixes of the sequences. An example of a dynamic programming solution to global sequence alignment is the Needleman & Wunsch algorithm [26] Similarly to compute local sequence alignment, dynamic programming is used to determine the score of the best alignment between a substring of the first sequence, and a substring of the second sequence. An example of a dynamic programming solution to local sequence alignment is the Smith & Waterman algorithm [35]. The advantage of the dynamic programming solution is that it produces accurate results. The disadvantage is that it is too slow and not practical when the 12

21 sequences are long or the database size is very large, when answering a sequence similarity query. Heuristic Method This is the most popular approach among biological applications. It establishes the relatedness of two strings by measuring their similarity instead of their distance, which had been followed in the earlier approach of dynamic programming. Heuristic gives approximation algorithms that are used to obtain near optimal results [10]. These approximations are either with respect to the objective function and value, or with respect to the constraints. When searching large databases or when the sequences are long, it is not practical to use dynamic programming, because of its time requirement. Using a heuristic method we can find regions of high local similarity in alignments with gaps. Heuristics also help when we search a whole database for matches to a given query. Some examples of popular heuristic algorithms are BLAST [1, 9, 32], FASTA [22, 28, 29], and PSI BLAST [2]. The advantage of the heuristic sequence alignment algorithms is that they run nearly 50 times faster than dynamic programming. The disadvantage is that heuristics give approximate results and there is a likelihood of missing an alignment or giving inaccurate output. 2.3 The Suffix Tree Suffix tree is a data structure that represents the internal structure of a string in a comprehensive manner. We can use it to solve the exact matching problem in linear time O(n), where n is the length of the string. This data structure enables linear time solutions for numerous problems related to strings. Several algorithms 13

22 have been written to build a suffix tree data structure. The first linear-time suffixtree construction algorithm was developed by Weiner in 1973 [38]. He addressed it as a position tree. A few years later, McCreight [24] came up with an algorithm to generate suffix tree that achieved a better space-complexity. Eventually Ukkonen [37] built a linear time suffix-tree construction algorithm that incorporated all the benefits of McCrieght s algorithm and also offered a simpler implementation Definitions 1. Suffix Tree A suffix tree τ for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Figure 1 illustrates an example of suffix tree. Each internal node, other than the root, has at least two children and each edge is labeled with a non-empty substring of S. No two edges out of a node can have edge labels beginning with the same character [14]. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i..m]. Let us construct a suffix tree for the string S = xabxac. 14

23 4 6 c b Root c x a b x a c 1 c a x a b x c 5 3 a c Figure 1. Suffix tree for string xabxac Label The label of a path from the root that ends at a node is the concatenation, in order, of the substrings labeling the edges of that path. The path-label of a node is the label of the path from the root of τ to that node [14]. 3. String-depth For any node ν in a suffix tree, the string-depth of ν is the number of characters in ν s label [20]. 4. Split A path that ends in the middle of an edge (u,v) splits that label on (u, v) at a designated point. A new node is introduced at the location of split [14].k 15

24 5. Generalized Suffix Tree (GST) A generalized suffix tree is a suffix tree that combines the suffixes of a set of strings {S 1, S 2.., S n }. A GST can be constructed easily and quickly. First build a ST for S 1 (assuming an added terminal character $). Then starting at the root of this tree, match S 2 against a path in the tree until a mismatch occurs. At that point add the remaining characters of the suffix of S 2 to the ST. When it is fully processed, the tree will encode all the suffixes of S 1 and that of S 2 [25]. Let us take two strings S 1 = xabxa and S 2 = babxba. Then we can construct the generalized suffix tree shown in Figure 2. Figure 2. Generalized suffix tree for strings S 1 and S 2. 16

25 In the above figure, a leaf s label consists of two numbers. The first number is the string number and the second number is the starting position of the suffix in that string. For example the leaf labeled (2,3) corresponds to the suffix of string S 2 = babxba that starts at location 3 i.e. bxba$ Longest Common Substring Problem of Two Strings. Given two strings S 1 and S 2, the longest common substring is a substring that appears both in S 1 and S 2 and has the largest possible length. A generalized suffix tree is an easy and efficient data structure to solve this problem. We build a GST for S 1 and S 2. Each leaf of the tree represents either a suffix from one of the two strings or a suffix that occurs in both strings. Mark each internal node ν with a 1 (and/or 2) if there is a leaf in the subtree of ν representing a suffix from S 1 (and/or S 2 ). The path-label of any internal node marked both 1 and 2 is a substring common to both S 1 and S 2, and the longest such string is the longest common substring. The algorithm finds the deepest node (that has the highest number of characters on the path to it) that is marked both 1 and 2 (and therefore, has two leaves). This problem can be solved in linear time, since the construction of the GST can be done in linear time. The time is proportional to the total length of S 1 and S 2. The node markings and the calculations of string-depth can be done by the known linear-time tree traversal methods [25]. Consider two strings S 1 = xabxa and S 2 = babxba. Then the GST with common substrings of S 1 and S 2 can be shown in Figure 3. 17

26 Figure 3. GST of strings S 1 and S 2 for the longest common substring problem. Each internal leaf maintains a list containing 1 (and/or 2), depending on whether that node is common to both strings. Here the longest common substring of S 1 and S 2 is found to be abx. We discussed the longest common substring problem here because we will be using a similar concept in our approach for sequence similarity query. We will differ from this application in three ways: 1. We will use as input a query string and a database of strings. 2. The substring should be common between the query and some other strings in the database. 3. We will identify common substrings that meet certain threshold values. 18

27 For instance we can have common substrings of n strings, where n = 4. Let the four strings be S 1 = xabxa, S 2 = babxba, S 3 = abx, and S 4 = ba. Node Children Depth 1 1, 2, ,2, , 2, , 2, , 4 2 Table 1. Common substrings of n strings Here a generalized suffix tree has been generated for the four strings {1,2,3,4}. Table 1 shows the string depth and list of the common substrings covered at some internal nodes of these four strings [21]. A substring of length 3, identified at node 4 is common to strings 1, 2 and 3. Another substring of length 2, identified at node 7 is common to strings 2 and 4. Similarly there are four other common substrings which have been listed with their depth and the strings sharing them Applications Some of the well known applications of generalized suffix trees are: 1. Longest Common Substring Problem of Two Strings. Given two strings S 1 and S 2, the longest common substring is a substring that appears in both S 1 and S 2, and has the largest possible length. A GST is built for the two strings to obtain the largest common substring. 2. All Pairs Suffix-Prefix Matching Problem. Given a set of sequences {S 1,,S n }, this problem is about finding for each ordered pair S i,s j in the set, the longest suffix-prefix match of (S i,s j ). This problem can be solved in linear time using GST as the main data structure. 19

28 3. All Maximal Repeats Problem in a Single Sequence. A sequence can contain repetitive subsequences. These repeats can occur either adjacent to each other (Tandem repeats) or apart, anywhere in the sequence. A GST can be used to find all maximal repeats in linear time. 4. Circular DNA Sequences Problems. A circular sequence S of length n is defined as a sequence in which character n is considered to precede character 1. The characters in a sequence are numbered sequentially from 1 to n, starting at an arbitrary character in S. Given two circular sequences of the same length, a GST can be used to compare these two sequences to determine if they are equal, in linear time. 2.4 Related Work This section unfolds some earlier work done in the area of sequence comparison and alignment Dynamic Programming If two sequences (DNA or protein) are highly similar, it implies that they have similar 3D structure or share similar function. We can measure the similarity of genomes of different species by knowing their evolution distance. Levenshtein (1966) introduced the notion of edit distance. String similarity can be studied using the concept of edit distance. Let X= x 1 x 2. x n and Y= y 1 y 2. y m be two strings over an alphabet with n >= m. For computing the similarity between X and Y, we will transform X into Y through a sequence of edit operations, called an edit sequence. For 1 i n, the edit operations applicable on the symbols of X to transform it into Y are of three types: Insertion: any symbol s in can be inserted before or after x i, 20

29 Deletion: the symbol x i can be deleted, Substitution: the symbol x i can be replaced by a symbol s in. A substitution operation is: A matching substitution if s = x i, A non-matching substitution if s x i. A common framework used for computing edit distance is called the edit graph G x,y of the strings X and Y and a given cost function γ. The edit graph is a directed acyclic graph having (n+1)(m+1) lattice points (i,j) for 0 i n, and 0 j m as vertices. The top left extreme point of this rectangular grid is the vertex (0,0) and the bottom-right extreme point is the vertex (n,m). Consider two strings X = aba and Y = bab. We draw the edit graph for X and Y shown in Figure 4. X a Y b a b 0,0 1,0 2,0 3,0 є b є a є a є a b a є a a a є a b a є b 0,1 1,1 2,1 є b є a є b 3,1 b b є b b b є b a b є b b b є a a 0,2 є 1,2 2,2 є b є a є a b a є a a a є a b b a 3,2 є 0,3 1,3 2,3 є b є a є b 3,3 21

30 Figure 4. Edit graph for transforming X into Y. An edit path in G x,y is a directed path from (0,0) to (n,m). Arcs of an edit path correspond to an edit sequence as follows: A horizontal arc ((i,j-1),(i,j)) corresponds to the insertion of y j immediately before x i (i.e. ε y j), A vertical arc ((i-1,j),(i,j)) corresponds to the deletion of x i (i.e. x i ε), And a diagonal arc ((i-1,j-1),(i,j)) corresponds to the substitution of symbol y j for x i (i.e. x i y j). Here ε represents the null string. A cost function γ assigns a weight to each edit operation turning G x,y,γ into a weighted graph. Thus edit distance problem seeks an edit sequence with minimum total weight over all edit sequences. The edit distance between X and Y is defined as the weight of such an optimal sequence. The edit distance can be computed in O(nm) time Needleman-Wunsch Algorithm Having understood the notion of edit distance, we can also formalize the relatedness of two strings by calculating a similarity score rather than their edit distance. The Needleman-Wunsch algorithm [26] introduced the global sequence alignment, and gave a dynamic programming algorithm whose time complexity is cubic. The time complexity of the problem is improved to quadratic by an algorithm in Figure 6. Consider two sequences, X = ATTGT and Y = AGGACAT as shown in Figure 5. A T T G T A - G G A C A T Figure 5. Alignment of two sequences X and Y. 22

31 We can use the edit graph to visualize all possible alignments between the two strings X and Y. In the context of sequence alignment the edit graph is called the alignment graph, insertions(horizontal arcs) and deletions(vertical arcs) are both called indels. The names match, and mismatch, are used to refer to matching diagonal, and mismatching diagonal arcs. In the simplest scoring scheme, the arcs of the alignment graph are assigned weights determined by non-negative reals δ (mismatch penalty) and µ (indel, or gap penalty). In Figure 6, s(x i, y j ) denotes the similarity score between the symbols x i and y j which is normally 1 for a match (x i = y j ) and δ for a mismatch (x i y j ). The optimum global alignment score GA* between X and Y can be computed by Needleman-Wunsch algorithm as shown in Figure 6. This takes O(nm) time and O(m) space. Simple Scoring Scheme: S i,j : Score achieved at (i,j). S i,j = max{ S i-1,j - µ, S i-1,j-1 + s(x i, y j ), S i,j-1 - µ} Where S i,j = -iµ when j = 0, S i,j = -jµ when i = 0, and s(x i, y j ) = 1 for match; δ otherwise. Therefore, GA* = S n,m In O(nm) time and O(m) space. Figure 6. Computation of global alignment score GA*. Hence the steps in the global alignment algorithm in Figure 6 can be outlined as: 1. Assign the similarity values. 2. For each cell, look at all the possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway. 23

32 3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment. Features of the algorithm in Figure 6: 1. Classical algorithm for sequence comparison. 2. Maximizes the similarity score to give maximum match. 3. Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions. 4. Finds the best global alignment of any two sequences. 5. Involves an iterative matrix method of calculation. All possible pairs of residues (bases or amino acids) one from each sequence are represented in a 2 dimensional array. All possible alignments (comparisons) are represented by pathways through this array Smith -Waterman Algorithm Now we discuss an algorithm that provides a dynamic programming solution to the problem of local sequence alignment. It was given by Smith and Waterman [35] in As shown in Figure 7, while comparing two sequences there may be only a relatively small region in the sequences that actually approximately matches. Thus Smith Waterman algorithm aims to detect local similarities in sequences. The difference between a local and a global alignment is that a local alignment may involve any pairs of subsequences I and J of X and Y, respectively. 24

33 Figure 7. Local alignment involving subsequences ATTGT and AGGACAT. In local alignment any pairs of subsequences may be involved, but the algorithm computes the optimal alignments on the subsequences. The Smith Waterman algorithm determines the maximum local alignment score S i,j ending at each vertex (i,j) for the basic scoring scheme. Figure 8 shows the Smith Waterman formulation for local alignment under the simple scoring scheme: Simple Scoring Scheme: S i,j : Similarity score achieved at (i,j). S i,j = max{ 0, S i-1,j - µ, S i-1,j-1 + s(x i, y j ), S i,j-1 - µ} Where S i,j = 0 whenever i=0 or j=0, and s(x i, y j ) = 1 for match; δ otherwise. Therefore, LA* = max S i,j In O(nm) time and O(m) space. Figure 8. Computation of local alignment score LA*. 25

34 The Smith Waterman algorithm requires only O(m) space complexity because only O(m) entries of the dynamic programming matrix need to be stored at any given time. The algorithm extends a local alignment as long as the resulting score is positive as shown in Figure 9. Some extensions increase the score while others decrease it. Figure 9. Extension of local alignment as long as the resulting score is positive. We can outline the four possible ways of forming a path using Smith Waterman algorithm as follows: For every residue in one sequence, 1. Align with the next residue of the second sequence. Score is the previous score plus the similarity score for the two residues. 2. For deletion (i.e. match residue of query with a gap), the score is the previous score minus gap penalty dependent on size of the gap and the gap open penalty. 26

35 3. For insertion (i.e. match residue of second sequence with a gap), the score is previous score minus gap penalty dependent on size of the gap and the gap open penalty. 4. Stop extending the alignment if the score is less than zero. Then choose whichever of these is the highest. Features of the Smith-Waterman algorithm: 1. Instead of looking at each sequence in its entirety, it compares segments of all possible lengths (local alignments) and chooses whichever maximizes the similarity measure. 2. For every cell, the algorithm calculates ALL possible paths leading to it. These paths can be of any length and can contain insertions, deletions and substitutions. 3. These computations are incorporated into the dynamic programming solution in Figure 6, and we can obtain the Smith Waterman local alignment algorithm for the simple scoring scheme shown in Figure Heuristics Dynamic programming is not practical when the sequences are long. Heuristic sequence alignment algorithms are approximation algorithms that are 10 to 50 times faster and are used to obtain near optimal results BLAST BLAST stands for Basic Local Alignment tool. BLAST [1] is the dominant search engine for biological sequence databases. It heuristically finds high scoring local alignments. It is typically used to search a query sequence against a database of sequences. 27

36 Given two strings S 1 and S 2, a segment pair is a pair of equal length substrings of S 1 and S 2, aligned without spaces. A locally maximal segment pair is a segment pair whose alignment score (without spaces) would fall either by expanding or shortening the segments on either side. A maximal segment pair (MSP) in S 1, S 2 is a segment pair with the maximum score over all segment pairs in S 1, S 2. BLAST directly approximates alignments that optimize the maximal segment pair(msp) score. This heuristic algorithm can be applied in a variety of contexts including straight forward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. The key tradeoff it makes is of sensivity vs. speed. Sensivity can be defined as the ability to detect correct matches. It is the ratio of number of significant matches detected to number of significant matches in the database. History: BLAST1 was created in 1990, and dedicated to the search for regions of local similarity without gaps. BLAST2 was created as an extension of BLAST1, by allowing the insertion of gaps. Two versions of BLAST2 were independently developed, namely NCBI_BLAST2 [NCBIBLAST] by National Center for Biotechnology Information in 1997 and WU_BLAST2 [WUBLAST] by the Washington University in Algorithm BLAST1 This algorithm concentrates on finding regions of high local similarity in alignments without gaps, evaluated by an alphabet-weight scoring matrix. Alignments with some gaps can be created by chaining together several locally similar regions that BLAST finds. The fundamental objects that concern BLAST are segment pairs, locally maximal segment pairs, and the maximal segment pairs. 28

37 Figure 10. Steps in BLAST We can outline the steps of BLAST1 shown in Figure 10 as follows: Given: Query sequence Q, word length W, word score threshold T, segment score threshold S. 1. Compile a list of words that score at least T when compared to words from Q. 2. Scan the database for matches to words in the list. 29

38 3. Extend all matches to seek high scoring un-gapped alignments. Then return all alignments scoring at least S. Let us understand this with an example. EXAMPLE: Query sequence: QLNFSAGW Word Length w=2 Word score threshold T = 8 Step1: Determine all words of length w in query sequence. QL LN NF FS SA AG GW Step2: Determine all words that score at least T when compared to a word in the query sequence. Words from Query Words with T=8 Sequence QL QL=11, QM=9, HL=8, ZL=9 LN LN=9, LB=8 NF NF=12, AF=8, NY=8, DF=10,.. SA none While searching the database for all occurrences of query words, we apply the following approach: Index database sequences into table of words (pre-compute this). Index query words into table (at query time) Step3: Extending Hits. 1. Extend the hits in both directions without allowing gaps. 30

39 2. Terminate the extension in one direction when score falls certain distance below best score for shorter extensions. 3. Return all the segment pairs scoring at least S. Query Q P of DB Figure 11. Extension of a hit between query Q and string P of database DB. A hit as shown in red is being extended in Figure 11, without allowing gaps. Algorithm BLAST2 The most important feature of NCBI-BLAST2 is that it allows local alignment with gaps. The first two steps, leading to the generation of primary hits are the same as those in BLAST1. The third step includes two major refinements: 1. Two-hits requirement Do extension only when there are two closely spaced hits on the same diagonal. 2. Gapped Extension Allow gaps in extensions. The Two Hit Method 1. The extension step typically accounts for 90% of BLAST s execution time. 31

40 2. Do extension only when there are two hits on the same diagonal within distance A of each other. 3. To maintain sensitivity, lower T parameter. This will result in more single hits being found and only a small fraction will have associated second hit. A Figure 12. Two hits requirement In Figure 12 we can see there are two hits marked in red on the diagonal. Two hits are within a distance of A units from each other. Gapped Extension The following steps are applied in gapped extension: 1. Trigger the gapped alignment if two hit extension has a sufficiently high score. 2. Find length-11 segment with highest score and use the central pair in this segment as seed. 3. Run dynamic programming process both forward and backward from seed. 4. Prune cells when local alignment score falls a certain distance below best score yet. Procedure When comparing all the sequences in a database with a fixed query sequence P, BLAST attempts to find all those database sequences that together with P, contain an MSP above some cut-off score C. The choice of C is guided by the scoring 32

41 matrix and the characteristics of P and of the database sequences. Any sequence with a MSP score above C is considered significant and reported. Score The Scoring scheme of BLAST is based on PAM matrices. PAM matrices are amino acid substitution matrices that encode and summarize the expected evolutionary change at the amino acid level. Each PAM matrix is designed to be used to compare pairs of sequences that are a specific number of PAM units diverged. Figure 13 shows an example output of local alignment between query and a database sequence. It is accompanied by the score for this alignment. Score = 28.2 bits (14), Expect = 1.9 Identities = 20/22 (90%) Strand = Plus / Minus Query: 390 gttgactgcacttccagccagg 411 Sbjct: 239 gttgactgatcttccagccagg 218 Figure 13. Local alignment between two strings using BLAST. Two sequences S 1 and S 2 are considered one PAM unit diverged if a series of accepted point mutations (and no insertions or deletions) has converted S 1 to S 2 with an average of one accepted point-mutation event per one-hundred amino acids. For any specific pair of amino acid characters, denoted A i, A j, the (i, j) entry in the PAM n matrix reflects the frequency that A i is expected to replace A j in two sequences that are n PAM units diverged. Let f(i, j) denote the resulting frequency, and f(i) and f(j) respectively, be the frequencies that amino acids A i and A j appear in the sequences. 33

42 Then the (i, j)th entry for the ideal PAM n matrix is log[f(i, j)/( f(i)*f(j))]. The reason for dividing f(i,j) by f(i)*f(j)) is to normalize the true(historical) replacement frequency one expects due to chance alone. Summary BLAST is successful because of its speed, range of solutions and providing an estimate of statistical significance for the matches found in sequences FASTA FASTA [9, 29] is another popular algorithm for string comparison. It was developed in 1985 and further improved in Procedure FASTA works by comparing a query string against a single text string. When searching the whole database for matches to a given query, the query string is compared using the FASTA algorithm to every string in the database. When looking for an alignment, one might expect to find a few segments in which there will be absolute identity between the two compared strings. The algorithm uses this property and focuses on these identical regions. 34

43 Figure 14. Steps in FASTA Let us take two sequences X and Y. From Figure 14 we can study the four stages in FASTA algorithm:- 1. Specify an integer parameter called ktup (k respective tuples), and look for ktup-length matching substrings between X and Y. 2. Rescore the runs of identities using PAM substitution matrix. Keep top scoring segments. 3. Apply joining threshold to eliminate segments that are unlikely to be part of the alignment that includes highest scoring segment. 35

44 4. Use dynamic programming to optimize the alignment in a narrow band that encompasses the top scoring segments. FASTA format >gi pir TVFV2E TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGS QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLR HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHG MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWL TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVE APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGIL LAAVEAQQQMLKLTIWGVK Figure 15. Example of a sequence in FASTA format. In Figure 15 we see the format of a sample sequence which is used by the FASTA algorithm. Summary It is found that the resulting alignment scores with FASTA well compare to the accurate alignment, while the algorithm is also much faster than ordinary dynamic programming algorithm for sequence alignment. 36

45 CHAPTER 3: Design & Methodology This section can be divided into preliminaries, contributing factors, algorithm and methodology. The methodology broadly comprises of generating a generalized suffix tree (GST) of the database sequences, pre-processing GST to add sequence information, post-processing GST to extract patterns, isolate sequences associated with the patterns and rank them, and finally perform pair-wise local alignment between query and the chosen sequences. GST being the backbone of the method is constructed with a careful selection of a suitable linear time algorithm. 3.1 Preliminaries To begin with we cover some important requirements Importance of Data Structure For the correct representation and use of data, it is extremely important to pick the right data-structure for its storage. Knowledge can be retrieved out of a wealth of information only if we handle the data logically and follow a clear and simple implementation. Suffix tree is an excellent data structure to solve problems related to all kinds of sequences. The internal details of a biological sequence can be revealed in a better way using this tool, because it offers the robustness of a tree structure and advantage of linear time O(m), where m is the length of the sequence Choice of Algorithm Several algorithms have been written for the linear time construction of the suffix tree. But there is one algorithm that stands out in all existing works viz. the Ukkonen s method [37] given by Esko Ukkonen. Ukkonen s algorithm with O(m) 37

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah