Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Size: px
Start display at page:

Download "Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by"

Transcription

1 Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques A Thesis Presented by Divya R. Singh to The Faculty of the Graduate College of the University of Vermont In Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science February 2005

2 Abstract This thesis presents an application of a generalized suffix tree extended by the use of frequency of patterns, to perform accurate and faster biological sequence analysis as an improvement on the computation time of existing tools in this area. This application utilizes the knowledge of frequency of prefixes shared by two or more sequences in a generalized suffix tree, to identify with good accuracy, sequences in a database which are highly similar to a given query sequence. The speedup is achieved by reducing the size of the database to very few sequences which are found closest to the query sequence in question. This results in a faster computation. It can also be viewed as an extension of exact pattern matching, where cumulative results of matched patterns indicate the closest sequences. The specific strategy is to pick matched patterns of the query sequence and identify sequences in the database which share a large number of these matched patterns. Experiments conducted in this study demonstrate that this application outperforms BLAST by obtaining a better computation time, while preserving the accuracy of alignments.

3 Acknowledgements I would like to express my gratitude to my thesis advisors Dr. Abdullah Arslan and Dr. Xindong Wu for their guidance, support and encouragement. I am very thankful to Dr. Marc S. Greenblatt for his help to assist my final defense. I greatly appreciate the time devoted by Dr. Wu and Dr. Arslan in sharing their ideas, critic and experience for my research. I appreciate the feedback given by Dr. Arslan on the experimental results and performance considerations. This study was supported by a DOE EPSCoR grant for research in computational biology. This thesis would not have been possible without the encouragement and support of my family, especially that of my husband Rajesh and my parents. ii

4 Table of Contents Acknowledgements.. ii List of figures. vi List of tables.. vii 1 Introduction Motivation Problem Definition Why do we need string comparison Problem Statement The Proposed Technique Scope and Contributions Organization. 4 2 Background and Related Work Useful Terminology and Definitions String Substring Prefix Suffix S(i) Match Mismatch String Comparison Exact string matching Inexact string matching The Suffix Tree 13 iii

5 2.3.1 Definitions Longest common substring problem of two strings Applications Related Work Dynamic Programming Heuristics Design & Methodology Preliminaries Importance of Data Structure Choice of Algorithm Contributing factors Patterns Function Formula Constraint on Length Fold Cross-validation Ukkonen s Algorithm Methodology Steps Implementation Analysis Output Results and Discussion Experimental Setup Data Sets Query data Hardware and Software Platform Performance Metrics Experimental Results. 53 iv

6 4.3 Controlled Tests Varying the function formula Varying the length limit on patterns Varying the number of patterns for analysis Varying the threshold value Motivation Comparison with BLAST Ecoli.nt Database Bacteria_dna Database Plasmodium_dna Database Scenarios Conclusion Summary Future Work. 66 Bibliography v

7 List of Figures 1. Suffix Tree for string xabxac Generalized Suffix Tree for strings S 1 and S GST of strings S 1 and S 2 for the longest common substring problem Edit Graph for transforming X into Y Alignment of two sequences X and Y Computation of Global alignment score GA* Local alignment involving subsequences ATTGT and AGGACAT Computation of local alignment score LA* Extension of local alignment as long as the resulting score is positive Steps in BLAST Extension of a hit between query Q and string P of database DB Two hits requirement Local Alignment between two strings using BLAST Steps in FASTA Example sequence in FASTA format Runs in 6-fold cross validation Flowchart for SCT steps SCT showing generalized suffix tree for query sequence traversals Processing of the generalized suffix tree Local alignment between the query and a sequence in the database A Bacteria_dna sequence A DNA query sequence Varying the threshold Value SCT vs. BLAST (Ecoli.nt) SCT vs. BLAST (Bacteria_dna) SCT vs. BLAST (Plasmodium_dna) vi

8 List of Tables 1. Common substrings of n strings Dataset used Search results using SCT Search results using BLAST alone Varying the function formula Varying the length limit on patterns Varying the number of patterns for analysis.57 vii

9 CHAPTER 1: Introduction 1.1 Motivation String similarity has a wide application in different areas like biological sequence analysis, information retrieval, pattern recognition, image and signal processing, optical character recognition etc. Biological sequence comparison is extremely important to molecular biologists and other scientists. When sequences are similar they may have either descended from a common evolutionary ancestor, or might have evolved to perform a similar function. Scientists predict the genes for humans by studying the human DNA sequences or based on known and similar, un-annotated genomic sequences of other species like mouse. Unfortunately, the latter approach in its current state is not suitable for sequence comparison on a large genomic scale [3, 4, 6]. A comprehensive study has been conducted in the area of biological sequence comparison [1, 9, 32, 22, 28, 29]. Amongst the existing algorithms and heuristics that perform sequence comparison, the most popular are FASTA, BLAST, Smith- Waterman, and algorithms based on suffix-arrays. These algorithms utilize the existence of local similarity between the query sequence and sequences in a given database for their analysis. Even though these algorithms provide accurate or near optimal results, they have not proved efficient in finding local alignment between two or more sequences, achieving a good computation time. In sequence comparison, it is important to find locally similar segments. Two segments can be either highly conserved or poorly conserved fragments of two long genomic sequences. When sequence alignment is performed for a given sequence with a large database of sequences, finding the most similar sequence(s) from the pool is a challenge for the biologists. The existing algorithms provide local alignments, but face serious constraints in terms of time required for the comparison procedure. In this situation, they are likely to miss important 1

10 alignments or even generate unrelated fragments, leading to problems in comparative gene prediction and establishing sequence functions. Achieving accurate local-alignment between segments of two or more genomic sequences within a reasonable time, poses a big challenge to the biotech and pharmaceutical industry and is currently a serious limitation in the field of computational biology. Our study has been inspired by the use of frequency and the application of a very efficient data structure the suffix tree [3, 4, 6], which exposes the internal structure of a sequence in a deep way. This research focuses on how shared information between a given query sequence and the database of sequences can contribute to determining a strong relationship or similarity between sequences. The suffix tree proves to be a powerful tool to expose this relatedness in the form of shared patterns or substrings. We use the data-mining approach of exploiting frequency of a shared pattern [15, 38] to preserve accuracy and further strengthen our results. 1.2 Problem Definition Now we give a formal definition to the problem of string comparison. The aim of our research is to provide a means of finding frequent common substrings from a database of sequences that have the highest local similarity with a given query sequence followed by local alignment Why do we need string comparison? String comparison is an extremely important problem for identifying and presenting the biologically important, yet hidden or widely dispersed common characteristics from a set of strings. These commonalities can expose evolutionary histories, critical conserved motifs or conserved characters in DNA or proteins, common two and three-dimensional molecular structures, or clues about the common biological functions of the strings. Such commonalities are also used to 2

11 characterize families or superfamilies of proteins. These characterizations are then used in the database searches to identify other potential members of a family Problem Statement We define the sequence similarity query as the following problem: Given a string Q of length m and a database of candidate sequences S, determine one or more sequences from the database which are similar to string Q. Our purpose is to improve the computation time of the search compared to the existing methods, while preserving the accuracy of alignments between Q and the chosen sequences. We can represent the query string as Q[1.m], and the set of sequences in a database as S = {S 1, S 2, S 3,.., S k }. 1.3 The Proposed Technique The proposed study aims to develop a new comparison tool SCT (Sequence Comparison Tool) in computational biology that is fast and enables local alignment with a high degree of similarity between segments from a query and a database of genomic sequences. These segments are studied as patterns [30, 15] for search in this approach. Our purpose is to present a way to eliminate those sequences in the database which are non-similar to the query sequence. This would expedite the search mechanism while conserving all the related segments for alignment. In this way sequence comparison can be done faster, and without losing any important similarity information. This study uses sequential associations [39] to find related segments. Because of its excellent performance in exact pattern matching, we use the generalized suffix tree data structure for detecting the most similar sequences, while discarding the rest from the database. It is then combined with the efficiency of BLAST in determining local alignments between the query and candidate sequences to 3

12 obtain high similarity regions. The reduction in computation time is achieved since we preprocess the database to select a small number of potentially similar sequences to the query sequence and use the smaller database for sequence alignment. 1.4 Scope and Contributions The scope of the thesis is to develop a new space and time efficient technique for sequence similarity query. We make the following contributions in this thesis: Present a new technique for answering sequence similarity query. Ukkonen s suffix tree algorithm [37] is considered the most memory (space) efficient and fast pattern matching tool to generate a generalized suffix tree data structure. Therefore we use this algorithm in SCT. By taking advantage of the speed of comparison in the generalized suffix tree and the efficiency of BLAST, a faster computation of sequence similarity query is facilitated. Demonstrate the speedup achieved with this new technique over using BLAST alone, through a series of experiments. 1.5 Organization The remaining chapters are organized as follows. Chapter 2 provides some useful terminology and background on the concepts and methods of sequence comparison. Research efforts in this area of computational biology are also discussed in this chapter. Chapter 3 presents our approach and elaborates on the design. Chapter 4 describes the experiments conducted to compare the two techniques and explains the results. Chapter 5 summarizes the thesis and suggests future work. 4

13 CHAPTER 2: Background and Related Work We begin with providing some important terminology and definitions, followed by a high level view of the approaches, methods and data structures used in the area of string comparison. We also describe some tools and techniques that have been used so far, for the purpose of sequence comparison. 2.1 Useful Terminology and Definitions 1. String A string S is an ordered list of characters written contiguously from left to right. 2. Substring For a string S, S[i..j] is the contiguous substring of S that starts at position i and ends at position j. S[i..j] is an empty substring, if i > j. 3. Prefix A prefix of string S is a substring of S, S[1..i] that begins at position 1 and ends at position i, where i <= S, and S = number of characters in S or length of S. 4. Suffix A suffix of string S is a substring of S, S[i.. S ] that begins at position i and ends at the end of the string, where i <= S, and S = number of characters in S. 5. S(i) Given a string S, S(i) denotes the i th character of S. 5

14 6. Match In the comparison of two characters if the characters are equal, they are said to match. 7. Mismatch In the comparison of two characters if the characters are not equal, they are said to mismatch. 2.2 String Comparison String comparison can be categorized into two types, exact matching and inexact matching Exact String Matching The exact string matching, given a string P called the pattern and a longer string T called the text, is to find all occurrences, if any, of pattern P in text T. For example, let P = xyz and T = rstxyzuvxyz, then P occurs twice in T beginning at positions 4 and Approaches for Exact String Matching In this section we shortly illustrate on some techniques implemented for exact string matching. We categorize them into three groups as follows: Fundamental Preprocessing The algorithms following this approach ingeniously skip the comparisons between string characters by first spending modest time studying the internal structure of either the pattern P or the text T. This part of the entire algorithm is called the preprocessing stage. Preprocessing is followed by a search stage, in 6

15 which information gathered during the preprocessing stage is used to reduce the amount of work done while searching for occurrences of P in T. An example of this group of algorithms is the Z algorithm [13]. Classical Comparison Based Methods The algorithms following this approach first do a contiguous alignment of the pattern P with the text T and investigate whether P matches the opposing characters of T. On completion of this check, the pattern P is shifted right relative to T. If P has a length n and T has a length m, this shift of the pattern is implemented with several clever rules that direct to a method that examines fewer than m+n characters and runs in linear worst-case time. Some examples of this group of algorithms are the Boyer-Moore algorithm [16, 17, 33, 34, 36], the Knuth-Morris Pratt algorithm [19], and the Aho-Corasick algorithm [20]. Seminumerical String Matching The algorithms that follow this approach are based on bit operations or arithmetic instead of character comparisons. So they have an entirely different mechanism. But sometimes character comparisons can be found hidden at the internal levels of these methods. Some examples of this group of algorithms are the Shift-And method [7], the Fast Fourier Transform [12, 11], and the Karp-Rabin fingerprint methods [18] Inexact String Matching Inexact string matching is also addressed as approximate string matching. As the name suggests, inexact implies that certain errors of different nature are accepted 7

16 while matching [31]. Sequence alignment is the primary approach used in the sequence comparison of this type. Sequence alignment can be defined as a scheme of writing one sequence on top of another by lining up the characters of sequences, allowing for mismatches, gaps and also matches to occur together [23]. This would include permitting the characters of one sequence to be positioned opposite spaces made in the second sequence. These correspond to inserts and deletes in both sequences. Due to the presence of errors in the molecular data and because of active mutational processes that are modeled and revealed by the sequence comparison methods, sequence alignment has become a very significant area in computational molecular biology. Broadly there are two types of sequence alignments. The first is pairwise sequence alignment, that comprises of global and local alignment. The second is multiple sequence alignment. Global Sequence Alignment (GSA) Global sequence alignment (GSA) is used to find the best match of both sequences in their entirety [8]. In GSA, two given sequences S 1 and S 2 are aligned completely. First dashes are inserted anywhere in S 1 and/or S 2, and then the two resulting sequences are placed one above the other so that every character or a dash in either sequence is a unique character or a unique dash in the other sequence [20]. The alignment between S 1 and S 2 is shown below. S 1 = q a c d b d S 2 = q a u x b M M S I/D I/D M I/D 8

17 Here we have 3 matches (M), 1 substitution (S), and 3 insertions/deletions (I/D). A cost is associated to each GSA. This cost is a combination of the match cost (C m ), substitution cost (C s ), and insertions/deletions cost (C i ). Using these costs a score is computed as a measure of the similarity of these sequences. Score = f(m, S, I) = M. C m + S. C s + I. C i Thus we can summarize the uses of global sequence alignment as follows: It is used to deduce the evolutionary history by examining sequence differences and similarities for proteins of the same family. Residues in one position are deemed to have a common evolutionary origin and a position is conserved in evolution if the same letter occurs in both the sequences. We can sometimes infer the structure/function from sequence similarity. To determine similarities between sequences that are found in different species. For example sequences of human α-globin and mouse α-globin. To determine similarities between sequences of the same species that differ because of a gene duplication event. For example sequences of human α- globin and human β-globin. Local Sequence Alignment (LSA) Local sequence alignment finds the best approximate subsequence match [8]. LSA differs from GSA in a way that instead of trying to align the two given sequences completely, it searches and extracts a pair of regions, one from each of the two given strings, that exhibit high similarity. In LSA, given two strings S 1 and S 2 we find substrings α and β of S 1 and S 2 whose similarity (global alignment value) is maximum over all pairs of substrings from S 1 and S 2 respectively. Consider the following two strings S 1 and S 2 : S 1 = p q r a x a b c s t v q S 2 = x y a x b a c s l l 9

18 Similar to GSA we consider a score to be a measure of the similarity of these subsequences. Let us give each match (M) a value of 2, each substitution (S) a value of -2, and each space i.e. insertion/deletion (I/D) a value of -1. Then we have two substrings α and β of S 1 and S 2 whose optimal global alignment is: α = a x a b - c s β = a x - b a c s M M I/D M I/D M M Score = f(m, S, I) = [ (5 x 2) + (2 x -1) ] = 8 Over all the choices of pairs of substrings from the given two strings, the above two substrings have the maximum similarity [27]. So with this scoring scheme, the optimal local alignment of S 1 and S 2 has value 8 and is defined by the substrings α and β. LSA is considered more meaningful than GSA in some applications, especially while comparing long sequences of DNA. This is true, because in DNA sequences only some internal sections of those strings may be related. We can highlight the applications of local sequence alignment as follows: Useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere. Useful for comparing DNA sequences that share a similar motif but differ elsewhere. Useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence). More sensitive when comparing highly diverged sequences. 10

19 Multiple Sequence Alignment (MSA) Multiple sequence alignment can be defined as the alignment of a set of sequences {S 1, S 2,., S n }, where n >2. Dashes are inserted in each of the n strings, to enable the resulting strings to have equal length l. Next the strings are arrayed in n rows and l columns so that each character or dash in each string has a unique location [25]. We can summarize the uses of multiple sequence alignment as follows: MSA is considered most important for the purpose of revealing evolutionary history. It reveals the critically conserved motifs or the conserved characters in DNA or protein sequences. It identifies clues about the common biological function. These features can be very easily missed, if only a pair of sequences was being compared. Multiple sequence alignment is beyond the scope of our study, so here we have covered it only briefly Gap A gap is any maximal, consecutive run of spaces in a single string of a given alignment [14]. Gaps help to create alignments that better conform to underlying biological models and more closely fit patterns that one expects to find in meaningful alignments. Consider the following mutations: (i) ACGA AGGA (ii) ACGA ACCGA (ii) ACGA AGA [Substitution] [Insertion] [Deletion] 11

20 The mutations (ii) and (iii) will result in gaps in the alignments. That is, insertions and deletions form gaps Approaches for Inexact String Matching We now introduce some established approaches to perform inexact string matching using sequence alignment. We can either define a distance or a similarity function for an alignment. Dynamic Programming This approach breaks down the string matching problem into a series of incremental steps or sub-problems. It then provides a solution to the main matching problem by combining the solutions to the sub-problems. Dynamic programming [5] can be used when a problem can be divided into sub-problems and when these sub-problems are not independent, i.e. when sub-problems share problems within sub-problems. Dynamic programming is used to compute global sequence alignment so that it determines the best alignment of two sequences, by identifying the best alignment of all the suffixes of the sequences. An example of a dynamic programming solution to global sequence alignment is the Needleman & Wunsch algorithm [26] Similarly to compute local sequence alignment, dynamic programming is used to determine the score of the best alignment between a substring of the first sequence, and a substring of the second sequence. An example of a dynamic programming solution to local sequence alignment is the Smith & Waterman algorithm [35]. The advantage of the dynamic programming solution is that it produces accurate results. The disadvantage is that it is too slow and not practical when the 12

21 sequences are long or the database size is very large, when answering a sequence similarity query. Heuristic Method This is the most popular approach among biological applications. It establishes the relatedness of two strings by measuring their similarity instead of their distance, which had been followed in the earlier approach of dynamic programming. Heuristic gives approximation algorithms that are used to obtain near optimal results [10]. These approximations are either with respect to the objective function and value, or with respect to the constraints. When searching large databases or when the sequences are long, it is not practical to use dynamic programming, because of its time requirement. Using a heuristic method we can find regions of high local similarity in alignments with gaps. Heuristics also help when we search a whole database for matches to a given query. Some examples of popular heuristic algorithms are BLAST [1, 9, 32], FASTA [22, 28, 29], and PSI BLAST [2]. The advantage of the heuristic sequence alignment algorithms is that they run nearly 50 times faster than dynamic programming. The disadvantage is that heuristics give approximate results and there is a likelihood of missing an alignment or giving inaccurate output. 2.3 The Suffix Tree Suffix tree is a data structure that represents the internal structure of a string in a comprehensive manner. We can use it to solve the exact matching problem in linear time O(n), where n is the length of the string. This data structure enables linear time solutions for numerous problems related to strings. Several algorithms 13

22 have been written to build a suffix tree data structure. The first linear-time suffixtree construction algorithm was developed by Weiner in 1973 [38]. He addressed it as a position tree. A few years later, McCreight [24] came up with an algorithm to generate suffix tree that achieved a better space-complexity. Eventually Ukkonen [37] built a linear time suffix-tree construction algorithm that incorporated all the benefits of McCrieght s algorithm and also offered a simpler implementation Definitions 1. Suffix Tree A suffix tree τ for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Figure 1 illustrates an example of suffix tree. Each internal node, other than the root, has at least two children and each edge is labeled with a non-empty substring of S. No two edges out of a node can have edge labels beginning with the same character [14]. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i..m]. Let us construct a suffix tree for the string S = xabxac. 14

23 4 6 c b Root c x a b x a c 1 c a x a b x c 5 3 a c Figure 1. Suffix tree for string xabxac Label The label of a path from the root that ends at a node is the concatenation, in order, of the substrings labeling the edges of that path. The path-label of a node is the label of the path from the root of τ to that node [14]. 3. String-depth For any node ν in a suffix tree, the string-depth of ν is the number of characters in ν s label [20]. 4. Split A path that ends in the middle of an edge (u,v) splits that label on (u, v) at a designated point. A new node is introduced at the location of split [14].k 15

24 5. Generalized Suffix Tree (GST) A generalized suffix tree is a suffix tree that combines the suffixes of a set of strings {S 1, S 2.., S n }. A GST can be constructed easily and quickly. First build a ST for S 1 (assuming an added terminal character $). Then starting at the root of this tree, match S 2 against a path in the tree until a mismatch occurs. At that point add the remaining characters of the suffix of S 2 to the ST. When it is fully processed, the tree will encode all the suffixes of S 1 and that of S 2 [25]. Let us take two strings S 1 = xabxa and S 2 = babxba. Then we can construct the generalized suffix tree shown in Figure 2. Figure 2. Generalized suffix tree for strings S 1 and S 2. 16

25 In the above figure, a leaf s label consists of two numbers. The first number is the string number and the second number is the starting position of the suffix in that string. For example the leaf labeled (2,3) corresponds to the suffix of string S 2 = babxba that starts at location 3 i.e. bxba$ Longest Common Substring Problem of Two Strings. Given two strings S 1 and S 2, the longest common substring is a substring that appears both in S 1 and S 2 and has the largest possible length. A generalized suffix tree is an easy and efficient data structure to solve this problem. We build a GST for S 1 and S 2. Each leaf of the tree represents either a suffix from one of the two strings or a suffix that occurs in both strings. Mark each internal node ν with a 1 (and/or 2) if there is a leaf in the subtree of ν representing a suffix from S 1 (and/or S 2 ). The path-label of any internal node marked both 1 and 2 is a substring common to both S 1 and S 2, and the longest such string is the longest common substring. The algorithm finds the deepest node (that has the highest number of characters on the path to it) that is marked both 1 and 2 (and therefore, has two leaves). This problem can be solved in linear time, since the construction of the GST can be done in linear time. The time is proportional to the total length of S 1 and S 2. The node markings and the calculations of string-depth can be done by the known linear-time tree traversal methods [25]. Consider two strings S 1 = xabxa and S 2 = babxba. Then the GST with common substrings of S 1 and S 2 can be shown in Figure 3. 17

26 Figure 3. GST of strings S 1 and S 2 for the longest common substring problem. Each internal leaf maintains a list containing 1 (and/or 2), depending on whether that node is common to both strings. Here the longest common substring of S 1 and S 2 is found to be abx. We discussed the longest common substring problem here because we will be using a similar concept in our approach for sequence similarity query. We will differ from this application in three ways: 1. We will use as input a query string and a database of strings. 2. The substring should be common between the query and some other strings in the database. 3. We will identify common substrings that meet certain threshold values. 18

27 For instance we can have common substrings of n strings, where n = 4. Let the four strings be S 1 = xabxa, S 2 = babxba, S 3 = abx, and S 4 = ba. Node Children Depth 1 1, 2, ,2, , 2, , 2, , 4 2 Table 1. Common substrings of n strings Here a generalized suffix tree has been generated for the four strings {1,2,3,4}. Table 1 shows the string depth and list of the common substrings covered at some internal nodes of these four strings [21]. A substring of length 3, identified at node 4 is common to strings 1, 2 and 3. Another substring of length 2, identified at node 7 is common to strings 2 and 4. Similarly there are four other common substrings which have been listed with their depth and the strings sharing them Applications Some of the well known applications of generalized suffix trees are: 1. Longest Common Substring Problem of Two Strings. Given two strings S 1 and S 2, the longest common substring is a substring that appears in both S 1 and S 2, and has the largest possible length. A GST is built for the two strings to obtain the largest common substring. 2. All Pairs Suffix-Prefix Matching Problem. Given a set of sequences {S 1,,S n }, this problem is about finding for each ordered pair S i,s j in the set, the longest suffix-prefix match of (S i,s j ). This problem can be solved in linear time using GST as the main data structure. 19

28 3. All Maximal Repeats Problem in a Single Sequence. A sequence can contain repetitive subsequences. These repeats can occur either adjacent to each other (Tandem repeats) or apart, anywhere in the sequence. A GST can be used to find all maximal repeats in linear time. 4. Circular DNA Sequences Problems. A circular sequence S of length n is defined as a sequence in which character n is considered to precede character 1. The characters in a sequence are numbered sequentially from 1 to n, starting at an arbitrary character in S. Given two circular sequences of the same length, a GST can be used to compare these two sequences to determine if they are equal, in linear time. 2.4 Related Work This section unfolds some earlier work done in the area of sequence comparison and alignment Dynamic Programming If two sequences (DNA or protein) are highly similar, it implies that they have similar 3D structure or share similar function. We can measure the similarity of genomes of different species by knowing their evolution distance. Levenshtein (1966) introduced the notion of edit distance. String similarity can be studied using the concept of edit distance. Let X= x 1 x 2. x n and Y= y 1 y 2. y m be two strings over an alphabet with n >= m. For computing the similarity between X and Y, we will transform X into Y through a sequence of edit operations, called an edit sequence. For 1 i n, the edit operations applicable on the symbols of X to transform it into Y are of three types: Insertion: any symbol s in can be inserted before or after x i, 20

29 Deletion: the symbol x i can be deleted, Substitution: the symbol x i can be replaced by a symbol s in. A substitution operation is: A matching substitution if s = x i, A non-matching substitution if s x i. A common framework used for computing edit distance is called the edit graph G x,y of the strings X and Y and a given cost function γ. The edit graph is a directed acyclic graph having (n+1)(m+1) lattice points (i,j) for 0 i n, and 0 j m as vertices. The top left extreme point of this rectangular grid is the vertex (0,0) and the bottom-right extreme point is the vertex (n,m). Consider two strings X = aba and Y = bab. We draw the edit graph for X and Y shown in Figure 4. X a Y b a b 0,0 1,0 2,0 3,0 є b є a є a є a b a є a a a є a b a є b 0,1 1,1 2,1 є b є a є b 3,1 b b є b b b є b a b є b b b є a a 0,2 є 1,2 2,2 є b є a є a b a є a a a є a b b a 3,2 є 0,3 1,3 2,3 є b є a є b 3,3 21

30 Figure 4. Edit graph for transforming X into Y. An edit path in G x,y is a directed path from (0,0) to (n,m). Arcs of an edit path correspond to an edit sequence as follows: A horizontal arc ((i,j-1),(i,j)) corresponds to the insertion of y j immediately before x i (i.e. ε y j), A vertical arc ((i-1,j),(i,j)) corresponds to the deletion of x i (i.e. x i ε), And a diagonal arc ((i-1,j-1),(i,j)) corresponds to the substitution of symbol y j for x i (i.e. x i y j). Here ε represents the null string. A cost function γ assigns a weight to each edit operation turning G x,y,γ into a weighted graph. Thus edit distance problem seeks an edit sequence with minimum total weight over all edit sequences. The edit distance between X and Y is defined as the weight of such an optimal sequence. The edit distance can be computed in O(nm) time Needleman-Wunsch Algorithm Having understood the notion of edit distance, we can also formalize the relatedness of two strings by calculating a similarity score rather than their edit distance. The Needleman-Wunsch algorithm [26] introduced the global sequence alignment, and gave a dynamic programming algorithm whose time complexity is cubic. The time complexity of the problem is improved to quadratic by an algorithm in Figure 6. Consider two sequences, X = ATTGT and Y = AGGACAT as shown in Figure 5. A T T G T A - G G A C A T Figure 5. Alignment of two sequences X and Y. 22

31 We can use the edit graph to visualize all possible alignments between the two strings X and Y. In the context of sequence alignment the edit graph is called the alignment graph, insertions(horizontal arcs) and deletions(vertical arcs) are both called indels. The names match, and mismatch, are used to refer to matching diagonal, and mismatching diagonal arcs. In the simplest scoring scheme, the arcs of the alignment graph are assigned weights determined by non-negative reals δ (mismatch penalty) and µ (indel, or gap penalty). In Figure 6, s(x i, y j ) denotes the similarity score between the symbols x i and y j which is normally 1 for a match (x i = y j ) and δ for a mismatch (x i y j ). The optimum global alignment score GA* between X and Y can be computed by Needleman-Wunsch algorithm as shown in Figure 6. This takes O(nm) time and O(m) space. Simple Scoring Scheme: S i,j : Score achieved at (i,j). S i,j = max{ S i-1,j - µ, S i-1,j-1 + s(x i, y j ), S i,j-1 - µ} Where S i,j = -iµ when j = 0, S i,j = -jµ when i = 0, and s(x i, y j ) = 1 for match; δ otherwise. Therefore, GA* = S n,m In O(nm) time and O(m) space. Figure 6. Computation of global alignment score GA*. Hence the steps in the global alignment algorithm in Figure 6 can be outlined as: 1. Assign the similarity values. 2. For each cell, look at all the possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway. 23

32 3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment. Features of the algorithm in Figure 6: 1. Classical algorithm for sequence comparison. 2. Maximizes the similarity score to give maximum match. 3. Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions. 4. Finds the best global alignment of any two sequences. 5. Involves an iterative matrix method of calculation. All possible pairs of residues (bases or amino acids) one from each sequence are represented in a 2 dimensional array. All possible alignments (comparisons) are represented by pathways through this array Smith -Waterman Algorithm Now we discuss an algorithm that provides a dynamic programming solution to the problem of local sequence alignment. It was given by Smith and Waterman [35] in As shown in Figure 7, while comparing two sequences there may be only a relatively small region in the sequences that actually approximately matches. Thus Smith Waterman algorithm aims to detect local similarities in sequences. The difference between a local and a global alignment is that a local alignment may involve any pairs of subsequences I and J of X and Y, respectively. 24

33 Figure 7. Local alignment involving subsequences ATTGT and AGGACAT. In local alignment any pairs of subsequences may be involved, but the algorithm computes the optimal alignments on the subsequences. The Smith Waterman algorithm determines the maximum local alignment score S i,j ending at each vertex (i,j) for the basic scoring scheme. Figure 8 shows the Smith Waterman formulation for local alignment under the simple scoring scheme: Simple Scoring Scheme: S i,j : Similarity score achieved at (i,j). S i,j = max{ 0, S i-1,j - µ, S i-1,j-1 + s(x i, y j ), S i,j-1 - µ} Where S i,j = 0 whenever i=0 or j=0, and s(x i, y j ) = 1 for match; δ otherwise. Therefore, LA* = max S i,j In O(nm) time and O(m) space. Figure 8. Computation of local alignment score LA*. 25

34 The Smith Waterman algorithm requires only O(m) space complexity because only O(m) entries of the dynamic programming matrix need to be stored at any given time. The algorithm extends a local alignment as long as the resulting score is positive as shown in Figure 9. Some extensions increase the score while others decrease it. Figure 9. Extension of local alignment as long as the resulting score is positive. We can outline the four possible ways of forming a path using Smith Waterman algorithm as follows: For every residue in one sequence, 1. Align with the next residue of the second sequence. Score is the previous score plus the similarity score for the two residues. 2. For deletion (i.e. match residue of query with a gap), the score is the previous score minus gap penalty dependent on size of the gap and the gap open penalty. 26

35 3. For insertion (i.e. match residue of second sequence with a gap), the score is previous score minus gap penalty dependent on size of the gap and the gap open penalty. 4. Stop extending the alignment if the score is less than zero. Then choose whichever of these is the highest. Features of the Smith-Waterman algorithm: 1. Instead of looking at each sequence in its entirety, it compares segments of all possible lengths (local alignments) and chooses whichever maximizes the similarity measure. 2. For every cell, the algorithm calculates ALL possible paths leading to it. These paths can be of any length and can contain insertions, deletions and substitutions. 3. These computations are incorporated into the dynamic programming solution in Figure 6, and we can obtain the Smith Waterman local alignment algorithm for the simple scoring scheme shown in Figure Heuristics Dynamic programming is not practical when the sequences are long. Heuristic sequence alignment algorithms are approximation algorithms that are 10 to 50 times faster and are used to obtain near optimal results BLAST BLAST stands for Basic Local Alignment tool. BLAST [1] is the dominant search engine for biological sequence databases. It heuristically finds high scoring local alignments. It is typically used to search a query sequence against a database of sequences. 27

36 Given two strings S 1 and S 2, a segment pair is a pair of equal length substrings of S 1 and S 2, aligned without spaces. A locally maximal segment pair is a segment pair whose alignment score (without spaces) would fall either by expanding or shortening the segments on either side. A maximal segment pair (MSP) in S 1, S 2 is a segment pair with the maximum score over all segment pairs in S 1, S 2. BLAST directly approximates alignments that optimize the maximal segment pair(msp) score. This heuristic algorithm can be applied in a variety of contexts including straight forward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. The key tradeoff it makes is of sensivity vs. speed. Sensivity can be defined as the ability to detect correct matches. It is the ratio of number of significant matches detected to number of significant matches in the database. History: BLAST1 was created in 1990, and dedicated to the search for regions of local similarity without gaps. BLAST2 was created as an extension of BLAST1, by allowing the insertion of gaps. Two versions of BLAST2 were independently developed, namely NCBI_BLAST2 [NCBIBLAST] by National Center for Biotechnology Information in 1997 and WU_BLAST2 [WUBLAST] by the Washington University in Algorithm BLAST1 This algorithm concentrates on finding regions of high local similarity in alignments without gaps, evaluated by an alphabet-weight scoring matrix. Alignments with some gaps can be created by chaining together several locally similar regions that BLAST finds. The fundamental objects that concern BLAST are segment pairs, locally maximal segment pairs, and the maximal segment pairs. 28

37 Figure 10. Steps in BLAST We can outline the steps of BLAST1 shown in Figure 10 as follows: Given: Query sequence Q, word length W, word score threshold T, segment score threshold S. 1. Compile a list of words that score at least T when compared to words from Q. 2. Scan the database for matches to words in the list. 29

38 3. Extend all matches to seek high scoring un-gapped alignments. Then return all alignments scoring at least S. Let us understand this with an example. EXAMPLE: Query sequence: QLNFSAGW Word Length w=2 Word score threshold T = 8 Step1: Determine all words of length w in query sequence. QL LN NF FS SA AG GW Step2: Determine all words that score at least T when compared to a word in the query sequence. Words from Query Words with T=8 Sequence QL QL=11, QM=9, HL=8, ZL=9 LN LN=9, LB=8 NF NF=12, AF=8, NY=8, DF=10,.. SA none While searching the database for all occurrences of query words, we apply the following approach: Index database sequences into table of words (pre-compute this). Index query words into table (at query time) Step3: Extending Hits. 1. Extend the hits in both directions without allowing gaps. 30

39 2. Terminate the extension in one direction when score falls certain distance below best score for shorter extensions. 3. Return all the segment pairs scoring at least S. Query Q P of DB Figure 11. Extension of a hit between query Q and string P of database DB. A hit as shown in red is being extended in Figure 11, without allowing gaps. Algorithm BLAST2 The most important feature of NCBI-BLAST2 is that it allows local alignment with gaps. The first two steps, leading to the generation of primary hits are the same as those in BLAST1. The third step includes two major refinements: 1. Two-hits requirement Do extension only when there are two closely spaced hits on the same diagonal. 2. Gapped Extension Allow gaps in extensions. The Two Hit Method 1. The extension step typically accounts for 90% of BLAST s execution time. 31

40 2. Do extension only when there are two hits on the same diagonal within distance A of each other. 3. To maintain sensitivity, lower T parameter. This will result in more single hits being found and only a small fraction will have associated second hit. A Figure 12. Two hits requirement In Figure 12 we can see there are two hits marked in red on the diagonal. Two hits are within a distance of A units from each other. Gapped Extension The following steps are applied in gapped extension: 1. Trigger the gapped alignment if two hit extension has a sufficiently high score. 2. Find length-11 segment with highest score and use the central pair in this segment as seed. 3. Run dynamic programming process both forward and backward from seed. 4. Prune cells when local alignment score falls a certain distance below best score yet. Procedure When comparing all the sequences in a database with a fixed query sequence P, BLAST attempts to find all those database sequences that together with P, contain an MSP above some cut-off score C. The choice of C is guided by the scoring 32

41 matrix and the characteristics of P and of the database sequences. Any sequence with a MSP score above C is considered significant and reported. Score The Scoring scheme of BLAST is based on PAM matrices. PAM matrices are amino acid substitution matrices that encode and summarize the expected evolutionary change at the amino acid level. Each PAM matrix is designed to be used to compare pairs of sequences that are a specific number of PAM units diverged. Figure 13 shows an example output of local alignment between query and a database sequence. It is accompanied by the score for this alignment. Score = 28.2 bits (14), Expect = 1.9 Identities = 20/22 (90%) Strand = Plus / Minus Query: 390 gttgactgcacttccagccagg 411 Sbjct: 239 gttgactgatcttccagccagg 218 Figure 13. Local alignment between two strings using BLAST. Two sequences S 1 and S 2 are considered one PAM unit diverged if a series of accepted point mutations (and no insertions or deletions) has converted S 1 to S 2 with an average of one accepted point-mutation event per one-hundred amino acids. For any specific pair of amino acid characters, denoted A i, A j, the (i, j) entry in the PAM n matrix reflects the frequency that A i is expected to replace A j in two sequences that are n PAM units diverged. Let f(i, j) denote the resulting frequency, and f(i) and f(j) respectively, be the frequencies that amino acids A i and A j appear in the sequences. 33

42 Then the (i, j)th entry for the ideal PAM n matrix is log[f(i, j)/( f(i)*f(j))]. The reason for dividing f(i,j) by f(i)*f(j)) is to normalize the true(historical) replacement frequency one expects due to chance alone. Summary BLAST is successful because of its speed, range of solutions and providing an estimate of statistical significance for the matches found in sequences FASTA FASTA [9, 29] is another popular algorithm for string comparison. It was developed in 1985 and further improved in Procedure FASTA works by comparing a query string against a single text string. When searching the whole database for matches to a given query, the query string is compared using the FASTA algorithm to every string in the database. When looking for an alignment, one might expect to find a few segments in which there will be absolute identity between the two compared strings. The algorithm uses this property and focuses on these identical regions. 34

43 Figure 14. Steps in FASTA Let us take two sequences X and Y. From Figure 14 we can study the four stages in FASTA algorithm:- 1. Specify an integer parameter called ktup (k respective tuples), and look for ktup-length matching substrings between X and Y. 2. Rescore the runs of identities using PAM substitution matrix. Keep top scoring segments. 3. Apply joining threshold to eliminate segments that are unlikely to be part of the alignment that includes highest scoring segment. 35

44 4. Use dynamic programming to optimize the alignment in a narrow band that encompasses the top scoring segments. FASTA format >gi pir TVFV2E TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGS QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLR HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHG MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWL TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVE APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGIL LAAVEAQQQMLKLTIWGVK Figure 15. Example of a sequence in FASTA format. In Figure 15 we see the format of a sample sequence which is used by the FASTA algorithm. Summary It is found that the resulting alignment scores with FASTA well compare to the accurate alignment, while the algorithm is also much faster than ordinary dynamic programming algorithm for sequence alignment. 36

45 CHAPTER 3: Design & Methodology This section can be divided into preliminaries, contributing factors, algorithm and methodology. The methodology broadly comprises of generating a generalized suffix tree (GST) of the database sequences, pre-processing GST to add sequence information, post-processing GST to extract patterns, isolate sequences associated with the patterns and rank them, and finally perform pair-wise local alignment between query and the chosen sequences. GST being the backbone of the method is constructed with a careful selection of a suitable linear time algorithm. 3.1 Preliminaries To begin with we cover some important requirements Importance of Data Structure For the correct representation and use of data, it is extremely important to pick the right data-structure for its storage. Knowledge can be retrieved out of a wealth of information only if we handle the data logically and follow a clear and simple implementation. Suffix tree is an excellent data structure to solve problems related to all kinds of sequences. The internal details of a biological sequence can be revealed in a better way using this tool, because it offers the robustness of a tree structure and advantage of linear time O(m), where m is the length of the sequence Choice of Algorithm Several algorithms have been written for the linear time construction of the suffix tree. But there is one algorithm that stands out in all existing works viz. the Ukkonen s method [37] given by Esko Ukkonen. Ukkonen s algorithm with O(m) 37

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences SEQUENCE ALIGNMENT ALGORITHMS 1 Why compare sequences? Reconstructing long sequences from overlapping sequence fragment Searching databases for related sequences and subsequences Storing, retrieving and

More information

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 04: Variations of sequence alignments http://www.pitt.edu/~mcs2/teaching/biocomp/tutorials/global.html Slides adapted from Dr. Shaojie Zhang (University

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

From Smith-Waterman to BLAST

From Smith-Waterman to BLAST From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology? Local Sequence Alignment & Heuristic Local Aligners Lectures 18 Nov 28, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77 Dynamic Programming Part I: Examples Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, 2011 1 / 77 Dynamic Programming Recall: the Change Problem Other problems: Manhattan

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Lecture 10. Sequence alignments

Lecture 10. Sequence alignments Lecture 10 Sequence alignments Alignment algorithms: Overview Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. We want to maximize the score

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Combinatorial Pattern Matching. CS 466 Saurabh Sinha Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment Today s Lecture Edit graph & alignment algorithms Smith-Waterman algorithm Needleman-Wunsch algorithm Local vs global Computational complexity of pairwise alignment Multiple sequence alignment 1 Sequence

More information

Dynamic Programming & Smith-Waterman algorithm

Dynamic Programming & Smith-Waterman algorithm m m Seminar: Classical Papers in Bioinformatics May 3rd, 2010 m m 1 2 3 m m Introduction m Definition is a method of solving problems by breaking them down into simpler steps problem need to contain overlapping

More information

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence Alignments Overview Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence alignment means arranging

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. .. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. PAM and BLOSUM Matrices Prepared by: Jason Banich and Chris Hoover Background As DNA sequences change and evolve, certain amino acids are more

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha DNA Sequence Comparison: First Success Story Finding sequence similarities with genes of known function is a common approach to infer a newly

More information

Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool (BLAST) BLAST 26.04.2018 Basic Local Alignment Search Tool (BLAST) BLAST (Altshul-1990) is an heuristic Pairwise Alignment composed by six-steps that search for local similarities. The most used access point to

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching.

17 dicembre Luca Bortolussi SUFFIX TREES. From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi SUFFIX TREES From exact to approximate string matching. An introduction to string matching String matching is an important branch of algorithmica, and it has applications

More information

Dynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D.

Dynamic Programming Course: A structure based flexible search method for motifs in RNA. By: Veksler, I., Ziv-Ukelson, M., Barash, D. Dynamic Programming Course: A structure based flexible search method for motifs in RNA By: Veksler, I., Ziv-Ukelson, M., Barash, D., Kedem, K Outline Background Motivation RNA s structure representations

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

Sequence alignment theory and applications Session 3: BLAST algorithm

Sequence alignment theory and applications Session 3: BLAST algorithm Sequence alignment theory and applications Session 3: BLAST algorithm Introduction to Bioinformatics online course : IBT Sonal Henson Learning Objectives Understand the principles of the BLAST algorithm

More information

Lecture 5: Suffix Trees

Lecture 5: Suffix Trees Longest Common Substring Problem Lecture 5: Suffix Trees Given a text T = GGAGCTTAGAACT and a string P = ATTCGCTTAGCCTA, how do we find the longest common substring between them? Here the longest common

More information

Heuristic methods for pairwise alignment:

Heuristic methods for pairwise alignment: Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic

More information

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm CSCI1820: Sequence Alignment Spring 2017 Lecture 3: February 7 Lecturer: Sorin Istrail Scribe: Pranavan Chanthrakumar Note: LaTeX template courtesy of UC Berkeley EECS dept. Notes are also adapted from

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 5: Suffix trees and their applications Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

Alignment of Long Sequences

Alignment of Long Sequences Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 2 Materials used from R. Shamir [2] and H.J. Hoogeboom [4]. 1 Molecular Biology Sequences DNA A, T, C, G RNA A, U, C, G Protein A, R, D, N, C E,

More information

Sequence Alignment. part 2

Sequence Alignment. part 2 Sequence Alignment part 2 Dynamic programming with more realistic scoring scheme Using the same initial sequences, we ll look at a dynamic programming example with a scoring scheme that selects for matches

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 06: Multiple Sequence Alignment https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/rplp0_90_clustalw_aln.gif/575px-rplp0_90_clustalw_aln.gif Slides

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Lecture 9: Core String Edits and Alignments

Lecture 9: Core String Edits and Alignments Biosequence Algorithms, Spring 2005 Lecture 9: Core String Edits and Alignments Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 9: String Edits and Alignments p.1/30 III:

More information

Programming assignment for the course Sequence Analysis (2006)

Programming assignment for the course Sequence Analysis (2006) Programming assignment for the course Sequence Analysis (2006) Original text by John W. Romein, adapted by Bart van Houte (bart@cs.vu.nl) Introduction Please note: This assignment is only obligatory for

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Algorithms in Bioinformatics: A Practical Introduction. Database Search

Algorithms in Bioinformatics: A Practical Introduction. Database Search Algorithms in Bioinformatics: A Practical Introduction Database Search Biological databases Biological data is double in size every 15 or 16 months Increasing in number of queries: 40,000 queries per day

More information

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Abhishek Majumdar, Peter Z. Revesz Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln,

More information

On the Efficacy of Haskell for High Performance Computational Biology

On the Efficacy of Haskell for High Performance Computational Biology On the Efficacy of Haskell for High Performance Computational Biology Jacqueline Addesa Academic Advisors: Jeremy Archuleta, Wu chun Feng 1. Problem and Motivation Biologists can leverage the power of

More information

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties From LCS to Alignment: Change the Scoring The Longest Common Subsequence (LCS) problem the simplest form of sequence

More information

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment Sequence lignment (chapter 6) p The biological problem p lobal alignment p Local alignment p Multiple alignment Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity

More information

Scoring and heuristic methods for sequence alignment CG 17

Scoring and heuristic methods for sequence alignment CG 17 Scoring and heuristic methods for sequence alignment CG 17 Amino Acid Substitution Matrices Used to score alignments. Reflect evolution of sequences. Unitary Matrix: M ij = 1 i=j { 0 o/w Genetic Code Matrix:

More information

A Design of a Hybrid System for DNA Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm

More information

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Improvisation of Global Pairwise Sequence Alignment

More information

Database Searching Using BLAST

Database Searching Using BLAST Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain

More information

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores. CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores. prepared by Oleksii Kuchaiev, based on presentation by Xiaohui Xie on February 20th. 1 Introduction

More information

BLAST MCDB 187. Friday, February 8, 13

BLAST MCDB 187. Friday, February 8, 13 BLAST MCDB 187 BLAST Basic Local Alignment Sequence Tool Uses shortcut to compute alignments of a sequence against a database very quickly Typically takes about a minute to align a sequence against a database

More information

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Index Based Multiple

More information

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS IN DNA SEQUENCES USING MULTIPLE SPACED SEEDS By SARAH BANYASSADY, B.S. A Thesis Submitted to the School of Graduate Studies

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 4: Suffix trees Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Sequence alignment algorithms

Sequence alignment algorithms Sequence alignment algorithms Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 23 rd 27 After this lecture, you can decide when to use local and global sequence alignments

More information

CS313 Exercise 4 Cover Page Fall 2017

CS313 Exercise 4 Cover Page Fall 2017 CS313 Exercise 4 Cover Page Fall 2017 Due by the start of class on Thursday, October 12, 2017. Name(s): In the TIME column, please estimate the time you spent on the parts of this exercise. Please try

More information

An introduction to suffix trees and indexing

An introduction to suffix trees and indexing An introduction to suffix trees and indexing Tomáš Flouri Solon P. Pissis Heidelberg Institute for Theoretical Studies December 3, 2012 1 Introduction Introduction 2 Basic Definitions Graph theory Alphabet

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database. BLAST Basic Local Alignment Search Tool Used to quickly compare a protein or DNA sequence to a database. There is no such thing as a free lunch BLAST is fast and highly sensitive compared to competitors.

More information

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Assumptions: Biological sequences evolved by evolution. Micro scale changes: For short sequences (e.g. one

More information

Inexact Pattern Matching Algorithms via Automata 1

Inexact Pattern Matching Algorithms via Automata 1 Inexact Pattern Matching Algorithms via Automata 1 1. Introduction Chung W. Ng BioChem 218 March 19, 2007 Pattern matching occurs in various applications, ranging from simple text searching in word processors

More information

11/5/09 Comp 590/Comp Fall

11/5/09 Comp 590/Comp Fall 11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors

More information

3.4 Multiple sequence alignment

3.4 Multiple sequence alignment 3.4 Multiple sequence alignment Why produce a multiple sequence alignment? Using more than two sequences results in a more convincing alignment by revealing conserved regions in ALL of the sequences Aligned

More information

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching, C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 13 Iterative homology searching, PSI (Position Specific Iterated) BLAST basic idea use

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

String Matching Algorithms

String Matching Algorithms String Matching Algorithms 1. Naïve String Matching The naïve approach simply test all the possible placement of Pattern P[1.. m] relative to text T[1.. n]. Specifically, we try shift s = 0, 1,..., n -

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 - Spring 2015 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment The number of all possible pairwise alignments (if

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count 2011 International Conference on Life Science and Technology IPCBEE vol.3 (2011) (2011) IACSIT Press, Singapore An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count Raju Bhukya

More information

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

CSE 417 Dynamic Programming (pt 5) Multiple Inputs CSE 417 Dynamic Programming (pt 5) Multiple Inputs Reminders > HW5 due Wednesday Dynamic Programming Review > Apply the steps... optimal substructure: (small) set of solutions, constructed from solutions

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University 1 Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment 2 The number of all possible pairwise alignments (if gaps are allowed)

More information

CHAPTER-6 WEB USAGE MINING USING CLUSTERING

CHAPTER-6 WEB USAGE MINING USING CLUSTERING CHAPTER-6 WEB USAGE MINING USING CLUSTERING 6.1 Related work in Clustering Technique 6.2 Quantifiable Analysis of Distance Measurement Techniques 6.3 Approaches to Formation of Clusters 6.4 Conclusion

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms: BLAST and FASTA Database search: gene finding and annotations

More information

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of

More information

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Inexact Matching, Alignment See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Outline Yet more applications of generalized suffix trees, when combined with a least common ancestor

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42 String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt

More information

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm

Rochester Institute of Technology. Making personalized education scalable using Sequence Alignment Algorithm Rochester Institute of Technology Making personalized education scalable using Sequence Alignment Algorithm Submitted by: Lakhan Bhojwani Advisor: Dr. Carlos Rivero 1 1. Abstract There are many ways proposed

More information

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture February 6, 2008 BLAST: Basic local alignment search tool B L A S T! Jonathan Pevsner, Ph.D. Introduction to Bioinformatics pevsner@jhmi.edu 4.633.0 Copyright notice Many of the images in this powerpoint

More information