Metric Indexing of Protein Databases and Promising Approaches

Size: px
Start display at page:

Download "Metric Indexing of Protein Databases and Promising Approaches"

Transcription

1 WDS'07 Proceedings of Contributed Papers, Part I, 91 97, ISBN MATFYZPRESS Metric Indexing of Protein Databases and Promising Approaches D. Hoksza Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic. Abstract. Most widely used biological databases nowadays are nucleotide and protein ones. These databases are crucial for determination of biological functions of living organisms with respect to their DNA structure. The biological function of a protein can be derived from the similarity with another protein with known function which is stored in a database and therefore the chance of finding the biological function of given protein or DNA sequence grows with size of the database. Because of this fact, the growth is exponential which in turn calls for sublinear methods of searching these databases. Optimal solution is aligning the query sequence with all sequences in the queried database. Since aligning of two sequences is computationally expensive, fast heuristic methods (e.g. BLAST [Altschul et al., 1997]) are used although they can only approximate the optimal solution without restricting the resulting error. In this paper we try to use metric access methods (MAMs) for exact and approximate searching through protein databases. As experiments show, such a straightforward use of MAMs is not very suitable, therefore we also show possible further directions in the area of indexing protein sequences based on the so far learned facts. Introduction The reason for existence of databases of protein sequences is that similar proteins secure similar biological functions, hence it makes sense to store such sequences in a database. This similarity is shared even among different species and because of this fact the growth of these databases have been exponential in the last ten years. One way to handle exponentially growing data is to use indexing since then complexity of searching is sublinear. There have been attempts to incorporate indexing techniques in searching of protein databases but almost all of these techniques split sequences into q-grams and use distance functions which are easy indexable but suffer from neglecting biological meaning of protein similarity (see Current Indexing Approaches section). Because of that nowadays most widely used method is BLAST which is heuristic approach with linear time complexity. Protein Databases 1 DNA molecules consist of two strings of nucleotides (conventionally labeled A, C, G, T) which can be transcribed to RNA and later translated to proteins which are linear polymers of 20 types of amino acids. The prescript for translating every triplet of nucleotides (codon) into a amino acid is called genetic code. Combination of proteins secures a biological function which is determined by the three-dimensional structure of the proteins involved. Important fact is that proteins with similar sequences have similar three-dimensional structure ergo similar function 2. When the sequence of amino acids of a protein is determined we usually want to know it s function. The easiest way to find it out is to perform a search against database of protein sequence with known functions. This will give us possible purpose of the examined sequence and we can use it as a clue to it s function. To increase probability of finding similar protein, the database which is searched for similarity should be as large as possible. 1 We don t consider nucleotide databases in this paper which are handled similarly (but for different purposes). 2 Moreover more extensive sequence similarity can be viewed as evidence of common ancestry, and therefore as a basis for reconstructing phylogenetic history of organisms and their genes. 91

2 Similarity Search in Sequence Databases When comparing sequences we can tell about two sequences 3 that they are somehow similar. Intuitively it means that those two sequences have sufficient number of identical (similar) letters at identical (similar) positions (when considering inserting gaps into both sequences). Hence finding optimal alignment means inserting gaps into both sequences to make the score of the alignment as best as possible. Consequently, the score can be further used in retrieval of sequences similar to a query sequence. Common string distance measures. If we accept the distance as the number of identical letters on identical positions (thus not allowing gaps) we speak about Hamming distance (HD) which is defined on sequences of equal length. If we loosen this condition we get Levenshtein (or edit) distance given by the minimum number of operations needed to transform one sequence into the other, where the operation can be insertion, deletion, or replacement of a single letter. Finally we can reflect similarity of letters and equip each pair of letters with a weight and we get weighted edit distance (with scoring matrix). Values (weights) for each pair of letters are stored in a squared scoring matrix. There are many different scoring matrices used in bioinformatics for various purposes (e.g. PAM matrices [Dayhoff et al., 1978], BLOSUM matrices [Henikoff S. and Henikoff J.G., 1978], etc.). The score of an alignment is then given as total of prices of the gaps (penalization) and values of matches/mismatches at the right position in the scoring matrix (see Figure 1). Figure 1. Global and local alignment (and their scores) of protein sequences NPHGIIMGLAE and HGLGL according to BLOSUM62 scoring matrix Global and Local Alignment Measures. We further distinguish global and local alignment. Both are weighted edit distances but global alignment aligns whole sequences whilst local alignment globally aligns every two substrings of the two sequences and the best alignment of them is declared to be the value of the local alignment 4 (see Figure 1). Algorithms for both of these problems are of exponential complexity when taking all possible alignments into account. But for both of them there are dynamic programming solutions which compute the alignments in O(m n) time and space. The algorithm for global alignment was published in 1970 by Needleman and Wunsch [Needleman and Wunsch, 1970] and in 1981 there was proposed an extension to it by Smith and Waterman [Smith and Waterman, 1981] (SW) which computes the local alignment with the same time and space complexity as the global one. The core of the solution is n m matrix s where cell s i,j of the matrix stays for the optimal (e.g. maximal) score which belongs to prefixes of lengths i and j of the aligned sequences. In the initialization stage the cell s 0,0 is set to 0 and 0th row and column are filled with scores corresponding to 0th prefix of the first and ith prefix of the second string and vice versa. The recursive formula for filling matrix s cells when computing the global alignment is: s i 1,j + σ s i,j = max s i,j 1 + σ (1) s i,j + δ(a i, b j ) where a and b represent the sequences to be aligned, σ is a score for gaps and δ is the scoring matrix. Since s i,j contains score of the global alignment of the i-long prefix of a and j-long prefix of b, the cell s a, b contains the alignment score for the whole sequences. For computing local alignment, the only change to the recursive formula is adding 0 to values out of which the maximal value is computed. Hence we stop alignment done so far and start new local alignment at position [i,j] whenever it would improve the alignment. The optimal alignment is than the highest value in the matrix. Finally, there is usually another modification which enables differentiating between an opening and extending the gap. Extending a gap is considerable less penalized than opening a gap. 3 There can be also multiple sequences mutually aligned for some purposes, but this is out of scope of this paper. 4 In aligning protein sequences, local alignment is used. 92

3 Heuristic Approaches to Retrieval of Similar Sequences In the time when Smith-Waterman algorithm was invented it was the optimal way for searching through protein sequences for similarities. But, as mentioned earlier, since then sizes of databases have been exponentially growing and it is no more feasible to align each sequence with the query sequence in quadratic time. Therefore heuristic methods have been proposed to reduce resulting time. Examples of these methods are FASTA [Pearson and Lipman, 1988], BLAST (Basic Local Alignment Tool) [Altschul et al., 1997],... Since BLAST is nowadays most widely used method we will describe it here (other heuristic methods work in a similar way): 1) Remove low complexity regions from the query sequence (those with no meaningful alignment). 2) Generate all n-grams substrings of length n from query sequence. 3) Compute the similarity for every sequence of length n (on a given alphabet) and each n-gram from the previous step. 4) Filter out sequences with similarity lower than a cut-off score (called neighborhood word score threshold). 5) Remaining high-scoring sequences (organized in search tree) are then used to search all database sequences for exact match. 6) High-scoring sequences within a given distance (those on the same diagonal if we imagine sequences in a matrix) are then connected together with gapped alignment and these are being extended as long as the score grows 5. Such alignments are called high scoring pairs (HSP). 7) All HSPs with scores below a given threshold are excluded. 8) The scores of non-filtered sequences are refined by the classic Smith-Waterman algorithm. Statistical Relevance. To see how significant given alignment is, there is defined so called that determines whether an alignment happened just by chance or whether it is somehow important. Thus is expected number of sequences of length m and n with score at least S in database with N residues (amino acids) as 6 E = Km eff n eff e λs N/n (2) where K and λ are characteristics of the SW score distribution [Karlin and Altschul, 1990] and m eff and n eff are effective lengths of sequences that compensate the effect of higher probability of occurrence of an alignment in the middle of sequences. Current Indexing Approaches Although BLAST shows good speed and accuracy, there is still 64% slowdown every year (as stated in [Cameron et al., 2004]) because of exponential growth of the databases which BLAST searches for similarities. This is the reason why there have been efforts to find methods which would search for similarities in sublinear time, e.g. indexing. Next, we will show some of these indexing methods (both metric and non-metric). SSAHA. SSAHA [Ning et al., 2001] is a method that is primarily meant to be used for nucleotide sequences but the idea could be generalized for protein sequences too. It makes use of hashing all q-grams of stored sequences where for each q-gram there is one tuple consisting of index of the sequence and offset of the q-gram. When searching, the query sequences is splitted into q-grams and all hits of every such q-gram is converted to a triplet (index, offset t, offset) where t is position of the q-gram in the query sequence. List of these triplets is then sorted according to the first two attributes. Consecutive sections of the list of size n having identical first two attributes are hits of length q n. BLAT. Direct improvement of BLAST is BLAT (BLAST-like alignment tool) [Kent, 2002] that uses indexing for improving effectivity of BLAST. Probably the most important modification is that whereas BLAST uses search tree for the query sequence and scans the database letter by letter, BLAT does it contrariwise. It organises non-overlapping q-grams of the database sequences into index and this index is used for searching for similarity with every q-gram of the query sequence. Of course, because of splitting into q-grams, the method is still heuristic, much like BLAST. BLAT was principally intended to be used for nucleotide sequences. FT#N(s). FT#N(s) [Ozturk and Ferhatosmanoglu, 2003] is a method that uses transformation to turn sequences of strings into multidimensional vectors of numbers, which are further indexed with R-tree (any other indexing method can be involved, of course). Distance functions are then defined on these indexed data to secure similarity search among vectors (i.e. sequences). Frequencies of q-grams are used 5 This applies to BLAST2 - previous versions of BLAST did not connect sequences on diagonals (and therefore used higher value for cut-off score in step 4). 6 This is the way how BLAST compute it s (other heuristics can do it in a different way. 93

4 as vectors (i.e. FT2N(s) for 2-grams of sequence of s) and the distance is FD#N(s 1, s 2 ) which is defined as maximum of differences between each of the dimensions 7. This distance is proved to be the lower bound of the edit distance between original strings. With growing length of the q-grams, length of the transformed vector grows exponentially therefore [Hsieh et al., 2006] introduced a method which uses similar transformations and distance functions but grows linearly with number of features which it uses instead of q-grams. mpam. Authors of mpam method used observation made in [Sellers, 1974] that global alignment with metric scoring matrix is a metric function. Therefore they revisited mathematics of the PAM matrices which resulted in mpam (metric PAM) substitution matrix which has similar sensitivity to PAM matrix ([Xu and Miranker, 2004] ). This matrix is then used in global alignment to define measure for indexing q-grams with MVP-tree (multi vantage point tree). That is very similar to BLAT but unlike BLAT, mpam method employs global edit distance with substitution matrix while aligning q-grams (BLAT uses exact or near-exact matching or Hamming distance 8 ). Metric Sequence Indexing & Search One can see that almost all of the methods mentioned above use simpler distance functions than weighted edit local distance, hence they are not very well comparable with optimal alignment or with current daily used methods (e.g. BLAST) and definitely can t give biologically optimal (correct) solutions. Our idea was to preserve the commonly used distance and turn it into a distance metric δ which satisfies the metric properties (reflexivity, non-negativity, symmetry and triangle inequality). Such a metric could then be utilized by various metric access methods (MAMs) [Zezula et al., 2005]. MAM s use triangle inequality to organize database objects into metric regions and only those regions that have nonempty intersection with the query region need to be examined. Creating the Metric To be able to compare our method with practical solutions (e.g. BLAST) we need to use as distance measure. Since is not a metric we need to turn it into it. The original satisfies just the non-negativity property, but it can be easily modified to satisfy also reflexivity and symmetry. To enforce reflexivity, we make identical sequences to have zero (the probability that non-identical alignment would have lower is in practice really low). To satisfy the symmetry, we have to accomplish a more substantial change. A problem causes the query length in computing the but because in average length of the query is similar to average DB sequence length (query object uses to be a protein sequence), we replace n in the formula by max(m, n) TriGen Algorithm. Now, the hardest part has left and that is to fulfill the triangle inequality. We use the Trigen algorithm [Skopal, 2006] designed for turning semi-metrics into metrics (or approximations of metrics) by applying concave similarity preserving modifiers (functions). Such an modifier is applied to a training set of triplets and makes them to preserve the triangular inequality. Moreover, one can specify amount of triplets from the training set that may violate the triangle inequality (so-called T-error tolerance) and thus allowing faster but approximate searching. LAESA The LAESA method [Micó et al., 1994] is a pivot-based MAM, which uses m pivots (objects from database) for mapping each object into a m dimensional vector, hence the database is n m matrix. When querying, query is mapped into the pivot space and the matrix is sequentially scanned for (candidate) objects that overlap with the query. Candidates are then filtered in the original space. The LAESA method is very powerful in its pruning effectiveness, however, due to expensive selection of pivots and due to the sequential processing of distance matrix its usage in dynamic database environments is limited. & P A typical tree-based MAM designed for database environments is the [Ciaccia et al., 1997]. recursively bounds objects into balls specified by a center data object (one of the indexed ones). 7 Also transformation and distance based on wavelet functions is introduced in the paper 8 When aligning nucleotide sequences (for which BLAT is primarily intended), substitution matrices are not usually used, so it s all right not to use weighted edit distance 94

5 The inner nodes of an index contain routing entries, consisting of a region ball and a pointer to the subtree (all objects in a subtree must fall into the parent region ball). The leaf nodes contain ground entries the DB objects themselves. When querying just the nodes intersecting query ball (which is not constant in the case of knn queries) are further processed. P proposed in [Skopal, 2004; Skopal et al., 2005] is then combination of both methods mentioned above - LAESA and. The entries contain also a set of pivots which prune the region ball, thus, the total volume of a P s data region is always smaller than an equivalent region. The number of pivots in the routing and ground entries may differ. s and P s data regions can overlap as a result of a bad splitting. To fix this, Slim-down algorithm [Skopal et al., 2003] was proposed to optimize an already built index. Slimming down is very expensive but it can significantly speed up querying (up to 10x). Experimental Results As dataset we used random subset of the Swiss-Prot database ([Bairoch et al., 2004]) of size 3000 with total number of amino acids. Another random 100 hundred sequences have been chosen as query sequences.all of the sequences were of maximal length 1000 which doesn t cause any problem when we realize that there are only 9191 longer sequences out of in whole Swiss-prot which makes 3% (average sequence length is 365 in whole Swiss-prot and 335 in the reduced variant). These longer sequences could then be treated in special way since they are just small part of the whole. doesn t cause any problem when we realize that there are only 9191 longer sequences out of in whole Swiss-prot which makes 3% (average sequence length is 365 in whole Swiss-prot and 335 in the reduced variant). These longer sequences could then be treated in special way since they are just small part of the whole. We don t show time comparison in our tests because we do not have effective implementation of Smith-Waterman yet, being the crucial component of the running time. But our method could be easily compared to SSEARCH (part of FASTA package) when we realize that SSEARCH is equivalent to sequence scan. To be able to compare index based methods with BLAST we distinguish number of distance computations from computational costs. We defined computational costs here as number of comparing two letters. Therefore computational cost of tree based methods are averaged as number of distance computations multiplied by the average size of distance matrix for Smith-Waterman which is = (for the derivation of computational costs of BLAST see [Hoksza and Skopal, 2007]). To be able to compare index-based methods with sequential scan, we show number of distance computations too. Four indexing methods were tested -, P, slimmed P and LAESA, each of them using the same set of distance modificators 9 generated by the TriGen algorithm. As can be seen in Figure 2a, when the zero error tolerance is used, weight of the modificator causes that number of distance computations is almost equivalent to sequence scan. When the weight is too big, it makes triangle inequality hold but for the price of increasing intrinsic dimension. However performs slightly better then sequence scan which means that when searching in the tree, not all of the nodes have been inspected (because of inner nodes, number of objects in the tree exceeds 3000). On the other side, P and slimmed P show worse result since there are additional computations to pivots (mapping the query). The most similar to sequential scan is LAESA method, which uses constant number of distance computations independently on range of the query. The situation slightly changes when we allow small error (Figure 2b). This causes that number of distance computations decrease about six percent compared to zero error tolerance. Here, P and slimmed PM tree outperform and the difference is about 1%. On the other hand LAESA showed just slight improvement. But in both cases BLAST method is evidently more effective since effctivity of the index is almost sequence scan even if small error is allowed. Why P and slimmed P behave better when allowing some error? Answer to this question can be seen on Figure 2c which shows on range query of five the relation between declared TriGen error tolerance and real error experienced in test. Here we can see, that P real error growth more quickly than the error of and moreover, we can see that slope of those lines are almost inverse, which means that the distance computations gain of P is counterbalanced by the error. From these two graphs can also be seen that for real error 50%, there still have to be done approximately 1200 distance computations which is about computational operations. That 9 fractional power modificators were used 95

6 Computational costs (milions) P(32,16) SlimP(32,16) BLAST Error tolerance 0 Computational costs (millions) Error tolerance P(32,16) SlimP(32,16) BLAST Real error P(32,16) SlimP(32,16) 5 Distance c omputations Error tolerance 0 P(32,16) SlimP(32,16) Sequential Scan Distance Computations Error tolerance P(32,16) SlimP(32,16) Sequential Scan Distance computations TriGen error tolerance 5 P(32,16) SlimP(32,16) TriGen error tolerance (a) (b) (c) Figure 2. Relation between and number of computations - range query (a,b) and relation between real error and distance computations (c) means that if BLAST would be such a bad hauristic that it would have just 50% successfulness, it still would be noticeable more effective. Conclusions and new promising approaches In this paper, we have have tested suitability of metric access indexing methods for indexing protein sequences. It has been shown that these method are not applicable to sequence alignment problem without their modification. This is primarily because of quality of the data to be indexed and the distance function which is used to define similarity between them. This distance function is highly nonmetric which demands strong modifications to it, to make it metric. This modification distorts distances in a way that strongly increases intrinsic dimension of the data and therefore the efficiency of search is almost the same as efficiency of sequential scan. But against sequential scan it has that advantage that precision can be defined and thus traded off for efficiency. This learned facts can aim next research to several areas. To name a few: Examining TriGen modificators and finding such modificators, which would minimize real error while distributing objects (i.e. sequences) in the space in a way which will be appropriate for indexing methods (i.e. decreasing intrinsic dimension). Modifying the search structures. For example examining possibilities of cutting sequences to q- grams but being able to define arised error (caused by splitting and thus losing information included in the whole sequence) and (optimally) minimize it. Modifying computing of Smith-Waterman local alignment. The idea is to change computing so that it will be faster and resulting scores won t violate properties of metric so much, as they do now (for example by using borders to limit the computational space in the distance matrix). Acknowledgments. This research has been supported by grant GAUK provided by the Grant Agency of Charles University. 96

7 References HOKSZA: PROTEIN INDEXING Altschul, S.F., T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res., 25, , Bairoch, A., B. Boeckmann, S. Ferro, E. Gasteiger, Swiss-Prot: Juggling between evolution and stability, Brief. Bioinform., 5, 39-55, Cameron, M., H. E. Williams, A. Cannane, Improved Gapped Alignment in BLAST, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(3), , Ciaccia, P., M. Patella, P. Zezula, : An Efficient Access Method for Similarity Search in Metric Spaces, VLDB 97, , Hoksza, D., T. Skopal, Index-based approach to similarity search in protein and nucleotide databases, DATESO 07, 67-80, Karlin, S., S.F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proc. Natl. Acad. Sci., 87, , Mao, R., W. Xu, S. Ramakrishnan, G. Nuckolls, D.P. Miranker, On optimizing distance-based similarity search for biological databases, Proc IEEE Comput Syst Bioinform Conference, , Micó, M. L., J. Oncina, E. Vidal, A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements, Pattern Recognition Letters, 15, 9-17, Needleman S.B., C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of Molecular Biology, 48, , Pearson, W.R., D.J. Lipman, Improved Tools for Biological Sequence Analysis, Proc. Natl. Acad. Sci., 85, , Sellers, P. H., On the theory and computation of evolutionary distances, SIAM Journal on Applied Mathematics, 26(4), , Skopal, T., J. Pokorný, V. Snášel, Nearest Neighbours Search using the P, DASFAA 05, , Skopal, T., Pivoting : A Metric Access Method for Efficient Similarity Search, DATESO 04, 21-31, Skopal, T., On Fast Non-metric Similarity Search by Metric Access Methods, EDBT 06, , Smith T.F., M.S. Waterman, Identification of common molecular subsequences, Journal of Molecular Biology, 147, , Xu W., D. P. Miranker, A Metric Model of Amino Acid Substitution, Bioinformatics, 20(8), , Zezula, P., G. Amato, V. Dohnal, M. Batko, Similarity Search: The Metric Space Approach, Advances in Database Systems, Dayhoff, M.O., R.M. Schwartz, B.C. Orcutt, A model for evolutionary change in proteins, Atlas of Protein Sequence and Structure, 5, , Henikoff S., J.G. Henikoff, Amino acid substitution matrices from protein blocks, Proc. Natl. Acad. Sci., 89, , Xu W., Miranker D.P., A metric model of amino acid substitution Bioinformatics, 20, , Weimin Ch., K. Aberer, Efficient querying on genomic databases by using metric space indexing techniques DEXA 07, 148, Ning, Z., A.J. Cox, J.C. Mullikin, SSAHA: A Fast Search Method for Large DNA Databases Genome Research, 11(10), , Kent, W. J. BLAT - The BLAST-Like Alignment Tool Genome Research, 12(4), , Ozturk O., H. Ferhatosmanoglu, Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases, Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering, 359-, Hsieh, T., H. Kuo, J. Huang, Filtering Bio-sequence Based on Sequence Descriptor BioDM 06, 14-23,

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

) I R L Press Limited, Oxford, England. The protein identification resource (PIR) Volume 14 Number 1 Volume 1986 Nucleic Acids Research 14 Number 1986 Nucleic Acids Research The protein identification resource (PIR) David G.George, Winona C.Barker and Lois T.Hunt National Biomedical

More information

From Smith-Waterman to BLAST

From Smith-Waterman to BLAST From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is

More information

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. .. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. PAM and BLOSUM Matrices Prepared by: Jason Banich and Chris Hoover Background As DNA sequences change and evolve, certain amino acids are more

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic tomas.skopal@vsb.cz

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover

More information

Distributed Protein Sequence Alignment

Distributed Protein Sequence Alignment Distributed Protein Sequence Alignment ABSTRACT J. Michael Meehan meehan@wwu.edu James Hearne hearne@wwu.edu Given the explosive growth of biological sequence databases and the computational complexity

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

Heuristic methods for pairwise alignment:

Heuristic methods for pairwise alignment: Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Improvisation of Global Pairwise Sequence Alignment

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

Database Similarity Searching

Database Similarity Searching An Introduction to Bioinformatics BSC4933/ISC5224 Florida State University Feb. 23, 2009 Database Similarity Searching Steven M. Thompson Florida State University of Department Scientific Computing How

More information

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA Comparative Analysis of Protein Alignment Algorithms in Parallel environment using BLAST versus Smith-Waterman Shadman Fahim shadmanbracu09@gmail.com Shehabul Hossain rudrozzal@gmail.com Gulshan Jubaed

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P Clote Department of Biology, Boston College Gasson Hall 16, Chestnut Hill MA 0267 clote@bcedu Abstract In this paper, we give

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

BLAST MCDB 187. Friday, February 8, 13

BLAST MCDB 187. Friday, February 8, 13 BLAST MCDB 187 BLAST Basic Local Alignment Sequence Tool Uses shortcut to compute alignments of a sequence against a database very quickly Typically takes about a minute to align a sequence against a database

More information

Scoring and heuristic methods for sequence alignment CG 17

Scoring and heuristic methods for sequence alignment CG 17 Scoring and heuristic methods for sequence alignment CG 17 Amino Acid Substitution Matrices Used to score alignments. Reflect evolution of sequences. Unitary Matrix: M ij = 1 i=j { 0 o/w Genetic Code Matrix:

More information

Sequence alignment theory and applications Session 3: BLAST algorithm

Sequence alignment theory and applications Session 3: BLAST algorithm Sequence alignment theory and applications Session 3: BLAST algorithm Introduction to Bioinformatics online course : IBT Sonal Henson Learning Objectives Understand the principles of the BLAST algorithm

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching A CAM(Content Addressable Memory)-based architecture for molecular sequence matching P.K. Lala 1 and J.P. Parkerson 2 1 Department Electrical Engineering, Texas A&M University, Texarkana, Texas, USA 2

More information

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A Steve Thompson: stthompson@valdosta.edu http://www.bioinfo4u.net 1 Similarity searching and homology First, just

More information

Dynamic Programming & Smith-Waterman algorithm

Dynamic Programming & Smith-Waterman algorithm m m Seminar: Classical Papers in Bioinformatics May 3rd, 2010 m m 1 2 3 m m Introduction m Definition is a method of solving problems by breaking them down into simpler steps problem need to contain overlapping

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences

SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences Sérgio A. D. Deusdado 1 and Paulo M. M. Carvalho 2 1 ESA,

More information

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS Ivan Vogel Doctoral Degree Programme (1), FIT BUT E-mail: xvogel01@stud.fit.vutbr.cz Supervised by: Jaroslav Zendulka E-mail: zendulka@fit.vutbr.cz

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6) International Journals of Advanced Research in Computer Science and Software Engineering ISSN: 77-18X (Volume-7, Issue-6) Research Article June 017 DDGARM: Dotlet Driven Global Alignment with Reduced Matrix

More information

Chapter 4: Blast. Chaochun Wei Fall 2014

Chapter 4: Blast. Chaochun Wei Fall 2014 Course organization Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms for Sequence Analysis (Week 3-11)

More information

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology ICB Fall 2008 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2008 Oliver Jovanovic, All Rights Reserved. The Digital Language of Computers

More information

Proceedings of the 11 th International Conference for Informatics and Information Technology

Proceedings of the 11 th International Conference for Informatics and Information Technology Proceedings of the 11 th International Conference for Informatics and Information Technology Held at Hotel Molika, Bitola, Macedonia 11-13th April, 2014 Editors: Vangel V. Ajanovski Gjorgji Madjarov ISBN

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits Branch and Bound Algorithms for Nearest Neighbor Search: Lecture 1 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology 1 / 36 Outline 1 Welcome

More information

BIOINFORMATICS. Multiple spaced seeds for homology search

BIOINFORMATICS. Multiple spaced seeds for homology search BIOINFORMATICS Vol. 00 no. 00 2007 pages 1-9 Sequence Analysis Multiple spaced seeds for homology search Lucian Ilie 1, and Silvana Ilie 2 1 Department of Computer Science, University of Western Ontario,

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 04: Variations of sequence alignments http://www.pitt.edu/~mcs2/teaching/biocomp/tutorials/global.html Slides adapted from Dr. Shaojie Zhang (University

More information

A Coprocessor Architecture for Fast Protein Structure Prediction

A Coprocessor Architecture for Fast Protein Structure Prediction A Coprocessor Architecture for Fast Protein Structure Prediction M. Marolia, R. Khoja, T. Acharya, C. Chakrabarti Department of Electrical Engineering Arizona State University, Tempe, USA. Abstract Predicting

More information

Database Searching Using BLAST

Database Searching Using BLAST Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Algorithms in Bioinformatics: A Practical Introduction. Database Search

Algorithms in Bioinformatics: A Practical Introduction. Database Search Algorithms in Bioinformatics: A Practical Introduction Database Search Biological databases Biological data is double in size every 15 or 16 months Increasing in number of queries: 40,000 queries per day

More information

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure Bioinformatics Sequence alignment BLAST Significance Next time Protein Structure 1 Experimental origins of sequence data The Sanger dideoxynucleotide method F Each color is one lane of an electrophoresis

More information

Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool (BLAST) BLAST 26.04.2018 Basic Local Alignment Search Tool (BLAST) BLAST (Altshul-1990) is an heuristic Pairwise Alignment composed by six-steps that search for local similarities. The most used access point to

More information

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology? Local Sequence Alignment & Heuristic Local Aligners Lectures 18 Nov 28, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

More information

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. www.ijarcet.org 54 Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. Hassan Kehinde Bello and Kazeem Alagbe Gbolagade Abstract Biological sequence alignment is becoming popular

More information

CS313 Exercise 4 Cover Page Fall 2017

CS313 Exercise 4 Cover Page Fall 2017 CS313 Exercise 4 Cover Page Fall 2017 Due by the start of class on Thursday, October 12, 2017. Name(s): In the TIME column, please estimate the time you spent on the parts of this exercise. Please try

More information

Lecture 9: Core String Edits and Alignments

Lecture 9: Core String Edits and Alignments Biosequence Algorithms, Spring 2005 Lecture 9: Core String Edits and Alignments Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 9: String Edits and Alignments p.1/30 III:

More information

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague, FMP, Department of Software Engineering Malostranské nám. 25, 118 00 Prague,

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 - Spring 2015 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment The number of all possible pairwise alignments (if

More information

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm 5th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2017) Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm Xiantao Jiang1, a,*,xueliang

More information

A Design of a Hybrid System for DNA Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm

More information

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture February 6, 2008 BLAST: Basic local alignment search tool B L A S T! Jonathan Pevsner, Ph.D. Introduction to Bioinformatics pevsner@jhmi.edu 4.633.0 Copyright notice Many of the images in this powerpoint

More information

Algorithmic Approaches for Biological Data, Lecture #20

Algorithmic Approaches for Biological Data, Lecture #20 Algorithmic Approaches for Biological Data, Lecture #20 Katherine St. John City University of New York American Museum of Natural History 20 April 2016 Outline Aligning with Gaps and Substitution Matrices

More information

Cost Partitioning Techniques for Multiple Sequence Alignment. Mirko Riesterer,

Cost Partitioning Techniques for Multiple Sequence Alignment. Mirko Riesterer, Cost Partitioning Techniques for Multiple Sequence Alignment Mirko Riesterer, 10.09.18 Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores. CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores. prepared by Oleksii Kuchaiev, based on presentation by Xiaohui Xie on February 20th. 1 Introduction

More information

Sequence Alignment Heuristics

Sequence Alignment Heuristics Sequence Alignment Heuristics Some slides from: Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt Serafim Batzoglu, Stanford http://ai.stanford.edu/~serafim/ Geoffrey J. Barton, Oxford Protein

More information

On fuzzy vs. metric similarity search in complex databases

On fuzzy vs. metric similarity search in complex databases On fuzzy vs. metric similarity search in complex databases Alan Eckhardt 1,2, Tomáš Skopal 1, and Peter Vojtáš 1,2 1 Department of Software Engineering, Charles University, 2 Institute of Computer Science,

More information

Nearest Neighbor Search by Branch and Bound

Nearest Neighbor Search by Branch and Bound Nearest Neighbor Search by Branch and Bound Algorithmic Problems Around the Web #2 Yury Lifshits http://yury.name CalTech, Fall 07, CS101.2, http://yury.name/algoweb.html 1 / 30 Outline 1 Short Intro to

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms: BLAST and FASTA Database search: gene finding and annotations

More information

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm CSCI1820: Sequence Alignment Spring 2017 Lecture 3: February 7 Lecturer: Sorin Istrail Scribe: Pranavan Chanthrakumar Note: LaTeX template courtesy of UC Berkeley EECS dept. Notes are also adapted from

More information

Highly Scalable and Accurate Seeds for Subsequence Alignment

Highly Scalable and Accurate Seeds for Subsequence Alignment Highly Scalable and Accurate Seeds for Subsequence Alignment Abhijit Pol Tamer Kahveci Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, 32611

More information

Central Issues in Biological Sequence Comparison

Central Issues in Biological Sequence Comparison Central Issues in Biological Sequence Comparison Definitions: What is one trying to find or optimize? Algorithms: Can one find the proposed object optimally or in reasonable time optimize? Statistics:

More information

Similarity Searches on Sequence Databases

Similarity Searches on Sequence Databases Similarity Searches on Sequence Databases Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Zürich, October 2004 Swiss Institute of Bioinformatics Swiss EMBnet node Outline Importance of

More information

Introduction to Computational Molecular Biology

Introduction to Computational Molecular Biology 18.417 Introduction to Computational Molecular Biology Lecture 13: October 21, 2004 Scribe: Eitan Reich Lecturer: Ross Lippert Editor: Peter Lee 13.1 Introduction We have been looking at algorithms to

More information

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS IN DNA SEQUENCES USING MULTIPLE SPACED SEEDS By SARAH BANYASSADY, B.S. A Thesis Submitted to the School of Graduate Studies

More information

Clustered Pivot Tables for I/O-optimized Similarity Search

Clustered Pivot Tables for I/O-optimized Similarity Search Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško Charles University in Prague, Faculty of Mathematics and Physics, SIRET research group mosko.juro@centrum.sk Jakub Lokoč Charles University

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences SEQUENCE ALIGNMENT ALGORITHMS 1 Why compare sequences? Reconstructing long sequences from overlapping sequence fragment Searching databases for related sequences and subsequences Storing, retrieving and

More information

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids Important Example: Gene Sequence Matching Century of Biology Two views of computer science s relationship to biology: Bioinformatics: computational methods to help discover new biology from lots of data

More information

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. 5 Multiple Match Refinement and T-Coffee In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. This exposition

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

Lecture 4: January 1, Biological Databases and Retrieval Systems

Lecture 4: January 1, Biological Databases and Retrieval Systems Algorithms for Molecular Biology Fall Semester, 1998 Lecture 4: January 1, 1999 Lecturer: Irit Orr Scribe: Irit Gat and Tal Kohen 4.1 Biological Databases and Retrieval Systems In recent years, biological

More information

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University 1 Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University DP Algorithms for Pairwise Alignment 2 The number of all possible pairwise alignments (if gaps are allowed)

More information

Fast Sequence Alignment Method Using CUDA-enabled GPU

Fast Sequence Alignment Method Using CUDA-enabled GPU Fast Sequence Alignment Method Using CUDA-enabled GPU Yeim-Kuan Chang Department of Computer Science and Information Engineering National Cheng Kung University Tainan, Taiwan ykchang@mail.ncku.edu.tw De-Yu

More information

Alignments BLAST, BLAT

Alignments BLAST, BLAT Alignments BLAST, BLAT Genome Genome Gene vs Built of DNA DNA Describes Organism Protein gene Stored as Circular/ linear Single molecule, or a few of them Both (depending on the species) Part of genome

More information

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties From LCS to Alignment: Change the Scoring The Longest Common Subsequence (LCS) problem the simplest form of sequence

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

A DNA Index Structure Using Frequency and Position Information of Genetic Alphabet

A DNA Index Structure Using Frequency and Position Information of Genetic Alphabet A DNA Index Structure Using Frequency and Position Information of Genetic Alphabet Woo-Cheol Kim 1, Sanghyun Park 1, Jung-Im Won 1, Sang-Wook Kim 2, and Jee-Hee Yoon 3 1 Department of Computer Science,

More information

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification Preliminary Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Oct 25 Oct 28 Nov 4 Nov 8 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification OCTOBER BREAK

More information

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences Reading in text (Mount Bioinformatics): I must confess that the treatment in Mount of sequence alignment does not seem to me a model

More information

New Algorithms for the Spaced Seeds

New Algorithms for the Spaced Seeds New Algorithms for the Spaced Seeds Xin Gao 1, Shuai Cheng Li 1, and Yinan Lu 1,2 1 David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 6P7 2 College of Computer

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance

An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance An Efficient Algorithm for Computing Non-overlapping Inversion and Transposition Distance Toan Thang Ta, Cheng-Yao Lin and Chin Lung Lu Department of Computer Science National Tsing Hua University, Hsinchu

More information