USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT
|
|
- Daniel Harrison
- 6 years ago
- Views:
Transcription
1 IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah N. Arslan Assistant Professor Computer Science Department, University of Vermont, Burlington, VT 05405, USA Xindong Wu Professor, Chair Computer Science Department, University of Vermont, Burlington, VT 05405, USA ABSTRACT An important problem in computational biology is the alignment of a given query sequence and sequences in a database to find similar (locally or globally) sequences from the database to the query. Many heuristic algorithms for this problem are based on the idea of locating a fixed-length matching pair of substrings (called a seed) to start an alignment, and then extending this alignment using dynamic programming. We generalize this idea and take it one step further in a tool we develop, namely Sequence Comparison Tool (SCT). SCT preprocesses the database to create a special generalized suffix tree from the sequences in the database. This tree extends the definition of a generalized suffix tree by containing additional information at the nodes for the length and frequency (number of occurrences) of the corresponding substrings (patterns). A pattern is regarded as significant if it is sufficiently long and it appears many times in the database. A significant pattern shared by two sequences is an indication that the sequences are locally similar. SCT ranks the sequences with respect to the number of significant patterns they share with the query sequence. SCT reduces the database by selecting a given number of sequences with the topmost ranks. It proceeds with invoking an ordinary local alignment algorithm on this reduced database. We conducted experiments on real biological sequences, and compared SCT's performance with a popular alignment tool BLAST. In these tests we used the 6-fold cross validation technique of data mining. The tests show that SCT effectively reduces the database and obtains very similar results compared to those of BLAST in approximately half the time taken by BLAST. KEYWORDS Sequence alignment, suffix tree, 6-fold cross-validation. 1. INTRODUCTION Sequence alignment is an important problem for identifying and presenting the biologically important, yet hidden or widely dispersed common characteristics from a set of sequences. These commonalities can reveal evolutionary histories, critical conserved motifs or molecular structures that give clues about the common biological functions. Such commonalities are also used to characterize families or super-families of proteins. These characterizations are then used in database searches to identify other potential members of a family. The sequence similarity query can be formally defined as the following problem: given a sequence Q and a database of sequences, determine one or more sequences from the database which are the closest to sequence Q. Our objective is to improve the computation time of the search compared to the existing methods while preserving the accuracy. 655
2 ISBN: IADIS Local pairwise sequence alignment seeks similar segments in a given pair of sequences. A classical algorithm for this problem is the Smith-Waterman algorithm [9] which uses dynamic programming. This quadratic-time algorithm is too slow to be practical for long sequences. For the pairwise sequence alignment problem there are several heuristic algorithms such as FASTA [6] and BLAST [1, 2]. BLAST is approximately times faster than the Smith-Waterman algorithm. One important feature of BLAST is its ability to compare a query with a database of sequences. Considering the rapid growth of database sizes, this problem demands ever-growing computation resources, and remains as a computational challenge. Main idea in heuristic algorithms is to first locate a fixed-length matching-pair (called a seed) to start an alignment, and use dynamic programming to extend the alignment in both directions. In this paper, we generalize this idea, and take it one step further. Our research has been inspired by the use of frequency and the application of a very efficient data structure, the suffix tree [4, 5]. The suffix tree is a powerful tool to determine common patterns in sequences. It represents the internal structure of a string in a comprehensive manner. A suffix tree can be constructed in linear time [4,5,7,8,10]. A suffix tree T for an m-character-long string S is a rooted directed tree such that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix S[i..m] of S that starts at position i. Figure 2 Part (a) includes an example suffix tree, the suffix tree for S 1 =TATAA. A generalized suffix tree (GST) [5] is a suffix tree that combines the suffixes of a set of strings {S 1, S 2,,S n }. In a GST a node may be shared by suffixes of more than one strings. This is indicated by including string identifiers in leafs. In the GST in Figure 2 Part (b), each leaf of the tree represents a suffix either from one of the two strings or from both. Delcher et al. [3] showed that suffix trees are efficient in whole genome pairwise alignments. Their tool MUMer find maximal matches (MUMs) using a suffix tree and combine these matches by using indels (inserts/deletes) into larger alignments. In this paper we concentrate not on pairwise sequence alignment but on sequence similarity query in a database. We aim to accelerate answering sequence similarity query first by filtering the database to select most promising sequences. The query sequence can then be compared with these sequences using a suitable sequence alignment tool such as BLAST, or MUMer. We preprocess the database to create a generalized suffix tree from the sequences in the database. We extend the generalized suffix tree by including additional information at the nodes. We compute, and store at each node the length and frequency (the number of occurrences in the database) of the corresponding pattern. For each substring (pattern) p in the database there exists on this tree a node i such that we obtain p when we spell out (concatenate) the labels of the edges on the path from the root to node i. We consider several functions that, at each node, take into account the frequency and the length of the corresponding pattern, and assign a score to each node. Our hypothesis is that high frequency indicates that the pattern points to a conserved region, and the length parameter measures the information value of the pattern. A pattern is regarded as significant if it is sufficiently long and it appears many times in the database. We determine significant patterns by checking if their scores are larger than or equal to a given threshold. We expect that a significant pattern is contained in biologically-related sequences. For each query we temporarily add the query sequence to the tree to determine sequences that share high-scoring significant patterns with the query. We identify the sequences in the database which share these patterns. We then rank the sequences with respect to the number of significant patterns they share with the query sequence. Next, we reduce the database by selecting only a given number of sequences with the topmost ranks. We expect that these are the closest sequences to the query. In the end, we apply a local alignment algorithm on the reduced database. We implemented this method into a new tool SCT (Sequence Comparison Tool). We conducted experiments on real biological sequences. In these experiments we used the 6-fold cross validation technique of data mining. We compared SCT's performance with that of a popular alignment tool BLAST. The tests show that our method works faster without losing accuracy. We organize this paper as follows: We describe our method for answering sequence similarity queries in Section 2. We summarize the experimental results in Section 3. We provide concluding remarks and pointers for future work in Section DESIGN AND METHODOLOGY We generate a generalized suffix tree GST from a given database of sequences, and then process this tree to add information that we use later to select sequences that are potentially similar to a given query sequence. 656
3 IADIS International Conference Applied Computing 2006 We invoke BLAST to perform pairwise sequence alignment between the query sequence and the chosen sequences in the database. Central to our approach is the processing of the tree and selection of the sequences. We designed these steps based on the following observations: The substrings of a string S in GST can be used as patterns to identify similarity between homologous sequences. This is because similar sequences contain conserved regions (or common substrings). We can quantify the significance of a pattern by computing a score. For a given pattern, a scoringfunction must take into account the length of the pattern, the number of occurrences (frequency) of the pattern in the database, and possibly the size of the database subjected to search to normalize the frequency over the database size. Ideally, a given pattern with a high score carries an important feature (information) that belongs to a family of sequences with a high likelihood. For a pattern: the length is an important parameter in measuring the biological information it carries, the frequency is important because a pattern conserved in many biologically-related sequences has a high frequency. In an ideal case, the higher the frequency the higher the chances that it points to a family of similar sequences. We classify a given pattern p as significant if it satisfies the following two constraints: the length of p a given length-threshold: a significant pattern must be sufficiently long to carry important biological information, and the score of p a given score-threshold: a significant pattern must have a sufficiently high score. Sequences similar to the query sequence in the database are expected to share many significant patterns with the query sequence. We implement our method, and create a tool SCT. The flowchart in Figure 1 shows the steps of SCT: Figure 1. Flowchart of SCT 1. Read all the sequences in the database into memory. 2. Construct a generalized suffix tree from the input sequences {S 1, S 2,, S n } in the database. We use Ukkonen's algorithm [8]. 3. While constructing the suffix tree, at each node i for the corresponding pattern p i, store the length l(p i ) and the frequency f(p i ) (i.e. the number of occurrences of p i in the database): we increment the frequency by one at each visited node during the addition of a new suffix. 4. For a given node i we use a function W to assign a score to pattern p i corresponding to node i. We define W ( pi ) = f ( pi ) * l( pi ) / DB where DB denotes the size of the database. We compute W(p i ) when we construct the GST. We use W(p i ) to measure the significance of the pattern p i. As we explained earlier our hypothesis is that biologically significant patterns are shared by many substrings of sequences, and as the length of the pattern increases so does the information value. Therefore we incorporate both factors in W. We experimented with several different functions. The reason we also include DB is to normalize the frequency over databases with different sizes. This is important only if we want to determine and use a scorethreshold independent of the underlying database. 5. Prompt the user to obtain the number of sequences n to select from the database. 6. Read in the query sequence. 7. Temporarily add the suffixes of the query sequence Q onto the generalized suffix tree. This enables us to determine which suffixes of the query are shared by the sequences in the database. 657
4 ISBN: IADIS The query sequence is only temporarily added to the tree so that SCT is not affected for future sequence searches. Initially the colors of all the nodes in the GST are 0. When we add the suffixes of the query sequence Q we change the color of the nodes visited in the GST to 1. This expedites the search for common patterns within the GST because we only examine those paths in the tree for patterns that contain substrings of the query sequence. Consider a query sequence Q=S 3 =AATGT and two sequences in the database, S 1 =TATAA and S 2 = AACGA. Figure 2 Part (c) shows the coloring of the nodes after temporarily adding the query sequence to the GST. Figure 2. For S 1 =TATAA and S 2 =AACGA: (a) The suffix tree of S 1, (b) The generalized suffix tree of S 1 and S 2, and (c) The suffix tree after the query string Q=S 3 =AATGT is added to the GST of S 1 and S 2 in part (b). The vertices with color 1 are shown by filled circles. The color of other vertices is 0 8. Post-process the generated tree to extract significant patterns shared by the query sequence: in this step we do the actual traversal of the generalized suffix tree to extract significant patterns. Starting at the root and in a depth-first manner we visit all nodes whose colors are 1. If the current node has no child whose color is 1 then we backtrack to its parent node. During this traversal we collect all significant patterns into a set G. The sequences may have other common patterns that are not significant. An optimal alignment between these two sequences in an ideal case contains all significant patterns. 9. Delete the query sequence from the GST once the significant patterns have been collected. 10. Pick the top 10 significant patterns from G and store them in a set P. 11. Extract into a set R the sequences that contain the significant patterns in P. 12. Do a reverse check to compute a weight for each sequence in R. We define the weight of a sequence as the number of patterns that it contains from P. Our hypothesis is that the higher is this number (weight), the greater will be the similarity of the corresponding sequence to the query. 13. Rank the sequences according to these weights. 14. Pick top n (a user specified number) sequences from R and write to a new database. 15. Apply BLAST to the query and the new database of sequences. 16. Output the results of BLAST. We use the BLAST2 implementation we obtained from NCBI's website. BLAST2 allows for insertion of gaps in alignments. For very large databases the sequences in the database can be loaded into memory in parts, the suffix-tree can be created in parts, and all significant patterns can be collected in G by repeating Steps 7-8 for each part. 3. EXPERIMENTS AND RESULTS We have used real DNA sequence databases from three species in our tests. Table 1 lists these data sets: (1) Escherichia coli; (2) Bacillus anthracis; (3) Plasmodium falciparum. The source for these databases is the website of National Center of Biotechnology Information, established in Table 1. Data sets we obtained from NCBI's website and used in our experiments Database Sequence-Length Number of Sequences Ecoli.nt 800 1,000 Bacteria.dna 700 1,000 Plasmodium.dna 127 1,
5 IADIS International Conference Applied Computing 2006 In designing our tests, we used the 6-fold cross validation approach which is based on the idea ``train on 5 folds, test on 1 fold'' as illustrated in Figure 3. The argument in support of the combined approaches is that with a limited amount of training data, the individual classifier may not represent the true hypotheses. On the other hand, a combined classifier may produce a good approximation for the true hypotheses. This data mining technique compares two learned models. For the implementation we randomly divided training data into two disjoint sets: a training set and a validation set. The two data sets consist of DNA sequences from NCBI's website. There are three steps in implementing the 6-fold cross validation: (a) Use 5 folds for training and 1 fold for testing; (b) Run until every fold is used for training; (3) Calculate the average of the results from 6 runs. The goal was to verify that the results of these runs are consistent. So the data set was split and iteratively 5/6 of the data were used for training and the remaining 1/6 were used for testing. The average of the six runs was computed for analysis. This established the consistency in the timing for database search using SCT. Figure 3. 6-fold cross validation approach Table 2 summarizes the results that we obtained after conducting the experiments with SCT. In the table we also include in bold the results obtained by applying BLAST alone for the same queries. For each query the first row is for the results of SCT, and the second row is those obtained by applying BLAST alone. Experiments are performed on each of the three data sets individually. Six searches are executed on each database to implement 6-fold cross-validation using different queries. The idea is to test whether the results are consistent for all the queries on a particular database in terms of computation time. The results show that in our tests we obtain consistent results and our tool performs sequence comparison with a good accuracy and a practical time improvement is achieved over BLAST. Table 2. Average query times for Ecoli.nt 971 (1,699), for Bacteria_dna 918 (1,714), for Plasmodium_dna 1,009 ( 1,828). The results obtained by BLAST alone are shown within parentheses here, and in boldface in the table. Numbers in the column named Chosen Sequences are the indices of chosen sequences in the database (it does not include the query) Data Set Query Chosen Sequences Time (ms) Data Set Query Chosen Sequences Time (ms) Data Set Query Chosen Sequences Time (ms) Ecoli 1 2, 14, Bacteria. 1 25, 6, Pasmod. 1 24, 2, 40 1,078 2, 14, 872 1,693 25, 6, 706 1,713 24, 2, 120 1, , 20, , 30, , 12, , 20, 205 1,726 6, 30, 512 1,726 8, 12, 99 1, , , , , 28 1,596 32, 11 1,711 5, 106 1, , 41, , 51, , 84, , 41, 519 1,742 8, 51, 262 1,712 63, 84, 179 1, , , , 72, , 112 1,749 2, 95 1,712 11, 72, 155 1, , 95, , , 25, 28 1,006 31, 95, 154 1,688 5, 78 1,713 14, 25, 310 1,849 We have conducted a set of controlled experiments to test the effect of each parameter alone. We compare the output of SCT with the results obtained when BLAST is used directly on the original database Scoring patterns: A scoring function W changes the significance of patterns. We tested with W ( pi ) = f ( pi ) * l( pi ), and then with W ( pi ) = f ( pi )* l( pi ) for a pattern p i corresponding to node i. The former performed better in picking closest sequences. This suggests that in measuring the significance of a pattern the length carries more weight than the frequency. Length-threshold: We set the threshold to 3, 4, and 5 separately. Even though the closest sequence remained the same in each case, other similar sequences were affected. We observed that the accuracy was the best when the threshold was set to 3. Number of patterns: We experimented with setting the number of patterns to be used to 7, 8, and 10 separately. In each case, the closest sequence obtained was the same. Other sequences obtained were affected. The accuracy was the highest when the number of patterns was set to
6 ISBN: IADIS Score-threshold: We did a set of tests with different values for the score-threshold. The best results were obtained when we set it to It is easy to think of a worst-case and a best-case scenario for SCT. If a large number of substrings (patterns) are common in almost all sequences in the database then based on these patterns SCT will not be able to distinguish sequences close to the query sequence. In the best case the database contains a family of sequences that share very long patterns with Q, or a sequence P which is almost identical to the query sequence Q, and SCT will be able to identify the common patterns, and return the closest sequence(s) in the answer to the query very quickly. 4. CONCLUDING REMARKS AND FUTURE WORK In this paper we have presented SCT (Sequence Comparison Tool) for answering alignment-based similarity queries against a database. SCT preprocesses the database to create a generalized suffix tree that we extend by adding frequency and length information for the patterns. The tree is resident in the memory, and it is used for answering future queries. The tree can be created, fetched into memory, and used in parts in answering the similarity query by repeating certain steps for each part. SCT distinguishes patterns by computing significance-scores. A pattern is regarded as significant if it is long enough, and it appears frequently enough in the database. The scoring function takes into account a pattern's length and frequency, the given threshold values, and determines if a pattern is significant. Using these, for a given query sequence SCT reduces the database to only a few sequences that share the most significant patterns with the query. This reduction in database size speeds-up the local alignment of the query sequence against the database. Experimental results have shown that SCT provides a speed-up over BLAST, which is currently the dominant search engine for database-searches. It is able to curtail the time of a database search to nearly half the time originally taken by BLAST. Results from BLAST in our tests have shown that this method is experimentally effective, as we obtain accurate sequence alignment results from SCT. Combined with the extended suffix tree, SCT has the advantage of using BLAST to do the local sequence alignment. We applied the 6-fold cross validation technique of data mining to attain a greater accuracy in our results. The 6 runs of the cross validation help us establish that SCT performs consistently well for all the queries for a particular database included in our tests. With this effort we have obtained very promising results. We selected a small domain of sequences from the universe for which our experiments showed that our method works well. Our method can be further enhanced by covering databases of protein sequences to determine domain-specific parameters such as score and length thresholds, and scoring functions based on natural frequency of amino acid patterns. We can allow for approximate matches in significant patterns. Suffix arrays can be used to improve the performance. REFERENCES [1] Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. Lipman, Basic local alignment search tool. Journal of Molecular Biology, 215: [2] Altschul, S. F., T. L. Madden, A. A. Schaffer, J Zhang, Z Zhang, W. Miller, and D. J. Lipman, September Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17): [3] Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, L, Alignment of whole genomes. Nucleic Acid Research}, 27(11): [4] Giegerich, R. and Kurtz, S, From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19: [5] Gusfield, Dan, Algorithms on Strings, Trees, and Sequences. Cambridge University Press. [6] Lipman, D. J. and Pearson, W. R., Rapid and sensitive protein similarity searches. Science, 227, [7] McCreight, E. M., A space-economical suffix tree construction algorithm. J. of the ACM, 23(2), [8] Ukkonen, E., On-line construction of suffix-trees. Algorithmica, 14: [9] Waterman, M. S., Introduction to Computational Biology, Chapman & Hall. [10] Weiner, P., Linear pattern matching algorithms. Proceedings of the 14th IEEE Symposium on Switching and Automata Theory,
24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:
24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid
More informationPairwise Sequence Alignment using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array
352 The International Arab Journal of Information Technology, Vol. 12, No. 4, July 2015 Pairwise Sequence Alignment using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array Arumugam
More informationBLAST, Profile, and PSI-BLAST
BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources
More informationCOS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching
COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database
More informationDivya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by
Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques A Thesis Presented by Divya R. Singh to The Faculty of the Graduate College of the University of Vermont In Partial Fulfillment of
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive
More informationAn Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST
An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise
More informationSpace Efficient Linear Time Construction of
Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics
More informationAs of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be
48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and
More informationData Mining Technologies for Bioinformatics Sequences
Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment
More informationJyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Improvisation of Global Pairwise Sequence Alignment
More informationBioinformatics explained: BLAST. March 8, 2007
Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics
More informationNew Algorithms for the Spaced Seeds
New Algorithms for the Spaced Seeds Xin Gao 1, Shuai Cheng Li 1, and Yinan Lu 1,2 1 David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 6P7 2 College of Computer
More informationFastA & the chaining problem
FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,
More informationHeuristic methods for pairwise alignment:
Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic
More informationHighly Scalable and Accurate Seeds for Subsequence Alignment
Highly Scalable and Accurate Seeds for Subsequence Alignment Abhijit Pol Tamer Kahveci Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, 32611
More informationSuffix Vector: A Space-Efficient Suffix Tree Representation
Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,
More informationFastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:
FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem
More informationBLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics
More informationA Prototype for Multiple Whole Genome Alignment
A Prototype for Multiple Whole Genome Alignment Jitender S. Deogun, Fangrui Ma, Jingyi Yang Department of Computer Science and Engineering University of Nebraska Lincoln Lincoln, NE 6888-0, USA Andrew
More informationMismatch String Kernels for SVM Protein Classification
Mismatch String Kernels for SVM Protein Classification Christina Leslie Department of Computer Science Columbia University cleslie@cs.columbia.edu Jason Weston Max-Planck Institute Tuebingen, Germany weston@tuebingen.mpg.de
More informationLectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures
4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut
More informationAccelerating Protein Classification Using Suffix Trees
From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science
More informationThe Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science
The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane
More informationImproving Suffix Tree Clustering Algorithm for Web Documents
International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal
More informationProceedings of the 11 th International Conference for Informatics and Information Technology
Proceedings of the 11 th International Conference for Informatics and Information Technology Held at Hotel Molika, Bitola, Macedonia 11-13th April, 2014 Editors: Vangel V. Ajanovski Gjorgji Madjarov ISBN
More informationAcceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.
www.ijarcet.org 54 Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. Hassan Kehinde Bello and Kazeem Alagbe Gbolagade Abstract Biological sequence alignment is becoming popular
More informationSequence alignment theory and applications Session 3: BLAST algorithm
Sequence alignment theory and applications Session 3: BLAST algorithm Introduction to Bioinformatics online course : IBT Sonal Henson Learning Objectives Understand the principles of the BLAST algorithm
More informationComparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA
Comparative Analysis of Protein Alignment Algorithms in Parallel environment using BLAST versus Smith-Waterman Shadman Fahim shadmanbracu09@gmail.com Shehabul Hossain rudrozzal@gmail.com Gulshan Jubaed
More informationOPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT
OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align
More informationBiology 644: Bioinformatics
Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg
More informationA Suffix Tree Construction Algorithm for DNA Sequences
A Suffix Tree Construction Algorithm for DNA Sequences Hongwei Huo School of Computer Science and Technol Xidian University Xi 'an 710071, China Vojislav Stojkovic Computer Science Department Morgan State
More informationMAP: SEARCHING LARGE GENOME DATABASES
MAP: SEARCHING LARGE GENOME DATABASES TAMER KAHVECI AMBUJ SINGH Department of Computer Science University of California Santa Barbara, CA 93106 tamer,ambuj @cs.ucsb.edu Abstract A number of biological
More informationCache and Energy Efficient Alignment of Very Long Sequences
Cache and Energy Efficient Alignment of Very Long Sequences Chunchun Zhao Department of Computer and Information Science and Engineering University of Florida Email: czhao@cise.ufl.edu Sartaj Sahni Department
More informationBLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.
BLAST Basic Local Alignment Search Tool Used to quickly compare a protein or DNA sequence to a database. There is no such thing as a free lunch BLAST is fast and highly sensitive compared to competitors.
More informationFASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.
FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence
More informationCompares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.
Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the
More informationC E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,
C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 13 Iterative homology searching, PSI (Position Specific Iterated) BLAST basic idea use
More informationMAP: Searching Large Genome Databases. T. Kahveci, A. Singh. Pacific Symposium on Biocomputing 8: (2003)
MAP: Searching Large Genome Databases T. Kahveci, A. Singh Pacific Symposium on Biocomputing 8:303-314(2003) MAP: SEARCHING LARGE GENOME DATABASES a TAMER KAHVECI AMBUJ SINGH Department of Computer Science
More informationBasic Local Alignment Search Tool (BLAST)
BLAST 26.04.2018 Basic Local Alignment Search Tool (BLAST) BLAST (Altshul-1990) is an heuristic Pairwise Alignment composed by six-steps that search for local similarities. The most used access point to
More informationA DNA Index Structure Using Frequency and Position Information of Genetic Alphabet
A DNA Index Structure Using Frequency and Position Information of Genetic Alphabet Woo-Cheol Kim 1, Sanghyun Park 1, Jung-Im Won 1, Sang-Wook Kim 2, and Jee-Hee Yoon 3 1 Department of Computer Science,
More informationResearch on Pairwise Sequence Alignment Needleman-Wunsch Algorithm
5th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2017) Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm Xiantao Jiang1, a,*,xueliang
More informationAn I/O device driver for bioinformatics tools: the case for BLAST
An I/O device driver for bioinformatics tools 563 An I/O device driver for bioinformatics tools: the case for BLAST Renato Campos Mauro and Sérgio Lifschitz Departamento de Informática PUC-RIO, Pontifícia
More informationCISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment
CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features
More informationA NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE
205 A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE SEAN R. EDDY 1 eddys@janelia.hhmi.org 1 Janelia Farm Research Campus, Howard Hughes Medical Institute, 19700 Helix Drive,
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More informationBioinformatics for Biologists
Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover
More informationBioinformatics explained: Smith-Waterman
Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com
More informationAcceleration of Ungapped Extension in Mercury BLAST. Joseph Lancaster Jeremy Buhler Roger Chamberlain
Acceleration of Ungapped Extension in Mercury BLAST Joseph Lancaster Jeremy Buhler Roger Chamberlain Joseph Lancaster, Jeremy Buhler, and Roger Chamberlain, Acceleration of Ungapped Extension in Mercury
More informationA Coprocessor Architecture for Fast Protein Structure Prediction
A Coprocessor Architecture for Fast Protein Structure Prediction M. Marolia, R. Khoja, T. Acharya, C. Chakrabarti Department of Electrical Engineering Arizona State University, Tempe, USA. Abstract Predicting
More informationHIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT
HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins
More informationPROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota
Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein
More informationSequence Alignment as a Database Technology Challenge
Sequence Alignment as a Database Technology Challenge Hans Philippi Dept. of Computing and Information Sciences Utrecht University hansp@cs.uu.nl http://www.cs.uu.nl/people/hansp Abstract. Sequence alignment
More informationCS313 Exercise 4 Cover Page Fall 2017
CS313 Exercise 4 Cover Page Fall 2017 Due by the start of class on Thursday, October 12, 2017. Name(s): In the TIME column, please estimate the time you spent on the parts of this exercise. Please try
More informationScoring and heuristic methods for sequence alignment CG 17
Scoring and heuristic methods for sequence alignment CG 17 Amino Acid Substitution Matrices Used to score alignments. Reflect evolution of sequences. Unitary Matrix: M ij = 1 i=j { 0 o/w Genetic Code Matrix:
More informationAlignment of Long Sequences
Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale
More informationFrom Smith-Waterman to BLAST
From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is
More informationTHE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION
THE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION CHRISTINA LESLIE, ELEAZAR ESKIN, WILLIAM STAFFORD NOBLE a {cleslie,eeskin,noble}@cs.columbia.edu Department of Computer Science, Columbia
More informationAn Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario
An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm
More informationDynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014
Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into
More informationProfiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University
Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence
More informationFastCluster: a graph theory based algorithm for removing redundant sequences
J. Biomedical Science and Engineering, 2009, 2, 621-625 doi: 10.4236/jbise.2009.28090 Published Online December 2009 (http://www.scirp.org/journal/jbise/). FastCluster: a graph theory based algorithm for
More informationA Fast Algorithm for Optimal Alignment between Similar Ordered Trees
Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221
More informationChapter 4: Blast. Chaochun Wei Fall 2014
Course organization Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms for Sequence Analysis (Week 3-11)
More informationReconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences
SEQUENCE ALIGNMENT ALGORITHMS 1 Why compare sequences? Reconstructing long sequences from overlapping sequence fragment Searching databases for related sequences and subsequences Storing, retrieving and
More informationPacific Symposium on Biocomputing 4: (1999)
EFFECTIVE QUERY FILTERING FOR FAST HOMOLOGY SEARCHING HUGH E. WILLIAMS Department of Computer Science, RMIT University, GPO Box 2476V, Melbourne 3001, Australia hugh@cs.rmit.edu.au To improve the accuracy
More informationNotes on Dynamic-Programming Sequence Alignment
Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA
More informationDatabase Searching Using BLAST
Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain
More informationAn improved algorithm for the regular expression constrained multiple sequence alignment problem
An improved algorithm for the regular expression constrained multiple sequence alignment problem Abdullah N. Arslan and Dan He Department of Computer Science University of Vermont Burlington, VT 05405,
More informationA CAM(Content Addressable Memory)-based architecture for molecular sequence matching
A CAM(Content Addressable Memory)-based architecture for molecular sequence matching P.K. Lala 1 and J.P. Parkerson 2 1 Department Electrical Engineering, Texas A&M University, Texarkana, Texas, USA 2
More information1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998
7 Multiple Sequence Alignment The exposition was prepared by Clemens Gröpl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all
More informationKeywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character
Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Index Based Multiple
More informationICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology
ICB Fall 2008 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2008 Oliver Jovanovic, All Rights Reserved. The Digital Language of Computers
More informationsplitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014
splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis Objective Input! Output! A B C D Several
More informationRevisiting the Speed-versus-Sensitivity Tradeoff in Pairwise Sequence Search
Revisiting the Speed-versus-Sensitivity Tradeoff in Pairwise Sequence Search Ashwin M. Aji and Wu-chun Feng The Synergy Laboratory Department of Computer Science Virginia Tech {aaji,feng}@cs.vt.edu Abstract
More informationDatabase Similarity Searching
An Introduction to Bioinformatics BSC4933/ISC5224 Florida State University Feb. 23, 2009 Database Similarity Searching Steven M. Thompson Florida State University of Department Scientific Computing How
More informationFINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS
FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS IN DNA SEQUENCES USING MULTIPLE SPACED SEEDS By SARAH BANYASSADY, B.S. A Thesis Submitted to the School of Graduate Studies
More informationAn Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data
An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data Takeaki Uno uno@nii.jp, National Institute of Informatics 2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
More informationA New Method for Database Searching and Clustering
90 \ A New Method for Database Searching and Clustering Antje Krause Martin Vingron a.krause@dkfz-heidelberg.de m.vingron@dkfz-heidelberg.de Deutsches Krebsforschungszentrum (DKFZ), Abt. Theoretische Bioinformatik
More informationJET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2
JET 2 User Manual 1 INSTALLATION 1.1 Download The JET 2 package is available at www.lcqb.upmc.fr/jet2. 1.2 System requirements JET 2 runs on Linux or Mac OS X. The program requires some external tools
More informationA Scalable Coprocessor for Bioinformatic Sequence Alignments
A Scalable Coprocessor for Bioinformatic Sequence Alignments Scott F. Smith Department of Electrical and Computer Engineering Boise State University Boise, ID, U.S.A. Abstract A hardware coprocessor for
More informationBIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A
BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A Steve Thompson: stthompson@valdosta.edu http://www.bioinfo4u.net 1 Similarity searching and homology First, just
More informationCombinatorial Pattern Matching. CS 466 Saurabh Sinha
Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary
More informationIntroduction to Phylogenetics Week 2. Databases and Sequence Formats
Introduction to Phylogenetics Week 2 Databases and Sequence Formats I. Databases Crucial to bioinformatics The bigger the database, the more comparative research data Requires scientists to upload data
More informationINTRODUCTION TO BIOINFORMATICS
Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain
More informationAlgorithms in Bioinformatics: A Practical Introduction. Database Search
Algorithms in Bioinformatics: A Practical Introduction Database Search Biological databases Biological data is double in size every 15 or 16 months Increasing in number of queries: 40,000 queries per day
More informationPSIST: Indexing Protein Structures using Suffix Trees
PSIST: Indexing Protein Structures using Suffix Trees Feng Gao and Mohammed J. Zaki {gaof,zaki}@cs.rpi.edu Department of Computer Science Rensselaer Polytechnic Institute 110 8th Street, Troy, NY, 1180
More informationPAPER Constructing the Suffix Tree of a Tree with a Large Alphabet
IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is
More informationLecture 7 February 26, 2010
6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some
More informationTCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?
Local Sequence Alignment & Heuristic Local Aligners Lectures 18 Nov 28, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall
More information15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs
5-78: Graduate rtificial Intelligence omputational biology: Sequence alignment and profile HMMs entral dogma DN GGGG transcription mrn UGGUUUGUG translation Protein PEPIDE 2 omparison of Different Organisms
More informationBioinformatics I, WS 09-10, D. Huson, February 10,
Bioinformatics I, WS 09-10, D. Huson, February 10, 2010 189 12 More on Suffix Trees This week we study the following material: WOTD-algorithm MUMs finding repeats using suffix trees 12.1 The WOTD Algorithm
More informationSemi-supervised protein classification using cluster kernels
Semi-supervised protein classification using cluster kernels Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany weston@tuebingen.mpg.de Dengyong Zhou, Andre Elisseeff
More informationA BANDED SMITH-WATERMAN FPGA ACCELERATOR FOR MERCURY BLASTP
A BANDED SITH-WATERAN FPGA ACCELERATOR FOR ERCURY BLASTP Brandon Harris*, Arpith C. Jacob*, Joseph. Lancaster*, Jeremy Buhler*, Roger D. Chamberlain* *Dept. of Computer Science and Engineering, Washington
More informationCAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1
CAP 5510-6 BLAST BIOINFORMATICS Su-Shing Chen CISE 8/20/2005 Su-Shing Chen, CISE 1 BLAST Basic Local Alignment Prof Search Su-Shing Chen Tool A Fast Pair-wise Alignment and Database Searching Tool 8/20/2005
More informationComparison of Sequence Similarity Measures for Distant Evolutionary Relationships
Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Abhishek Majumdar, Peter Z. Revesz Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln,
More informationBIOINFORMATICS. Mismatch string kernels for discriminative protein classification
BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 10 Mismatch string kernels for discriminative protein classification Christina Leslie 1, Eleazar Eskin 1, Adiel Cohen 1, Jason Weston 2 and William Stafford Noble
More informationAccurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing
Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation
More informationPyMod Documentation (Version 2.1, September 2011)
PyMod User s Guide PyMod Documentation (Version 2.1, September 2011) http://schubert.bio.uniroma1.it/pymod/ Emanuele Bramucci & Alessandro Paiardini, Francesco Bossa, Stefano Pascarella, Department of
More information