USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

Size: px
Start display at page:

Download "USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT"

Transcription

1 IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah N. Arslan Assistant Professor Computer Science Department, University of Vermont, Burlington, VT 05405, USA Xindong Wu Professor, Chair Computer Science Department, University of Vermont, Burlington, VT 05405, USA ABSTRACT An important problem in computational biology is the alignment of a given query sequence and sequences in a database to find similar (locally or globally) sequences from the database to the query. Many heuristic algorithms for this problem are based on the idea of locating a fixed-length matching pair of substrings (called a seed) to start an alignment, and then extending this alignment using dynamic programming. We generalize this idea and take it one step further in a tool we develop, namely Sequence Comparison Tool (SCT). SCT preprocesses the database to create a special generalized suffix tree from the sequences in the database. This tree extends the definition of a generalized suffix tree by containing additional information at the nodes for the length and frequency (number of occurrences) of the corresponding substrings (patterns). A pattern is regarded as significant if it is sufficiently long and it appears many times in the database. A significant pattern shared by two sequences is an indication that the sequences are locally similar. SCT ranks the sequences with respect to the number of significant patterns they share with the query sequence. SCT reduces the database by selecting a given number of sequences with the topmost ranks. It proceeds with invoking an ordinary local alignment algorithm on this reduced database. We conducted experiments on real biological sequences, and compared SCT's performance with a popular alignment tool BLAST. In these tests we used the 6-fold cross validation technique of data mining. The tests show that SCT effectively reduces the database and obtains very similar results compared to those of BLAST in approximately half the time taken by BLAST. KEYWORDS Sequence alignment, suffix tree, 6-fold cross-validation. 1. INTRODUCTION Sequence alignment is an important problem for identifying and presenting the biologically important, yet hidden or widely dispersed common characteristics from a set of sequences. These commonalities can reveal evolutionary histories, critical conserved motifs or molecular structures that give clues about the common biological functions. Such commonalities are also used to characterize families or super-families of proteins. These characterizations are then used in database searches to identify other potential members of a family. The sequence similarity query can be formally defined as the following problem: given a sequence Q and a database of sequences, determine one or more sequences from the database which are the closest to sequence Q. Our objective is to improve the computation time of the search compared to the existing methods while preserving the accuracy. 655

2 ISBN: IADIS Local pairwise sequence alignment seeks similar segments in a given pair of sequences. A classical algorithm for this problem is the Smith-Waterman algorithm [9] which uses dynamic programming. This quadratic-time algorithm is too slow to be practical for long sequences. For the pairwise sequence alignment problem there are several heuristic algorithms such as FASTA [6] and BLAST [1, 2]. BLAST is approximately times faster than the Smith-Waterman algorithm. One important feature of BLAST is its ability to compare a query with a database of sequences. Considering the rapid growth of database sizes, this problem demands ever-growing computation resources, and remains as a computational challenge. Main idea in heuristic algorithms is to first locate a fixed-length matching-pair (called a seed) to start an alignment, and use dynamic programming to extend the alignment in both directions. In this paper, we generalize this idea, and take it one step further. Our research has been inspired by the use of frequency and the application of a very efficient data structure, the suffix tree [4, 5]. The suffix tree is a powerful tool to determine common patterns in sequences. It represents the internal structure of a string in a comprehensive manner. A suffix tree can be constructed in linear time [4,5,7,8,10]. A suffix tree T for an m-character-long string S is a rooted directed tree such that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix S[i..m] of S that starts at position i. Figure 2 Part (a) includes an example suffix tree, the suffix tree for S 1 =TATAA. A generalized suffix tree (GST) [5] is a suffix tree that combines the suffixes of a set of strings {S 1, S 2,,S n }. In a GST a node may be shared by suffixes of more than one strings. This is indicated by including string identifiers in leafs. In the GST in Figure 2 Part (b), each leaf of the tree represents a suffix either from one of the two strings or from both. Delcher et al. [3] showed that suffix trees are efficient in whole genome pairwise alignments. Their tool MUMer find maximal matches (MUMs) using a suffix tree and combine these matches by using indels (inserts/deletes) into larger alignments. In this paper we concentrate not on pairwise sequence alignment but on sequence similarity query in a database. We aim to accelerate answering sequence similarity query first by filtering the database to select most promising sequences. The query sequence can then be compared with these sequences using a suitable sequence alignment tool such as BLAST, or MUMer. We preprocess the database to create a generalized suffix tree from the sequences in the database. We extend the generalized suffix tree by including additional information at the nodes. We compute, and store at each node the length and frequency (the number of occurrences in the database) of the corresponding pattern. For each substring (pattern) p in the database there exists on this tree a node i such that we obtain p when we spell out (concatenate) the labels of the edges on the path from the root to node i. We consider several functions that, at each node, take into account the frequency and the length of the corresponding pattern, and assign a score to each node. Our hypothesis is that high frequency indicates that the pattern points to a conserved region, and the length parameter measures the information value of the pattern. A pattern is regarded as significant if it is sufficiently long and it appears many times in the database. We determine significant patterns by checking if their scores are larger than or equal to a given threshold. We expect that a significant pattern is contained in biologically-related sequences. For each query we temporarily add the query sequence to the tree to determine sequences that share high-scoring significant patterns with the query. We identify the sequences in the database which share these patterns. We then rank the sequences with respect to the number of significant patterns they share with the query sequence. Next, we reduce the database by selecting only a given number of sequences with the topmost ranks. We expect that these are the closest sequences to the query. In the end, we apply a local alignment algorithm on the reduced database. We implemented this method into a new tool SCT (Sequence Comparison Tool). We conducted experiments on real biological sequences. In these experiments we used the 6-fold cross validation technique of data mining. We compared SCT's performance with that of a popular alignment tool BLAST. The tests show that our method works faster without losing accuracy. We organize this paper as follows: We describe our method for answering sequence similarity queries in Section 2. We summarize the experimental results in Section 3. We provide concluding remarks and pointers for future work in Section DESIGN AND METHODOLOGY We generate a generalized suffix tree GST from a given database of sequences, and then process this tree to add information that we use later to select sequences that are potentially similar to a given query sequence. 656

3 IADIS International Conference Applied Computing 2006 We invoke BLAST to perform pairwise sequence alignment between the query sequence and the chosen sequences in the database. Central to our approach is the processing of the tree and selection of the sequences. We designed these steps based on the following observations: The substrings of a string S in GST can be used as patterns to identify similarity between homologous sequences. This is because similar sequences contain conserved regions (or common substrings). We can quantify the significance of a pattern by computing a score. For a given pattern, a scoringfunction must take into account the length of the pattern, the number of occurrences (frequency) of the pattern in the database, and possibly the size of the database subjected to search to normalize the frequency over the database size. Ideally, a given pattern with a high score carries an important feature (information) that belongs to a family of sequences with a high likelihood. For a pattern: the length is an important parameter in measuring the biological information it carries, the frequency is important because a pattern conserved in many biologically-related sequences has a high frequency. In an ideal case, the higher the frequency the higher the chances that it points to a family of similar sequences. We classify a given pattern p as significant if it satisfies the following two constraints: the length of p a given length-threshold: a significant pattern must be sufficiently long to carry important biological information, and the score of p a given score-threshold: a significant pattern must have a sufficiently high score. Sequences similar to the query sequence in the database are expected to share many significant patterns with the query sequence. We implement our method, and create a tool SCT. The flowchart in Figure 1 shows the steps of SCT: Figure 1. Flowchart of SCT 1. Read all the sequences in the database into memory. 2. Construct a generalized suffix tree from the input sequences {S 1, S 2,, S n } in the database. We use Ukkonen's algorithm [8]. 3. While constructing the suffix tree, at each node i for the corresponding pattern p i, store the length l(p i ) and the frequency f(p i ) (i.e. the number of occurrences of p i in the database): we increment the frequency by one at each visited node during the addition of a new suffix. 4. For a given node i we use a function W to assign a score to pattern p i corresponding to node i. We define W ( pi ) = f ( pi ) * l( pi ) / DB where DB denotes the size of the database. We compute W(p i ) when we construct the GST. We use W(p i ) to measure the significance of the pattern p i. As we explained earlier our hypothesis is that biologically significant patterns are shared by many substrings of sequences, and as the length of the pattern increases so does the information value. Therefore we incorporate both factors in W. We experimented with several different functions. The reason we also include DB is to normalize the frequency over databases with different sizes. This is important only if we want to determine and use a scorethreshold independent of the underlying database. 5. Prompt the user to obtain the number of sequences n to select from the database. 6. Read in the query sequence. 7. Temporarily add the suffixes of the query sequence Q onto the generalized suffix tree. This enables us to determine which suffixes of the query are shared by the sequences in the database. 657

4 ISBN: IADIS The query sequence is only temporarily added to the tree so that SCT is not affected for future sequence searches. Initially the colors of all the nodes in the GST are 0. When we add the suffixes of the query sequence Q we change the color of the nodes visited in the GST to 1. This expedites the search for common patterns within the GST because we only examine those paths in the tree for patterns that contain substrings of the query sequence. Consider a query sequence Q=S 3 =AATGT and two sequences in the database, S 1 =TATAA and S 2 = AACGA. Figure 2 Part (c) shows the coloring of the nodes after temporarily adding the query sequence to the GST. Figure 2. For S 1 =TATAA and S 2 =AACGA: (a) The suffix tree of S 1, (b) The generalized suffix tree of S 1 and S 2, and (c) The suffix tree after the query string Q=S 3 =AATGT is added to the GST of S 1 and S 2 in part (b). The vertices with color 1 are shown by filled circles. The color of other vertices is 0 8. Post-process the generated tree to extract significant patterns shared by the query sequence: in this step we do the actual traversal of the generalized suffix tree to extract significant patterns. Starting at the root and in a depth-first manner we visit all nodes whose colors are 1. If the current node has no child whose color is 1 then we backtrack to its parent node. During this traversal we collect all significant patterns into a set G. The sequences may have other common patterns that are not significant. An optimal alignment between these two sequences in an ideal case contains all significant patterns. 9. Delete the query sequence from the GST once the significant patterns have been collected. 10. Pick the top 10 significant patterns from G and store them in a set P. 11. Extract into a set R the sequences that contain the significant patterns in P. 12. Do a reverse check to compute a weight for each sequence in R. We define the weight of a sequence as the number of patterns that it contains from P. Our hypothesis is that the higher is this number (weight), the greater will be the similarity of the corresponding sequence to the query. 13. Rank the sequences according to these weights. 14. Pick top n (a user specified number) sequences from R and write to a new database. 15. Apply BLAST to the query and the new database of sequences. 16. Output the results of BLAST. We use the BLAST2 implementation we obtained from NCBI's website. BLAST2 allows for insertion of gaps in alignments. For very large databases the sequences in the database can be loaded into memory in parts, the suffix-tree can be created in parts, and all significant patterns can be collected in G by repeating Steps 7-8 for each part. 3. EXPERIMENTS AND RESULTS We have used real DNA sequence databases from three species in our tests. Table 1 lists these data sets: (1) Escherichia coli; (2) Bacillus anthracis; (3) Plasmodium falciparum. The source for these databases is the website of National Center of Biotechnology Information, established in Table 1. Data sets we obtained from NCBI's website and used in our experiments Database Sequence-Length Number of Sequences Ecoli.nt 800 1,000 Bacteria.dna 700 1,000 Plasmodium.dna 127 1,

5 IADIS International Conference Applied Computing 2006 In designing our tests, we used the 6-fold cross validation approach which is based on the idea ``train on 5 folds, test on 1 fold'' as illustrated in Figure 3. The argument in support of the combined approaches is that with a limited amount of training data, the individual classifier may not represent the true hypotheses. On the other hand, a combined classifier may produce a good approximation for the true hypotheses. This data mining technique compares two learned models. For the implementation we randomly divided training data into two disjoint sets: a training set and a validation set. The two data sets consist of DNA sequences from NCBI's website. There are three steps in implementing the 6-fold cross validation: (a) Use 5 folds for training and 1 fold for testing; (b) Run until every fold is used for training; (3) Calculate the average of the results from 6 runs. The goal was to verify that the results of these runs are consistent. So the data set was split and iteratively 5/6 of the data were used for training and the remaining 1/6 were used for testing. The average of the six runs was computed for analysis. This established the consistency in the timing for database search using SCT. Figure 3. 6-fold cross validation approach Table 2 summarizes the results that we obtained after conducting the experiments with SCT. In the table we also include in bold the results obtained by applying BLAST alone for the same queries. For each query the first row is for the results of SCT, and the second row is those obtained by applying BLAST alone. Experiments are performed on each of the three data sets individually. Six searches are executed on each database to implement 6-fold cross-validation using different queries. The idea is to test whether the results are consistent for all the queries on a particular database in terms of computation time. The results show that in our tests we obtain consistent results and our tool performs sequence comparison with a good accuracy and a practical time improvement is achieved over BLAST. Table 2. Average query times for Ecoli.nt 971 (1,699), for Bacteria_dna 918 (1,714), for Plasmodium_dna 1,009 ( 1,828). The results obtained by BLAST alone are shown within parentheses here, and in boldface in the table. Numbers in the column named Chosen Sequences are the indices of chosen sequences in the database (it does not include the query) Data Set Query Chosen Sequences Time (ms) Data Set Query Chosen Sequences Time (ms) Data Set Query Chosen Sequences Time (ms) Ecoli 1 2, 14, Bacteria. 1 25, 6, Pasmod. 1 24, 2, 40 1,078 2, 14, 872 1,693 25, 6, 706 1,713 24, 2, 120 1, , 20, , 30, , 12, , 20, 205 1,726 6, 30, 512 1,726 8, 12, 99 1, , , , , 28 1,596 32, 11 1,711 5, 106 1, , 41, , 51, , 84, , 41, 519 1,742 8, 51, 262 1,712 63, 84, 179 1, , , , 72, , 112 1,749 2, 95 1,712 11, 72, 155 1, , 95, , , 25, 28 1,006 31, 95, 154 1,688 5, 78 1,713 14, 25, 310 1,849 We have conducted a set of controlled experiments to test the effect of each parameter alone. We compare the output of SCT with the results obtained when BLAST is used directly on the original database Scoring patterns: A scoring function W changes the significance of patterns. We tested with W ( pi ) = f ( pi ) * l( pi ), and then with W ( pi ) = f ( pi )* l( pi ) for a pattern p i corresponding to node i. The former performed better in picking closest sequences. This suggests that in measuring the significance of a pattern the length carries more weight than the frequency. Length-threshold: We set the threshold to 3, 4, and 5 separately. Even though the closest sequence remained the same in each case, other similar sequences were affected. We observed that the accuracy was the best when the threshold was set to 3. Number of patterns: We experimented with setting the number of patterns to be used to 7, 8, and 10 separately. In each case, the closest sequence obtained was the same. Other sequences obtained were affected. The accuracy was the highest when the number of patterns was set to

6 ISBN: IADIS Score-threshold: We did a set of tests with different values for the score-threshold. The best results were obtained when we set it to It is easy to think of a worst-case and a best-case scenario for SCT. If a large number of substrings (patterns) are common in almost all sequences in the database then based on these patterns SCT will not be able to distinguish sequences close to the query sequence. In the best case the database contains a family of sequences that share very long patterns with Q, or a sequence P which is almost identical to the query sequence Q, and SCT will be able to identify the common patterns, and return the closest sequence(s) in the answer to the query very quickly. 4. CONCLUDING REMARKS AND FUTURE WORK In this paper we have presented SCT (Sequence Comparison Tool) for answering alignment-based similarity queries against a database. SCT preprocesses the database to create a generalized suffix tree that we extend by adding frequency and length information for the patterns. The tree is resident in the memory, and it is used for answering future queries. The tree can be created, fetched into memory, and used in parts in answering the similarity query by repeating certain steps for each part. SCT distinguishes patterns by computing significance-scores. A pattern is regarded as significant if it is long enough, and it appears frequently enough in the database. The scoring function takes into account a pattern's length and frequency, the given threshold values, and determines if a pattern is significant. Using these, for a given query sequence SCT reduces the database to only a few sequences that share the most significant patterns with the query. This reduction in database size speeds-up the local alignment of the query sequence against the database. Experimental results have shown that SCT provides a speed-up over BLAST, which is currently the dominant search engine for database-searches. It is able to curtail the time of a database search to nearly half the time originally taken by BLAST. Results from BLAST in our tests have shown that this method is experimentally effective, as we obtain accurate sequence alignment results from SCT. Combined with the extended suffix tree, SCT has the advantage of using BLAST to do the local sequence alignment. We applied the 6-fold cross validation technique of data mining to attain a greater accuracy in our results. The 6 runs of the cross validation help us establish that SCT performs consistently well for all the queries for a particular database included in our tests. With this effort we have obtained very promising results. We selected a small domain of sequences from the universe for which our experiments showed that our method works well. Our method can be further enhanced by covering databases of protein sequences to determine domain-specific parameters such as score and length thresholds, and scoring functions based on natural frequency of amino acid patterns. We can allow for approximate matches in significant patterns. Suffix arrays can be used to improve the performance. REFERENCES [1] Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. Lipman, Basic local alignment search tool. Journal of Molecular Biology, 215: [2] Altschul, S. F., T. L. Madden, A. A. Schaffer, J Zhang, Z Zhang, W. Miller, and D. J. Lipman, September Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17): [3] Delcher, A. L., Kasif, S., Fleischmann, R. D., Peterson, J., White, O., and Salzberg, L, Alignment of whole genomes. Nucleic Acid Research}, 27(11): [4] Giegerich, R. and Kurtz, S, From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19: [5] Gusfield, Dan, Algorithms on Strings, Trees, and Sequences. Cambridge University Press. [6] Lipman, D. J. and Pearson, W. R., Rapid and sensitive protein similarity searches. Science, 227, [7] McCreight, E. M., A space-economical suffix tree construction algorithm. J. of the ACM, 23(2), [8] Ukkonen, E., On-line construction of suffix-trees. Algorithmica, 14: [9] Waterman, M. S., Introduction to Computational Biology, Chapman & Hall. [10] Weiner, P., Linear pattern matching algorithms. Proceedings of the 14th IEEE Symposium on Switching and Automata Theory,

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

Pairwise Sequence Alignment using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array

Pairwise Sequence Alignment using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array 352 The International Arab Journal of Information Technology, Vol. 12, No. 4, July 2015 Pairwise Sequence Alignment using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array Arumugam

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques A Thesis Presented by Divya R. Singh to The Faculty of the Graduate College of the University of Vermont In Partial Fulfillment of

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

Space Efficient Linear Time Construction of

Space Efficient Linear Time Construction of Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Improvisation of Global Pairwise Sequence Alignment

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

New Algorithms for the Spaced Seeds

New Algorithms for the Spaced Seeds New Algorithms for the Spaced Seeds Xin Gao 1, Shuai Cheng Li 1, and Yinan Lu 1,2 1 David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 6P7 2 College of Computer

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

Heuristic methods for pairwise alignment:

Heuristic methods for pairwise alignment: Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic

More information

Highly Scalable and Accurate Seeds for Subsequence Alignment

Highly Scalable and Accurate Seeds for Subsequence Alignment Highly Scalable and Accurate Seeds for Subsequence Alignment Abhijit Pol Tamer Kahveci Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, 32611

More information

Suffix Vector: A Space-Efficient Suffix Tree Representation

Suffix Vector: A Space-Efficient Suffix Tree Representation Lecture Notes in Computer Science 1 Suffix Vector: A Space-Efficient Suffix Tree Representation Krisztián Monostori 1, Arkady Zaslavsky 1, and István Vajk 2 1 School of Computer Science and Software Engineering,

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

A Prototype for Multiple Whole Genome Alignment

A Prototype for Multiple Whole Genome Alignment A Prototype for Multiple Whole Genome Alignment Jitender S. Deogun, Fangrui Ma, Jingyi Yang Department of Computer Science and Engineering University of Nebraska Lincoln Lincoln, NE 6888-0, USA Andrew

More information

Mismatch String Kernels for SVM Protein Classification

Mismatch String Kernels for SVM Protein Classification Mismatch String Kernels for SVM Protein Classification Christina Leslie Department of Computer Science Columbia University cleslie@cs.columbia.edu Jason Weston Max-Planck Institute Tuebingen, Germany weston@tuebingen.mpg.de

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Accelerating Protein Classification Using Suffix Trees

Accelerating Protein Classification Using Suffix Trees From: ISMB-00 Proceedings. Copyright 2000, AAAI (www.aaai.org). All rights reserved. Accelerating Protein Classification Using Suffix Trees Bogdan Dorohonceanu and C.G. Nevill-Manning Computer Science

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

Proceedings of the 11 th International Conference for Informatics and Information Technology

Proceedings of the 11 th International Conference for Informatics and Information Technology Proceedings of the 11 th International Conference for Informatics and Information Technology Held at Hotel Molika, Bitola, Macedonia 11-13th April, 2014 Editors: Vangel V. Ajanovski Gjorgji Madjarov ISBN

More information

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. www.ijarcet.org 54 Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. Hassan Kehinde Bello and Kazeem Alagbe Gbolagade Abstract Biological sequence alignment is becoming popular

More information

Sequence alignment theory and applications Session 3: BLAST algorithm

Sequence alignment theory and applications Session 3: BLAST algorithm Sequence alignment theory and applications Session 3: BLAST algorithm Introduction to Bioinformatics online course : IBT Sonal Henson Learning Objectives Understand the principles of the BLAST algorithm

More information

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA Comparative Analysis of Protein Alignment Algorithms in Parallel environment using BLAST versus Smith-Waterman Shadman Fahim shadmanbracu09@gmail.com Shehabul Hossain rudrozzal@gmail.com Gulshan Jubaed

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

A Suffix Tree Construction Algorithm for DNA Sequences

A Suffix Tree Construction Algorithm for DNA Sequences A Suffix Tree Construction Algorithm for DNA Sequences Hongwei Huo School of Computer Science and Technol Xidian University Xi 'an 710071, China Vojislav Stojkovic Computer Science Department Morgan State

More information

MAP: SEARCHING LARGE GENOME DATABASES

MAP: SEARCHING LARGE GENOME DATABASES MAP: SEARCHING LARGE GENOME DATABASES TAMER KAHVECI AMBUJ SINGH Department of Computer Science University of California Santa Barbara, CA 93106 tamer,ambuj @cs.ucsb.edu Abstract A number of biological

More information

Cache and Energy Efficient Alignment of Very Long Sequences

Cache and Energy Efficient Alignment of Very Long Sequences Cache and Energy Efficient Alignment of Very Long Sequences Chunchun Zhao Department of Computer and Information Science and Engineering University of Florida Email: czhao@cise.ufl.edu Sartaj Sahni Department

More information

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database. BLAST Basic Local Alignment Search Tool Used to quickly compare a protein or DNA sequence to a database. There is no such thing as a free lunch BLAST is fast and highly sensitive compared to competitors.

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the

More information

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching, C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 13 Iterative homology searching, PSI (Position Specific Iterated) BLAST basic idea use

More information

MAP: Searching Large Genome Databases. T. Kahveci, A. Singh. Pacific Symposium on Biocomputing 8: (2003)

MAP: Searching Large Genome Databases. T. Kahveci, A. Singh. Pacific Symposium on Biocomputing 8: (2003) MAP: Searching Large Genome Databases T. Kahveci, A. Singh Pacific Symposium on Biocomputing 8:303-314(2003) MAP: SEARCHING LARGE GENOME DATABASES a TAMER KAHVECI AMBUJ SINGH Department of Computer Science

More information

Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool (BLAST) BLAST 26.04.2018 Basic Local Alignment Search Tool (BLAST) BLAST (Altshul-1990) is an heuristic Pairwise Alignment composed by six-steps that search for local similarities. The most used access point to

More information

A DNA Index Structure Using Frequency and Position Information of Genetic Alphabet

A DNA Index Structure Using Frequency and Position Information of Genetic Alphabet A DNA Index Structure Using Frequency and Position Information of Genetic Alphabet Woo-Cheol Kim 1, Sanghyun Park 1, Jung-Im Won 1, Sang-Wook Kim 2, and Jee-Hee Yoon 3 1 Department of Computer Science,

More information

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm 5th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2017) Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm Xiantao Jiang1, a,*,xueliang

More information

An I/O device driver for bioinformatics tools: the case for BLAST

An I/O device driver for bioinformatics tools: the case for BLAST An I/O device driver for bioinformatics tools 563 An I/O device driver for bioinformatics tools: the case for BLAST Renato Campos Mauro and Sérgio Lifschitz Departamento de Informática PUC-RIO, Pontifícia

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE

A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE 205 A NEW GENERATION OF HOMOLOGY SEARCH TOOLS BASED ON PROBABILISTIC INFERENCE SEAN R. EDDY 1 eddys@janelia.hhmi.org 1 Janelia Farm Research Campus, Howard Hughes Medical Institute, 19700 Helix Drive,

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

Acceleration of Ungapped Extension in Mercury BLAST. Joseph Lancaster Jeremy Buhler Roger Chamberlain

Acceleration of Ungapped Extension in Mercury BLAST. Joseph Lancaster Jeremy Buhler Roger Chamberlain Acceleration of Ungapped Extension in Mercury BLAST Joseph Lancaster Jeremy Buhler Roger Chamberlain Joseph Lancaster, Jeremy Buhler, and Roger Chamberlain, Acceleration of Ungapped Extension in Mercury

More information

A Coprocessor Architecture for Fast Protein Structure Prediction

A Coprocessor Architecture for Fast Protein Structure Prediction A Coprocessor Architecture for Fast Protein Structure Prediction M. Marolia, R. Khoja, T. Acharya, C. Chakrabarti Department of Electrical Engineering Arizona State University, Tempe, USA. Abstract Predicting

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein

More information

Sequence Alignment as a Database Technology Challenge

Sequence Alignment as a Database Technology Challenge Sequence Alignment as a Database Technology Challenge Hans Philippi Dept. of Computing and Information Sciences Utrecht University hansp@cs.uu.nl http://www.cs.uu.nl/people/hansp Abstract. Sequence alignment

More information

CS313 Exercise 4 Cover Page Fall 2017

CS313 Exercise 4 Cover Page Fall 2017 CS313 Exercise 4 Cover Page Fall 2017 Due by the start of class on Thursday, October 12, 2017. Name(s): In the TIME column, please estimate the time you spent on the parts of this exercise. Please try

More information

Scoring and heuristic methods for sequence alignment CG 17

Scoring and heuristic methods for sequence alignment CG 17 Scoring and heuristic methods for sequence alignment CG 17 Amino Acid Substitution Matrices Used to score alignments. Reflect evolution of sequences. Unitary Matrix: M ij = 1 i=j { 0 o/w Genetic Code Matrix:

More information

Alignment of Long Sequences

Alignment of Long Sequences Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale

More information

From Smith-Waterman to BLAST

From Smith-Waterman to BLAST From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is

More information

THE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION

THE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION THE SPECTRUM KERNEL: A STRING KERNEL FOR SVM PROTEIN CLASSIFICATION CHRISTINA LESLIE, ELEAZAR ESKIN, WILLIAM STAFFORD NOBLE a {cleslie,eeskin,noble}@cs.columbia.edu Department of Computer Science, Columbia

More information

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and sequence logos Profile hidden Markov models Aligning profiles Multiple sequence alignment by gradual sequence

More information

FastCluster: a graph theory based algorithm for removing redundant sequences

FastCluster: a graph theory based algorithm for removing redundant sequences J. Biomedical Science and Engineering, 2009, 2, 621-625 doi: 10.4236/jbise.2009.28090 Published Online December 2009 (http://www.scirp.org/journal/jbise/). FastCluster: a graph theory based algorithm for

More information

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees

A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Fundamenta Informaticae 56 (2003) 105 120 105 IOS Press A Fast Algorithm for Optimal Alignment between Similar Ordered Trees Jesper Jansson Department of Computer Science Lund University, Box 118 SE-221

More information

Chapter 4: Blast. Chaochun Wei Fall 2014

Chapter 4: Blast. Chaochun Wei Fall 2014 Course organization Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms for Sequence Analysis (Week 3-11)

More information

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences SEQUENCE ALIGNMENT ALGORITHMS 1 Why compare sequences? Reconstructing long sequences from overlapping sequence fragment Searching databases for related sequences and subsequences Storing, retrieving and

More information

Pacific Symposium on Biocomputing 4: (1999)

Pacific Symposium on Biocomputing 4: (1999) EFFECTIVE QUERY FILTERING FOR FAST HOMOLOGY SEARCHING HUGH E. WILLIAMS Department of Computer Science, RMIT University, GPO Box 2476V, Melbourne 3001, Australia hugh@cs.rmit.edu.au To improve the accuracy

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

Database Searching Using BLAST

Database Searching Using BLAST Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain

More information

An improved algorithm for the regular expression constrained multiple sequence alignment problem

An improved algorithm for the regular expression constrained multiple sequence alignment problem An improved algorithm for the regular expression constrained multiple sequence alignment problem Abdullah N. Arslan and Dan He Department of Computer Science University of Vermont Burlington, VT 05405,

More information

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching A CAM(Content Addressable Memory)-based architecture for molecular sequence matching P.K. Lala 1 and J.P. Parkerson 2 1 Department Electrical Engineering, Texas A&M University, Texarkana, Texas, USA 2

More information

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998 7 Multiple Sequence Alignment The exposition was prepared by Clemens Gröpl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all

More information

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character

Keywords Pattern Matching Algorithms, Pattern Matching, DNA and Protein Sequences, comparison per character Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Index Based Multiple

More information

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology ICB Fall 2008 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2008 Oliver Jovanovic, All Rights Reserved. The Digital Language of Computers

More information

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis Objective Input! Output! A B C D Several

More information

Revisiting the Speed-versus-Sensitivity Tradeoff in Pairwise Sequence Search

Revisiting the Speed-versus-Sensitivity Tradeoff in Pairwise Sequence Search Revisiting the Speed-versus-Sensitivity Tradeoff in Pairwise Sequence Search Ashwin M. Aji and Wu-chun Feng The Synergy Laboratory Department of Computer Science Virginia Tech {aaji,feng}@cs.vt.edu Abstract

More information

Database Similarity Searching

Database Similarity Searching An Introduction to Bioinformatics BSC4933/ISC5224 Florida State University Feb. 23, 2009 Database Similarity Searching Steven M. Thompson Florida State University of Department Scientific Computing How

More information

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS IN DNA SEQUENCES USING MULTIPLE SPACED SEEDS By SARAH BANYASSADY, B.S. A Thesis Submitted to the School of Graduate Studies

More information

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data Takeaki Uno uno@nii.jp, National Institute of Informatics 2-1-2, Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan

More information

A New Method for Database Searching and Clustering

A New Method for Database Searching and Clustering 90 \ A New Method for Database Searching and Clustering Antje Krause Martin Vingron a.krause@dkfz-heidelberg.de m.vingron@dkfz-heidelberg.de Deutsches Krebsforschungszentrum (DKFZ), Abt. Theoretische Bioinformatik

More information

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2 JET 2 User Manual 1 INSTALLATION 1.1 Download The JET 2 package is available at www.lcqb.upmc.fr/jet2. 1.2 System requirements JET 2 runs on Linux or Mac OS X. The program requires some external tools

More information

A Scalable Coprocessor for Bioinformatic Sequence Alignments

A Scalable Coprocessor for Bioinformatic Sequence Alignments A Scalable Coprocessor for Bioinformatic Sequence Alignments Scott F. Smith Department of Electrical and Computer Engineering Boise State University Boise, ID, U.S.A. Abstract A hardware coprocessor for

More information

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A Steve Thompson: stthompson@valdosta.edu http://www.bioinfo4u.net 1 Similarity searching and homology First, just

More information

Combinatorial Pattern Matching. CS 466 Saurabh Sinha

Combinatorial Pattern Matching. CS 466 Saurabh Sinha Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary

More information

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

Introduction to Phylogenetics Week 2. Databases and Sequence Formats Introduction to Phylogenetics Week 2 Databases and Sequence Formats I. Databases Crucial to bioinformatics The bigger the database, the more comparative research data Requires scientists to upload data

More information

INTRODUCTION TO BIOINFORMATICS

INTRODUCTION TO BIOINFORMATICS Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain

More information

Algorithms in Bioinformatics: A Practical Introduction. Database Search

Algorithms in Bioinformatics: A Practical Introduction. Database Search Algorithms in Bioinformatics: A Practical Introduction Database Search Biological databases Biological data is double in size every 15 or 16 months Increasing in number of queries: 40,000 queries per day

More information

PSIST: Indexing Protein Structures using Suffix Trees

PSIST: Indexing Protein Structures using Suffix Trees PSIST: Indexing Protein Structures using Suffix Trees Feng Gao and Mohammed J. Zaki {gaof,zaki}@cs.rpi.edu Department of Computer Science Rensselaer Polytechnic Institute 110 8th Street, Troy, NY, 1180

More information

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology? Local Sequence Alignment & Heuristic Local Aligners Lectures 18 Nov 28, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall

More information

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs 5-78: Graduate rtificial Intelligence omputational biology: Sequence alignment and profile HMMs entral dogma DN GGGG transcription mrn UGGUUUGUG translation Protein PEPIDE 2 omparison of Different Organisms

More information

Bioinformatics I, WS 09-10, D. Huson, February 10,

Bioinformatics I, WS 09-10, D. Huson, February 10, Bioinformatics I, WS 09-10, D. Huson, February 10, 2010 189 12 More on Suffix Trees This week we study the following material: WOTD-algorithm MUMs finding repeats using suffix trees 12.1 The WOTD Algorithm

More information

Semi-supervised protein classification using cluster kernels

Semi-supervised protein classification using cluster kernels Semi-supervised protein classification using cluster kernels Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany weston@tuebingen.mpg.de Dengyong Zhou, Andre Elisseeff

More information

A BANDED SMITH-WATERMAN FPGA ACCELERATOR FOR MERCURY BLASTP

A BANDED SMITH-WATERMAN FPGA ACCELERATOR FOR MERCURY BLASTP A BANDED SITH-WATERAN FPGA ACCELERATOR FOR ERCURY BLASTP Brandon Harris*, Arpith C. Jacob*, Joseph. Lancaster*, Jeremy Buhler*, Roger D. Chamberlain* *Dept. of Computer Science and Engineering, Washington

More information

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1 CAP 5510-6 BLAST BIOINFORMATICS Su-Shing Chen CISE 8/20/2005 Su-Shing Chen, CISE 1 BLAST Basic Local Alignment Prof Search Su-Shing Chen Tool A Fast Pair-wise Alignment and Database Searching Tool 8/20/2005

More information

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Abhishek Majumdar, Peter Z. Revesz Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln,

More information

BIOINFORMATICS. Mismatch string kernels for discriminative protein classification

BIOINFORMATICS. Mismatch string kernels for discriminative protein classification BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 10 Mismatch string kernels for discriminative protein classification Christina Leslie 1, Eleazar Eskin 1, Adiel Cohen 1, Jason Weston 2 and William Stafford Noble

More information

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation

More information

PyMod Documentation (Version 2.1, September 2011)

PyMod Documentation (Version 2.1, September 2011) PyMod User s Guide PyMod Documentation (Version 2.1, September 2011) http://schubert.bio.uniroma1.it/pymod/ Emanuele Bramucci & Alessandro Paiardini, Francesco Bossa, Stefano Pascarella, Department of

More information