Correlogram-based method for comparing biological sequences

Size: px
Start display at page:

Download "Correlogram-based method for comparing biological sequences"

Transcription

1 Correlogram-based method for comparing biological sequences Debasis Mitra*, Gandhali Samant* and Kuntal Sengupta + Department of Computer Sciences Florida Institute of Technology Melbourne, Florida, USA {dmitra, gsamant}@fit.edu + Authentec Corporation Melbourne, Florida, USA Abstract. In this article we have proposed an abstract representation for a sequence using a constant sized 3D matrix. Subsequently the representation may be utilized for many analytical purposes. We have attempted to use it for comparing sequences, and analyzed the method s asymptotic complexity. Providing a metric for sequence comparison is an underlying operation to many bioinformatics applications. In order to show the effectiveness of the proposed sequence comparison technique we have generated some phylogeny over two sets of bio-sequences and compared them with the ones available in literature. The results prove that our technique is comparable to the standard ones. The technique, called the correlogram-based method, is borrowed from the image analysis area. We have also done some experiments with synthetically generated sequences in order to compare correlogram-based method with the well-known dynamic programming method. Finally, we have discussed some other possibilities on how our method can be used or extended. 1. Introduction Sequence comparison constitutes one of the most fundamental operations in many problems in bio-informatics. For this reason many sequence comparison techniques have been developed in the literature sometimes targeting specific problems. In this article we have proposed a novel comparison method and have shown a few of its usages. Possibly the most accurate sequence comparison technique is the dynamic programming algorithm of Smith and Waterman [1981]. The primary objective of this algorithm is to align two sequences optimally. When the objective is to come up with a distance or a similarity value between two sequences the global alignment provides a mechanism to achieve that. The value of the optimizing function in that case is typically utilized as a similarity parameter. Another popular and efficient sequence comparison method is the BLAST [Altschul, 1990] algorithm. However, its primary purpose is to find homologous longest common sub-sequence between two bio-

2 sequences. BLAST is a problem specific algorithm and is not a competitor to our proposed method. Our proposed technique is based on a similar method introduced for comparing two images [Huang et al, 1999]. While an image is a two-dimensional organization of the pixels a bio-sequence is a one dimensional organization of characters from a finite set of alphabets (typically nucleic acids or amino acids). We create a mathematical representation of a sequence and subsequently use that representation for the comparison purpose. In the following sections we describe the method (section 2) and some experiments toward using it for sequence comparison (section 3). We conclude with a discussion on a few other possibilities with using the correlogram representations of sequences (section 4). The concept of correlogram has been used in the field of bioinformatics before. Macchiato et al. [1995] used correlograms to analyze autocorrelation characteristics of active polypeptides. Further, correlograms have been used for analyzing spatial patterns in various experiments, e.g., Bertorelle et al [1995] used correlograms to study DNA diversity. Rosenberg et al [1995] used correlograms in their studies regarding patterns of transitional mutation biases within and among mammalian genomes. However, the representation has not been used for the sequence comparison purposes before. 2. Correlogram and its usage in sequence comparison 2.1. Correlogram of a sequence Let a sequence be indicated by, s = a 1 a 2... a n, where s =n, and for all i, a i, is the finite set of alphabets over nucleic acids ({A, T, G, C} for a DNA or {A, U, G, C} for a RNA) or over twenty amino acids for a protein. Let, = m (m is 4 or 20). Definition 1: A correlogram for s is a 3-dimensional matrix of size (m x m x d), where 0<d<n is a predefined integer (typically between 4 to 7). Both the first two dimensions of the matrix represents the alphabets in, and the third dimension is over the integer index i, 0 i d. For x, y, and 0 i d, let Freq s (x, y, i) be the frequency of occurrence of pairs (x, y) at a distance i, on the sequence s. Each entry of the Correlogram matrix for the sequence s, Corr s (x, y, i) is the normalized frequency, Freq s (x, y, i)/ N, where N = (n i) is the total number of pairs in the sequence at a distance i. The normalization is needed to compensate for the sequence length, so that the sequences of different lengths can be subsequently compared. The sequences with greater lengths will have tendency to have higher frequencies of pairs of (x, y) s.

3 i C G T A A T G C Figure 1: A shell of a Correlogram over {A,T,G,C} and d=3 Example 1: With = {A, C, G, T}, a string S = AGCTTAGTCT. The Freq s (x, y, 1) for the plane with i =1 is the following matrix in Figure 2. Corr s (x, y, 1) plane will have each of these elements divided by 9. A T G C A 2/9 T 1/9 1/9 1/9 G 1/9 1/9 C 2/9 Figure2: A layer for a sample correlogram for S Note that the list of the distances corresponding to the planes of correlograms need not be all the integers between 0 and d (0 i d). Rather they could be a predetermined finite set of integers each less than n, the length of the sequence. For example, i could be {0, 3, 5, 7, 11}. The particular application determines this list.

4 The correlogram-plane for the distance i=0 is nothing but a normalized histogram representing the normalized frequencies of the occurrences of the characters in the sequence. The corresponding plane (for i=0) is a 2D diagonal matrix Computing Correlogram The following algorithm computes a correlogram given an input sequence s and a d value. Algorithm ComputeCorrelogram (string S, integer d) // Let, S = a 1 a 2... a n, where each a j (1) for each x, y do (2) for i = 1 through d do (3) Corr s (x, y, i) = 0; (4) for i = 0 through d do (5) for integer j = 1 through (n - i) do (6) Corr s (a j, a j+i, i) = Corr s (a j, a j+i, i) + (1/(n-i)); (7) return Corr s ; End Algorithm. The complexity of initialization from lines 1 through 3 is O(m 2 d), where = m. The complexity of the main computing loops in lines 4 through 6 is O(nd). So, the total complexity is O(max{m 2 d, nd}). For a large sequence n >> m, and hence, the complexity is O(nd). Also, since m is a constant (m= 20 or 4), and so is d, ComputeCorrelogram is a linear algorithm with respect to the sequence length Using Correlograms to Compare Sequences Once a sequence is transformed into a correlogram it is possible to measure a distance between two correlograms corresponding to two sequences. Definition 2: The distance between two sequences S and T is l st = L(Corr S, Corr T ), where the function L is one of the standard L-norms of distance metrics. For L 0 -norm: l st = x,y,, 0 i d Corr S (x, y, i) - Corr T (x, y, i) / ( S + T +1) For L 1 -norm: l st = ( x,y,, 0 i d [Corr S (x, y, i) - Corr T (x, y, i)] 2 )/ ( S + T +1) Higher order L-norm distance metrics may be defined accordingly. Since both the correlogram-matrices are of the same dimension the computation of any of these L-norm distances is of the order of O(m 2 d). We used L 1 - norm for our experiments. The distance measure is apparently a metric as evidenced in some of our preliminary experiments (not presented here).

5 3. Experiments We have done three sets of experiments in order to study the effectiveness of the correlogram method in the sequence comparison. They are described below Experiments with synthetic data In this set of experiments we compared our proposed technique with Smith- Waterman s [1981] Dynamic Programming (DP) method over some synthetically generated sequences. For some experiments, we start with a target sequence S and deform it systematically to S and measure the distance (or similarity with DP) between the two sequences (l SS ) using the two methods (correlogram and DP). In other experiments, we start with two arbitrary sequences S1 and S2 and deform one of them (S2) systematically to S2 and measure how the distance between them (l S1S2 ) changes with the deformation. Such experiments with synthetic sequences have never been done before, to the best of our knowledge. For the lack of space we will provide some sample results of our experiments from this set [for detail see the Tech Report, Samant et al, 2005]. The same conclusion holds over all such experiments that the Correlogram method is more sensitive to the deformation of a sequence than the DP method. Correlogram Score DP Score Scores Iterations Figure 3: Correlogram vs. DP scores against character deletion positions (iterations) Figure 3 shows the result (l S1S2 ) obtained by deleting a character from the second sequence S2. The position j of the deleted character is systematically varied between 1 j n, for S2 =n. As expected, DP is not very sensitive to the position of

6 the character being deleted, whereas the correlogram method shows some fluctuation as the position progresses over the string S2. A cautionary note here is that the absolute values of the two methods should not be compared as the two methods measure different aspects distances and similarities. Rather their relative change with respect to the control parameter (position of deletion in this experiment) should be compared. Our conclusion is that the correlogram method, overall, is more suitable for comparing sequences when character deletion takes place. This is expected, as we create a richer abstract representation (correlograms) for each of the sequences before we compare the two sequences, vis a vis the DP method. Figure 4 shows the results from a similar experiment where a sequence is wrapped around systematically (first with one character, then with 2 characters, and so on iterations on X-axis in the figure indicates this number), and the distance is measured between the original and the deformed sequence. For example, a wrapping around of string AGCTTAGTCT for i=2 is, CTAGCTTAGT. The higher sensitivity of the correlogram-based method is evidenced in Fig 4 as well. Other experiments, done with systematic character addition, and by reversal of a sequence, also provided the same conclusion as drawn from the character deletion-experiment. Correlogram Score DP Score Score Iterations Figure 4: Correlogram vs. DP scores against systematic circular permutation of the source sequence 3.2. Experiment with Equine influenza virus In this experiment we used a set of protein sequences that is important for immunity of the horse influenza virus. The protein sequences are products of the hemagglutinin (HA) gene that has gone through multiple mutations over ten years (1990 through 2000) as the infected horse moved through different parts of the USA. We constructed the phylogeny tree from the distance values generated by the correlogram method and

7 compared the tree with a standard work on the same data available in the literature [Lai et al, 2004]. The Figure 5a and 5b shows the two trees over the set of viruses (SA90/AF197243, SU90/X68437, LM92/X85087, HK92/L27597, KY91/L39918, KY92/L39917, KY94/L39914, KY95/AF197247, KY96/AF197248, KY97/AF197249, KY98/AF197241, FL93/L39916, FL94/AF197242, AR93/L39913, AR94/AF197245, AR95/AF197244, AR96/AF197246, NY99/AY273167, OK00/AY273168). The first part of each of the elements in this set is the common identifier of the corresponding influenza-a virus, whereas the second part is the respective accession number to the database (EMBL-EBI, European Bioinformatics Institute, The strings are a few hundred characters long. Lai et al [2004] used GeneTool version 1.1 ( to generate the distance matrix and then ran the Phylip software (Neigbor-join method) from the University of Washington to draw the tree (Fig 4a). We used correlogram method to generate the distance matrix and used the same program from Phylip package ( evolution.genetics.washington.edu/phylip.html) for the phylogeny construction. The two trees are small enough to be compared manually. The similarity between the two trees justifies the usability of correlogram method in drawing phylogeny. The minor differences between the two trees necessitate further investigation for their biological significances Experiment with Parvo-virus Parvo-virus family resides in the intestines of higher organisms. They are known to cause illness/death of children and are a focus of medical research. The RNA sequences of the viruses of this family from different organisms have been used and we have measured the distances between the sequences using our proposed method. The sequences are from the set: (B19 virus, Bovine parvovirus, Canine parvovirus strain B, Feline panleukopenia virus (strain 193), Murine minute virus (strain MVMI), Porcine parvovirus (strain NADL-2), Raccoon parvovirus, Adeno-associated virus 2, Galleria mellonella densovirus) studied in the literature [Chapman et al, 1993]. The strings are around 5000 characters long. We have drawn phylogeny tree from the generated distance matrix over the family. Again the striking similarity (Figures 6a and 6b) with Chapman et al s tree proves the strength of the correlogram-based method. The two viruses AAV2 and B19 have low sequence similarity and high structural similarity compared to other viruses. Their coming closer to each other in the phylogeny in correlogram-based tree suggests that our proposed method has a stronger capability to classify structures. However, this needs further experimentation to verify.

8 Figure 5: Phylogeny trees of the Horse Influenza HA1, a. Lai et al (2004), b. Correlogram-based Figure 6: Phylogeny trees of the Parvovirus RNA sequences, a. Chapman et al (1993), b. Correlogram-based

9 4. Conclusion and pointers In this research we have proposed a new approach toward sequence representation and investigated a few of its capabilities. In order to fully understand the significance of the representation many new questions and challenges need to be addressed. Some of them are posed below Information content of a correlogram Is correlogram reversible? In other words, can one reconstruct a string given a correlogram? Obviously, when the distance range of the correlogram matrix is maximal or d= s -1 for a string s, no information is lost and the corresponding correlogram should be reversible. A relevant question is - does there exist a smaller list of integers i < s such that a correlogram over this list of i s (refer to the discussion in section 2.1) will create a loss-less representation? In that case, the correlogram representation may be used for the string encryption purposes Using correlgram for finding patterns Any sequence comparison method can be utilized for finding a given pattern P over a longer target string T ( T > P ). However, the cost of the method could be prohibitive for such pattern finding purpose. For example, the DP method has quadratic timecomplexity for each comparison (P with each subsequence of T of size P ) over scanning the target sequence. We have implemented a modified version of the correlogram method that scans and searches for a pattern over a target sequence in linear time [Samant et al, 2005]. As our experiments indicate that the correlogram method may have a potential for finding structurally similar patterns, it may have significant impact in bioinformatics. For example, protein docking may be expedited by such candidate pattern-matching pre-scan Extending correlogram with gap handling capability When a pair of characters, say, AG, appears at a distance i=7 on a sequence (as A G), then it may appear at a distance 8 or 6 on a corresponding mutated sequence, where a new character is inserted or deleted from the original sequence. With an objective to use the correlogram-based representation for comparing such modified sequences we extended the basic correlogram technique. In this Gapped-correlogram representation of a sequence we make a weighted distribution of the frequency counts over the respective adjacent cells of the basic correlogram (over the distance i). Thus, e1*freq(x, y, i) is added to Freq(x, y, i+1) and to Freq(x, y, i-1), and, e2*freq(x, y, i) is added to Freq(x, y, i+2) and to Freq(x, y, i-2), where 0 < e1, e2 <1, are the weight factors. Example e1 and e2 may be 0.5 and 0.25 respectively. This type of weighing can be extended to arbitrary number of adjacent cells, not just to two cell-distances. The weight vector itself may be normalized as a probability density distribution that adds up to 1, e.g., e0+2*(e1+e2)=1 above, where e0 is the weight factor for Freq(x, y, i) itself that was e0=1 before.

10 Gapped-correlogram has obvious biological appeal in sequence comparison. However, our preliminary experiments with gapped-correlograms over the problems addressed here did not show any significant difference with the results obtained using the basic correlograms [Samant et al, 2005]. We suspect broader experiments may show some impact. Acknowledgement: Mavis McKenna provided some data and insight for this work. References Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, Journal of Molecular. Biology, 215: Bertorelle, G., and Barbujanit, G. (1995) Analysis of DNA diversity by spatial auto correlation, Genetics, Volume 140(2): Chapman, M. S., and Rossmann, M. G. (1993) Structure, Sequence and Function Correlations among Parvoviruses, Viriology, 194(2): Huang, J., Mitra, M., Zhu, W.J. and Zabih, R. (1999), Image Indexing using color correlograms, International Journal of Computer Vision, 35(3), pp Lai, A. C.K., Rogers, K. M., Glaser, A., Tudor, L., and Chambers, T. Alternate circulation of recent equine-2 influenza viruses (H3N8) from two distinct lineages in the United States, Virus Res. (2004) Mar, 15;100(2): Macchiato, M. F., Cuomo, V., and Tramontano, A. (1995) Determination of the autocorrelation orders of proteins, Genetics, 140: Rosenberg, M. S., Subramanian, S., and Kumar S. (2003) Patterns of Transitional Mutation Biases Within and Among Mammalian Genomes, Mol Biol Evol. (2003) Jun;20(6): Samant, G., and Mitra, D. (2005) Correlogram method for Comparing Bio- Sequences, Florida Institute of Technology Technical Report No. CS , Smith, T.F., and Waterman, M.S. (1981) Identification of common molecular sequences, Journal of Molecular Biology, 147:

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. www.ijarcet.org 54 Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion. Hassan Kehinde Bello and Kazeem Alagbe Gbolagade Abstract Biological sequence alignment is becoming popular

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 Improvisation of Global Pairwise Sequence Alignment

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Notes on Dynamic-Programming Sequence Alignment

Notes on Dynamic-Programming Sequence Alignment Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA

More information

An I/O device driver for bioinformatics tools: the case for BLAST

An I/O device driver for bioinformatics tools: the case for BLAST An I/O device driver for bioinformatics tools 563 An I/O device driver for bioinformatics tools: the case for BLAST Renato Campos Mauro and Sérgio Lifschitz Departamento de Informática PUC-RIO, Pontifícia

More information

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence

More information

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS Ivan Vogel Doctoral Degree Programme (1), FIT BUT E-mail: xvogel01@stud.fit.vutbr.cz Supervised by: Jaroslav Zendulka E-mail: zendulka@fit.vutbr.cz

More information

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 04: Variations of sequence alignments http://www.pitt.edu/~mcs2/teaching/biocomp/tutorials/global.html Slides adapted from Dr. Shaojie Zhang (University

More information

Biological Sequence Matching Using Fuzzy Logic

Biological Sequence Matching Using Fuzzy Logic International Journal of Scientific & Engineering Research Volume 2, Issue 7, July-2011 1 Biological Sequence Matching Using Fuzzy Logic Nivit Gill, Shailendra Singh Abstract: Sequence alignment is the

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P Clote Department of Biology, Boston College Gasson Hall 16, Chestnut Hill MA 0267 clote@bcedu Abstract In this paper, we give

More information

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm 5th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2017) Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm Xiantao Jiang1, a,*,xueliang

More information

Heuristic methods for pairwise alignment:

Heuristic methods for pairwise alignment: Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic

More information

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles Today s Lecture Multiple sequence alignment Improved scoring of pairwise alignments Affine gap penalties Profiles 1 The Edit Graph for a Pair of Sequences G A C G T T G A A T G A C C C A C A T G A C G

More information

Highly Scalable and Accurate Seeds for Subsequence Alignment

Highly Scalable and Accurate Seeds for Subsequence Alignment Highly Scalable and Accurate Seeds for Subsequence Alignment Abhijit Pol Tamer Kahveci Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, 32611

More information

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

) I R L Press Limited, Oxford, England. The protein identification resource (PIR) Volume 14 Number 1 Volume 1986 Nucleic Acids Research 14 Number 1986 Nucleic Acids Research The protein identification resource (PIR) David G.George, Winona C.Barker and Lois T.Hunt National Biomedical

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover

More information

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA Comparative Analysis of Protein Alignment Algorithms in Parallel environment using BLAST versus Smith-Waterman Shadman Fahim shadmanbracu09@gmail.com Shehabul Hossain rudrozzal@gmail.com Gulshan Jubaed

More information

A Design of a Hybrid System for DNA Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm

More information

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure Bioinformatics Sequence alignment BLAST Significance Next time Protein Structure 1 Experimental origins of sequence data The Sanger dideoxynucleotide method F Each color is one lane of an electrophoresis

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 6: Alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Multiple Sequence Alignment: Multidimensional. Biological Motivation

Multiple Sequence Alignment: Multidimensional. Biological Motivation Multiple Sequence Alignment: Multidimensional Dynamic Programming Boston University Biological Motivation Compare a new sequence with the sequences in a protein family. Proteins can be categorized into

More information

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6) International Journals of Advanced Research in Computer Science and Software Engineering ISSN: 77-18X (Volume-7, Issue-6) Research Article June 017 DDGARM: Dotlet Driven Global Alignment with Reduced Matrix

More information

FastCluster: a graph theory based algorithm for removing redundant sequences

FastCluster: a graph theory based algorithm for removing redundant sequences J. Biomedical Science and Engineering, 2009, 2, 621-625 doi: 10.4236/jbise.2009.28090 Published Online December 2009 (http://www.scirp.org/journal/jbise/). FastCluster: a graph theory based algorithm for

More information

Project Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University

Project Report on. De novo Peptide Sequencing. Course: Math 574 Gaurav Kulkarni Washington State University Project Report on De novo Peptide Sequencing Course: Math 574 Gaurav Kulkarni Washington State University Introduction Protein is the fundamental building block of one s body. Many biological processes

More information

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2

JET 2 User Manual 1 INSTALLATION 2 EXECUTION AND FUNCTIONALITIES. 1.1 Download. 1.2 System requirements. 1.3 How to install JET 2 JET 2 User Manual 1 INSTALLATION 1.1 Download The JET 2 package is available at www.lcqb.upmc.fr/jet2. 1.2 System requirements JET 2 runs on Linux or Mac OS X. The program requires some external tools

More information

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods

Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Comparison of Phylogenetic Trees of Multiple Protein Sequence Alignment Methods Khaddouja Boujenfa, Nadia Essoussi, and Mohamed Limam International Science Index, Computer and Information Engineering waset.org/publication/482

More information

Scoring and heuristic methods for sequence alignment CG 17

Scoring and heuristic methods for sequence alignment CG 17 Scoring and heuristic methods for sequence alignment CG 17 Amino Acid Substitution Matrices Used to score alignments. Reflect evolution of sequences. Unitary Matrix: M ij = 1 i=j { 0 o/w Genetic Code Matrix:

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Chapter 4: Blast. Chaochun Wei Fall 2014

Chapter 4: Blast. Chaochun Wei Fall 2014 Course organization Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms for Sequence Analysis (Week 3-11)

More information

New Algorithms for the Spaced Seeds

New Algorithms for the Spaced Seeds New Algorithms for the Spaced Seeds Xin Gao 1, Shuai Cheng Li 1, and Yinan Lu 1,2 1 David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 6P7 2 College of Computer

More information

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching A CAM(Content Addressable Memory)-based architecture for molecular sequence matching P.K. Lala 1 and J.P. Parkerson 2 1 Department Electrical Engineering, Texas A&M University, Texarkana, Texas, USA 2

More information

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the

More information

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships Abhishek Majumdar, Peter Z. Revesz Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln,

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Database Similarity Searching

Database Similarity Searching An Introduction to Bioinformatics BSC4933/ISC5224 Florida State University Feb. 23, 2009 Database Similarity Searching Steven M. Thompson Florida State University of Department Scientific Computing How

More information

Bio-Sequence Analysis with Cradle s 3SoC Software Scalable System on Chip

Bio-Sequence Analysis with Cradle s 3SoC Software Scalable System on Chip 2004 ACM Symposium on Applied Computing Bio-Sequence Analysis with Cradle s 3SoC Software Scalable System on Chip Xiandong Meng Department of Electrical and Computer Engineering Wayne State University

More information

A New Approach For Tree Alignment Based on Local Re-Optimization

A New Approach For Tree Alignment Based on Local Re-Optimization A New Approach For Tree Alignment Based on Local Re-Optimization Feng Yue and Jijun Tang Department of Computer Science and Engineering University of South Carolina Columbia, SC 29063, USA yuef, jtang

More information

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment Courtesy of jalview 1 Motivations Collective statistic Protein families Identification and representation of conserved sequence features

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

3.4 Multiple sequence alignment

3.4 Multiple sequence alignment 3.4 Multiple sequence alignment Why produce a multiple sequence alignment? Using more than two sequences results in a more convincing alignment by revealing conserved regions in ALL of the sequences Aligned

More information

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology ICB Fall 2008 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2008 Oliver Jovanovic, All Rights Reserved. The Digital Language of Computers

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins

More information

Multiple sequence alignment. November 20, 2018

Multiple sequence alignment. November 20, 2018 Multiple sequence alignment November 20, 2018 Why do multiple alignment? Gain insight into evolutionary history Can assess time of divergence by looking at the number of mutations needed to change one

More information

Semi-supervised protein classification using cluster kernels

Semi-supervised protein classification using cluster kernels Semi-supervised protein classification using cluster kernels Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany weston@tuebingen.mpg.de Dengyong Zhou, Andre Elisseeff

More information

Brief review from last class

Brief review from last class Sequence Alignment Brief review from last class DNA is has direction, we will use only one (5 -> 3 ) and generate the opposite strand as needed. DNA is a 3D object (see lecture 1) but we will model it

More information

ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster

ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster Bioinformatics Advance Access published January 29, 2004 ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster Gernot Stocker, Dietmar Rieder, and

More information

Hardware Acceleration of Sequence Alignment Algorithms An Overview

Hardware Acceleration of Sequence Alignment Algorithms An Overview Hardware Acceleration of Sequence Alignment Algorithms An Overview Laiq Hasan Zaid Al-Ars Stamatis Vassiliadis Delft University of Technology Computer Engineering Laboratory Mekelweg 4, 2628 CD Delft,

More information

Sequence alignment theory and applications Session 3: BLAST algorithm

Sequence alignment theory and applications Session 3: BLAST algorithm Sequence alignment theory and applications Session 3: BLAST algorithm Introduction to Bioinformatics online course : IBT Sonal Henson Learning Objectives Understand the principles of the BLAST algorithm

More information

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Multiple Sequence Alignment. Mark Whitsitt - NCSA Multiple Sequence Alignment Mark Whitsitt - NCSA What is a Multiple Sequence Alignment (MA)? GMHGTVYANYAVDSSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKQPHV GMHGTVYANYAVEHSDLLLAFGVRFDDRVTGKLEAFASRAKIVHIDIDSAEIGKNKTPHV

More information

Massive Automatic Functional Annotation MAFA

Massive Automatic Functional Annotation MAFA Massive Automatic Functional Annotation MAFA José Nelson Perez-Castillo 1, Cristian Alejandro Rojas-Quintero 2, Nelson Enrique Vera-Parra 3 1 GICOGE Research Group - Director Center for Scientific Research

More information

A Coprocessor Architecture for Fast Protein Structure Prediction

A Coprocessor Architecture for Fast Protein Structure Prediction A Coprocessor Architecture for Fast Protein Structure Prediction M. Marolia, R. Khoja, T. Acharya, C. Chakrabarti Department of Electrical Engineering Arizona State University, Tempe, USA. Abstract Predicting

More information

SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences

SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences Sérgio A. D. Deusdado 1 and Paulo M. M. Carvalho 2 1 ESA,

More information

Lecture 4: January 1, Biological Databases and Retrieval Systems

Lecture 4: January 1, Biological Databases and Retrieval Systems Algorithms for Molecular Biology Fall Semester, 1998 Lecture 4: January 1, 1999 Lecturer: Irit Orr Scribe: Irit Gat and Tal Kohen 4.1 Biological Databases and Retrieval Systems In recent years, biological

More information

Cache and Energy Efficient Alignment of Very Long Sequences

Cache and Energy Efficient Alignment of Very Long Sequences Cache and Energy Efficient Alignment of Very Long Sequences Chunchun Zhao Department of Computer and Information Science and Engineering University of Florida Email: czhao@cise.ufl.edu Sartaj Sahni Department

More information

BIOINFORMATICS. Multiple spaced seeds for homology search

BIOINFORMATICS. Multiple spaced seeds for homology search BIOINFORMATICS Vol. 00 no. 00 2007 pages 1-9 Sequence Analysis Multiple spaced seeds for homology search Lucian Ilie 1, and Silvana Ilie 2 1 Department of Computer Science, University of Western Ontario,

More information

Combinatorial Pattern Matching

Combinatorial Pattern Matching Combinatorial Pattern Matching Outline Exact Pattern Matching Keyword Trees Suffix Trees Approximate String Matching Local alignment is to slow Quadratic local alignment is too slow while looking for similarities

More information

Distributed Protein Sequence Alignment

Distributed Protein Sequence Alignment Distributed Protein Sequence Alignment ABSTRACT J. Michael Meehan meehan@wwu.edu James Hearne hearne@wwu.edu Given the explosive growth of biological sequence databases and the computational complexity

More information

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids Important Example: Gene Sequence Matching Century of Biology Two views of computer science s relationship to biology: Bioinformatics: computational methods to help discover new biology from lots of data

More information

Finding homologous sequences in databases

Finding homologous sequences in databases Finding homologous sequences in databases There are multiple algorithms to search sequences databases BLAST (EMBL, NCBI, DDBJ, local) FASTA (EMBL, local) For protein only databases scan via Smith-Waterman

More information

A Revised Algorithm to find Longest Common Subsequence

A Revised Algorithm to find Longest Common Subsequence A Revised Algorithm to find Longest Common Subsequence Deena Nath 1, Jitendra Kurmi 2, Deveki Nandan Shukla 3 1, 2, 3 Department of Computer Science, Babasaheb Bhimrao Ambedkar University Lucknow Abstract:

More information

Dynamic Programming & Smith-Waterman algorithm

Dynamic Programming & Smith-Waterman algorithm m m Seminar: Classical Papers in Bioinformatics May 3rd, 2010 m m 1 2 3 m m Introduction m Definition is a method of solving problems by breaking them down into simpler steps problem need to contain overlapping

More information

A Scalable Coprocessor for Bioinformatic Sequence Alignments

A Scalable Coprocessor for Bioinformatic Sequence Alignments A Scalable Coprocessor for Bioinformatic Sequence Alignments Scott F. Smith Department of Electrical and Computer Engineering Boise State University Boise, ID, U.S.A. Abstract A hardware coprocessor for

More information

A BANDED SMITH-WATERMAN FPGA ACCELERATOR FOR MERCURY BLASTP

A BANDED SMITH-WATERMAN FPGA ACCELERATOR FOR MERCURY BLASTP A BANDED SITH-WATERAN FPGA ACCELERATOR FOR ERCURY BLASTP Brandon Harris*, Arpith C. Jacob*, Joseph. Lancaster*, Jeremy Buhler*, Roger D. Chamberlain* *Dept. of Computer Science and Engineering, Washington

More information

ProGreSS: SIMULTANEOUS SEARCHING OF PROTEIN DATABASES BY SEQUENCE AND STRUCTURE

ProGreSS: SIMULTANEOUS SEARCHING OF PROTEIN DATABASES BY SEQUENCE AND STRUCTURE ProGreSS: SIMULTANEOUS SEARCHING OF PROTEIN DATABASES BY SEQUENCE AND STRUCTURE A. BHATTACHARYA T. CAN T. KAHVECI A. K. SINGH Y.-F. WANG Department of Computer Science University of California, Santa Barbara,

More information

Multiple Map Intersection Detection using Visual Appearance

Multiple Map Intersection Detection using Visual Appearance Multiple Map Intersection Detection using Visual Appearance Kin Leong Ho, Paul Newman Oxford University Robotics Research Group {klh,pnewman}@robots.ox.ac.uk Abstract It is difficult to detect intersections

More information

A Comparative Study of Linear Encoding in Genetic Programming

A Comparative Study of Linear Encoding in Genetic Programming 2011 Ninth International Conference on ICT and Knowledge A Comparative Study of Linear Encoding in Genetic Programming Yuttana Suttasupa, Suppat Rungraungsilp, Suwat Pinyopan, Pravit Wungchusunti, Prabhas

More information

Lecture 9: Core String Edits and Alignments

Lecture 9: Core String Edits and Alignments Biosequence Algorithms, Spring 2005 Lecture 9: Core String Edits and Alignments Pekka Kilpeläinen University of Kuopio Department of Computer Science BSA Lecture 9: String Edits and Alignments p.1/30 III:

More information

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2 Chapter 5 A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2 Graph Matching has attracted the exploration of applying new computing paradigms because of the large number of applications

More information

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

Speeding up Subset Seed Algorithm for Intensive Protein Sequence Comparison

Speeding up Subset Seed Algorithm for Intensive Protein Sequence Comparison Speeding up Subset Seed Algorithm for Intensive Protein Sequence Comparison Van Hoa NGUYEN IRISA/INRIA Rennes Rennes, France Email: vhnguyen@irisa.fr Dominique LAVENIER CNRS/IRISA Rennes, France Email:

More information

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD Assumptions: Biological sequences evolved by evolution. Micro scale changes: For short sequences (e.g. one

More information

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform Barry Strengholt Matthijs Brobbel Delft University of Technology Faculty of Electrical Engineering, Mathematics

More information

Keywords -Bioinformatics, sequence alignment, Smith- waterman (SW) algorithm, GPU, CUDA

Keywords -Bioinformatics, sequence alignment, Smith- waterman (SW) algorithm, GPU, CUDA Volume 5, Issue 5, May 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Accelerating Smith-Waterman

More information

Alignment of Long Sequences

Alignment of Long Sequences Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale

More information

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Pairwise Sequence Alignment. Zhongming Zhao, PhD Pairwise Sequence Alignment Zhongming Zhao, PhD Email: zhongming.zhao@vanderbilt.edu http://bioinfo.mc.vanderbilt.edu/ Sequence Similarity match mismatch A T T A C G C G T A C C A T A T T A T G C G A T

More information

COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP. Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas

COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP. Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas First of all connect once again to the CBS system: Open ssh shell client. Press Quick

More information

A multiple alignment tool in 3D

A multiple alignment tool in 3D Outline Department of Computer Science, Bioinformatics Group University of Leipzig TBI Winterseminar Bled, Slovenia February 2005 Outline Outline 1 Multiple Alignments Problems Goal Outline Outline 1 Multiple

More information

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS FINDING APPROXIMATE REPEATS IN DNA SEQUENCES USING MULTIPLE SPACED SEEDS By SARAH BANYASSADY, B.S. A Thesis Submitted to the School of Graduate Studies

More information

Central Issues in Biological Sequence Comparison

Central Issues in Biological Sequence Comparison Central Issues in Biological Sequence Comparison Definitions: What is one trying to find or optimize? Algorithms: Can one find the proposed object optimally or in reasonable time optimize? Statistics:

More information

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database. BLAST Basic Local Alignment Search Tool Used to quickly compare a protein or DNA sequence to a database. There is no such thing as a free lunch BLAST is fast and highly sensitive compared to competitors.

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

Fast Sequence Alignment Method Using CUDA-enabled GPU

Fast Sequence Alignment Method Using CUDA-enabled GPU Fast Sequence Alignment Method Using CUDA-enabled GPU Yeim-Kuan Chang Department of Computer Science and Information Engineering National Cheng Kung University Tainan, Taiwan ykchang@mail.ncku.edu.tw De-Yu

More information

Sequence Alignment Heuristics

Sequence Alignment Heuristics Sequence Alignment Heuristics Some slides from: Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt Serafim Batzoglu, Stanford http://ai.stanford.edu/~serafim/ Geoffrey J. Barton, Oxford Protein

More information

Improving CUDASW++, a Parallelization of Smith-Waterman for CUDA Enabled Devices

Improving CUDASW++, a Parallelization of Smith-Waterman for CUDA Enabled Devices 2011 IEEE International Parallel & Distributed Processing Symposium Improving CUDASW++, a Parallelization of Smith-Waterman for CUDA Enabled Devices Doug Hains, Zach Cashero, Mark Ottenberg, Wim Bohm and

More information