Notes on Dynamic-Programming Sequence Alignment

Similar documents
Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Lecture 10. Sequence alignments

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Distributed Protein Sequence Alignment

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Computational Genomics and Molecular Biology, Fall

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

Mouse, Human, Chimpanzee

Algorithmic Approaches for Biological Data, Lecture #20

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Biologically significant sequence alignments using Boltzmann probabilities

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Sequence analysis Pairwise sequence alignment

Biology 644: Bioinformatics

Central Issues in Biological Sequence Comparison

Recent Developments in Linear-Space Alignment Methods: A Survey

An Efficient Algorithm to Locate All Locally Optimal Alignments Between Two Sequences Allowing for Gaps

Bioinformatics explained: Smith-Waterman

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

Sequence Alignment. part 2

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

EECS730: Introduction to Bioinformatics

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Sequence comparison: Local alignment

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

Programming assignment for the course Sequence Analysis (2006)

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Shortest Path Algorithm

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Sequence Alignment & Search

BLAST & Genome assembly

A Design of a Hybrid System for DNA Sequence Alignment

Computational Molecular Biology

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

Brief review from last class

From Smith-Waterman to BLAST

Special course in Computer Science: Advanced Text Algorithms

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm

Multiple Sequence Alignment: Multidimensional. Biological Motivation

BGGN 213 Foundations of Bioinformatics Barry Grant

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Bioinformatics for Biologists

1 Metric spaces. d(x, x) = 0 for all x M, d(x, y) = d(y, x) for all x, y M,

Software Implementation of Smith-Waterman Algorithm in FPGA

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

SWAMP: Smith-Waterman using Associative Massive Parallelism

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Pairwise Sequence alignment Basic Algorithms

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Bioinformatics explained: BLAST. March 8, 2007

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

Computational Molecular Biology

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

FastA & the chaining problem

Sequence alignment theory and applications Session 3: BLAST algorithm

Alignment of Long Sequences

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Representation is Everything: Towards Efficient and Adaptable Similarity Measures for Biological Data

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences

An Index Based Sequential Multiple Pattern Matching Algorithm Using Least Count

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

Proceedings of the 11 th International Conference for Informatics and Information Technology

Cache and Energy Efficient Alignment of Very Long Sequences

On the Efficacy of Haskell for High Performance Computational Biology

The Dot Matrix Method

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Pairwise alignment II

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Affine-invariant shape matching and recognition under partial occlusion

BLAST & Genome assembly

Contents Sequence alignment algorithms

Lecture 9: Core String Edits and Alignments

Accelerating Smith Waterman (SW) Algorithm on Altera Cyclone II Field Programmable Gate Array

Sequence alignment algorithms

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

Dynamic Programming & Smith-Waterman algorithm

Sequence Alignment. Ulf Leser

Sept. 9, An Introduction to Bioinformatics. Special Topics BSC5936:

Correlogram-based method for comparing biological sequences

Transcription:

Notes on Dynamic-Programming Sequence Alignment Introduction. Following its introduction by Needleman and Wunsch (1970), dynamic programming has become the method of choice for rigorous alignment of DNA and protein sequences. For a number of useful alignment-scoring schemes, this method is guaranteed to produce an alignment of two given sequences having the highest possible score. For alignment scores that are popular with molecular biologists, dynamic-programming alignment of two sequences requires quadratic time, i.e., time proportional to the product of the two sequence lengths. In particular, this holds for affine gap costs, that is, scoring schemes under which a gap of length k is penalized g + ek, where g is a fixed gap-opening penalty and e is a gap-extension penalty (Gotoh, 1982). (More general alignment scores, which are more expensive to optimize, were considered by Waterman et al., 1976, but have not found wide-spread use.) Quadratic time is necessitated by the inspection of every pair (i, j), where i is a position in the first sequence and j is a position in the second sequence. For many applications, e.g., database searches, such an exhaustive examination of position pairs may not be worth the effort, and a number of faster methods have been proposed, such as Blast. he Dynamic-Programming Alignment Algorithm. It is quite helpful to recast the problem of aligning two sequences as an equivalent problem of finding a maximum-score path in a certain graph, as has been observed by a number of authors, including Myers and Miller (1989). his alternative formulation allows the problem to be visualized in a way that permits the use of geometric intuition. We find this visual imagery critical for keeping track of the low-level details that arise in development and implementation of dynamic-programming alignment algorithms. An alignment of two sequences, say S and,isarectangular array of symbols having two rows, such that removing all dash characters from the first row (if any are there) gives S, and removing dashes from the second row gives. Also, we do not allow columns containing two dash symbols. For instance, AAGAA-A A-GAA is an alignment of AAGAAA and AGAA. For the current discussion, we assume the following simple alignment-scoring scheme. For x each possible aligned pair [,where each of x and y is either a normal sequence entry or the y] x symbol, there is an assigned score σ ( [ ). he score of a pairwise alignment is defined to y] be the sum of the σ -values of its aligned pairs (i.e., columns). For instance if we score each match (i.e., column of identical symbols) 1, and each other column 1, then the above alignment scores 1 1 + 1 + 1 1 + 1 1 + 1=2. Recall that a directed graph G =(V, E) consists of a set V of nodes (also called vertices) and a set E of edges. he edge from node u to node v, ifitexists, is denoted u v. A sequence of consecutive edges u 1 u 2, u 2 u 3,..., u k1 u k is a path from u 1 to u k. Ifeach edge u v is assigned a score σ (u v), then the score of such a path is Σ k1 i=1 σ (u i u i+1 ). We now describe the relationship between maximum-score paths and optimal alignments. onsider two sequences, A = a 1 a 2...a M and B = b 1 b 2...b N. hat is, A contains M symbols and B contains N symbols, where the symbols are from an arbitrary alphabet that does not contain the dash symbol,. he alignment graph for A and B, denoted G A, B,isanedgelabeled directed graph. he nodes of G A, B are the pairs (i, j) where i [0, M] and j [0, N]. (We

-2- use the notation [ p, q] for the set { p, p + 1,..., q 1, q}.) When graphed, these nodes are arrayed in M + 1rows (row i corresponds to for i [1, M], with an additional row 0)and N + 1 columns (column j corresponds to for j [1, N]). he edge set for G A, B consists of the following edges, labeled as indicated. 1. (i 1, j) (i, j) for i [1, M] and j [0, N], labeled [ 2. (i, j 1) (i, j) for i [0, M] and j [1, N], labeled [ ] a 3. (i 1, j 1) (i, j) for i [1, M] and j [1, N], labeled [ i ] Fig. 1 provides an example of the construction. ] FIG. 1. Alignment graph G A, B for the sequences A = and B =. It is instructive to look for a path from (0, 0) (the upper left corner of the graph of Fig. 1) to (2, 3) (the lower right) such that the labels along the path spell the alignment: - he first aligned pair is [ ],sothe first edge must be horizontal. he second pair is [ ],sothe,sothe third edge must be diagonal. Gener- ] second edge must be diagonal. he third pair is [ ally, when a path descends from row i 1torow i, itpicks up an aligned pair with top entry. Apath from (0, 0) to (M, N) has zero or more horizontal edges, then a vertical or diagonal edges to row 1,then zero or more horizontal edges, then an edge to row 2,then...,sothe top entries of the labels along the path are a 1, a 2,...,possibly with some interspersed dashes. Similarly, the bottom entries spell B if dashes are ignored, so the aligned pairs spell an alignment of A and B. Indeed, alignments are in general equivalent to paths, as we now state more precisely. Fact: Let G A, B be the alignment graph for sequences A and B. With each path from (0, 0) to (M, N) associate the alignment formed by concatenating the edge labels along the path, i.e., the alignment spelled by the path. hen every such path determines an alignment of A and B, and every alignment of A and B is determined by a unique path. In other words, there is a one-toone correspondence between paths in G A, B from (0, 0) to (M, N) and alignments of A and B. Furthermore, if the score σ (π )isassigned to each edge of G A, B,where π is the aligned pair labeling that edge, then a path s score is exactly the score of the corresponding alignment.

-3- At each node, the score is computed from the scores of immediate predecessors and of entering edges, which are pictured in Fig. 2. he procedure of Fig. 3 computes the maximum alignment score by considering rows of G A, B in order, sweeping left to right within each row. S[i, j] denotes the maximum score of a path from (0, 0) to (i, j). Lines 7-10 mirror Fig. 2. In row 0there is but a single edge entering a node (lines 2-3), and similarly for column 0 (line 5). his is aquadratic-space procedure since it uses the (M+1)-by-(N+1) array S to hold all node-scores. i1 i j1 σ ( ) σ ( b j ) σ ( j ) FIG. 2. Edges entering node (i, j) and their scores. 1. S[0, 0] 0 2. for j 1 to N do 3. S[0, j] S[0, j 1] + σ ( [ 4. for i 1 to M do 5. S[i,0] S[i 1, 0] + σ ( [ ] } 6. for j 1 to N do 7. Vertical S[i 1, j] + σ ( [ 8. Diagonal S[i 1, j 1] + σ ( [ 9. Horizontal S[i, j 1] + σ ( [ 10. S[i, j] max{vertical, Diagonal, Horizontal} 11. write "Maximum alignment score is" S[M, N] FIG. 3. Score-only alignment algorithm. An Example. onsider sequences AGG of length M = 3 and AG of length N = 4. he algorithm systematically fills in entries of a 4-by-5 matrix, where the entry at the intersection of row i and column j is the highest score of any alignment between the first i entries of the first sequence and the first j entries of the second sequence. For this example, suppose that a match scores 1 and any other column scores 1. onsider computation of the 2,2-entry, given:

-4- A G 0 1 2 3 4 A 1 1 0 1 2 G 2 0? What is the best of the three ways to fill in the entry? Moving from the entry 0 to the left of the new position, or down from the 0 just above, wewould add 1 (horizontal or vertical moves always pay the cost for a gap, which is 1 in this example). oming from the 1just above and left, we would add the score for aligning A to, getting 0, so this is the best move and fills in a 0. Now, what about the next entry? A G 0 1 2 3 4 A 1 1 0 1 2 G 2 0 0? Again the unique best move is down the diagonal, which gives the score 1. Filling in the entire matrix this way, gives: A G 0 1 2 3 4 A 1 1 0 1 2 G 2 0 0 1 0 G 3 1 1 1 0 hus the best score for aligning these two sequences is 0. Before reading further, find two alignments that attain this score. Determining the Alignment. Given the filled-in score array, an optimal alignment can be constructed (in reversed order of its columns) by a traceback operation, which walks backwards along an optimal path from the lower right corner to the upper left corner. he basic operation is to identify which of the three possibilities was chosen to reach the current position, which gives (1) the nature of the corresponding alignment column (horizontal move means insertion, diagonal move means substitution, and vertical move means deletion) and (2) tells the previous position on an optimal path. For instance, the lower-right corner in the above score array can be attained with either a horizontal or a diagonal move. hus, the last column of an optimal alignment can be either the insertion dash-over- orthe substitution G-over-. Local Alignment. In many applications, a global (i.e., end-to-end) alignment of the two given sequences is inappropriate; instead, a local alignment (i.e., involving only a part of each sequence) is desired. In other words, one seeks a high-scoring path that need not terminate at the corners of the dynamic-programming grid (Smith and Waterman, 1981). he highest local alignment score can be computed as follows:

-5-0 if0 i M and 0 j N S[i 1, j] + σ ( S[i, j] max [ ) if1 i M and 0 j N ] S[i 1, j 1] + σ ( [ if1 i M and 1 j N S[i, j 1] + σ ( [ if0 i M and 1 j N Further complications arise when one seeks k best alignments, where k >1. For computing an arbitrary number of non-intersecting and high-scoring local alignments, Waterman and Eggert (1987) developed a very time-efficient method. REFERENES Gotoh, O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705-708. Needleman, S. B. and. D. Wunsch (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 443-453. Smith,. F. and M. S. Waterman (1981) Identification of common molecular sequences. J. Mol. Biol. 197, 723-728. Waterman, M. S.,. F. Smith and W. A. Beyer (1976) Some biological sequence metrics. Adv. Math. 20, 367-387. Waterman, M. S. and M. Eggert (1987) A new algorithm for best subsequence alignments with application to trna-rrna comparisons. J. Mol. Biol. 197, 723-728.