Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Similar documents
Lecture 10. Sequence alignments

Sequence analysis Pairwise sequence alignment

Computational Genomics and Molecular Biology, Fall

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming & Smith-Waterman algorithm

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Sequence Alignment. part 2

Algorithmic Approaches for Biological Data, Lecture #20

Notes on Dynamic-Programming Sequence Alignment

Concept of Curve Fitting Difference with Interpolation

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Sequence Alignment & Search

Biology 644: Bioinformatics

Brief review from last class

BLAST MCDB 187. Friday, February 8, 13

Lesson 12: Angles Associated with Parallel Lines

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Outline. Sequence Alignment. Types of Sequence Alignment. Genomics & Computational Biology. Section 2. How Computers Store Information

Sequence alignment algorithms

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Sequence comparison: Local alignment

Bioinformatics for Biologists

FastA & the chaining problem

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Alignment ABC. Most slides are modified from Serafim s lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

Pairwise Sequence alignment Basic Algorithms

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Similar Polygons Date: Per:

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Least Squares; Sequence Alignment

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Lecture 3.3 Robust estimation with RANSAC. Thomas Opsahl

EECS730: Introduction to Bioinformatics

Section 1: Introduction to Geometry Points, Lines, and Planes

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Eureka Math. Grade 7, Module 6. Student File_A. Contains copy-ready classwork and homework

Polynomial Functions I

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Basics on bioinforma-cs Lecture 4. Concita Cantarella

Machine Learning. Computational biology: Sequence alignment and profile HMMs

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

BLAST. NCBI BLAST Basic Local Alignment Search Tool

Multiple Sequence Alignment Augmented by Expert User Constraints

Pairwise alignment II

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Computational Molecular Biology

Lesson 19: The Graph of a Linear Equation in Two Variables Is a Line

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Bioinformatics explained: Smith-Waterman

Stephen Scott.

Mouse, Human, Chimpanzee

Central Issues in Biological Sequence Comparison

Quadratic Functions Date: Per:

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Sequence alignment theory and applications Session 3: BLAST algorithm

Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh

Alignment of Long Sequences

Computational Molecular Biology

Linear Programming with Bounds

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Special course in Computer Science: Advanced Text Algorithms

Accelerating Smith Waterman (SW) Algorithm on Altera Cyclone II Field Programmable Gate Array

Engineering Methods in Microsoft Excel. Part 1:

Programming assignment for the course Sequence Analysis (2006)

Copy Material. Geometry Unit 1. Congruence, Proof, and Constructions. Eureka Math. Eureka Math

Multiple Sequence Alignment: Multidimensional. Biological Motivation

EECS730: Introduction to Bioinformatics

BLAST - Basic Local Alignment Search Tool

The ABC s of Web Site Evaluation

Algorithmic Paradigms. Chapter 6 Dynamic Programming. Steps in Dynamic Programming. Dynamic Programming. Dynamic Programming Applications

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Section 6: Triangles Part 1

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids

BMI/CS 576 Fall 2015 Midterm Exam

From Smith-Waterman to BLAST

BLAST & Genome assembly

Visit MathNation.com or search "Math Nation" in your phone or tablet's app store to watch the videos that go along with this workbook!

Arabesque Groups Where Art and Mathematics Meet. Jawad Abuhlail, KFUPM (KSA)

Q.4 Properties of Quadratic Function and Optimization Problems

DNA Alignment With Affine Gap Penalties

Wisconsin Retirement Testing Preparation

Heuristic methods for pairwise alignment:

recruitment Logo Typography Colourways Mechanism Usage Pip Recruitment Brand Toolkit

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Transcription:

Sequence Alignments

Overview Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence alignment means arranging two sequences so that regions of similarity line up. There are several ways that alignments can be reported and there is no simple, universal format that can present all the information encoded in an alignment.

Displaying alignments We use a visual display that uses various extra characters to help us interpret the lineup. For example, The character may indicate a gap The character is used to display a match The. Character may be used to display a mismatch. Usually, we read and interpret the alignment as if we were comparing the bottom sequence against the top one. In the case above, we could say that it is an alignment with deletions. The second sequence has missing bases relative to the first.

Displaying alignments We could display this same alignment the other way around, in which case the bottom sequence would have insertions relative to the top one.

PAIRWISE ALIGNMENT

Overview The most basic sequence analysis task is to ask if two sequences are related. This is done by first aligning the sequences and then deciding whether that alignment is more likely to have occurred because the sequences are related or just by chance. The key issues are What sorts of alignment should be considered The scoring system used to rank alignments The algorithm used to find optimal scoring alignments The statistical methods used to evaluate the significance of an alignment score

The scoring model When we compare sequences, we are looking for evidence that they have diverged from a common ancestor by a process of mutation and selection. The basic mutational processes that are considered are substitutions, which change residues in a sequence, and insertions and deletions, which add or remove residues. Insertions and deletions are together referred to as gaps.

The scoring model The total score we assign to an alignment is a sum of terms for each aligned pair of residues, plus terms for each gap. Probabilistic interpretation: the logarithm of the relative likelihood that the sequences are related, compared to being unrelated. ii log pp aa ii bb ii qq aaii qq bbii

The scoring model We expect identities and conservative substitutions to be more likely in alignments than we expect by chance, and so to contribute positive score terms. Non-conservative changes are expected to be observed less frequently in real alignments than we expect by chance, and so these contribute negative score terms.

The scoring model Using an additive scoring scheme corresponds to an assumption that we can consider mutations at different sites in a sequence to have occurred independently. All the algorithms for finding optimal alignments depend on such a scoring scheme. The assumption of independence appears to be a reasonable approximation for DNA sequences. However, it is inaccurate for protein sequences and structural RNAs.

Substitution matrices We need score terms for each aligned residue pair. I will derive substitution scores from a probabilistic model. Some notations: Consider a pair of sequences, xx and yy, of length nn. Let xx ii be the iith symbol in xx and yy jj be the jjth symbol of yy. These symbols come from the four bases {AA, GG, CC, TT} in the case of DNA. We denote symbols by lower-case letters like aa and bb.

Substitution matrices Given a pair of aligned sequences, we want to assign a score to the alignment that gives a measure of the relative likelihood that the sequences are related as opposed to being unrelated. The random model RR: PP xx, yy RR = ii qq xxii qq yyii where it assumes that letter aa occurs independently with some frequency qq aa. The match model MM: PP xx, yy MM = ii pp xxii yy ii where pp aaaa is the probability that the residues aa and bb have each independently been derived from some unknown original residue in their common ancestor.

Substitution matrices The ratio of these two likelihoods is known as the odds ratio: PP(xx, yy MM) PP(xx, yy RR) = ii pp xxii yy ii qq xxii qq yyii The log-odds ratio: where SS = ss(xx ii, yy ii ) ii ss aa, bb = log pp aaaa qq aa qq bb

Substitution matrices ss aa, bb is known as a score matrix or a substitution matrix. An example of a substitution matrix (EDNAFULL or NUC4.4) is

Gap penalties The standard cost associated with a gap of length gg is given either by a linear score: γγ gg = gggg or by an affine score γγ gg = dd gg 1 ee where dd is the gap-open penalty and ee is the gap-extension penalty. dd > ee, allowing long insertions and deletions to be penalized less than they would be by the linear gap cost.

Alignment algorithms: Overview Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. The algorithm for finding optimal alignment given an additive alignment score is called dynamic programming. Dynamic programming algorithms are guaranteed to find the optimal scoring alignment or set of alignments. In most cases heuristic methods have also been developed to perform the same type of search. These can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs.

Alignment algorithms: Overview We want to maximize the score (represented by log-odds ratios) to find the optimal alignment.

Global alignment The first problem is that of obtaining the optimal global alignment between two sequences, allowing gaps. The dynamic programming algorithm for solving this problem is known as the Needleman-Wunsch algorithm. The idea is to build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences.

Dynamic programming To set about developing an algorithm based on dynamic programming, one needs a collection of subproblems derived from the original problem that satisfies a few basic properties: 1) There are only a polynomial number of subproblems 2) The solution to the original problem can be easily computed from the solution to t he subproblems. 3) There is a natural ordering on subproblems from smallest to largest together with an easy-to-compute recurrence that allows one to determine the solution to a subproblem from the solutions to some number of smaller subproblems.

Alignment Suppose we are given two sequences xx and yy, where xx consists of the sequence of symbols xx 1 xx 2 xx mm and yy consists of the sequence of symbols yy 1 yy 2 yy nn. Consider the sets {1,2,, mm} and {1,2,, nn} as representing the different positions in the sequences xx and yy, and consider a matching of these sets. A matching a set of ordered pairs with the property that each item occurs in at most one pair. A matching M of these two sets is an alignment if there are no crossing pairs: if ii, jj, (ii, jj ) MM and ii < ii, then jj < jj

Alignment An alignment gives a way of lining up the two sequences, by telling us which pairs of positions will be lined up with one another. For example, stop- -tops corresponds to the alignment { 2,1, 3,2, 4,3 }.

Optimal alignment Suppose MM is a given alignment between xx and yy. First, there is a parameter dd that defines a gap penalty. For each position of xx and yy that is not matched in MM, we incur a cost of dd. Second, for each pair of letters aa, bb in our alphabet, there is a mismatch score of ss aa, bb < 0 for lining up aa with bb. Thus, for each ii, jj MM, we pay the appropriate mismatch cost ss(aa, bb). One generally assumes that ss aa, aa > 0 for each letter aa. The score of M is the sum of its gap penalties, mismatch scores, and match scores. We seek an alignment of maximum score.

Optimal alignment The process of maximizing this score is referred to as sequence alignment in the biology literature. The quantities dd and ss(aa, bb) are external parameters that must be plugged into software for sequence alignment. The higher the cost, the more similar we declare the sequences to be.

Designing the algorithm [Theorem] Let MM be any alignment of xx and yy. If mm, nn MM, then either the mm th position of xx or the nn th position of yy is not matched in MM. Proof. Suppose by way of contradiction that mm, nn MM, and there are numbers ii < mm and jj < nn so that mm, jj MM and ii, nn MM. But this contradicts our definition of alignment: we have ii, nn, mm, jj MM with ii < mm, but jj < nn so the pairs ii, nn and mm, jj cross.

Designing the algorithm There is an equivalent way to write the theorem that exposes three alternative possibilities, and leads directly to the formulation of a recurrence. In an optimal alignment MM, at least one of the following is true: 1) mm, nn MM; or 2) the mm th position of xx is not matched; or 3) the nn th position of yy is not matched.

Designing the algorithm Let FF(ii, jj) denote the maximum score of an alignment between xx 1 ii and yy 1 jj. If case 1) holds, we pay ss(xx mm, yy nn ) and we get FF mm, nn = FF mm 1, nn 1 + ss(xx mm, yy nn ) If case 2) holds, we pay a gap penalty of dd since the mm th position of xx is not matched and we get FF mm, nn = FF mm 1, nn dd If case 3) holds, we pay a gap penalty of dd since the nn th position of yy is not matched and we get FF mm, nn = FF mm, nn 1 dd

Designing the algorithm Using the same argument for the subproblem of finding the maximum-score alignment between xx 1 ii and yy 1 jj, we get the following fact: The maximum alignment scores satisfy the following recurrence for ii 1 and jj 1: Moreover, (ii, jj) is in an optimal alignment MM for this subproblem if and only if the maximum is achieved by the first of these values.

Designing the algorithm We build up the values of FF(ii, jj) using the recurrence. There are only OO(mmmm) subproblems, and FF(mm, nn) is the value we are seeking. We now specify the algorithm to compute the value of the opt imal alignment. For purpose of initialization, we note that FF ii, 0 = iiii FF 0, jj = jjjj for all ii and jj, since the only way to line up the ii-letter word with 0-letter word is to use ii gaps.

Designing the algorithm Alignmnet(x,y) Array F[0 m,0 n] Initialize F[i,0]=-id for each i Initialize F[0,j]=-jd for each j For j=1,,n For i=1,,m Use the recurrence to compute F(i,j) Endfor Endfor Return F[m,n]

Designing the algorithm To find the alignment itself, we must find the path of choices that led to this final value. This procedure is known as a traceback.

Example

Running time The algorithm takes OO(mmmm) time and OO(mmmm) memory. OO(mmmm) is a standard notation, called big-o notation, meaning of order mmmm. The computation time or memory storage required to solve the problem scales as the product of the sequence lengths mmmm, up to a constant factor.

Local alignment A much more common situation is where we are looking for the best alignment between subsequences of xx and yy. This arises for example when it is suspected that two protein sequences may share a common domain, or when comparing two very highly diverged sequences. The highest scoring alignment of subsequences of xx and yy is called the best local alignment.

Smith-Waterman algorithm The algorithm for finding optimal local alignments is closely related to that for global alignments. There are two differences. First,

Smith-Waterman algorithm Taking the option 0 corresponds to starting a new alignment. If the best alignment up to some point has a negative score, it is better to start a new one. Note that FF ii, 0 = 0 FF 0, jj = 0

Smith-Waterman algorithm Second, an alignment can end anywhere in the matrix. Instead of taking the value at FF(mm, nn) for the best score, we look for the highest value of FF(ii, jj) over the whole matrix, and start the traceback from there. The traceback ends when we meet a cell with value 0, which corresponds to the start of the alignment.

Smith-Waterman algorithm

Smith-Waterman algorithm The local version of the dynamic programming sequence alignment algorithm is known as the Smith-Waterman algorithm.