Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Similar documents
Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Lecture 10. Sequence alignments

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Sequence Alignment. part 2

Biology 644: Bioinformatics

Brief review from last class

Mouse, Human, Chimpanzee

Algorithmic Approaches for Biological Data, Lecture #20

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

Computational Genomics and Molecular Biology, Fall

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

Programming assignment for the course Sequence Analysis (2006)

Notes on Dynamic-Programming Sequence Alignment

Bioinformatics 1: lecture 4. Followup of lecture 3? Molecular evolution Global, semi-global and local Affine gap penalty

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Local Alignment & Gap Penalties CMSC 423

Sequence analysis Pairwise sequence alignment

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Sequence comparison: Local alignment

EECS730: Introduction to Bioinformatics

Sequence Alignment. COMPSCI 260 Spring 2016

Computational Molecular Biology

Sequence Analysis '17: lecture 4. Molecular evolution Global, semi-global and local Affine gap penalty

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Sequence Alignment & Search

Special course in Computer Science: Advanced Text Algorithms

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Computational Biology Lecture 6: Affine gap penalty function, multiple sequence alignment Saad Mneimneh

Heuristic methods for pairwise alignment:

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Sequence alignment algorithms

BLAST MCDB 187. Friday, February 8, 13

Bioinformatics for Biologists

Bioinformatics explained: Smith-Waterman

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Chapter 6. Multiple sequence alignment (week 10)

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--

Lecture 9: Core String Edits and Alignments

1 Computing alignments in only linear space

EECS730: Introduction to Bioinformatics

12 Dynamic Programming (2) Matrix-chain Multiplication Segmented Least Squares

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Computational Molecular Biology

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Cache and Energy Efficient Alignment of Very Long Sequences

Alignment of Long Sequences

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Outline. Sequence Alignment. Types of Sequence Alignment. Genomics & Computational Biology. Section 2. How Computers Store Information

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

Multiple Sequence Alignment. Mark Whitsitt - NCSA

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Pairwise alignment II

Stephen Scott.

Distributed Protein Sequence Alignment

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Keywords -Bioinformatics, sequence alignment, Smith- waterman (SW) algorithm, GPU, CUDA

Gaps ATTACGTACTCCATG ATTACGT CATG. In an edit script we need 4 edit operations for the gap of length 4.

DNA Alignment With Affine Gap Penalties

Biological Sequence Matching Using Fuzzy Logic

Bioinformatics explained: BLAST. March 8, 2007

Alignment Based Similarity distance Measure for Better Web Sessions Clustering

Scoring and heuristic methods for sequence alignment CG 17

Similarity searches in biological sequence databases

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

Biologically significant sequence alignments using Boltzmann probabilities

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Software Implementation of Smith-Waterman Algorithm in FPGA

The implementation of bit-parallelism for DNA sequence alignment

Using Blocks in Pairwise Sequence Alignment

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

A New Approach For Tree Alignment Based on Local Re-Optimization

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

GPU Accelerated Smith-Waterman

Central Issues in Biological Sequence Comparison

Algorithms IV. Dynamic Programming. Guoqiang Li. School of Software, Shanghai Jiao Tong University

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

BLAST, Profile, and PSI-BLAST

BLAST & Genome assembly

Dynamic Programming part 2

Alignment ABC. Most slides are modified from Serafim s lectures

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Multiple sequence alignment. November 20, 2018

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Fast Sequence Alignment Method Using CUDA-enabled GPU

Transcription:

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 - Spring 2015 Luay Nakhleh, Rice University

DP Algorithms for Pairwise Alignment The number of all possible pairwise alignments (if gaps are allowed) is exponential in the length of the sequences Therefore, the approach of score every possible alignment and choose the best is infeasible in practice Efficient algorithms for pairwise alignment have been devised using dynamic programming (DP)

DP Algorithms for Pairwise Alignment The key property of DP is that the problem can be divided into many smaller parts and the solution can be obtained from the solutions to these smaller parts

The Needleman-Wunch Algorithm for Global Pairwise Alignment The problem is to align two sequences x (x1x2...xm) and y (y1y2...yn) finding the best scoring alignment in which all residues of both sequences are included The score is assumed to be a measure of similarity, so the highest score is desired Alternatively, the score could be an evolutionary distance, in which case the smallest score would be sought, and all uses of max in the algorithm would be replaced by min

The Needleman-Wunch Algorithm for Global Pairwise Alignment The key concept in all these algorithms is the matrix S of optimal scores of subsequence alignments The matrix has (m+1) rows labeled 0 m and (n+1) columns labeled 0 n The rows correspond to the residues of sequence x, and the columns correspond to the residues of sequence y

The Needleman-Wunch Algorithm for Global Pairwise Alignment We ll use as a working example the two sequences x=thisline and y=isaligned with BLOSUM-62 substitution matrix as the scoring matrix and linear gap penalty g=e The optimal alignment of these two sequences is T H I S - L I - N E - - - I S A L I G N E D

The Matrix S E=-8 Si,j stores the score for the optimal alignment of all residues up to xi of sequence x will all residues up to yj of sequence y S i,0 = S 0,i = ig

The Matrix S To complete the matrix, we use the formula S i,j = max S i 1,j 1 + s(x i,y j ) S i 1,j + g S i,j 1 + g

The Matrix S E=-8 S i,j = max S i 1,j 1 + s(x i,y j ) S i 1,j + g S i,j 1 + g The global alignment can be obtained by traceback

The alignment we obtained is not the one we expected in that it contains no gaps (the gap at the end is added only because the two sequences are of different lengths) The problem is that the worst score in the BLOSUM-62 matrix is -4, which is significantly less than the gap penalty (-8), so gaps are unlikely to be present in the optimal alignment

If instead we use a linear gap penalty of -4, inserting a gap becomes less severe, and a gapped alignment is more likely to be obtained Therefore it is very important to match the gap penalty to the substitution matrix used

The Matrix S E=-4 S i,j = max S i 1,j 1 + s(x i,y j ) S i 1,j + g S i,j 1 + g The global alignment can be obtained by traceback

Multiple Optimal Alignments There may be more than one optimal alignment During traceback this is indicated by encountering an element that was derived from more than one of the three possible alternatives The algorithm does not distinguish between these possible alignments, although there may be reasons (such as knowledge of molecular structure of function) for preferring one to the others Most programs will arbitrarily report just one single alignment

Semiglobal Alignments In certain applications, we don t want to penalize gaps at the beginning or end of either of the two sequences in the alignment The resulting alignment is called semiglobal alignment To obtain it, we have two modifications to the Needleman- Wunch algorithm: (1) we initialize Si,0 and S0,j to 0 for all i and j, and (2) the traceback starts at the highest-scoring element in the bottom row or right-most column

Semiglobal Alignment

General Gap Penalty The algorithm we presented works with a linear gap penalty of the form g(ngap) = -ngape For a general gap penalty model g(ngap), one must consider the possibility of arriving at Si,j directly via insertion of a gap of length up to i in sequence x or j in sequence y

General Gap Penalty

General Gap Penalty The algorithm now has to be modified to S i,j = max S i 1,j 1 + s(x i,y j ) (S i ngap1,j + g(n gap1 )) 1 ngap1 i (S i,j ngap2 + g(n gap2 )) 1 ngap2 j This algorithm takes time that is proportional to mn 2, where m and n are the sequence lengths with n>m

Affine Gap Penalty For an affine gap penalty g(ngap)=-i-(ngap-1)e, we can refine this algorithm to obtain an O(mn) algorithm Define two matrices V i,j = max{s i ngap1,j + g(n gap1 )} 1 ngap1 i W i,j = max{s i,j ngap2 + g(n gap2 )} 1 ngap2 j

Affine Gap Penalty These matrices can be defined recursively as V i,j = max W i,j = max { Si 1,j I V i 1,j E { Si,j 1 I W i,j 1 E Now, the matrix S can be written as S i,j = max S i 1,j 1 + s(x i,y j ) V i,j W i,j

Local Pairwise Alignment As mentioned before, sometimes local alignment is more appropriate (e.g., aligning two proteins that have just one domain in common) The algorithmic differences between the algorithm for local alignment (Smith-Waterman algorithm) and the one for global alignment: Whenever the score of the optimal sub-alignment is less than zero, it is rejected (the matrix element is set to 0) Traceback starts from the highest-scoring matrix element

Local Pairwise Alignment S i,j = max S i 1,j 1 + s(x i,y j ) (S i ngap1,j + g(n gap1 )) 1 ngap1 i (S i,j ngap2 + g(n gap2 )) 1 ngap2 j 0

Local Pairwise Alignment E=-8

Local Pairwise Alignment E=-4

Linear Space Alignment Is there a linear-space algorithm for the problem? If only the maximal score is needed, the problem is simple But even if the alignment itself is needed, there is a linear-space algorithm (originally due to Hirschberg 1975, and introduced into computational biology by Myers and Miller 1988)

Linear Space Alignment Main observation S i,j = max 0 k n {S i 2,k + Sr i 2,n k} S r i,j score of the best alignment of last i residues of x with last j residues of y

Linear Space Alignment Compute and save row m/2 Compute and save row m/2 Find k* that satisfies Recurse S m,n S r m,n S m 2,k + S r m 2,n k = S m,n