Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Similar documents
Sequence analysis Pairwise sequence alignment

Biology 644: Bioinformatics

Computational Molecular Biology

FastA & the chaining problem

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Outline. Sequence Alignment. Types of Sequence Alignment. Genomics & Computational Biology. Section 2. How Computers Store Information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 8. Note

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Sequence Comparison: Dynamic Programming. Genome 373 Genomic Informatics Elhanan Borenstein

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Bioinformatics for Biologists

Sequence alignment algorithms

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Sequence Alignment & Search

Scoring and heuristic methods for sequence alignment CG 17

EECS730: Introduction to Bioinformatics

Finding homologous sequences in databases

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

EECS730: Introduction to Bioinformatics

BLAST - Basic Local Alignment Search Tool

Pairwise Sequence alignment Basic Algorithms

B I O I N F O R M A T I C S

Brief review from last class

Computational Genomics and Molecular Biology, Fall

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Software Implementation of Smith-Waterman Algorithm in FPGA

Lecture 10. Sequence alignments

Algorithmic Approaches for Biological Data, Lecture #20

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Programming assignment for the course Sequence Analysis (2006)

Computational Molecular Biology

Multiple Sequence Alignment: Multidimensional. Biological Motivation

An FPGA-Based Web Server for High Performance Biological Sequence Alignment

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

Alignment of Long Sequences

BLAST, Profile, and PSI-BLAST

Chapter 8 Multiple sequence alignment. Chaochun Wei Spring 2018

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Bioinformatics explained: Smith-Waterman

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Multiple Sequence Alignment. Mark Whitsitt - NCSA

Chapter 4: Blast. Chaochun Wei Fall 2014

Chapter 6. Multiple sequence alignment (week 10)

Pairwise Sequence Alignment. Zhongming Zhao, PhD

BLAST & Genome assembly

BGGN 213 Foundations of Bioinformatics Barry Grant

Heuristic methods for pairwise alignment:

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Biochemistry 324 Bioinformatics. Multiple Sequence Alignment (MSA)

Basic Local Alignment Search Tool (BLAST)

Salvador Capella-Gutiérrez, Jose M. Silla-Martínez and Toni Gabaldón

Sequence Alignment Heuristics

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm

Sequence alignment theory and applications Session 3: BLAST algorithm

Special course in Computer Science: Advanced Text Algorithms

BLAST & Genome assembly

Dynamic Programming & Smith-Waterman algorithm

BLAST MCDB 187. Friday, February 8, 13

Neighbour Joining. Algorithms in BioInformatics 2 Mandatory Project 1 Magnus Erik Hvass Pedersen (971055) November 2004, Daimi, University of Aarhus

Database Searching Using BLAST

From Smith-Waterman to BLAST

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Pairwise alignment II

Bioinformatics explained: BLAST. March 8, 2007

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Alignment of Pairs of Sequences

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Acceleration of Algorithm of Smith-Waterman Using Recursive Variable Expansion.

The Dot Matrix Method

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Multiple Sequence Alignment Sum-of-Pairs and ClustalW. Ulf Leser

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

Sequence Alignment. part 2

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Central Issues in Biological Sequence Comparison

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

A Design of a Hybrid System for DNA Sequence Alignment

3.4 Multiple sequence alignment

HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

Profiles and Multiple Alignments. COMP 571 Luay Nakhleh, Rice University

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Transcription:

Sequence lignment (chapter 6) p The biological problem p lobal alignment p Local alignment p Multiple alignment

Local alignment: rationale p Otherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 3 aa region that resembles 9 aa region in TF- receptor (right). The shared function here is protein kinase.

Local alignment: rationale B Regions of similarity p lobal alignment would be inadequate p Problem: find the highest scoring local alignment between two sequences p Previous algorithm with minor modifications solves this problem (Smith & Waterman 98)

From global to local alignment p Modifications to the global alignment algorithm Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix), or in other words: llow preceding and trailing indels without penalty 3

Scoring local alignments = a a a 3 a n, B = b b b 3 b m Let I and J be intervals (substrings) of and B, respectively: Best local alignment score: where S(I, J) is the alignment score for substrings I and J. 4

llowing preceding and trailing indels p First row and column initialised to zero: 3 4 M i, = M,j = - b b b 3 b 4 - a b b b3 --a a 3 a 3 5

6 Recursion for local alignment p M i,j = max{ M i-,j- + s(a i, b i ), M i-,j, M i,j-, } T T - T T - llow alignment to start anywhere in sequences

7 Finding best local alignment p Optimal score is the highest value in the matrix = max i,j M i,j p Best local alignment can be found by backtracking from the highest value in M p What is the best local alignment in this example? T T - T T -

8 Local alignment: example 8 7 6 5 T 4 3 - T T - 9 8 7 6 5 4 3 M i,j = max { M i-,j- + s(a i, b i ), M i-,j, M i,j-, } Scoring (for example) Match: + Mismatch: - Indel: -

9 Local alignment: example 8 7 6 5 T 4 3 - T T - 9 8 7 6 5 4 3 M i,j = max { M i-,j- + s(a i, b i ), M i-,j, M i,j-, } Scoring (for example) Match: + Mismatch: - Indel: -

T T Local alignment: example Scoring (for example) Match: + Mismatch: - Indel: - Optimal local alignment: 4 3 4 8 3 5 4 3 7 3 4 6 5 6 3 3 4 3 5 4 T 4 3 3 - T T - 9 8 7 6 5 4 3

Multiple optimal alignments Non-optimal, good-scoring alignments 4 3 4 8 3 5 4 3 7 3 4 6 5 6 3 3 4 3 5 4 T 4 3 3 - T T - 9 8 7 6 5 4 3 How can you find. Optimal alignments if more than one exist?. Non-optimal, good-scoring alignments?

Overlap alignment p Overlap matrix used by Overlap-Layout- onsensus algorithm can be computed with dynamic programming p Initialization: O i, = O,j = for all i, j p Recursion: O i,j = max{ O i-,j- + s(a i, b i ), O i-,j, O i,j-, } Best overlap: maximum value from rightmost column and bottom row

Non-uniform mismatch penalties p We used uniform penalty for mismatches: s(, ) = s(, ) = = s(, T ) = µ p Transition mutations (->, ->, ->T, T->) are approximately twice as frequent than transversions (->T, T->, ->, ->T) use non-uniform mismatch penalties collected into a substitution matrix T - -.5 - - - -.5 -.5 - - T - -.5-3

aps in alignment p ap is a succession of indels in alignment T - - T p Previous model scored a length k gap as w(k) = -k p Replication processes may produce longer stretches of insertions or deletions In coding regions, insertions or deletions of codons may preserve functionality 4

ap open and extension penalties () p We can design a score that allows the penalty opening gap to be larger than extending the gap: w(k) = - (k ) p ap open cost, ap extension cost p lignment algorithms can be extended to use w(k) (not discussed on this course) 5

mino acid sequences p We have discussed mainly DN sequences p mino acid sequences can be aligned as well p However, the design of the substitution matrix is more involved because of the larger alphabet p More on the topic in the course Biological sequence analysis 6

Demonstration of the EBI web site p European Bioinformatics Institute (EBI) offers many biological databases and bioinformatics tools at http://www.ebi.ac.uk/ Sequence alignment: Tools -> Sequence nalysis -> lign 7

Sequence lignment (chapter 6) p The biological problem p lobal alignment p Local alignment p Multiple alignment 8

Multiple alignment p onsider a set of n sequences on the right Orthologous sequences from different organisms Paralogs from multiple duplications p How can we study relationships between these sequences? aggcgagctgcgagtgcta cgttagattgacgctgac ttccggctgcgac gacacggcgaacgga agtgtgcccgacgagcgaggac gcgggctgtgagcgcta aagcggcctgtgtgcccta atgctgctgccagtgta agtcgagccccgagtgc agtccgagtcc actcggtgc 9

Optimal alignment of three sequences p lignment of = a a a i and B = b b b j can end either in (-, b j ), (a i, b j ) or (a i, -) p = 3 alternatives p lignment of, B and = c c c k can end in 3 ways: (a i, -, -), (-, b j, -), (-, -, c k ), (-, b j, c k ), (a i, -, c k ), (a i, b j, -) or (a i, b j, c k ) p Solve the recursion using three-dimensional dynamic programming matrix: O(n 3 ) time and space p eneralizes to n sequences but impractical with even a moderate number of sequences

Multiple alignment in practice p In practice, real-world multiple alignment problems are usually solved with heuristics p Progressive multiple alignment hoose two sequences and align them hoose third sequence w.r.t. two previous sequences and align the third against them Repeat until all sequences have been aligned Different options how to choose sequences and score alignments Note the similarity to Overlap-Layout-onsensus

Multiple alignment in practice p Profile-based progressive multiple alignment: LUSTLW onstruct a distance matrix of all pairs of sequences using dynamic programming Progressively align pairs in order of decreasing similarity LUSTLW uses various heuristics to contribute to accuracy

dditional material p R. Durbin, S. Eddy,. Krogh,. Mitchison: Biological sequence analysis p N.. Jones, P.. Pevzner: n introduction to bioinformatics algorithms p ourse Biological sequence analysis in period II, 8 3

Rapid alignment methods: FST and BLST p The biological problem p Search strategies p FST p BLST 4

The biological problem p lobal and local alignment algoritms are slow in practice p onsider the scenario of aligning a query sequence against a large database of sequences New sequence with unknown function NBI enbank size in January 7 was 65 369 9 95 bases (6 3 599 sequences) Feb 8: 85 759 586 764 bases (8 853 685 sequences) 5

Problem with large amount of sequences p Exponential growth in both number and total length of sequences p Possible solution: ompare against model organisms only p With large amount of sequences, chances are that matches occur by random Need for statistical analysis 6

Rapid alignment methods: FST and BLST p The biological problem p Search strategies p FST p BLST 7

FST p FST is a multistep algorithm for sequence alignment (Wilbur and Lipman, 983) p The sequence file format used by the FST software is widely used by other sequence analysis software p Main idea: hoose regions of the two sequences (query and database) that look promising (have some degree of similarity) ompute local alignment using dynamic programming in these regions 8

FST outline p FST algorithm has five steps:. Identify common k-words between I and J. Score diagonals with k-word matches, identify best diagonals 3. Rescore initial regions with a substitution score matrix 4. Join initial regions using gaps, penalise for gaps 5. Perform dynamic programming to find final alignments 9

Search strategies p How to speed up the computation? Find ways to limit the number of pairwise comparisons p ompare the sequences at word level to find out common words Word means here a k-tuple (or a k-word), a substring of length k 3

nalyzing the word content p Example query string I: TTTT p For k = 8, the set of k-words (substring of length k) of I is TTT TT TT TT T 3

nalyzing the word content p There are n-k+ k-words in a string of length n p If at least one word of I is not found from another string J, we know that I differs from J p Need to consider statistical significance: I and J might share words by chance only p Let n= I and m= J 3

Word lists and comparison by content p The k-words of I can be arranged into a table of word occurences L w (I) p onsider the k-words when k= and I=T:,, T, T,,, T: 3 : : 5 :, 7 : 6 T: 4 Start indecies of k-word in I Building L w (I) takes O(n) time 33

ommon k-words 34 p Number of common k-words in I and J can be computed using L w (I) and L w (J) p For each word w in I, there are L w (J) occurences in J p Therefore I and J have common words p This can be computed in O(n + m + 4 k ) time O(n + m) time to build the lists O(4 k ) time to calculate the sum (in DN strings)

ommon k-words p I = T p J = TT L w (I) T: 3 : : 5 :, 7 : 6 T: 4 L w (J) T: 3, 9 :, 8 :, 7 : 5, : 6 T: 4, ommon words in total 35

Properties of the common word list p Exact matches can be found using binary search (e.g., where TT occurs in I?) O(log4 k ) time p For large k, the table size is too large to compute the common word count in the previous fashion p Instead, an approach based on merge sort can be utilised (details skipped) p The common k-word technique can be combined with the local alignment algorithm to yield a rapid alignment approach 36

FST outline p FST algorithm has five steps:. Identify common k-words between I and J. Score diagonals with k-word matches, identify best diagonals 3. Rescore initial regions with a substitution score matrix 4. Join initial regions using gaps, penalise for gaps 5. Perform dynamic programming to find final alignments 37

Dot matrix comparisons p Word matches in two sequences I and J can be represented as a dot matrix p Dot matrix element (i, j) has a dot, if the word starting at position i in I is identical to the word starting at position j in J p The dot matrix can be plotted for various k i j i I = TT J = TTT j 38

k= k=4 Dot matrix (k=,4,8,6) for two DN sequences X85973. (875 bp) Y93. (3 bp) k=8 k=6 39

k= k=4 Dot matrix (k=,4,8,6) for two protein sequences B5. (53 aa) 768. (588 aa) k=8 k=6 Shading indicates now the match score according to a score matrix (Blosum6 here) 4

omputing diagonal sums p We would like to find high scoring diagonals of the dot matrix p Lets index diagonals by the offset, l = i - j I 4 T T * * * * * T * * * * * J k= Diagonal l = i j = -6

omputing diagonal sums p s an example, lets compute diagonal sums for I = T, J = TT, k = p. onstruct k-word list L w (J) p. Diagonal sums S l are computed into a table, indexed with the offset and initialised to zero l - -9-8 -7-6 -5-4 -3 - - 3 4 5 6 S l 4

omputing diagonal sums p 3. o through k-words of I, look for matches in L w (J) and update diagonal sums I 43 T T * * * * * T * * * * * J For the first -word in I,, L (J) = {6}. We can then update the sum of diagonal l = i j = 6 = -5 to S -5 := S -5 + = + =

omputing diagonal sums p 3. o through k-words of I, look for matches in L w (J) and update diagonal sums I 44 T T * * * * * T * * * * * J Next -word in I is, for whichl (J) = {, 8}. Two diagonal sums are updated: l = i j = = S := S + = + = I = i j = 8 = -6 S -6 := S -6 + = + =

omputing diagonal sums p 3. o through k-words of I, look for matches in L w (J) and update diagonal sums I 45 T T * * * * * T * * * * * J Next -word in I is T, for whichl T (J) = {3, 9}. Two diagonal sums are updated: l = i j = 3 3 = S := S + = + = I = i j = 3 9 = -6 S -6 := S -6 + = + =

omputing diagonal sums fter going through the k-words of I, the result is: l - -9-8 -7-6 -5-4 -3 - - 3 4 5 6 S l 4 4 I 46 T T * * * * * T * * * * * J

lgorithm for computing diagonal sum of scores S l := for all m l n ompute L w (J) for all words w for i := to n k do end w := I i I i+ I i+k- for j L w (J) do end l := i j S l := S l + Match score is here 47

FST outline p FST algorithm has five steps:. Identify common k-words between I and J. Score diagonals with k-word matches, identify best diagonals 3. Rescore initial regions with a substitution score matrix 4. Join initial regions using gaps, penalise for gaps 5. Perform dynamic programming to find final alignments 48

Rescoring initial regions p Each high-scoring diagonal chosen in the previous step is rescored according to a score matrix p This is done to find subregions with identities shorter than k p Non-matching ends of the diagonal are trimmed I: T T J: T I : T T J : T T 75% identity, no 4-word identities 33% identity, one 4-word identity 49

Joining diagonals p Two offset diagonals can be joined with a gap, if the resulting alignment has a higher score p Separate gap open and extension are used p Find the best-scoring combination of diagonals High-scoring diagonals Two diagonals joined by a gap 5

FST outline p FST algorithm has five steps:. Identify common k-words between I and J. Score diagonals with k-word matches, identify best diagonals 3. Rescore initial regions with a substitution score matrix 4. Join initial regions using gaps, penalise for gaps 5. Perform dynamic programming to find final alignments 5

Local alignment in the highest-scoring region p Last step of FST: perform local alignment using dynamic programming around the highestscoring p Region to be aligned covers w and +w offset diagonal to the highestscoring diagonals p With long sequences, this region is typically very small compared to the whole n x m matrix Dynamic programming matrix M filled only for the green region w w 5

Properties of FST p Fast compared to local alignment using dynamic programming only Only a narrow region of the full matrix is aligned p Increasing parameter k decreases the number of hits: Increases specificity Decreases sensitivity Decreases running time p FST can be very specific when identifying long regions of low similarity Specific method does not find many incorrect results Sensitive method finds many of the correct results 53

Properties of FST p FST looks for initial exact matches to query sequence Two proteins can have very different amino acid sequences and still be biologically similar This may lead into a lack of sensitivity with diverged sequences 54

Demonstration of FST at EBI p http://www.ebi.ac.uk/fasta/ p Note that parameter ktup in the software corresponds to parameter k in lectures 55