TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Similar documents
EECS730: Introduction to Bioinformatics

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

Computational Molecular Biology

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Biology 644: Bioinformatics

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Bioinformatics for Biologists

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Sequencing Alignment I

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Scoring and heuristic methods for sequence alignment CG 17

Alignment ABC. Most slides are modified from Serafim s lectures

Basic Local Alignment Search Tool (BLAST)

Database Searching Using BLAST

FastA & the chaining problem

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

BLAST, Profile, and PSI-BLAST

Algorithms in Bioinformatics: A Practical Introduction. Database Search

BLAST MCDB 187. Friday, February 8, 13

From Smith-Waterman to BLAST

Algorithmic Approaches for Biological Data, Lecture #20

Lecture 10. Sequence alignments

Lecture 10: Local Alignments

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Heuristic methods for pairwise alignment:

BLAST. Basic Local Alignment Search Tool. Used to quickly compare a protein or DNA sequence to a database.

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Computational Genomics and Molecular Biology, Fall

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Sequence Alignment & Search

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Sequence Alignment Heuristics

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice University

Sequence Alignment. part 2

L4: Blast: Alignment Scores etc.

Lecture 3: February Local Alignment: The Smith-Waterman Algorithm

BGGN 213 Foundations of Bioinformatics Barry Grant

Bioinformatics explained: Smith-Waterman

Sequence alignment algorithms

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--

Sequence analysis Pairwise sequence alignment

Today s Lecture. Edit graph & alignment algorithms. Local vs global Computational complexity of pairwise alignment Multiple sequence alignment

Brief review from last class

Bioinformatics explained: BLAST. March 8, 2007

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Pairwise Sequence Alignment: Dynamic Programming Algorithms. COMP Spring 2015 Luay Nakhleh, Rice University

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Chapter 4: Blast. Chaochun Wei Fall 2014

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Mouse, Human, Chimpanzee

Finding homologous sequences in databases

Divya R. Singh. Faster Sequence Alignment using Suffix Tree and Data-Mining Techniques. February A Thesis Presented by

Dynamic Programming: Sequence alignment. CS 466 Saurabh Sinha

Alignment of Pairs of Sequences

Sequence alignment theory and applications Session 3: BLAST algorithm

A CAM(Content Addressable Memory)-based architecture for molecular sequence matching

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Alignments BLAST, BLAT

CS2220: Introduction to Computational Biology Lecture 5: Essence of Sequence Comparison. Limsoon Wong

Dynamic Programming & Smith-Waterman algorithm

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

A Design of a Hybrid System for DNA Sequence Alignment

Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary

Pairwise Sequence Alignment. Zhongming Zhao, PhD

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

BLAST & Genome assembly

Multiple Sequence Alignment II

Biological Sequence Analysis. CSEP 521: Applied Algorithms Final Project. Archie Russell ( ), Jason Hogg ( )

Similarity searches in biological sequence databases

Darwin-WGA. A Co-processor Provides Increased Sensitivity in Whole Genome Alignments with High Speedup

Similarity Searches on Sequence Databases

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

Multiple Sequence Alignment. With thanks to Eric Stone and Steffen Heber, North Carolina State University

BLAST - Basic Local Alignment Search Tool

Computational Molecular Biology

A TUTORIAL OF RECENT DEVELOPMENTS IN THE SEEDING OF LOCAL ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Proceedings of the 11 th International Conference for Informatics and Information Technology

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A

Introduction to Computational Molecular Biology

NGS Data and Sequence Alignment

Data Mining Technologies for Bioinformatics Sequences

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

CS313 Exercise 4 Cover Page Fall 2017

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

SimSearch: A new variant of dynamic programming based on distance series for optimal and near-optimal similarity discovery in biological sequences

Transcription:

Local Sequence Alignment & Heuristic Local Aligners Lectures 18 Nov 28, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Review: Probabilistic Interpretation X: Y: TCCAGGTG-GAT TGCAAGTGCG-T Chance or true homology? Sharing a common ancestor 2 1

Review: Likelihood Ratio X: Y: TCCAGGTG-GAT TGCAAGTGCG-T Pr(Data Homology) Pr(Data Chance) 3 Review: Log Likelihood Ratio Score The most commonly used alignment score of aligning two sequences is the log likelihood ratio of the alignment under two models Common ancestry By chance Score log log i Pr Prxi yi i Prxi Pryi Prx i yi x Pr y i i i s x, y i i 4 2

Outline: Scoring Alignments Scoring alignments Probabilistic meaning Scoring matrices PAM: scoring based on evolutionary statistics BLOSUM: tuning to evolutionary conservation Gaps revisited Local vs global alignment Database search FASTA BLAST 5 Gap Initiation and Extension TCCACCGTG-GA TGCA--GTGCGA 6 3

Gap Initiation and Extension TCCACCGTG-GA CSCCDDCCCICC TGCA--GTGCGA Insertion / deletion (indel) 7 Scoring Indels: Naive Approach A fixed penalty d is given to every indel: -d for 1 indel, -2d for 2 consecutive indels -3d for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels! 8 4

Affine Gap Penalties In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: This is more likely. Normal scoring would give the same score for both alignments This is less likely. 9 Gap Initiation and Extension TCCACCGTG-GA Init CSCCDDCCCICC Extend Init TGCA--GTGCGA Insertion / deletion (indel) 10 5

Scoring the Gaps More Accurately Current model: (n) Gap of length incurs penalty n nd However, gaps usually occur in bunches Convex gap penalty function: (n): for all n, (n + 1) - (n) (n) - (n 1) (n) 11 General Gap Dynamic Programming Initialization: same (n) Iteration: V(i, j) = max V(i-1, j-1) + s(x i, y j ) max V(i-1,j) k=0 i-1 + s(x V(k,j) i,-) (i-k) max V(i,j-1) k=0 j-1 + s(-,y V(i,k) j ) (j-k) Previously T[j] : V(i-2,j) V(i-1,j-1) V(i-1,j) Termination: same S[i].. V(i,j-2) V(i,j-1) V(i,j) Running Time: O(N 2 M) Space: O(NM) (assume N>M) 12 6

Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(d + ex) where d >0 is the penalty for introducing a gap: gap opening penalty d will be large relative to e: gap extension penalty because you do not want to add too much of a penalty for extending the gap. 13 Affine Gap Penalties Gap penalties: -d - e when there is 2 indel -d -2e when there are 3 indels -d -3e when there are 4 indels, etc. -d (n-1) e when there are n indels Somehow reduced penalties (as compared to naïve scoring) are given to runs of horizontal and vertical arrows in the V matrix 14 7

Needleman-Wunsch With Affine Gaps (n) = d + (n 1)e gap gap open extend To compute optimal alignment, (n) e d At position i,j, need to remember best score if gap is open best score if gap is not open F(i, j): score of alignment x 1 x i to y 1 y j if x i aligns to y j G(i, j): score if x i or y j aligns to a gap 15 Needleman-Wunsch With Affine Gaps Initialization: F(i, 0) = d + (i 1)e F(0, j) = d + (j 1)e Iteration: F(i, j) = max F(i 1, j 1) + s(x i, y j ) G(i 1, j 1) + s(x i, y j ) Termination: G(i, j) = max same F(i 1, j) d F(i, j 1) d G(i, j 1) e G(i 1, j) e 16 8

Outline: Scoring Alignments Scoring alignments Probabilistic meaning Scoring matrices PAM: scoring based on evolutionary statistics BLOSUM: tuning to evolutionary conservation Gaps revisited Local vs global alignment Database search FASTA BLAST 17 Local vs. Global Alignment The Global Alignment Problem tries to find the highest scoring alignment between input sequences S (of length n) and T (of length m) S[1-n] and T[1-m]. The Local Alignment Problem tries to find the highest scoring alignment between the substrings S[i-i ] and T[j-j ], where i,j>0, i <n+1 and j <m+1. In the V matrix (alignment scores of substrings) with negatively-scored arrows, Local Alignment may score higher than Global Alignment 18 9

Local vs. Global Alignment (cont d) Global alignment --T -CC-C-AGT -TATGT-CAGGGGACACG A-GCATGCAGA-GAC AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG T-CAGAT--C Local alignment: better alignment to find conserved segment tcccagttatgtcaggggacacgagcatgcagagac aattgccgccgtcgttttcagcagttatgtcagatc 19 Local Alignments: Why? Genes are shuffled between genomes Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Portions of proteins (domains) are often conserved 20 10

Local Alignment: Example Local alignment Compute a mini Global Alignment to get Local Global alignment 21 Local Alignment: Example Local run time O(n 4 ): In the grid of size n x n, there are ~n 2 vertices (i,j) that may serve as a source. For each such vertex computing alignments from (i,j) to (i,j ) takes O(n 2 ) time. This can be remedied by giving free rides. 22 11

Local Alignment: Free Rides Yeah, a free ride! Vertex (0,0) The dashed arrows represent the free rides from (0,0) to every other entry in the V matrix. 23 The Local Alignment Problem Goal: Find the best local alignment between two sequences Input : Sequences S, T and scoring matrix σ Output : Alignment of sequences S and T whose alignment score is maximum among all possible alignment of all possible substrings 24 12

The Smith-Waterman Algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: V(0, j) = V(i, 0) = 0 0 Iteration: V(i, j) = max V(i 1, j) d V(i, j 1) d V(i 1, j 1) + s(x i, y j ) Power of ZERO: there is only this change from the original recurrence of a Global Alignment - since there is only one free ride arrow entering into every vertex 25 The Smith-Waterman Algorithm Termination: 1. If we want the best local alignment V OPT = max i,j V(i, j) 2. If we want all local alignments scoring > t For all i, j find V(i, j) > t, and trace back 26 13

Outline: Scoring Alignments Scoring alignments Probabilistic meaning Scoring matrices PAM: scoring based on evolutionary statistics BLOSUM: tuning to evolutionary conservation Gaps revisited Local vs global alignment Database search FASTA BLAST 27 Database Search The problems: Dynamic programming: prohibitively complex Exact matching: prohibitively mismatch-sensitive q=tacgaat..??? ATAAGAATATACGAATCCACGAT.. TCGATACGTTAGCAATACTAG CGAAATATAGGTTAGCAATAC.. ACGACATCGAAGAATAAATAT.... acacgaatatacgaatccacga-t.. _tacgaat-tacgaat-tacgaat 28 14

State of Biological Databases Sequenced Genomes: Human 310 9 Yeast 1.210 7 Mouse 2.710 9 12 different strains Rat 2.610 9 Neurospora 410 7 14 more fungi within next year Fugu fish 3.310 8 Tetraodon 310 8 ~250 bacteria/viruses Mosquito 2.810 8 Drosophila 1.210 8 Worm 1.010 8 2 sea squirts 1.610 8 Current rate of sequencing: Rice 1.010 9 4 big labs 3 10 9 bp /year/lab Arabidopsis 1.210 8 10s small labs 29 State of Biological Databases Number of genes Vertebrate: ~30,000 Insects: ~14,000 Worm: ~17,000 Fungi: ~6,000-10,000 Small organisms: 100s-1,000s Each known or predicted gene has an associated protein sequence >1,000,000 known / predicted protein sequences 30 15

Some Useful Applications of Alignments Given a newly discovered gene, Does it occur in other species? How fast does it evolve? Assume we try Smith-Waterman: Our new gene 10 4 The entire genomic database 10 10-10 11 31 Some Useful Applications of Alignments Given a newly sequenced organism, Which subregions align with other organisms? Potential genes Other biological characteristics Assume we try Smith-Waterman: Our newly sequenced mammal 310 9 The entire genomic database 10 10-10 11 32 16

Reconsider DP Geometry Diagonal matching segments: Basis for alignment Alignment: Connecting matching diagonals With mismatched diagonals or horizontal/vertical gaps Score: Additive contributions of diagonals and connectors Connectors may reduce the score Focus: high score diagonals, positive score connectors 33 Dot Matrix Heuristics Rule 1: Find high-scoring diagonals Search small diagonal segments Extend to max diagonal matches Connect diagonals to max score Rule 2: Focus on meaningful alignments Filter out low-scoring diagonals * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 34 17

FASTA Key idea (Pearson & Lipman 88): Find short diagonals by indexing the DB Extend these to high scoring diagonals Use DP to connect them A 4 steps process 35 Step (a): Find diagonal matches by indexing Key idea: k-mers index of of the DB Preprocess: Scan database to index words of size k (k-mers) [k=1..5] ( Query: Scan query to index k-mers Compare hashes to find all diagonal matches of length k Merge short diagonals into maximal diagonal matches 36 18

Find Diagonal Matches by Indexing Example: Database d: TATCGATCGA Position: 1 2 3 4 5 6 7 8 9 10 Query q: GATCG Position: 1. Extract index TAT ATC TCG CGA GAT Ktup=3 1 2,6 3,7 4,8 5 d 2 3 1 q 1 2 3 4 5 0,-4 0,-4-4 Offset=q-d TATCGATCGA GATCG 2. Find matches 3. Merge diagonal matches FASTA Steps (b-d): Optimize score 1. Filter low-score diagonals 2. Extend diagonals to max score; keep high-scoring segments 3. Use DP in a narrow band around the high scoring segments 38 19

BLAST: Basic Local Alignment Search Tool Altschul & Karlin [1990]; a family of algorithms BLAST, WU-BLAST, BlastZ, MegaBLAST, BLAT Idea: find matches with significant score statistics Find maximal segment pairs (MSP): segments with significant score 39 BLAST Algorithm Step 1: index DB for words of size W (W-mers); index query sequence for W-mers with score >Threshold W= 3 for protein, 11 for nucleotides Step 2: search for matches with high score (HSP=high scoring pairs) Step 3: extend hits to maximal score segments Step 4: report matches with score above S 40 20

BLAST Step 1-3: Finding Short High-Scoring Pairs (HSP) Create an index of W-mers for database & query For proteins W=3 a dictionary of 20 3 =8000 words Match W-mers that score above a threshold T FASTA searches for exact matches of k-mers BLAST searches for high scoring pairs (HSP) Key idea: exploit fast part of the search to max the score rather than push maximization for later, slower, phases 41 From A. Baxevanis: Nucleotide and Protein Sequence Analysis via Kellis & Indyk, MIT, BLAST & Database Search, Lecture 2 Blast Steps 3-4: Extending Short HSPs The short HSPs are extended to increase the score Score Max score extension Report above threshold HSPs and their scores 42 21