Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Similar documents
FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA & the chaining problem

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Computational Molecular Biology

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Sequence analysis Pairwise sequence alignment

Pairwise alignment using shortest path algorithms, Gunnar Klau, November 29, 2005, 11:

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Biology 644: Bioinformatics

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

Bioinformatics for Biologists

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

Single Source Shortest Path

Scoring and heuristic methods for sequence alignment CG 17

CS 270 Algorithms. Oliver Kullmann. Analysing BFS. Depth-first search. Analysing DFS. Dags and topological sorting.

CS 341: Algorithms. Douglas R. Stinson. David R. Cheriton School of Computer Science University of Waterloo. February 26, 2019

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Alignment of Long Sequences

Shortest Path Problem

CS 270 Algorithms. Oliver Kullmann. Breadth-first search. Analysing BFS. Depth-first. search. Analysing DFS. Dags and topological sorting.

Computational Genomics and Molecular Biology, Fall

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

Week 5. 1 Analysing BFS. 2 Depth-first search. 3 Analysing DFS. 4 Dags and topological sorting. 5 Detecting cycles. CS 270 Algorithms.

Introduction. I Given a weighted, directed graph G =(V, E) with weight function

Utility of Sliding Window FASTA in Predicting Cross- Reactivity with Allergenic Proteins. Bob Cressman Pioneer Crop Genetics

Minimum Spanning Trees

Finding homologous sequences in databases

The Encoding Complexity of Network Coding

Sequence Alignment Heuristics

EECS730: Introduction to Bioinformatics

Dynamic Programming Part I: Examples. Bioinfo I (Institut Pasteur de Montevideo) Dynamic Programming -class4- July 25th, / 77

Algorithms and Data Structures 2014 Exercises and Solutions Week 9

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998

Sequence alignment theory and applications Session 3: BLAST algorithm

Sequence Alignment & Search

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

V1.0: Seth Gilbert, V1.1: Steven Halim August 30, Abstract. d(e), and we assume that the distance function is non-negative (i.e., d(x, y) 0).

1 Dijkstra s Algorithm

From Smith-Waterman to BLAST

Lecture 22 Tuesday, April 10

Algorithms in Bioinformatics: A Practical Introduction. Database Search

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Algorithm Design, Anal. & Imp., Homework 4 Solution

Lecture 10. Sequence alignments

CSE 331: Introduction to Algorithm Analysis and Design Graphs

Sequence alignment algorithms

Introduction. I Given a weighted, directed graph G =(V, E) with weight function

Chapter 22. Elementary Graph Algorithms

Graph Search. Adnan Aziz

Graph Algorithms. Definition

Bioinformatics explained: Smith-Waterman

In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.

Special course in Computer Science: Advanced Text Algorithms

Jyoti Lakhani 1, Ajay Khunteta 2, Dharmesh Harwani *3 1 Poornima University, Jaipur & Maharaja Ganga Singh University, Bikaner, Rajasthan, India

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Lecture 10. Elementary Graph Algorithm Minimum Spanning Trees

Single Source Shortest Paths

Chapter 4: Blast. Chaochun Wei Fall 2014

Lecture 5: Multiple sequence alignment

Unit 2: Algorithmic Graph Theory

Sample Solutions to Homework #4

Bioinformatics explained: BLAST. March 8, 2007

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

6.006 Introduction to Algorithms Spring 2008

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Lecture 11: Analysis of Algorithms (CS ) 1

CSC 373: Algorithm Design and Analysis Lecture 8

CS521 \ Notes for the Final Exam

Brief review from last class

CSE 431/531: Algorithm Analysis and Design (Spring 2018) Greedy Algorithms. Lecturer: Shi Li

Pairwise Sequence alignment Basic Algorithms

1 The Shortest Path Problem

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A

Data Mining Technologies for Bioinformatics Sequences

Database Searching Using BLAST

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

Problem Set 7 Solutions

BLAST, Profile, and PSI-BLAST

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Algorithm Design and Analysis

Computational Biology Lecture 4: Overlap detection, Local Alignment, Space Efficient Needleman-Wunsch Saad Mneimneh

Shortest Path Algorithm

BLAST & Genome assembly

Taking Stock. IE170: Algorithms in Systems Engineering: Lecture 20. Example. Shortest Paths Definitions

Computational Molecular Biology

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem

Algorithm Design and Analysis

Design and Analysis of Algorithms

Transcription:

4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures Dan Gusfield: Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, Cambridge, 1997. ISBN 0-521-58519-8 David B. Mount: Bioinformatics. Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, New York, 2001. ISBN 0-87969-608-7 File fasta3x.doc from the FastA distribution 4.2 Motivation A local alignment of two sequences x, y can be computed by the Smith-Waterman algorithm, which requires O( x y ) time and space in its original form. While the space requirement can be pushed down to O( x + y ) using a divide-and-conquer algorithm, the running time remains O( x y ), which is not feasible for many applications. Therefore, heuristic approaches are of great importance in practice. Heuristics A heuristic is an often problem-specific algorithm that will yield reasonable results, even if it is not provably optimal or even lacks a performance guarantee. In most cases, it is intuitively clear that the method works / is fast / is space efficient (in all / many / typical / special cases) but we do not (yet?) have a sound analysis for it. 4.3 FastA FastA is a program for heuristic pairwise local alignment of nucleic and amino acid sequences. FastA is much faster than the exact algorithms based on dynamic programming, and can be used to search a database of sequences for homologs to a query sequence. However its speed is based on heuristic considerations and it may fail to detect similar sequences in certain cases.

FastA and the chaining problem, by Clemens Gröpl, Gunnar Klau, November 22, 2010, 09:09 4001 The name FastA is an abbreviation for fast all. (There was a version for nucleic acids before.) The official pronounciation is fast-ay. First version by Pearson and Lipman, 1988; recent version (3.4) August 2002. 4.4 Main steps of the FastA algorithm 1. Find hot-spots. A hot-spot is a short, exact match between the two sequences. 2. Find diagonal runs. A diagonal run is a collection of hot-spots on the same diagonal within a short distance from each other. 3. Rescore the best diagonal runs. This is done using a substitution matrix. The best initial region is INIT1. 4. Chain several initial regions. This is where the chaining problem comes up. The result is INITN. 5. Moreover, compute an optimal local alignment in a band around INIT1. The result is called OPT. 6. Use Smith-Waterman alignments to display final results. 4.5 Hot-spots Essentially, a hot-spot is a dot in a dot plot. Definition. Let the input sequences be x = x[1... m] and y = y[1... n]. Then a hot-spot is a pair (i, j) such that the substrings of length k starting at the positions i and j are equal, i. e., x[i... i + k 1] = y[j... j + k 1]. Here k = ktup (for k-tuple ) is a user-specified parameter. The idea behind hot-spots is that in most cases, homologous sequences will contain at least a few identical segments. By default, FastA uses ktup = 2 for amino acids and ktup = 6 for nucleic acids. Larger values of ktup increase speed (because fewer hot-spots have to be considered in subsequent steps) and precision, but decrease recall. 4.6 Precision and recall Precision and recall are concepts from information retrieval and can be defined as follows: Precision: The probability that a retrieved item is correct. precision = Recall: The probability that a correct item is retrieved. recall = TP TP + FP TP TP + FN (where TP is true positives, etc.). A perfect precision score of 1.0 means that every result retrieved by a search was relevant (but says nothing about whether all relevant items were retrieved) whereas a perfect recall score of 1.0 means that all relevant items were retrieved by the search (but says nothing about how many irrelevant items were also retrieved).

4002 FastA and the chaining problem, by Clemens Gröpl, Gunnar Klau, November 22, 2010, 09:09 4.7 Diagonal runs If there are similar regions, they will show up in the dot plot as a number of hot-spots which are roughly on a diagonal line. FastA gains efficiency by relying on the heuristic assumption that a local alignment will contain several hot-spots along the same diagonal line. Definition. A diagonal run is a set of hot-spots {(i 1, j 1 ),..., (i l, j l )} such that i p j p = i q j q for all 1 p, q l. (They are on exactly the same diagonal.) A diagonal can contain more than one diagonal run, and a diagonal run need not contain all hot-spots on the same diagonal. Each diagonal run is assigned a score. Hot-spots score positive. The gaps between them score negative, depending on their length. At this stage, FastA does not yet use a substitution matrix; it just considers whether the sequence positions are identical or not. For each diagonal, the hot-spots on it are joined by FastA into diagonal runs. The crucial observation is that diagonal runs can be found without looking at the whole matrix. FastA does not create a dot plot! Since ktup is small, we can hash all positions i of x into the bins of a lookup-table, using the substring x[i... i + k 1] as a key. (There are 20 2 = 400 bins for AA and 4 6 = 4096 bins for NA, using the default values for ktup.) Collisions in the hash table can be resolved, e.g., by linked lists. We can fill the hash table in a single pass over x. This way the collision lists are already in sorted order. Example. We use a two-letter alphabet {a, b} and ktup = 2 for simplicity. sequence x aababbaaab 0123456789 key collision list aa 0, 6, 7 ab 1, 3, 8 ba 2, 5 bb 4 Using the hash we can compute the diagonal runs as we walk along sequence y: When we are at position j of y, we look at the positions i 1, i 2,... of x which we find in the hash bin indexed by y[j... j + k 1]. Each pair (i 1, j), (i 2, j),... is a hot-spot, and we can update the corresponding diagonal runs (and their scores) accordingly. Two hot-spots (i, j), (i, j ) are on the same diagonal iff i j = i j. Hence the growing diagonal runs (along the diagonals) can be adressed by an array of size m + n 1. Thus the time needed to compute the diagonal runs is proportional to the number of hot-spots which is usually much smaller than x y. Figure A shows the diagonal runs and the hot-spots which are found using the hash table.

FastA and the chaining problem, by Clemens Gröpl, Gunnar Klau, November 22, 2010, 09:09 4003 j sequence y i_1 i_2 sequence x i_3 4.8 Rescoring the best diagonal runs Next, the 10 highest-scoring diagonal runs are identified and rescored using a substitution matrix. Note that thus far, FastA still does not allow any indels in the alignment; it just counts matches and mismatches. Definition. The best diagonal run is called INIT1. It is marked with a * in Figure B below. 4.9 Chaining of initial regions To account for indels, FastA tries to chain several initial regions into a better alignment. Definition. The resulting alignment is called INITN. The chaining problem can be modeled as a purely graph theoretical problem...

4004 FastA and the chaining problem, by Clemens Gröpl, Gunnar Klau, November 22, 2010, 09:09 4.10 The chaining problem Chaining problem or heaviest-path-in-dag problem Find a heaviest path in a directed acyclic graph G = (V, E) with weights w : V E Z on the edges and vertices. Two diagonal runs can be chained iff they do not overlap in one of the sequences. This is the case iff the lower right end of one diagonal run is to the left and top of the upper left end of the other one. and Formally, let be diagonal runs, then we define r = ( [i... i + p 1], [j... j + p 1] ) r = ( [i... i + q 1], [j... j + q 1] ) r < r : i + p 1 < i j + p 1 < j. Let G = (V, E) be the directed graph whose vertex set V are all diagonal runs and whose edge set is E := {(r, r ) r < r }. Clearly, G is acyclic, so we can find a topological ordering of G in O( V + E ) time. That is, we can write V = {r 1,..., r N } such that r i < r j i < j. Moreover, FastA assigns a score s(r) > 0 to every diagonal run r V and a score s(e) < 0 to every edge e E. Note that Dijkstra s algorithm is not applicable, because we have negative edge weights. However, we can use that the graph is acyclic. Example. For simplicity, each letter scores 2 and all other columns of the (resulting) alignment score 1 (diagonal shortcuts allowed). i i g g g d a c d a c a h a k h b k h b f h b f j b f e j b e j b k / 4 a / 8 c / 4-1 g / 6-2 -6 f / 6-2 j / 6-7 h / 8-1 e / 4 i / 4 d / 4-4 b / 12

FastA and the chaining problem, by Clemens Gröpl, Gunnar Klau, November 22, 2010, 09:09 4005 Formally, we have to compute a path r π(1) < r π(2) <... < r π(l) that maximizes l 1 ( s(rπ(i) ) + s(r π(i), r π(i+1) ) ) + s(r π(l) ). i=1 In the following, we show how to use the algorithm DAG shortest paths for this task. (1) DAG shortest paths(g, w, s, d, π) (2) { (3) G = (V, E) : directed acyclic graph; (4) Adj : V P(V) : set of successor nodes, this implements E; (5) w : E Z : edge weights; (6) d : V Z : distance from s as computed by algorithm; (7) π : V V : predecessor; (8) V = {s = v 1, v 2,..., v n } : topological ordering starting with s. (9) d[s] 0; (10) π[s] nil; (11) for i = 1 to n do (12) foreach ( v Adj(v i ))do (13) c = d[v i ] + w(v i, v); (14) if (c < d[v]) (15) d[v] c; (16) π[v] v i ; (17) fi (18) od (19) od (20) } Theorem Let G = (V, E) be a directed acyclic graph, let s V be a vertex without predecessors and let w : E R be a weight function on the edges (which may take negative values). Let v 1,..., v n be the topological ordering used by Algorithm DAG shortest paths. Then the following holds: After the loop for v i has been run through, d[v i ] equals dist w (s, v i ), the weight of a shortest path from s to v i, and one such path is given by s,..., π(π(v i )), π(v i ), v i. Proof. We apply induction on i = 1,..., n. For v i = s there is nothing to be shown. Next we derive two inequalities from which together the claim will follow. Since backtracking from v i yields a path s,..., π(π(v i )), π(v i ), v i of length d[v i ], we have d[v i ] dist w (s, v i ). Now let P = (s = u 0, u 1,..., u k = v i ) be a shortest path. Since u k 1 is a predecessor of v i, and the loop runs through the vertices in a topological ordering, u k 1 was processed before v i and by our inductive assumption, d[u k 1 ] = dist w (s, u k 1 ). Due to the relaxation step we have d[v i ] = d[u k ] d[u k 1 ] + w(u k 1, u k ) = dist w (s, u k 1 ) + w(u k 1, u k ) = dist w (s, v i ). By combination of these inequalities, it follows that d[v i ] = dist w (s, v i ).

4006 FastA and the chaining problem, by Clemens Gröpl, Gunnar Klau, November 22, 2010, 09:09 4.11 Reduction of the problem The chaining problem in FastA appears in a slightly different form from that expected by the DAG shortest path algorithm, so we need to transform the input before we can apply it. There are three issues: 1. Allow any vertex to play the role of s and t 2. Eliminate the vertex weights 3. Reverse the optimization goal Assume that G = (V, E), w : V E Z is an instance of the heaviest-path-in-dag problem. Arbitrary source and target. Add a new vertex s with outgoing edges to all other vertices. Let w(s, x) = 0, w(s) = 0. Add a new vertex t with incoming edges from all other vertices. Let w(x, t) = 0, w(t) = 0. Eliminating the vertex weights. Replace w(u, v) with w (u, v) := w(u, v) + w(v), for all edges (u, v). Then any path s = v 1, v 2,..., v k = t has weight in both problems. w(v 1 ) + w(v 1, v 2 ) + w(v 2 ) + w(v 2, v 3 ) +... + w(v k ) Reversing the optimization goal. Replace w with w := w. Replace max with min. Example. (Finished on blackboard)

FastA and the chaining problem, by Clemens Gröpl, Gunnar Klau, November 22, 2010, 09:09 4007 c / 4 i / 4 b / 12-1 k / 4 g / 6-7 d / 4-2 -6-4 a / 8 f / 6 h / 8-2 -1 j / 6 e / 4 To illustrate the idea, consider the path c k f. total weight: path before reduction: c k f weights: 4 1 4 3 6 10 path after reduction: s c k f t weights: 4 3 3 0 10 Figure C shows the result (INITN) of chaining the initial regions in Figure B. 4.12 Banded alignment Apart from solving the chaining problem, FastA makes another heuristic attempt to improve INIT1.

4008 FastA and the chaining problem, by Clemens Gröpl, Gunnar Klau, November 22, 2010, 09:09 Definition. A banded alignment is an alignment that stays within a certain diagonal band of the DP matrix. FastA computes a banded alignment around the diagonal containing INIT1. Definition. This alignment is called OPT. Note that OPT is not necessarily optimal! We use the (confusing) terminology from the original paper. %-) To compute a banded alignment, we simply ignore entries of the DP matrix which are outside the band. Of course, the recurrence has to be modified in the obvious way for cells at the boundary of the band. This reduces running time and space requirement. A band can be specified by an interval I and the condition that computed entries (i, j) satisfy i j I. An example with I = { 1, 0, 1}: 4.13 Final application of Smith-Waterman As is usual practice with heuristic alignment programs, FastA computes an exact local alignment using the Smith-Waterman algorithm (or some variant thereof) before it displays the final result. This is possible in reasonable time, because the search has been narrowed down to a small region. 4.14 Concluding remarks on FastA FastA is actually a collection of programs which are customized for various tasks (DNA, proteins, peptide collections, mixed input types). If FastA is used to search through a large database, one will often find a relatively good hit simply by chance. Therefore, FastA also computes a statistical significance score. This information is very important to assess the results however, the underlying theory is beyond this lecture...