As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and statistics 3. BLAT: search and statistics 4.1 Sequence searches - challenges A fundamental task in bioinformatics: Given a large database of sequences D and a query sequence Q, find all sequences in D that are homologous to Q. As of August 15, 2008, GenBank contained 95033791652 bases from 92748599 reported sequences. The search procedure should be fast filter most sequences (because they are unrelated with query) align only homologous ones Most popular algorithms use a seed-and-extend approach that operates in two steps: 1. Find a set of small exact matches (called seeds) 2. Try to extend each seed match to obtain a long inexact match. 4.2 Sensitivity and Specificity Classifications:

Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 49 An event or signal (such as a DNA sequences is orthologous to a second one, a given DNA sequence is contained in a given coding region, or a gene is differentially expressed etc.) can be predicted to occur: Predicted Positive be predicted not to occur: Predicted Negative actually occur: Actual Positive actually not occur: Actual Negative The sets of these four types of situations are denoted PP, PN, AP and AN, respectively. 4.3 Sensitivity and Specificity Based on these classifications, one can compute the number of: Signal Detected Name Definition Yes Yes True Positive TP = PP AP No No True Negative TN = PN AN Yes No False Negative FN = PN AP No Yes False Positive FP = PP AN 4.4 Sensitivity and Specificity Sensitivity: probability of correctly predicting a positive example Sn = T P/(TP + FN) Specificity: probability of correctly predicting a negative example Sp = T N/(TN + FP) or probability that positive pediction is correct Sp = T P/(TP + FP) 4.5 BLAST: Overview The BLAST = Basic Local Alignment Search Tool 1 algorithm is a heuristic for computing optimal local alignments between a query sequence and a database containing one or more subject sequences. BLAST has two main parts: 1 S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman: Basic local alignment search tool. J. Molecular Biology 215:403-410 (1990)

50 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 1. A search algorithm for finding local alignments 2. An associated theory for estimating the statistical significance of solutions to help distinguish true significant similarities from ones that are due to chance. BLAST searches for words of length k in query that have a similarity score score T with another word of length k in database. These words are seeds that are extended to HSPs = High-Scoring Segment Pairs. An HSP as the property that it cannot be extended further to the left or right without the score dropping significantly below the best score achieved on part of the HSP. The original BLAST algorithm performs the extention without gaps. 4.5.1 Poisson distribution The Karlin and Altschul theory (Karlin-Altschul statistics) for local alignments (without gaps) is based on Poisson and extreme value distributions. The details of that theory are beyond the scope of this lecture, but basics are sketched in the following. Definition 4.5.1 The Poisson distribution with parameter v is given by P(X = x) = vx x! e v (4.1) Note that v is the expected value as well as the variance. From equation 1 we follow that the probability that a variable X will have a value at least x is x 1 P(X x) =1 i=0 v i i! e v (4.2) 4.5.2 Statistical significance of an HSP Assume we are given an HSP (s, t) with score σ(s, t). How significant is this match (i.e. local alignment)? To analyze how high a score is likely to arise by chance, a model of random sequences over the alphabet Σ is needed. Given the scoring matrix S(a, b), the expected score for aligning a random pair of amino acid or bases is required to be negative: E = a,b Σ p a p b S(a, b) < 0 Were this not the case, long alignments would tend to have high score independently of whether the segments aligned were related, and the statistical theory would break down. 4.5.3 Statistical significance Assume that the length m and n of the query and database respectively are sufficiently large.

Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 51 The number of random HSPs (s, t) with σ(s, t) S can be described by a Poisson distribution with parameter v = Kmne λs. The number of HSPs with score S that we expect to see due to chance is then the parameter v, also called the E-value: E(S) =Kmne λs The parameters K and λ depend on the background probabilities of the symbols and on the employed scoring matrix. We define λ as the unique value for y that satisfies the equation a,b Σ p a p b e S(a,b)y =1 K and λ are scaling-factors for the search space and for the scoring scheme, respectively. Hence the probability of finding exactly x HSPs with a score S is given by E Ex P(X = x) =e x! The probability of finding at least one HSP by chance is where E is the E-value for S. P(S) = 1 P(X = 0) = 1 e E, Thus we see that the probability distribution of the scores follows an extreme value distribution. BLAST reports E-values rather than P-values as it is easier to interpret the difference between E-values than to interpret the difference between P-values. The raw scores S are of little use without detailed knowledge of the scoring system used, that is, of the statistical parameters K and λ. Therefore we introduced a normalized raw score called bit score S that is defined as E-values and bit scores are related by S = λs ln K. ln 2 E = mn2 S (exercise!) 4.6 Gapped BLAST A new version of BLAST called BLAST 2.0 2 allows gaps in the extension phase. 4.7 The BLAST family BLASTN: compares a DNA query sequence to a DNA sequence database 2 S. F. Altschul, T. L. Madden, A.A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389-402 (1997).

52 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 BLASTP: compares a protein query sequence to a protein sequence database TBLASTN: compares a protein query sequence to a DNA sequence database (6 frames translation) BLASTX: compares a DNA query sequence (6 frames translation) to a protein sequence database TBLASTX: compares a DNA query sequence (6 frames translation) to a DNA sequence database (6 frames translation) Phi-BLAST: Pattern Hit Initiated BLAST searches for particular patterns in protein queries, incorp. into PSI-Blast PSI-BLAST: Position specific iterated BLAST profile of hits is computed database is searched with profile many iterations designed to detect weak relationships between the query and members of the database not necessarily detectable by standard BLAST searches. results in increased sensitivity 4.8 Available BLAST implementations NCBI BLAST: Implementation of all BLAST programs maintained by NCBI. AB-BLAST (former WU-BLAST): Alternative implementation of all BLAST programs (except for PHI- and PSI-BLAST) but the other BLAST families are 4.9 BLAT BLAT = Blast Like Alignment Tool 3 Motivation for the development of BLAT: For public assembly of the human genome 3 million ESTs and 13 million whole genome shotgun reads needed to be mapped to the human genome. For EST against genome alignment: 1.75 Gb in 3.72 million ESTs against 2.88 Gb bases of Human DNA. Application in particular for large query sequences, eg. genomes Analyzing vertebrate genomes requires rapid mrna/dna and cross-species protein alignments. BLAT is especially designed for very fast and accurate alignments of both DNA and protein sequences. BLAST preprocesses the query. BLAT preprocesses the database: index of all non-overlapping K-mers in db (genome) Several stages: use index to find regions in the genome that are possibly homologous to the query sequence. perform an alignment between such regions. stitch together the aligned regions (often exons) into larger alignments (typically genes). 3 W. J. Kent: BLAT - The BLAST-Like Alignment Tool. Genome Res. 12:656-664(2002)

Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 53 4.9.1 BLAT - Step 1 Preprocessing of the database: Index database with k-words (only once, independent of query): typically k =8... 16 for nucleotide sequences typically k =3... 5 for protein sequences For each k-word store in which sequence of db it appears (via hashing) 4.10 BLAT vs BLAST BLAT is similar to BLAST: The program rapidly scans for relatively short matches (hits) and extends these into HSPs. However BLAT differs from BLAST in some important ways: BLAST builds an index of the query string and then scans linearly through the database BLAT builds an index of the database and then scans linearly through the query, BLAST triggers an extension when one or two hits occur BLAT can trigger extensions on any given number of perfect or near perfect matches, BLAST returns each area of homology as separate alignments BLAT stitches them together into larger alignments, BLAST delivers a list of exons sorted by size, with alignments extending slightly beyond the edge of each exon BLAT unsplices mrna onto the genome, giving a single alignment that uses each base of the mrna only once, with correctly positioned splice sites. 4.11 Seed-and-extend Like all fast alignment programs, BLAT uses the two stage seed-and-extend approach: in the seed stage, the program detects regions of the two sequences that are likely to be homologous, and in the extend stage, these regions are examined in detail and alignments are produced for the regions that are indeed homologous according to some criterion. BLAT provides three different methods for the seed stage: Single perfect K-mer matches, Single near-perfect K-mer matches, and Multiple perfect K-mer matches. 4.11.1 3 different seed strategies The simplest seed method is to look for subsequences of a given size K that are shared by the query and the database. Compare every K-mer in the query sequence with all non-overlapping K-mers in the database sequence. We want to analyze:

54 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 1. how many homologous regions are missed (FN), and 2. how many non-homologous regions are passed to the extension stage (FP), using this criteria, thus increasing the running time of the application. 4.11.2 Some definitions K: The K-mer size M: Match ratio between homologous areas, 98% for cdna/genomic alignments within the same species, 89% for protein alignments between human and mouse. H: The size of a homologous area. For a human exon this is typically 50 200 bp. G: Database size, e.g. 3 Gb for human. Q: Query size. A: Alphabet size, 20 for amino acids, 4 for nucleotides. query sequence (e.g. cdna) matches Database sequence (e.g. genome) 4.11.3 Strategy 1: Single perfect matches Assuming that each letter is independent of the previous letter, the probability that a specific K-mer in a homologous region of the database matches perfectly the corresponding K-mer in the query is: p 1 = M K. Let T = H K denote the number of non-overlapping K-mers in a homologous region of length H. Sensitivity: The probability (of a hit) that at least one non-overlapping K-mer in the homologous region matches perfectly with the corresponding K-mer in the query is: Specificity: P =1 (1 p 1 ) T =1 (1 M K ) T. The number of non-overlapping K-mers that are expected to match by chance, assuming all letters are equally likely, is: F =(Q K + 1) G ( ) 1 K K. A These formulas can be used to predict the sensitivity and specificity of single perfect nucleotide K-mer matches as a seed-search criterion:

Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 55 (Source: Kent 2002) 1. For EST alignments, we would like to find seeds for 99% of all homologous regions that have 5% or less sequencing noise (so are at least 95% identical). From Table 3 we see that in order to achieve Sn =0.99 K =7..14 will work. For K = 14, we can expect that 399 random hits per query will be produced. A smaller value of K will produce significantly more random hits. 2. Comparing mouse and human at a nucleotide level, where there is only 86% identity is not feasible: Table 3 implies that K = 7 must be used to find 99% of all true hits, but this value generates 13 million random hits per query. The mouse and human genomes have on average 89% identity at the amino acid level. To find true seeds for 99% of all translated mouse reads requires K = 5 or less. For K = 5, each read will generate 62625 random hits. (Source: Kent 2002) 4.11.4 Strategy 2: Single near-perfect matches Now consider the case of near-perfect matches, that is, hits with one letter mismatch. The probability that a non-overlapping K-mer in a homologous region of the database matches near-perfectly the corresponding K-mer in the query is (with T := H K, as above): p 1 = K M K 1 (1 M)+M K.

56 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 Sensitivity: Probability that there exists a non-overlapping K-mer in the homologous region that matches the corresponding K-mer in the query with at most one mismatch is: P =1 (1 p 1 ) T. Specificity: The number of K-mers which match near-perfectly by chance is: ( F =(Q K + 1) G ( ) 1 K 1 ( K K 1 1 ) ( ) ) 1 K +. A A A (Source: Kent 2002) 1. EST alignments: K = 12..22 produce true seeds for 99% of all queries, with one random hit for K = 21. 2. A comparison of mouse reads and the human genome (86% identity) on the nucleotide level would require K = 12 or K = 13 to detect true seeds for 99% of the reads, while generating 68775 random hits (for K = 13). Sensitivity and specificity of single near-perfect amino acid K-mer matches as a seed-search criterion:

Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 57 (Source: Kent 2002) 1. For comparison of translated mouse reads and the human genome, Table 6 indicates that K =8 would detect true seeds for 99% of all mouse reads, while only generating 749 random hits. BLAT implements near-perfect matches allowing one mismatch in a hit, as follows: A non-overlapping index of all K-mers in the database is generated. Every possible K-mer in the query sequence that matches in all but one, or in all, positions, is looked up. Hence, this means K (A 1) + 1 lookups. For an amino-acid search with K = 8, for example, 153 lookups are required per occurring K-mer. For a given level of sensitivity however, the near-perfect match criterion runs slower than the multipleperfect match criterion and thus is not so useful in practice. 4.11.5 Strategy 3: Multiple perfect matches An alternative seeding strategy is to require multiple perfect matches that are constrained to be near each other. For example, consider a situation where there are two hits between the query and the database sequences that lie on the same diagonal and are close to each other (within some given distance W ), such as a and b here:

58 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 query sequence (e.g. cdna) d a k w b c Database sequence (e.g. genome) For N = 1, the probability that a non-overlapping K-mer in a homologous region of the database matches perfectly the corresponding K-mer in the query is (as discussed above): p 1 = M K. The probability that there are exactly n matches within the homologous region is P n = p n 1 (1 p 1 ) T n T! n! (T n)!, and the probability that there are N or more matches is the sum: P = P N + P N+1 + + P T. Again, we are interested in the number of matches generated by chance. The probability that such a chain is generated for N = 1 is simply: F 1 =(Q K + 1) G ( ) 1 K K. A The probability of a second match occurring within W letters after the first is S =1 ( 1 ( ) ) W 1 K K A because the second match can occur within any of the W K within W letters after the first match., non-overlapping K-mers in the database The number of size N chains of K-mers in which any two consecutive hits are not more than W apart is F N = F 1 S N 1. Prediction of sensitivity and specificity of multiple nucleotide (2 and 3) perfect K-mer matches as a seed-search criterion:

Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 59 (Source: Kent 2002) Prediction of the sensitivity and specificity of multiple amino acid (2 and 3) perfect K-mer matches as a seed-search criterion: (Source: Kent 2002) 4.12 Generating alignments BLAT builds a non-overlapping index of all K-mers in the database, ignoring those K-mers that occur too often in the database, those containing ambiguity codes and optionally, those in lower case ( soft screened regions ). BLAT then looks up each overlapping K-mer of the query sequence in the index, obtaining a list L of hits. Each hit consists of a database position and a query position. A number of heuristics are used to generate an alignment of the query sequence to the database. This involves chaining the hits, aligning the gaps between consecutive hits and attempting to place large gaps at splice boundaries.

60 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4.13 Clumping hits BLAT builds a non-overlapping index of all K-mers in the database, ignoring those K-mers that occur too often in the database, those containing ambiguity codes and optionally, those in lower case ( soft screened regions ). BLAT then looks up each overlapping K-mer of the query sequence in the index, obtaining a list L of hits. Each hit consists of a database position and a query position. The next step is to form clumps of hits that represent regions in the database sequence that are homologous to the query sequence. Each such clump consists of a number of hits (that exceeds a given minimum number of hits) that form a chain in which two consecutive hits are not too far apart from each other and also in which the gap size in either sequence does not exceed a given threshold. Multiple hits are clumped together as follows: The hit list L is sorted by database coordinate. The list L is split into buckets of size 64 kb each, based on the database coordinate. Each bucket is sorted along the diagonal, i.e. hits are sorted by the value of database position minus query position. Hits that are within the gap limit are grouped together into proto-clumps. Hits within proto-clumps are then sorted by their database coordinate and put into real clumps, if they are within the window limit on the database coordinate. Clumps within 300 bp or 100 amino acids of each other in the database are merged and then 500 bp are added to each end of a clump. A list of hits: query sequence 2 3 4 6 1 5 Database sequence Sorted by database coordinate: query sequence 1 2 3 4 5 6 Database sequence Sorted along the diagonal:

Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 61 query sequence 1 2 3 5 4 6 Database sequence 4.14 Nucleotide alignments Clumping is the first part of the extension stage. In the case of nucleotide alignments, each clump is then processed as follows. A hit list is generated between the query sequence q and the homologous region h in the database, looking for smaller, perfect K-mers. If a K-mer w in q matches multiple K-mers in h, then w is repeatedly extended by one until the match is unique or exceeds a certain size. The hits are extended as far as possible, without mismatches. Overlapping hits are merged. If there are gaps in the alignment in both the query and the database, then the algorithm recurses to fill in the gaps, using a smaller K. Then extensions using indels followed by matches are considered. Large gaps in the query sequence often correspond to introns and they are slid around to find the best GT/AG consensus sequence for the intron ends. 4.15 Protein alignments In the case of amino acid sequences, each clump is processed as follows: All hits obtained in the seed stage are extended into maximally scoring ungapped alignments (HSPs) using a score function where a match is worth 2 and a mismatch is worth 1. A graph is build with HSPs as nodes. If HSP A starts before HSP B in both sequences, then an edge is put from A to B that is weighted by the score of B minus a gap penalty based on the distances between A and B. If A and B overlap, then an optimal crossover position x is determined that maximizes the sum of score of A up to x and B starting from x and the edge weight is set accordingly. A dynamic programming algorithm then extracts the maximal scoring alignment by traversing the graph. The HSPs contained in the path are removed and if any HSPs are left then the dynamic program is run again.

62 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4.15.1 Mouse/Human alignment choices The similarity between the human and mouse genomes is 86% on the nucleotide level and 89% on the amino-acid level (for coding regions). The following table compares DNA vs amino acid alignments, and different seeding strategies: (Source: Kent 2002)

Bioinformatics I, WS 09-10, S. Henz, November 26, 2009 63 4.16 FASTA algorithm The FASTA algorithm 4 5 uses four steps to calculate three scores that characterize sequence similarity. The two main flavors of the FASTA algorithm are FASTA (for nucleotides) and FASTAP (amino acids). 4.16.1 Step 1 The algorithm operates in three steps. Step 1: Using a lookup table (see short explanation of a lookup table below) all identities or groups of identities between two sequences are determined. The ktup parameter (for amino acids: normally ktup = 2, sometimes 1, for DNA: 1 ktup 6, where 4 and 6 are recommended). In conjunction with the lookup table all regions of similarity between the two sequences, counting ktup matches and penalizing for intervening mismatches are found by using the diagonal method. Determine all exact substrings of the length k, i.e. ktups, (these seeds before they are combined to new regions are not allowed to contain mismatches ( seed stage)). Combine adjacent ktup regions within a diagonal to regions. Every diagonal can contain more than one region. ktups are assessed by v(ktup) = e number of matches + r number of mismatches (with score e>0 and r<0) ktups are combined, if score v increases, i.e. v(ktup 1 )+v(ktup 2 )+ r 1 r>max(v(ktup 1 ),v(ktup 2 )) This last step is repeated as long as combined regions fulfill this inequality. The best 10 (say) such regions of highest density of identities are saved. 4 D. J. Lipman and W. R. Pearson: Rapid and sensitive protein similarity searches, Science 227:1435-1441 (1985) 5 W. R. Pearson and D. J. Lipman: Improved tools for biological sequences comparison, Proc Natl. Acad. Sci. USA 85:1222-2448 (1988)

64 Bioinformatics I, WS 09-10, S. Henz, November 26, 2009 4.16.2 Step 2 The best 10 (say) regions with the highest density of identities are rescaned using a substitution matrix (PAM, or BLOSUM matrices). Trimming of the ends of the region to include only those residues contributing to the highest score. Each region is a partial alignment without gaps which has an assigned initial score init1. These scores are used to rank the library sequences. 4.16.3 Step 3 Combine region covered by different diagonals to a longer alignment which has a higher score. This stage entails the inserting of gaps. Regions below a given threshold T are neglected. Gaps contribute with a negative score (linear gap score d). These new scores are named initn, with initn = sum of init1 number of gaps d. The scores initn are not optimized. This best set of regions has to be found (optimization problem). Formulation as graph problem Each region is represented by a weighted node Edges with weights represent gaps, where the weights reflects the assessment of the gap. Generate an edge (u, v) if region u starts at position (i, j) and terminates at position (i + d, j + d)

Bioinformatics I, WS 09-10, S. Henz, November 26, 2009 65 region v starts at position (i,j ) i >i+ d, i.e. v follows after u In this way a directed acyclic graph is generated. Find maximal weighted path in the graph Starting and end point can be anywhere - local alignment All shortest paths - Floyd-Warshall, complexity O(V 3 ) 4.16.4 Step 4 Open question: How good is the score of the found alignment compared to the optimal one? To address this, calculate alternative alignments. K band alignment Search for better alignment score around init1, which was the best region of Step 2. Use K = 16, i.e. consider only those residues that lie in a band of 32 residues wide centered on the best initial region found in Step 2 (i.e., consider 32 diagonals). The optimal alignment within this K band is reported as opt score. 4.16.5 FASTA result The FASTA algorithm uses and reports three score: init1, initn, opt. Complexity of the FASTA algorithm: O(n 3 ), where n is the length of the sequences The BLAST algorithm was invented and introduced as a faster alternative to FastA and is more widely-used.

66 Bioinformatics I, WS 09-10, S. Henz, November 26, 2009 4.16.6 Supplement: Lookup table A lookup table provides a rapid method for finding the position of a residue in a sequence. One way to find the A in the sequence NDAPL is to compare A to each residue in the sequence. A faster method is to make a table of all possible residues (20 (23) for proteins) so that the computer representation for the residue (i.e A is 1, R is 2, N is 3) is the same as its position in the table. A value is then placed in the table that indicates whether the residue is present in the sequence and, if it is, where it is present. For this example the table has the value 1 at position 3, 2 at position 4, 3 at position 1, 4 at 15, 5 at 11, and the remainning 18 positions are 0. The position of the A in the sequence can then be determined in a single step by looking it up at position 1 in the table.