Sequence Alignment Heuristics

Size: px

Start display at page:

Download "Sequence Alignment Heuristics"

Sibyl Singleton
5 years ago
Views:

1 Sequence Alignment Heuristics Some slides from: Iosif Vaisman, GMU mason.gmu.edu/~mmasso/binf630alignment.ppt Serafim Batzoglu, Stanford Geoffrey J. Barton, Oxford Protein Sequence Alignment and Database Scanning CG 2015

2 Why Heuristics? Motivation: Dynamic programming guarantees an optimal solution & is efficient, but Not fast enough when searching a database of size ~10 12, with a query of length bp CG 2015

3 GenBank Growth CG

4 Possible Solutions Solutions: Implement on hardware. (COMPUGEN) Parallel hardware. (MASSPAR) Ad-hoc implementations using specific hardware. Use faster heuristic algorithms. Limit the number of allowed indels. Look for long matching subsequences. Use indexing/hashing. Common Heuristics: FASTA, BLAST CG Ron Shamir, 09

5 Key observations Even O(m+n) time would be problematic when db size is huge Substitutions are much more likely than indels Homologous sequences contain many matches Numerous queries are run on the same db Preprocessing of the db is desirable CG 2015

6 Indexing-based local alignment Dictionary: All words of length k (~10) Alignment initiated between words of alignment score T query Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold scan DB query CS262 Lecture 3, Win06, Batzoglou

7 Detour: Banded Alignment Assume we know that x and y are very similar Assumption: # gaps(x, y) < k(n) ( say N>M ) x i Then, implies i j < k(n) y j We can align x and y more efficiently: Time, Space: O(N k(n)) << O(N 2 ) CS262 Lecture 2, Win06, Batzoglou

8 Banded Alignment y N y 1 x 1 x M Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1 M For j = max(1, i k) min(n, i+k) F(i 1, j 1)+ s(x i, y j ) F(i, j) = max F(i, j 1) d, if j > i k(n) F(i 1, j) d, if j < i + k(n) k(n) Termination: same Easy to extend to the affine gap case CS262 Lecture 2, Win06, Batzoglou

9 CG 2015 Alignment Dot-Plot Matrix g t g c c c t g a a * * a * * * g * * g * * t * * * c * * * c * * * g * * t * * t * * * c

10 Dot plots Example 1: close protein homologs (man and mouse) CG

11 Example 2: remote protein homologs (man and bacilus) CG 2015

12 Example 1: dot for 4+ matches in window of 5 CG 2015

13 Example 2: dot for 4+ matches in window of 5 CG 2015

14 FASTA : A Heuristic Method for Sequence Comparison History: Lipman and Pearson in 1985, 1988 Key idea: Good local alignment must have exact matching subsequences. Algorithm Evaluation: Resulting alignment scores well compared to the optimal alignment (shown experimentally) Much faster than dynamic programming. CG 2015

15 Disclaimer Highly popular software tools get numerous updates, revisions, versions, variants etc. Implementation details differ considerably among versions. It is hard to single out one ultimate version. We present the basic ideas and details may vary. CG 2015

16 a a g t c c t g a t t t g c c c a g g t * * * * * g * * * g * * * t * * * * hot * spots c * * * * * a * * * * * * a * * * * * * g * * * * a * * * * * * t * * * * * t * * * * * c * * * * * c * * * * * a * * * * * * t * * * * * c * * * * * a * * * * * * g * * * g * * * * CG 2015

17 FASTA overview ktup = required min length of perfect match 1. Find hot spots = matches of length ktup 2. Find 10 best diagonal runs = almost consecutive hot spots on same diagonal. Best soln = init1 2.1 Find an optimal sub-alignment in each diagonal 3. Combine close sub-alignments. best soln = initn 4. Compute best DP solution in a band around initn. result = opt CG 2015

18 Sequence A FASTA Step 1 Sequence B Find hot spots: (runs of matches of length ktup) CG 2015

19 Sequence A FASTA Step 2 Sequence B 2 Rescoring using a subs. matrix high score low score The score of the highest scoring initial region is saved as the init1 score. CG 2015

20 Sequence A FASTA Step 3 Sequence B 3 Joining threshold - eliminates disjointed segments Non-overlapping regions are joined. The score equals sum of the scores of the regions minus a gap penalty. The score of the highest scoring region, at the end of this step, is saved as the initn score. CG 2015

21 FASTA Algorithm (2) 2. Find 10 best diagonal runs and init1 3. Allowing indels combine close diagonal runs: Construct an alignment graph: nodes =sub-alignments (SAs) weight alignment score (from 1) Edges btw SAs that can fit together, weight - negative, depends on the size of the corresponding gap Find a maximum weight path in it, initn Alignment graph CG 2015

22 Sequence A FASTA Step 4 Sequence B 4 Alignment optimization using dynamic programming The score for this alignment is the opt score. CG 2015

23 FASTA Output The information on each hit includes: General information and statistics SW score, %identity and length of overlap CG 2015

24 Statistical significance Key question: how significant is the score x that was obtained? Scores are not normally distributed Solution 1: view the scores distribution over all database entries, see how far out x is. CG 2015

25 CG 2015 Output of Fasta 2

26 Distribution of initial scores with ktup=2. v initn init1 < 2 2 2:= 4 0 0: 6 4 4:== :========= :===================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== :================================================== : : : : : : : : : : : : : : : : :+ > : residues in sequences statistics exclude scores greater than 73 mean initn score: 26.8 (7.79) mean init1 score: 26.0 (6.05) 5349 scores better than 33 saved, ktup: 2, variable pamfact joining threshold: 28 scan time: 0:00: CG 2015 The best scores are: initn init1 opt

27 Statistical significance (2) Key question: how significant is the score x that was obtained? Solution 2: - average score of random sequence; - standard dev. Z-score: z = (x- ) / Rule of thumb: z > 3 possibly significant, z>6 probably significant, z>10 significant Issues: sensitivity vs selectivity. Pertinence to biology is the bottom line CG 2015

28 August 1997: NCBI Director David Lipman (far left) coaches Vice President Gore (seated) as he searches PubMed. NIH Director Harold Varmus (center) and NLM Director Donald Lindberg look on. CG 2015

29 Bill Pearson Bill Pearson received his Ph.D. in Biochemistry in 1977 from the California Institute of Technology. He then did a postdoctoral fellowships at the Caltech Marine Station in Corona del Mar, CA and at the Department of Molecular Biology and Genetics at Johns Hopkins. In 1983 he joined the Department of Biochemistry at the University of Virginia. CG 2015

30 BLAST Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers and Lipman Motivation: Need to increase the speed of FASTA by finding fewer and better spots during the algorithm. The Core of the Algorithm: Finding fewer and better hot spots, but not insisting on perfect matches in them. Some statistical results on the significance of the results Different versions for protein, DNA, CG 2015

31 CG 2015

32 BLAST outline Compile a list of high scoring words with the query Scan the database for hits Extend hits CG 2015

33 BLAST Algorithm 1 Query sequence of length L Maximum of L-w+1 words (typically w = 3 for proteins) Word list For each word from the query sequence find the list of words with high score using a substitution matrix

34 BLAST Algorithm 2 Database sequences Word list Exact matches of words from the word list to the database sequences

35 BLAST Algorithm 3 Maximal Segment Pairs (MSPs) For each exact word match, alignment is extended in both directions to find high score segments

36 A second viewpoint BLAST - Basic Definitions Given two sequences S 1 and S 2, a segment pair is a pair of equal length subsequences of S 1 and S 2, respectively, aligned without spaces. A locally maximal segment pair is a pair aligned without spaces (but possibly with mismatches) whose alignment score cannot be improved by extending it or shortening it. A maximal segment pair (MSP) in S 1, S 2 is a segment pair with the maximum score over all segment pairs in S 1, S 2. match +2, mismatch -1 S 1 =a g c t g g t t t a S 2 =c t t g a t g g t a S 1 =a g c t g g t t t a S 2 =c t t g a t g g t a S 1 =a g c t g g t t t a S 2 =c t t g a t g g t a CG 2015

37 CG 2015 BLAST - The Algorithm Fix: word length w, thresholds t, C Seek segment pairs of length w & score t, Compile for each w-long subseq of the query, the list of all w-long words with similarity score t to. Scan the query with a shifting w-long window: find every exact occurrence of word in the list (linear in text length) Extend each such pair, test if contained within segment pair of score C, (local MSP) Typical w values: 3-5 for amino acids, ~12 for nucleotides hits

38 Sensitivity-Speed Tradeoff X% Sensitivity Speed long words (k = 15) short words (k = 7) Sens. Speed CS262 Lecture 3, Win06, Batzoglou Kent WJ, Genome Research 2002

39 Gene Myers, Webb Miller, Warren Gish CG 2015

40 BLAST statistics Theory of Karlin, Altschul, and Dembo on the distribution of the MSP of score at random Define parameters K, (depending on AA distribution) Pr (finding a pair of score >S in comparing two random seqs of length m, n) = 1 e -y where Y=Kmn e - s Extreme value dist (or Gumbell dist) Allow the calculated choice of smallest C CG 2015

41 Sam Karlin, Steve Altschul, Amir Dembo CG 2015

42 Improvement: Gapped BLAST Altschul et al. 97 The original BLAST extends several HSPs and then attempts to combine them without gaps The new version allows gapped extensions for the best segments passing the two hit condition Approximately one in 50 targets sequences reaches the gapped extension step Using DP on dynamically changing area (not a band) CG 2015

43 The sensitivity of the two-hit and one-hit heuristics as a function of HSP score. CG 2015

Gapped BLAST outline Find two nearby hits: Find two

each on same diagonal within distance A Perform

gapped extension Apply DP on a changing region: stop

44 Gapped BLAST outline Find two nearby hits: Find two non-overlapping w-long words with: CG 2015 score t, each on same diagonal within distance A Perform ungapped extension If score exceeds S, perform gapped extension Apply DP on a changing region: stop extension when score falls X g below best score attained so far

45 Figure 2. The BLAST comparison of broad bean leghemoglobin I (87) (SWISS-PROT accession no. P02232) and horse [beta]-globin (88) (SWISS-PROT accession no. P02062). The 15 hits with score at least 13 are indicated by plus signs. An additional 22 non-overlapping hits with score at least 11 are indicated by dots. Of these 37 hits, only the two indicated pairs are on the same diagonal and within distance 40 of one another. Thus the two-hit heuristic with T = 11 triggers two extensions, in place of the 15 extensions invoked by the one-hit heuristic with T = 13. Because this is just one example, the relative numbers of hits and extensions at the various settings of T correspond only roughly to the ratios found in a full database search. An ungapped extension of the leftward of the two hit pairs CG 2015 yields an HSP with nominal score 45, or 23.6 bits, calculated using [lambda]u and Ku.

46 CG 2015 Figure 3. A gapped extension generated by BLAST for the comparison of broad bean leghemoglobin I (87) and horse [beta]-globin (88). (a) The region of the path graph explored when seeded by the alignment of alanine residues at respective positions 60 and 62. This seed derives from the HSP generated by the leftward of the two ungapped extensions illustrated in Figure 2. The Xg dropoff parameter is the nominal score 40, used in conjunction with BLOSUM-62 substitution scores and a cost of 10 + k for gaps of length k. (b) The path corresponding to the optimal local alignment generated, superimposed on the hits described in Figure 2. The original BLAST program, using the one-hit heuristic with T = 11, is able to locate three of the five HSPs included in this alignment, but only the first and last achieve a score sufficient to be reported. (c) The optimal local alignment, with nominal score 75 and normalized score 32.4 bits. In the context of a search of SWISS-PROT (26), release 34 ( residues), using the leghemoglobin sequence (143 residues) as query, the E-value is 0.54 if no edge-effect correction (22) is invoked. The original BLAST program locates the first and last ungapped segments of this alignment. Using sum-statistics with no edge-effect correction, this combined result has an E-value of 31 (21,22). On the central lines of the alignment, identities are echoed and substitutions to which the BLOSUM-62 matrix (18) gives a positive score are indicated by a `+'

47 CG 2015

48 Figure 4. The path graph region explored by BLAST during a gapped extension for the comparison of broad bean leghemoglobin I and the E1B protein small T-antigen from human adenovirus type 4 (89) (SWISS-PROT accession no. P10406). The Xg dropoff parameter is the nominal score 40, used in conjunction with BLOSUM-62 substitution scores and 10 + k gap costs. The 22.7 bit HSP that triggers this extension, involving leghemoglobin residues and adenovirus residues , is merely a random similarity, and not part of a larger and higher-scoring alignment. The gapped extension is seeded by the alignment of residues 124 and 106. The optimal alignment score through points in the path graph drops steadily as one moves beyond the triggering HSP, and the reverse extension terminates before the beginning of either protein is reached. A total of 2766 path graph cells are explored, with the reverse extension accounting for 2047 of these cells. CG 2015

49 Relative times spent by the original and gapped BLAST programs on various algorithmic stages Overhead: database scanning, output, etc. Calculating whether hits qualify for ungapped extension Ungapped extensions Gapped extensions Original BLAST Gapped BLAST 8 (8%) 92 (92%) 8 (24%) 12 (37%) 5 (15%) 8 (24%) Speed: ~3 times faster than the original BLAST CG 2015

50 Psi-BLAST team Thomas Madden, David Lipman, Alex Schaeffer, Steve Altschul CG 2015

51 PSI - BLAST 1. Execute BLAST 2. Compile a PSSM position specific score matrix from the resulting hits 3. Execute BLAST with the new profile 4. Iterate to convergence 5. Idea: converge on a family CG 2015

52 CG Ron Shamir, 09 Substitution Matrices

53 Amino Acid Substitution Matrices Used to score alignments. Reflect evolution of sequences. Unitary Matrix: M ij = 1 i=j { 0 o/w Genetic Code Matrix: M ij = min no. of base changes needed to alter codon of i to codon of j. CG Ron Shamir, 09

54 Scoring Matrices Probability theory implies that more similar pairs of sequences will require different matrices than more divergent pairs. Several families of matrices were constructed, to be used according to the level of divergence: Probabilistic/evolutionary (global) approach - PAM. Functional approach (local) BLOSUM. Higher numbered PAM and Lower numbered BLOSUM for more divergent sequences CG Ron Shamir, 09

55 PAM Matrices (Dayhoff et al., 78) PAM = Percent (or Point) Accepted Mutation Measuring unit of evolutionary distance of proteins. Substitution matrix for comparing proteins that distance apart. Protein sequences S 1, S 2 are at evolutionary distance of one PAM if S 1 has converted to S 2 with an average of one accepted point mutation per 100 AAs.: PAM1 should be used for sequences whose evolutionary distance causes 1% difference (Percent Accepted Mutation) between them. PAM2 should be used for sequences twice as distant. CG Ron Shamir, 09

56 PAM Matrices (2) Generating PAM: Start with aligned sequences, highly similar, with known evolutionary trees. CG Ron Shamir, 09 "log odds" log ABCD AGCF ADIJ CBIJ f ( j) M f ( i) f AGCD Collect statistics on exchanges G B D F B D A C k ( i, j) ( j) G B C I D J Compute matrix M ij = prob. (j changes to i in one unit) Now M k gives change probs. in k units. k M ( i, j) log f ( i) ABIJ

57 Properties and caveats Markovian model: state at time n depends only on state at time n-1 Same model for all AA positions Multiple mutations can - and will - occur at same point. We count only accepted ( recorded) mutations. Assumes constant molecular clock. Ignores indels. k PAM difference k % difference!!! CG Ron Shamir, 09

58 CG Ron Shamir, 09 Observed % difference Evolutionary distance in PAMs

59 Dayhoff s Data 71 manually curated evolutionary trees (34 superfamilies) Sequences within a tree were <15% different 1,572 substitutions overall CG Ron Shamir, 09

60 CG Ron Shamir, 09

61 CG Ron Shamir, 09

62 CG Ron Shamir, 09

Margaret Oakley Dayhoff (1925-1983) A pioneer in the use of computers in chemistry and biology, beginning with her PhD thesis project in 1948.

63 Margaret Oakley Dayhoff ( ) A pioneer in the use of computers in chemistry and biology, beginning with her PhD thesis project in Her work was multi-disciplinary, and used her knowledge of chemistry, mathematics, biology and computer science to develop an entirely new field. She is credited today as one of the founders of the field of Bioinformatics. Dr. Dayhoff was the first woman in the field of Bioinformatics. She was also the first woman to hold office in the Biophysical Society, serving first as Secretary and later as President. CG Ron Shamir, 09

64 CG Ron Shamir, 09

65 BLOSUM (Henikoff & Henikoff, 92) PAM: based on highly similar global alignments BLOSUM (BLOcks SUbstitution Matrix): based on short, gapless local alignments Identify blocks: conserved segments in alignment of proteins from the same family. Eliminate sequences that are >x% identical (by deletion/clustering) Collect stats on pairs in each column q ij = prob of AA pairs (A i, A j ) in same column p i = prob of observing A i e ij = freq. of pair (A i, A j ) assuming independence =p i 2 if i=j, 2p i p j if i j Odds matrix: q ij /e ij. Log odds: s ij = log (q ij /e ij ) BLOSUM X matrix: 2s ij discretized CG Ron Shamir, 09

66 CG Ron Shamir, 09 Blosum62

67 CG Ron Shamir, 09 Comparing matrices

68 PAM vs BLOSUM in different algorithms CG Ron Shamir, 09

69 Steven & Jorja Henikoff CG Ron Shamir, 09

70 One recipe for selecting a matrix Compared sequences are related: 200 PAM or 250 PAM Database scanning: 120 PAM Local alignment search: 40 PAM, 120 PAM, 250 PAM Detection of related sequences using BLAST: BLOSUM 62 Low PAM: short segments, high similarity High PAM: long segments, low similarity THERE IS NO ONE SIZE FITS ALL MATRIX!

Scoring and heuristic methods for sequence alignment CG 17

Scoring and heuristic methods for sequence alignment CG 17 Amino Acid Substitution Matrices Used to score alignments. Reflect evolution of sequences. Unitary Matrix: M ij = 1 i=j { 0 o/w Genetic Code Matrix: