VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 9

Size: px

Start display at page:

Download "VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 9"

Shanon Hoover
5 years ago
Views:

Bioinformatics Institut für Mathematik & Informatik, Freie

1 VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 9 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Contains material from Stefan Burkhardt & William Noble, U Washington

2 Heuristic String Matching

5 The purpose of sequence alignment Homology Function identification By now, most of the genes of M. jannaschii were assigned a function, mainly using sequence similarity

6 DATABASE SEARCHES (AKA HEURISTICS) Tim Conrad, VL AlDaBi, WT015/16 6

7 Possible Result The best scores are: init1 initn opt z-sc E( ).. SW:PPI1_HUMAN Begin: 1 End: 269! Q00169 homo sapiens (human). phosph e-117 SW:PPI1_RABIT Begin: 1 End: 269! P48738 oryctolagus cuniculus (rabbi e-116 SW:PPI1_RAT Begin: 1 End: 270! P16446 rattus norvegicus (rat). pho e-116 SW:PPI1_MOUSE Begin: 1 End: 270! P53810 mus musculus (mouse). phosph e-116 SW:PPI2_HUMAN Begin: 1 End: 270! P48739 homo sapiens (human). phosph e-96 SPTREMBL_NEW:BAC25830 Begin: 1 End: 270! Bac25830 mus musculus (mouse). 10, e-95 SP_TREMBL:Q8N5W1 Begin: 1 End: 268! Q8n5w1 homo sapiens (human). simila e-95 SW:PPI2_RAT Begin: 1 End: 269! P53812 rattus norvegicus (rat). pho e-94

8 Alignments SCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: E(): 2.3e-58 >>GB_IN3:DMU09374 (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: expect(): 2.3e % identity in 875 nt overlap (83-957: ) u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT

9 Database searches: Why? To discover or verify identity of a newly sequenced gene To find other members of a multigene family To classify groups of genes

10 Alignment in Real Life One of the major uses of alignments is to find sequences in a database The current protein database contains about 10 8 residues! Searching a 10 3 long target sequence requires to evaluate about matrix cells which will take about three hours in the rate of 10 6 evaluations per second. Quite annoying when, say, 10 3 sequences are waiting to be searched. About four months will be required for completing the analysis! 10

11 Database searching In practice, we cannot use Smith-Waterman to search for sequences in a database: Databases are huge (GenBank ~30 million sequences, Swiss-Prot >> 100,000 sequences) S-W is slow: Time is proportional to N n 2 where n = sequence length and N = number of sequences in the database Instead, use faster heuristic approaches FASTA BLAST Tradeoff: Sensitivity vs. false positives Smith-Waterman is slower, but more sensitive

12 12

13 Heuristic Search Rather than struggling to find the optimal alignment we may save a lot of time by employing heuristic algorithms Execution time is much faster May completely miss the optimal alignment Two important algorithms BLAST FASTA 13

14 Database searching: heuristic search algorithms FASTA (Pearson 1995) BLAST (Altschul 1990, 1997) Uses heuristics to avoid calculating the full dynamic programming matrix Uses rapid word lookup methods to completely skip most of the database entries Speed up searches by an order of magnitude compared to full Smith-Waterman The statistical side of FASTA is still stronger than BLAST Extremely fast One order of magnitude faster than FASTA Two orders of magnitude faster than Smith- Waterman Almost as sensitive as FASTA

15 Basic Intuition 1: Seeds Observation: Real-life matches often contain long strings with gap-less matches Idea: Try to find significant gap-less matches and then extend them 15

16 Basic Intuition 2: Banded DP Observation: If the optimal alignment of s and t has few gaps, then path of the alignment will be close to diagonal Action: To find such a path, it suffices to search in a diagonal band of the matrix. If the diagonal band consists of k diagonals (width k), then dynamic programming takes O(kn). Much faster than O(n 2 ) of standard DP. t V(i,i+k/2) Out of range s k V(i, i+k/2+1) V(i+1, i+k/2+1)

17 Banded DP for Local Alignment Problem: The banded diagonal needs not be the main diagonal when looking for a good local alignment Also the case when the lengths of s and t are different t Solution: Heuristically find potential diagonals and evaluate them using Banded DP s

18 FASTA Tim Conrad, VL AlDaBi, WT015/16 18

19 FASTA Input Two sequences s and t Parameter ktup defines the length of seeds. Typically ktup=1-2 for proteins and ktup=4-6 for DNA/RNA Output The best local alignment between s and t 19

20 FASTA Algorithm Outline Find regions in s and t containing high density of seeds Re-score the 10 regions with the highest scores using PAM matrix Eliminate segments that are unlikely to be part of alignments Optimize the best alignment using the banded DP algorithm 20

21 Step 1: Finding Seeds t s 21

22 Step 2: Re-scoring Segments, Keeping Top 10 t s

23 Step 3: Eliminating Unlikely Segments t s

24 Step 4: Finding the Best Alignment t s

25 Finding Seeds Efficiently Prepare an index table of the database sequence s such that for any sequence of length ktup, one gets the list of its positions in s. March on the query sequence t while using the index table to list all matches with the database sequence s. Index Table (ktup=2) AA - AC - AG 5, 19 AT 11, 15 CA 10 CC 9, 21 CG 7 TT 16 s=****agcgccatggattgagcga* t=**tgcgacattgatcgaccta** (-,7) No match (10,8) One match (11,9), (15,9) Two matches 25

26 Connecting Seeds on the Same Diagonal The maximal size of the index table is ktup where is the alphabet size (4 or 20). For small ktup, the entire table is stored For large ktup values, one should keep only entries for tuples actually found in the database In this case, hashing is needed Typical values of ktup are 1-2 for Proteins and 4-6 for DNA The index table is prepared for each database sequence ahead of users matching requests, at compilation time. Matching time is O( t max{m,n}) 26

27 Identifying Potential Diagonals Input: Sets of pairs E.g, (6,4),(10,8),(14,12),(15,10),(20,4) Task Locate sets of pairs that are on the same diagonal. Method Sort according to the difference i-j. E.g, 6-4=2, 10-8=2, 14-12=2, 15-10=5, 20-4=16 27

28 From the paper Tim Conrad, VL AlDaBi, WT015/16 28

29 BLAST Tim Conrad, VL AlDaBi, WT015/16 29

30 Basic Local Alignment Search Tool (BLAST) Publications: Ungapped BLAST Alttschul et al., 1990 Gapped BLAST, PSI-BLAST - Altschul et al., 1997 Input: Query (target) sequence either DNA, RNA or Protein Scoring Scheme gap penalties, substitution matrix for proteins, identity/mismatch scores for DNA/RNA Word length W typical is W=3 for proteins and W=11 for DNA/RNA Output: Statistically significant matches 30

31 BLAST Algorithm Outline List all words of length W that score at least T when aligned with the query sequence s Scan the database DB for seeds, namely words from the list that appear in sequences of DB Find High Scoring Pairs (HSPs) by extending the seeds in both directions. Keep best scoring HSPs Combine several HSPs using the banded DP algorithm 31

32 BLAST Algorithm Basic Local Alignment Search Tool Fast alignment technique(s) Similar to FASTA algorithms (not used much now) There are more accurate ones, but they re slower BLAST makes a big use of lookup tables Idea: statistically significant alignments (hits) Will have regions of at least 3 letters same Or at least high scoring with respect to BLOSUM matrix Based on small local alignments CCNDHRKMTCSPNDNNRK TTNDHRMTACSPDNNNKH more likely than CCNDHRKMTCSPNDNNRK YTNHHMMTTYSLDNNNKK

33 BLAST Overview Given a query sequence Q Seven main stages 1. Remove (filter) low complexity regions from Q 2. Harvest k-tuples (triples) from Q 3. Expand each triple into ~50 high scoring words 4. Seed a set of possible alignments 5. Generate high scoring pairs (HSPs) from the seeds 6. Test significance of matches from HSPs 7. Report the alignments found from the HSPs

34 BLAST Algorithm Part 1 Removing Low-complexity Segments Imagine matching HHHHHHHHKMAY and HHHHHHHHURHD The KMAY and URHD are the interesting parts But this pair score highly using BLOSUM It s a good idea to remove the HHHHHHHs From the query sequence (low complexity)

35 Removing Low-complexity Segments Given a segment of length L With each amino acid occurring n 1 n 2 n 20 times Use the following measure for compositional complexity : To use this measure Slide a window of ~12 residues along Query Sequence Q Use a threshold to determine low complexity windows Use a minimise routine to replace the segment With an optimal minimised segment (or just an X)

36 BLAST Algorithm Part 2 Harvesting k-tuples Collect all the k-tuples of elements in Q k set to 3 for residues and 11 for DNA (can vary) Triples are called words. Call this set W STS TSL SLS LST S T S L S T S D K L M R

37 BLAST Algorithm Part 3 Finding High Scoring Triples Given a word w from W Find all other words w of same length (3), which: Appear in some database sequence Blosum(w,w ) > a threshold T Choose T to limit number to around 50 Call these the high scoring triples (words) for w Example: letting w=pqg, set T to be 13 Suppose that PQG, PEG, PSG, PQA are found in database Blosum(PQG,PQG) = 18, Blosum(PQG,PEG) = 15 Blosum(PQG,PSG) = 13, Blosum(PQG,PQA) = 12 Hence, PQG and PEG only are kept

38 BLOSUM62 Substitution Matrix Zero: by chance + more than chance - less than chance Arranged by Sidegroups So, high scoring in the end boxes Example M,I,L,V Interchangeable

39 Example Calculation Query = S S H L D K L M R Dbase = H S H L K L L M G Score = Total score = = 21 Write Blosum(Query,Dbase) = 21 Not standard to do this

40 Finding High Scoring Triples For each w in W, find all the high scoring words Organise these sets of words Remembering all the places where w was found in Q Each high scoring triple is going to be a seed In order to generate possible alignment(s) One seed can generate more than one alignment End of the first half of the algorithm Going to find alignments now

41 BLAST Algorithm Part 4 Seeding Possible Alignments Look at first triple V in query sequence Q Actually from Q (not from W - which has omissions) Retrieve the set of ~50 high scoring words Call this set H V Retrieve the list of places in Q where V occurs Call this set P V For every pair (word, pos) Where word is from H V and pos is from P V Find all the database sequences D Which have an exact match with word at position pos Store an alignment between Q and D With V matched at pos in Q and pos in D Repeat this for the second triple in Q, and so on

42 Extracting Seeds t s 42

43 Seeding Possible Alignments Example Suppose Q = QQGPHUIQEGQQG Suppose V = QQG, H V = {QQG, QEG} Then P V = {1, 11} Suppose we are looking in the database at: D = PKLMMQQGKQEG Then the alignments seeded are: QQGPHUIQEGQQG word=qqg QQGPHUIQEGQQG word=qqg PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11 QQGPHUIQEGQQG word=qeg QQGPHUIQEGQQG word=qeg PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11

44 BLAST Algorithm Part 5 Generating High Scoring Pairs (HSPs) For each alignment A Where sequences Q and D are matched Original region matching was M Extend M to the left Until the Blosum score begins to decrease Extend M to the right Until the Blosum score begins to decrease Larger stretch of sequence now matches May have higher score than the original triple Call these high scoring pairs Throw away any alignments for which the score S of the extended region M is lower than some cutoff score

45 Finding HSPs t s 45

46 Combining HSPs t s 46

47 Extending Alignment Regions Example QQGPHUIQEGQQGKEEDPP Blosum(QQG,QQG) = 16 PKLMMQQGKQEGM QQGPHUIQEGQQGKEEDPP Blosum(QQGK,QQGK) = 21 PKLMMQQGKQEGM QQGPHUIQEGQQGKEEDPP Blosum(QQGKE,QQGKQ) = 23 PKLMMQQGKQEGM QQGPHUIQEGQQGKEEDPP Blosum(QQGKEE,QQGKQE) = 28 PKLMMQQGKQEGM QQGPHUIQEGQQGKEEDPP Blosum(QQGKEED,QQGKQEG) = 27 PKLMMQQGKQEGM So, the extension to the right stops here HSP (before left extension) is QQGKEE, scoring 28

48 BLAST Algorithm Part 6 Checking Statistical Significance Reason we extended alignment regions Give a more accurate picture of the probability of that BLOSUM score occurring by chance Question: is a HSP significant? Suppose we have a HSP such that It scores S for a region of length L in sequences Q & D Then the probability of two random sequences Q and D scoring S in a region of length L is calculated Where Q is same length as Q and D is same length as D This probability needs to be low for significance

49 BLAST Algorithm Part 7 Reporting the Alignments For each statistically significant HSP The alignment is reported If a sequence D has two HSPs with Query Q Two different alignments are reported Later versions of BLAST Try and unify the two alignments

50 NCBI BLAST Server (protein-protein)

51 BLAST Notes Listing words Higher T lower sensitivity, faster execution time Extracting seeds Done using hash tables for making the process faster Finding HSPs Only seeds located on the same diagonal with some other seed located at distance smaller than some threshold will be extended Gapped alignment Will be triggered only for HSPs whose score is higher than threshold 51

52 52

53 53

54 54

55 55

56 56

57 SIGNIFICANCE? Tim Conrad, VL AlDaBi, WT015/16 57

58 The purpose of sequence alignment Homology Function identification

59 Similarity How much similar do the sequences have to be to infer homology? Two possibilities when similarity is detected: The similarity is by chance They evolved from a common ancestor hence, have similar functions

60 Significance of scores HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT Homology detection algorithm 45 LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE Low score = unrelated High score = homologs How high is high enough?

61 Are these proteins homologs? SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY L P W L Y N Y C L SEQ 2: QFFPLMPPAPYWILATDYENLPLVYSCTTFFWLF NO (score = 9) SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY L P W LDATYKNYA Y C L SEQ 2: QFFPLMPPAPYWILDATYKNYALVYSCTTFFWLF MAYBE (score = 15) SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY RVV L PS W LDATYKNYA Y CDVTYKL SEQ 2: RVVPLMPSAPYWILDATYKNYALVYSCDVTYKLF YES (score = 24)

62 Measures of similarity Percent identity: 40% similar, 70% similar problems with percent identity? Scoring matrices matching of some amino acids may be more significant than matching of other amino acids PAM matrix in 1970, BLOSUM in 1992 problems?

63 Statistical Significance Goal: to provide a universal measure for inferring homology How different is the result from a random match, or a match between unrelated sequences? Given a set of sequences not related to the query (or a set of random sequences), what is the probability of finding a match with the same alignment score by chance? Different statistical measures p-value E-value z-score

64 Statistical significance measures p-value: the probability that at least one sequence will produce the same score by chance E-value: expected number of sequences that will produce same or better score by chance z-score: measures how much standard deviations above the mean of the score distribution

65 Search Significance Scores A search will always return some hits. How can we determine how unusual a particular alignment score is? ORF s Assumptions

66 Assessing significance requires a distribution I have an apple of diameter 12cm. Is that unusual? Frequency Diameter (cm)

67 Is a match significant? Match scores for aligning my sequence with random sequences. Depends on: Scoring system Database Sequence to search for Frequency Length Composition Match score How do we determine the random sequences?

68 The null hypothesis We are interested in characterizing the distribution of scores from sequence comparison algorithms. We would like to measure how surprising a given score is, assuming that the two sequences are not related. The assumption is called the null hypothesis. The purpose of most statistical tests is to determine whether the observed results provide a reason to reject the hypothesis that they are merely a product of chance factors.

69 Sequence similarity score distribution Frequency Sequence comparison score Search a randomly generated database of DNA sequences using a randomly generated DNA query. What will be the form of the resulting distribution of pairwise sequence comparison scores?

70 Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from nonhomologous and homologous pairs. High scores from homology.

71 Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database.

72 Mehr Informationen im Internet unter medicalbioinformatics.de/teaching Tim Conrad AG Medical Bioinformatics Weitere Fragen

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and