Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic methods based on shared subsequences (with only a little sacrifice of sensitivity) FASTA BLAST + Gapped BLAST 1
FASTA Bi03c_3 Use hash table of short words of the query sequence. Short = 2 to 6 characters. Go through database and look for matches in to the query hash table (computing time linear in size of database) Score matching segments based on content of these matches: first regarding # of ocurrences, second regarding correct order Seq0 Seq1 Seq2 Seq3 Seq4 Seq5 Seq6... SeqN-1 SeqN Word 0 Word 1 Word 2... Word N from Altman (1999) BLAST (Basic( Local Alignment Search Tool) Bi03c_4 Very heuristic! But most successful! Detailed description in: Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J.Mol.Biol. Journal-of-Molecular-Biology. 1990; 215:3-410 Uses substitution matrices to compute scores e.g. PAM_120 for proteins +5 for matching aas, -5 for mismatch 2
BLAST (Basic( Local Alignment Search Tool), ctd. Bi03c_5 Define maximal segment pair (MSP): = : maximum scoring pair of identical length segments chosen from 2 sequences. Define local maximum scoring pair : = : high scoring pair (HSP) whose score cannot be improved by extending or shortening BLAST seeks all locally aligned HSPs with scores above some cutoff (and ranks them) yields list of local alignments without gaps BLAST implementation Bi03c_6 involved! (See above article) 1) 2) 3) compile a list of high scoring words (k-tuples, scoring at least T when compared to query sequency, (e.g. using a PAMsubstitution matrix). scan database for hits extend hits fast and most widely used! open to several variations of algorithm, strategy & parameters (e.g. substitution matrices threshold, word-lengths) setting the options (other than default) needs understanding of background concepts 3
BLAST services Bi03c_7 BLAST provided by NCBI (National Center for Biotechnology Information) WWW BLAST Stand alone BLAST Network BLAST BLAST URL API (HTTP-encoded requests to NCBI web server) BLAST is also provided by many other Institutions BLAST: types of programs & searches Bi03c_8 Offline: insert1.gif Further information not shown, see http://www.ncbi.nlm.nih.gov/education/blastinfo/query_tutorial.html 4
BLAST: databases to select Bi03c_9 Offline:insert2.gif Further information not shown, see http://www.ncbi.nlm.nih.gov/education/blastinfo/query_tutorial.html BLAST: databases to select, ctd. Bi03c_10 Offline:insert3.gif Further information not shown, see http://www.ncbi.nlm.nih.gov/education/blastinfo/query_tutorial.html 5
BLAST: parameters to select Bi03c_11 substitution matrix & gap penalties No single scoring scheme is best for all purposes experience & understanding of background is necessary for appropriate choice. Suggested combinations of substitution matrices and affine gap-penalties Amino-Acid Substitution Matrix affine gap penalties γ ( g) = d ( g 1) e d = gap opening (existence) e = gap extension PAM30 9 1 PAM70 10 1 BLOSUM80 10 1 BLOSUM62 11 1 BLOSUM45 15 2 BLAST, options Bi03c_12 substitution matrices, background small effect, replacement occurs often large effect, replacement occurs rarely (seldom) aa k aai aa j matrices 6
BLAST, options Bi03c_13 substitution matrix & gap penalties No single scoring scheme is best for all purposes experience & understanding of background is necessary for appropriate choice. A given class of alignments is best distinguished from chance by the substitution matrix whose target frequencies characterize the class Should one BLAST Proteins or rather Nucleotides? Since more than one codon codes for a particular aa, BLASTing proteins is more reliable than BLASTing nucleotides to find similarities between sequences BLAST, options Bi03c_14 (Some) More options filtering low complexity regions (yes/no) SEG -algorithm for protein-blast DUST -algorithm for nucleotide BLAST beware of regions with highly biased amino-acid-composition... A L M M M M M M L K M M M M M K M M M... (appear as X s in alignment with protein itself!) (appear as X s in alignment with nucleotide itself!) selecting WORD size default: 3 for protein BLAST default: 11 for nucleotide BLAST short words: increase sensitivity & computation time 7
BLAST, interpretation of results Bi03c_15? Which score is high enough? to be significant? High means high compared to scores obtained by chance! BLAST, interpretation of results, ctd. Bi03c_16 Number of hits by chance Compute the expected number (E) of HSPs with score[hsp] > S if query were compared to random sequences. E=Km.ne N λs m, n sequence lengths of query (m) and (whole) database K, λ factors (can be compiled) S raw score, directly obtained via substitution matrix cannot be quantitatively interpreted directly... Interpretation makes intuitive sense 8
BLAST, interpretation of results, ctd. Bi03c_17 Compute normalized score (believe or read Altschul et al. 1990...) S' = λs lnκ ln 2 (bit-score) m.n E = ' 2 S expected number of HSPs with bit-score S listed in BLAST_output BLAST, interpretation of results, ctd. Bi03c_18 Probabilities for finding several (n HSP ) random HSPs p=e nhsp -E E ( nhsp)! p = probability of finding n HSP HSPs if E HSPs are expected (i.e. with score S) Offline: Distribution of HSPs.gif 9
BLAST, interpretation of results, ctd. Bi03c_19 Compute probability of finding at least 1 HSP: p( nhsp 1) -E E = 1 e. 0! ( ) = 1 p nhsp= 0 0 = 1 ( E) the expected number of HSPs with score S is usually very small! 2 E 1 1 + + +... 1! 2! ( ) for 1 << p nhsp E E 1 BLAST, interpretation of results, ctd. Bi03c_20 Probabilities of finding at least 1 HSP: how accurate is approximation? Approximation for E<<1 accurate formula ( ) p nhsp 1 E for E << 1 10
BLAST, some more points to consider Bi03c_21 E-value of above equation refers to 2-sequence alignment. For comparison with a whole database of sequences E is adjusted: Mode chosen in FASTA: E E/(number of sequences in db) Mode chosen in BLAST: E E/(total length of db) E-value is valid only for ungapped alignments in a strict sense. But: Proves o.k. also for gapped ones Filter out low complexity regions! DUST-algorithm for nucleotide sequences SEG-algorithms for protein sequences (filtering is applied only to query sequence, not to db!) BLAST for human HFE Gene produkt (hemochromatosis protein) Bi03c_22 11
Human HFE Gene, Graphic Display Bi03c_23 high resolution Human HFE Gene, Graphic Display, detail for 1 st splicing variant Bi03c_24 12
Protein for Splicing Variant 1 Bi03c_25 high resolution Get FASTA for protein Bi03c_26 data: Variant1FASTA.txt 13
Launch BLAST via Entrez Bi03c_27 Submit BLAST Query Bi03c_28 14
format BLAST result Bi03c_29 Add Up: Aligned Conserved Domains Bi03c_30 high resolution 15
BLAST results, graphic Bi03c_31 high resolution BLAST results, graphic, ctd. Bi03c_32 high resolution 16
Interpreting BLAST alignment Bi03c_33 data: blosum62.html Bi03c_34 Correction items: get article from altschul page 18: figure import not ok nhsp HSP tiefstellen, auch in formula! Unit 3a p 22: link auf ScoringMatrix2.html und scoringmatrices_tut.html einfügen! General: Copmplete contents (starting at database searches) insert screen shots for queries 17