Finding homologous sequences in databases

Size: px

Start display at page:

Download "Finding homologous sequences in databases"

Hilary Henderson
5 years ago
Views:

1 Finding homologous sequences in databases

2 There are multiple algorithms to search sequences databases BLAST (EMBL, NCBI, DDBJ, local) FASTA (EMBL, local) For protein only databases scan via Smith-Waterman alignments 2

3 FASTA Pearson and Lipman 1988 PNAS 85:

4 Aim Look for homologs or similar sequences to a query sequence in a database The search can be performed on: remote computers on the EMBL (or DDBJ) web sites 4

5 Problem: Smith-Waterman is slow. You have a 200bp sequence and you wish to find homologues in the NCBI/Genbank database. But the latter is hundreds of billions - of nucleotides.... to make matters worse, you are searching not for an eact match but rather for something that is related (potentially distantly related) to your query sequence. NCBI receives (as of 2006) over requests for such searches every day. 5

6 Problem: Smith-Waterman is slow. Consider a related problem. To assemble reads into contigs you must align them pairwise to find small overlaps. If you have N = 107 reads (for a genome with 500bp reads, this is enough for a 5 coverage), then you need N(N-1)/2 pairwise comparisons. That is alignments each requiring computations. 6

7 So why not search for larger segments than just one nucleotide A lack of keywords would indicate that there is no need to search further. e.g. For computational biology you might wish to first search for the keywords... nucleotides alignment databases and then subsequently refine the search. If you did such a search in Jane Austen's Sense & Sensibility then you (or the computer) would know not to search any further within this book. 7

8 So why not search for larger segments than just one nucleotide A lack of keywords indicates there is no need to search further. e.g. But you might wish to search for... nucleotides alignment databases If you did such a search in Jane Austen's Sense & Sensibility then you (or the computer) would have to search within this book. Looking for these short segments might etend your search and slow it down 8

9 The K-tuple or Ktup or word size value determine how many consecutive identities are required for a match to be declared For protein searches Ktup = 2 For nucleotide sequences Ktup = 4 or 6 For short sequences (<20nt) Ktup = 1 This value will determine the speed and the sensitivity of the search HASH table: divide the database into a series of tables that contain an ordered (alphabetized) list of words along with links to their location in the database. 9

10 10

11 11

12 12

13 Establishment of a 'table' of sequences of variable length init1 assigned to each regions of a sequence using BLOSUM50 matri to score mismatches 'Table' sorted in alphabetical order initn sum of the init1 of each region Comparison of sequences to query sequence = init1-20 = init1-20 +init if (initn < any init1); the initn is discarded and regions are not joined; initn init1ma opt Score for the Smith -Waterman alignments 13

14 Sequence 1. Identification of regions of identity Query 14

15 Sequence 2. Selection of the region with the best score Query 15

16 3. Join the segments Sequence t1 i in t1 i in t1 i in Query 16

17 4. Compute the new score for the whole sequence initn = init1-20 = init init Sequence t1 i in t1 i in t1 i in Query 17

18 This is a visual (aka vague) description... How does a computer do it? 18

19 A k-tuple listing for sequences J: CCATCGCCATCG I: GCATCGGC K=2 AA AA AC AC AG AG AT AT CA CA CC CC CG CG CT CT GA GA GC GC GG GG GT GT TA TA TC TC TG TG TT TT 19

20 A k-tuple listing for sequences J: CCATCGCCATCG I: GCATCGGC K=2 AA AA AC AC AG AG AT 3 9 AT 3 CA 2 8 CA 2 CC 1 7 CC CG 5 11 CG CT CT GA GA GC GC 1 GG GG 6 GT GT TA TA TC TC TG TG TT TT

21 A k-tuple listing for sequences J: CCATCGCCATCG I: GCATCGGC AA AA AC AC AG AG AT 3 9 AT 3 CA 2 8 CA 2 CC 1 7 CC CG 5 11 CG CT CT GA GA GC 5 GC 1 GG GG 6 GT GT TA TA TC TC TG TG TT TT For GC (i=1) in sequence I, we know there is a match in sequence J at position 6 (j=6). This is on the 1-6=-5 diagonal

22 A k-tuple listing for sequences J: CCATCGCCATCG I: GCATCGGC AA AA AC AC AG AG AT 3 9 AT 3 CA 2 8 CA 2 CC 1 7 CC CG 5 11 CG CT CT GA GA GC 5 GC 1 GG GG 6 GT GT TA TA TC TC TG TG TT TT For GC (i=1) in sequence I, we know there is a match in sequence J at position 6 (j=6). This is on the 1-6=-5 diagonal. 7 For CA (i=2) in sequence I, we know there is a match in sequence J at position 2 (j=2). This is on the 2-2=0 diagonal (the 'main' diagonal). With a single pass through this we can calculate the number of matches on each diagonal. 4 22

23 A k-tuple listing for sequences J: CCATCGCCATCG I: GCATCGGC AA AA AC AC AG AG AT 3 9 AT 3 CA 2 8 CA 2 S0 = 4 (four k-tuples on the 'main' diagonal) CC 1 7 CC CG 5 11 CG 5 S1 = 1 (one k-tuple above the 'main' diagonal) GC 1 S-5 = 1 (one k-tuple 5 below the 'main' diagonal) GG GG 6 GT GT TA TA CT CT GA GA GC TC TC TG TG TT TT 7 S-6 = 4 (four k-tuples 6 below the 'main' diagonal) 4 All this can be done with a single pass through this lookup table. 23

24 Steps for FASTA 1 Calculate k-tuples for query sequence 2 Score diagonals and identify 10 best diagonals 3 Rescore with a scoring matri (allowing for less than k-tuple matches) and identify high scoring regions (init1) 4 - Join the initial regions with joining penalties (initn) 5 - Perform full alignment for sequences with high initn scores (opt). 24

25 How do we know that the match we obtained (the opt score) is unusual? For pairwise alignments we obtained the variation (distribution) to compare our score with from a random permutation. We can't do permutations for every score from every sequence in the databases! So what do we compare just the ''best score to? 25

26 How do we know that the match we obtained (the opt score) is unusual? Originally this was done empirically. Compare just the ''best score to a random collection of alignments (database matches) in order to determine how variable the opt scores are with or without true matches. 26

27 opt score ( )( )( )( )( )( )( )( ) done in discrete units ln(seq. length) Fit a linear regression line. Calculate a z-score. 27

28 opt score ( )( )( )( )( )( )( )( ) done in discrete units ln(seq. length) Z = (score mean)/stddev mean from the linear regression line stddev from the spread (variation) of the points Z - a measure of the no. of stddev above/below the mean 28

29 Obviously this lacks the beauty and simplicity of permutations. If the query or database sequence has unusual amino acid content this is ignored. Other than on a gross scale, differences due to length are ignored. Placing a probability on the Z-score is dubious at best since their distribution is unknown. More recently this is done in a more advanced fashion that is used for BLAST searches. Since this was an advance which came with BLAST we will defer the discussion of this till then. 29

30 Finding protein, nucleic sequences or domains in different databases fasta3: the current main version, it compares DNA query vs. DNA database (K-tuple = 6) or AA query versus AA database (K-tuple = 2) fast3/fasty3: translation of DNA query into the 3 frames AA vs AA database fast3: indels between codons fasty3: indels anywhere 30

31 tfast3/tfasty3: AA query vs DNA translated into 6 frames AA databases fastf3/tfastf3: ordered AA miture (peptide degradation) query to AA or DNA database respectively fasts3/tfasts3: set of fragment AA query (mass spec analysis) to AA or DNA database respectively 31

32 Remote query ferredoin gene of Halobacterium halobium Eample : In a to fasta@ebi.ac.uk LIB UNIPROT WORD 1 LIST 50 TITLE HALHA SEQ PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAE GEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSD EEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRVI 32

33 Options first line: data libraries to be search second line: word size or k-tuple value List n: display the top n scores Title: subject of the mail message Align n: align the top n to query sequence ONE: compare only the given strand to the data base Prot: force the query sequence to be a protein PATH: string mails the results back to the string 33

34 Protein search UniProt UniRef100 UniRef90 UniRef50 UniParc swissprot ipi prints sgt pdb imgthlap Euro Patents Japan Patents USPTO Patents A non-redundant collection of all proteins As for UniProt but eliminate identical proteins As for UniProt but eliminate proteins > 90% identical As for UniProt but eliminate proteins > 50% identical As for UniProt but included archived proteins (show changes to an entry) Proteins in the SwissProt database Proteins in the International Protein Inde Proteins in the FingerPrints database Proteins in the Structural Genomics Targets database Proteins in the 3D structural database PDB at Rutgers Proteins in the Immunogenetics Database Proteins in the European patents database Proteins in the Japanese patents database Proteins in the American patents database 34

35 Nucleotide search EMBL Fungi INVERTEBRATES HUMAN MAMMALS ORGANELLES BACTERIOPHAGE PLANT PROKARYOTES RODENTS MOUSE STSs SYNTHETIC UNCLASSIFIED VIRUSES VERTEBRATES The entire EMBL database Subsection of EMBL Sequence Tagged Sites 35

36 ESTs GSSs HTGs PATENTS VECTORS EMBLNEW EMBLALL IMGTLIGM IMGTHLA HGVBASE Epressed Sequence Tags Genome Survey Sequences High Throughput Genomics Sequences Sequences new since the last major database release EMBL + EMBLNEW Immunoglobulins and T cells receptors data base Human Major Histocompatibility Comple (MHC/HLA) database Human Genome Variation database 36

37 Web query Direct query on the EMBL websites (or mirrors) Depending on the moment of the day queries could be really slow 37

38 38

39 39

40 40

41 41

42 Output The output will begin with some information FASTA searches a protein or DNA sequence data bank Version Feb. 20, 2010 Please cite: W.R. Pearson & D.J. Lipman PNAS (1988) 85: >>>EMBOSS_ aa Library: UniProtKB residues in sequences Version References Search libraries sequence type query length 42

43 43

44 44

45 45

46 Output Tabular output Graphical view Listing Histogram (optional): show the number of sequences with various scores 46

47 47

48 48

49 49

50 50

51 51

52 52

53 53

54 54

55 residues in sequences statistics sampled from to sequences Epectation_n fit: rho(ln())= / ; mu= / mean_var= / , 0's: 22 Z-trim: 43 B-trim: 0 in 0/68 Lambda=

56 Score Epected number 56

57 Sequences related to ferredoin 57

58 58

59 59

60 60

61 61

62 62

63 Scores length of the sequence Name of the sequence Reference Number 63

64 64

65 65

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the