Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Size: px

Start display at page:

Download "Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA."

Diana Mills
6 years ago
Views:

3 Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

4 Fasta is used to compare a protein or DNA sequence to all of the entries in a sequence library. For example, fasta can compare a protein sequence to all of the sequences in the NBRF PIR or NCBI protein sequence database. Fasta compares a query sequence to a sequence library.

5 -is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. SEQUENCE SIMILARITY SEARCHING -a method of searching sequence databases by using alignment to a query sequence. By statistically assessing how well database and query sequences match one can infer homology and transfer information to the query sequence.

6 Protein<>protein searches DNA<>DNA searches protein:translated DNA (with frameshifts) ordered or unordered peptide searches.

7 The FASTA programs find regions of local or global (new) similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Several programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

8 FASTA PACKAGE PROVIDES SSEARCH, AN IMPLEMENTATION OF THE OPTIMAL SMITH- WATERMAN ALGORITHM SSEARCH - search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein).

9 Another searching tools: GGSEARCH performs optimal global-global alignment searches using the Needleman-Wunsch algorithm. GLSEARCH performs an optimal sequence search using alignments that are global in the query but local in the database sequence. This can be useful when you want to match all of a short query sequence to part of a larger database sequence.

10 FASTA Sequence format

12 EXAMPLE: FASTA FORMAT SEQUENCE (PROTEIN SEQUENCE) lines of sequence data. one line header

13 The description part is differentiate from the sequence data by a greater-than (">") symbol in the first column. Following after the > symbol,it is the identifier. and the rest of the line is the description. Between the ">" and the first letter of the identifier,there should be no space.

14 The details in the description line is separated by the pipe( ) symbol.

15 Red : genbank-identifier number Yellow : gb( genome bank ) accession number Blue : Name of genome

16 Following the description line is the sequence in standard one-letter code. Anything other than a valid code would be ignored for example, spaces, tabulators, asterisks and etc And there will be a blank line between the description and the sequence. Each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters

17 ANOTHER EXAMPLE OF SEQUENCE IDENTIFIER..

18 Sequence should be represented in the standard nucleic acid code and IUB/IUPAC amino acid. Lower case letters are accepted and will be changed to upper-case letters.

19 In amino acid sequences, U and * are acceptable letters Here amino acid codes supported (25 amino acids and 3 special codes) Numerical digits are not allowed but are used in some databases to indicate the position in the sequence

21 HASHING RESCORING INITIAL REGIONS JOINING DIAGONALS PERFORM DYNAMIC PROGRAMMING TO FIND OPTIMAL ALIGNMENT

22 STEP 1 : HASHING Finding the exact matches between 2 sequence with k-tup value(word size) Hot spot is used given by(i, j), where i and j are the location of query and database sequence respectively Difference between i and j value gives value of offset 10 of the best alignment according to same or different diagonal is saved i (query) A T C G G A A X x T j (database) X G X X T X X Offset : i - j FASTA also can use PAM250 scoring for aligned residue

24 Use a hash table to more efficiently store k-tup/word size/kmers A table of 4 k entries is requires to store all possible word size of a query sequence, where k= k-tup/word size/k-mers

25 STEP 2 : RESCORING INITIAL REGIONS The 10 of the best alignment from the step 1 are further processed. These alignment is called initial regions Each of the region is rescore using matrix scoring(pam or BLOSUM) The score obtained is reported as int1

26 STEP 3 : JOINING DIAGONALS 2 high scoring alignment step 2 is joined with a gap to form single larger alignments The score of this alignment is reported as intn There 2 type of penalty gap(gap open penalty & gap extend penalty) The formula for intn: intn= int1-penalty gap

27 STEP 4 : PERFORM DYNAMIC PROGRAMMING TO FIND OPTIMAL ALIGNMENT The optimal alignment is obtained using Smith-Waterman program(dynamic programming) The score resulting alignment of the dynamic program is reported as opt

28 Z-score Expected value Bit score Smith Waterman Score

29 FASTA OUTPUT-ALIGNMENT

30 1. Rescore using PAM or BLOSSUM matrix. The rescore is saved as init1. 2. Short diagonal is removed and long diagonal are joined by gap. The score for joined region is reported as initn. 3. The best alignment is obtained by using s-w dynamic alignment programming give opt value.

31 OUTPUT OF FASTA Z- score Bit scor e Expect ed value Smith- Water man score Similar ity How we evalulate the which database sequence is match to our Input sequence??

Z score for a single alignment= (score of query- mean score from database) standard deviation from database measures significance of a score based on the mean and standard

32 Z score for a single alignment= (score of query- mean score from database) standard deviation from database measures significance of a score based on the mean and standard deviation of random score distribution. ( scores) 2 Stand. Dev. = The larger the different between the real score and mean, the higher the Z-scores, the more significant it is.

33 Z-score calculate the number of library sequences that could obtain score greater than or equal to score obtain in the search, Z-score can be used with extreme value distribution and poisson distribution.

34 As Z-scores is only dependent on sequences itself and independent on the database size, therefore, they can be compared to each other. Because of this, Z-score very useful for doing allagainst-all pairwise sequence comparisons and is used instead of other

E (S) = Kmn eλˢ m = Length of query (in nucleotides or amino acids) n = Size of database (in nucleotides or

318) s /score S() = similarity score of database sequence and query sequence (depend on the matrix used) The

by searching a database of random sequences simply due to chance. Score (S), expected value exponentially.

35 E (S) = Kmn eλˢ m = Length of query (in nucleotides or amino acids) n = Size of database (in nucleotides or amino acids) K and λ =parameter depend on the matrix used(eg:blosum 62 K=0.14, λ =0.318) s /score S() = similarity score of database sequence and query sequence (depend on the matrix used) The Expect value (E) used to describe the number of alignments with a given score that are expected to be seem by searching a database of random sequences simply due to chance. Score (S), expected value exponentially. Expected value calculation take into consideration the distribution of the score in the database searched, thus it is database specific.

The closer the E-value to zero, the more

Increase of Z-score result in reduction in

that shorter sequences will have a higher

chance. Sequences with E() values less than 0.

36 The closer the E-value to zero, the more significant the match is. Increase of Z-score result in reduction in E-value The higher E-values match with the idea that shorter sequences will have a higher occurring probability in the database purely by chance. Sequences with E() values less than 0.01 for protein searches for a search of 10,000 entries protein database are almost homologous.

37 There another expected value formula when we have bit score: s = similarity score of database sequence and query sequence(depend of matrix used. eg:blosum 62) K and λ =parameter depend the matrix used m = Length of query (in nucleotides or amino acids) n = Size of database (in nucleotides or amino acids) Bit score used to indicate how good a particular alignment is, level of significance of the alignment increase with the increase of score.

1 bit increase in score, will result in reduction of 2-fold in expectation; while 10 bit increase corresponds to 1000 fold

38 Bit score calculation take into account the gaps and substitutions number related with each aligned sequence. The statistical significance of a bit score is based on the query and library sequences length and library size. 1 bit increase in score, will result in reduction of 2-fold in expectation; while 10 bit increase corresponds to 1000 fold reduction in expectation. As bit scores have been standardized with respect to the scoring system, they can be used to compare alignment scores from different searches.

39 Smith-waterman score is a parameter used to evaluate similar regions between two nucleotides or protein sequences. Instead of depend on the full pictures of the sequence, Smith-Waterman algorithm compares segments of all possible lengths and optimizes the likeness measure. The higher the value, the more significant the protein sequence Could be used to identify conserved domains in proteins, which may not extend over the entire sequence.

Best score initn: 952 init1: 952 opt: 952 Z-score: 1102.1 bits: 209.9 E(): 2.7e-52 Smith-Waterman score: 952; 100.

6 E(): 5.1e-21 Smith-Waterman score: 531; 72.4% identity (74.

40 Best score initn: 952 init1: 952 opt: 952 Z-score: bits: E(): 2.7e-52 Smith-Waterman score: 952; 100.0% identity (100% similar) in 148 aa overlap (1-148:1-148) Good score initn: 438 init1: 161 opt: 418 Z-score: bits: E(): 5.1e-21 Smith-Waterman score: 531; 72.4% identity (74.5% similar) in 145 aa overlap (1-116:2-141) Mediocre score initn: 250 init1: 107 opt: 167 Z-score: bits: 53.3 E(): 1.6e-05 Smith-Waterman score: 233; 53.6% identity (58.0% similar) in 112 aa overlap (1-75:2-113)

41 DEMO

43 They are 4 step to used fasta. 1. Select the databank 2. Input protein /DNA/RNA sequence 3. Set parameter 4. Submit the job

44 Can select multiple databases. The database will run the sequence. Eg:

45 The sequence can be in GCG, FASTA, EMBL, GenBank format. ence_formats.html

46 1) Choose Program: FASTA TFASX TFASY

2) Matric used 1. BLOSUM50 2. BLOSUM62 3. BLASTP62 4. BLOSUM80 5. PAM120 6.

47 2) Matric used 1. BLOSUM50 2. BLOSUM62 3. BLASTP62 4. BLOSUM80 5. PAM PAM250 Gap Open Penalty -12 by protein -16 by DNA Gap Extend Penalty -2 for protein -4 for DNA

48 What is GAP? maximal consecutive run of spaces. an atomic insertion or deletion of a substring. when gap is too low, high sequence allignment is achievable. Cause : single mutation Translocation of DNA

49 Gaps can occur : Before first character in a string Inside a string After last character string

50 4) Ktup To limit the word length of search. Use 2 for protein database. 5) Expectation Upper Limit Set the expectation value limit for score and alignment. Sequence with E() scores less than 0.01 is almost hologous. 6) Expectation Lower Limit Set the expectation value limit for score and alignment. this option will filter out the best matches and allow more distant relationships to be displayed.

51 6) Strand choose which DNA strand to search with when you are using a DNA sequence to compare against the DNA databanks. not required for protein/protein searches

52 7) Histogram Turn on or off Shows qualitative view of how well the statistical theory fits the similarity scores calculated by the program.

53 8) FILTER filter out unwanted segments of sequence. For example, filtering repeat regions out of your query sequence. Thus, it can reduce the reporting of un-related sequences that match by chance.

54 9) Scores Maximum number of match summary result search Default value is: 50

55 10) Alignments Maximum number of match alignments report result Default value is: 50 10) Sequence Range Specify a range or section of the input sequence to use in the search. Example: Specifying '35-90' in an input sequence of total length 100. Default value is: START-END

56 11) Database Range Specify the sizes of the sequences search in a database. For example: will search all sequences in a database with length between 100 and 250. Default value is: START-END

58 FASTA BLAST

59 FASTA AND BLAST FASTA Fast A A stand for ALL 1985 protein sequence only, later modified to conduct search on DNA BLAST Basic Local Alignment Search Tool Work on the principle between 2 sequence after short list

60 FASTA AND BLAST SIMILARITIES Both software compare biological sequences of DNA, amino acids, proteins and nucleotides of different species and look for the similarities BLAST use input data in FASTA format Both very fast, viable and saving time

61 FASTA BLAST DIFFERENCES BLAST Can be modified FASTA Cannot be modified Better for closely matched sequence Better for sequence dissimilar BLAST faster than FASTA BLAST more accurate than FASTA BLAST more versatile and widely used than FASTA

62 Which is the best score, good score and mediocre score? i) initn: 438 init1: 161 opt: 418 Z-score: bits: E(): 5.1e-21 Smith- Waterman score: 531; ii) initn: 1234 init1: 952 opt: 900 Z-score: bits: E(): 2.7e-52 Smith-Waterman score: 952; iii) initn: 250 init1: 107 opt: 167 Z-score: bits: 53.3 E(): 1.6e-05 Smith-Waterman score: 233;

63 THANKS

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence