FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Key Notes: Sequence Alignment - is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Sequence Similarity Searching - a method of searching sequence databases by using alignment to a query sequence. By statistically assessing how well database and query sequences match one can infer homology and transfer information to the query sequence. The current FASTA package contains programs for : Protein<>protein searches DNA<>DNA searches protein:translated DNA (with frameshifts) ordered or unordered peptide searches. The FASTA programs find regions of local or global (new) similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other programs provide information on the statistical significance of an alignment.like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. SSEARCH - search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein).
Another searching tools: GGSEARCH performs optimal global-global alignment searches using the Needleman- Wunsch algorithm. GLSEARCH performs an optimal sequence search using alignments that are global in the query but local in the database sequence. This can be useful when you want to match all of a short query sequence to part of a larger database sequence. FASTA Sequence Format in NCBI (Faiz) eg: The description part is differentiate from the sequence data by a greater-than (">") symbol in the first column.
Following after the > symbol,it is the identifier. and the rest of the line is the description. Between the ">" and the first letter of the identifier,there should be no space. The details in the description line is separated by the pipe( ) symbol. Following the description line is the sequence in standard one-letter code. Anything other than a valid code would be ignored for example,spaces, tabulators, asterisks and etc And there will be a blank line between the description and the sequence. Sequence should be represented in the standard nucleic acid code and IUB/IUPAC amino acid. Lower case letters are accepted and will be changed to upper-case letters. Meaning of the sequence identifier: gi 42494633 genbank-identifier number gb AAS17640.1 gb( genome bank ) accession number E[Murine hepatitis virus] Name of Genome/protein Another Example of FASTA format of sequence identifier on other sevices:
FASTA steps Step1:Hashing Finding the exact matches between 2 sequence with ktupvalue(word size) The lower ktup value the slower the searching and more sensitive Hot spot is used given by(i, j), where I and j are the location of query and database sequence respectively difference between I and j value gives value of offset Using similarity of offset positions of words give the region of alignment of the two sequences 10 of the best alignment according to same or different diagonal is saved FASTA also can use PAM250 scoring for aligned residue Example: Hash table
Step 2: Rescoring initial regions The 10 of the best alignment from the step 1 are further processed. These alignment is called initial regions Each of the region is rescore using matrix scoring(pam or BLOSUM) The score obtained is reported as int1 Step 3 : Joining diagonals 2 high scoring alignment step2 is joined with a gap to form single larger alignments The score of this alignment is reported as intn There 2 type of penalty gap(gap open penalty & gap extend penalty) The formula for intn: intn= int1-penalty gap
Step 4: perform dynamic programming to find optimal alignment The optimal alignment is obtained using Smith-Waterman program(dynamic programming) The score resulting alignment of the dynamic program is reported as opt FASTA steps summary 1. Identify 10 regions shared by 2 sequence 2. Rescore using PAM or BLOSSUM matrix. The rescore is saved as int1 3. Short diagonal is removed and long diagonal are joined by gap. The score for joined region is reported as intn 4. The best alignment is obtained by using s-w dynamic alignment programming give opt value
Significance Sequence alignments and database searching are key to all of bioinformatics. Understanding the significance of alignments requires an understanding of statistics and distributions. FASTA results Following is the FASTA output result. It shows the name and the details of the sequence being aligned in the first two or three rows. Followed by the statistics values that has calculated based on the alignment. Statistics values shown included z-score, bit score, expected value, Smith-waterman score and also similarity score.
Figure 1 FASTA output result Z-score Z-score is used to measures significance of a score based on the mean and standard deviation of random score distribution. The differentiation of similarity score from the mean of the random score distribution is standardize by the standard deviation of the random score distribution.to calculate the number of library sequences that could obtain score greater than or equal to score obtain in the search, Z-score can be used with extreme value distribution and poisson distribution. The larger the different between the real score and mean( in standard deviation unit), the higher the Z-scores, the more significant it is. As Z-scores is only dependent on sequences itself and independent on the database size, therefore, the can be compared to each other. Because of this, Z-score very useful for doing all-against-all pairwise sequence comparisons and is used instead of other.
Figure 2 formula of Z-score Expected value FASTA program can predict the number of sequences that would be expected to output a Z- score that is greater than or equal to the Z-score got in the search using the distribution of the Z-scores in the database purely by chance. This is named as E() or expect value. The Expect value (E) used to describe the number of alignments with a given score thatt are expected to be seem by searching a database of random sequences simply due to chance. With the increases of Score (S), expected value will decreases exponentially.expected value calculation take into consideration the distribution of the score in the database searched, thus it is database specific. The closer the E-value to zero, the more significant the match is. Virtually identical short alignment will get comparatively higher E-values. The higher E-values match with the idea that shorter sequences will have a higher occurring probability in the database purely by chance.increase of Z-score result in reduction in E-value.Sequences with E() values less than 0.01 for protein searches for a search of 10,000 entries protein database are almost homologous. m = Length of query (in nucleotides or amino acids) n = Size of database (in nucleotides or amino acids) K and λ =parameter depend on the matrix used(eg:blosum 62 K=0.14, λ =0.318) s /score S() (depend on E(S) = Kmn e λˢ = similarity score of database sequence and query sequence the matrix used) Bit Score Bit-score is a log-scaled version of a score.
Another expected value formula when we have bit score: s = similarity score of database sequence and query sequence(depend of matrix used. eg:blosum 62) K and λ =parameter depend the matrix used m = Length of query (in nucleotides or amino acids) n = Size of database (in nucleotides or amino acids) Bit score used to indicate how good a particular alignment is, level of significance of the alignment increase with the increase of score. Bit score calculation take into account the gaps and substitutions number related with each aligned sequence.the statistical significance of a bit score is based on the query and library sequences length and library size.for instances, 1 bit increase in score, will result in reduction of 2-fold in expectation; while 10 bit increase corresponds to 1000 fold reduction in expectation.as bit scores have been standardized with respect to the scoring system, they can be used to compare alignment scores from different searches. Smith-waterman Score Smith-waterman score is a parameter used to evaluate similar regions between two nucleotides or protein sequences. Instead of depend on the full pictures of the sequence, Smith-Waterman algorithm compares segments of all possible lengths and optimizes the likeness measure. The higher the Smith-waterman score, the more significant the protein sequence. It could be used to identify conserved domains in proteins, whichh may not extend over the entire sequence. Evaluating of result Best score initn: 952 init1: 952 opt: 952 Z-score: 1102.1 bits: 209.9 E() ): 2.7e-52
Smith-Waterman score: 952; 100.0% identity (100% similar) in 148 aa overlap (1-148:1-148) Above is the best score result, as it have similarity and identity of 100%. Besides, it has an extremely high Z-score which is 1102.1bits and very low expected value, 2.7e-52 which shows very high significance of the sequence, very high bits score of 209.9 that shows it is a very good alignment and extremely high Smith-Waterman score which is 952 that shows high local sequences and shows the very high significance of the sequence. Good score initn: 438 init1: 161 opt: 418 Z-score: 540.6 bits: 105.6 E(): 5.1e-21 Smith- Waterman score: 531; 72.4% identity (74.5% similar) in 145 aa overlap (1-116:2-141) Above is the good score result, as it have similarity and identity of 74.5% and 72.4% respectively. Besides, it has a high Z-score which is 540.6 bits and low expected value, 5.1e-21 which show high significance of the sequence, high bits score of 105.6 that shows it is a good alignment and high Smith-Waterman score which is 531 that shows high local sequences and shows the high significance of the sequence. Mediocre score initn: 250 init1: 107 opt: 167 Z-score: 262.4 bits: 53.3 E(): 1.6e-05 Smith-Waterman score: 233; 53.6% identity (58.0% similar) in 112 aa overlap (1-75:2-113) Above is the mediocre score result, as it have mediocre similarity and identity of 58.0% and 53.6% respectively. Besides, it has amoderate Z-score which is 262.4 bits and mediocre expected value, 1.6e-05 which show medium significance of the sequence, moderate bits score of 53.3 that shows it is a mediocre alignment and moderate Smith-Waterman score which is 233 that shows intermediate local sequences and shows the medium significance of the sequence. 1.1 FASTA program Sample of FASTA program on EMBL-EBI website :
Figure 3 FASTA program in EMBL-EBI Demo They are 4 step to used fasta. 1. Select the databank 2. Input protein /DNA/RNA sequence 3. Set parameter 4. Submit the job
Step 1: Select databank Can select multiple databases. The database will run the sequence. Eg: Step 2 : input protein/dna The sequence can be be in GCG, FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot format.
Step 3: Set Parameters 1) Choose Program: FASTA TFASX TFASY 2) Matric used 1. BLOSUM50 2. BLOSUM62 3. BLASTP62 4. BLOSUM80 5. PAM120 6. PAM250
Gap Open Penalty -12 by protein -16 by DNA Gap Extend Penalty -2 for protein -4 for DNA What is GAP? maximal consecutive run of spaces. an atomic insertion or deletion of a substring. when gap is too low, high sequence allignment is achievable. Causes of gaps: A single mutation can create a gap (very common). Unequal crossover in meiosis can lead to insertion or deletion of strings of bases. DNA slippage in the replication procedure can result in the repetition of a string. Retrovirus insertions. Translocations of DNA between chromosomes Gaps can occur : Before the first character of a string Inside a string After the last character of a string
4) Ktup To limit the word length of search. Use 2 for protein database. 5) Expectation Upper Limit Set the expectation value limit for score and alignment. Sequence with E() scores less than 0.01 is almost hologous. 6) Expectation Lower Limit Set the expectation value limit for score and alignment. this option will filter out the best matches and allow more distant relationships to be displayed. 6) Strand choose which DNA strand to search with when you are using a DNA sequence to compare against the DNA databanks. not required for protein/protein searches
Top sequence- will be searched as it is input. Bottom sequence- reverse and complement your input sequence. 7) Histogram Turn on/off the histogram in the FASTA result. The histogram gives a qualitative view of how well the statistical theory fits the similarity scores calculated by the program. 8) Filter filter out unwanted segments of sequence. For example, filtering repeat regions out of your query sequence. Thus, it can reduce the reporting of un-related sequences that match by chance. 9) Scores Maximum number of match summary result search Default value is: 50 10) Alignments - Maximum number of match alignments report result - Default value is: 50
11) Sequence Range Specify a range or section of the input sequence to use in the search. Example: Specifying '35-90' in an input sequence of total length 100. Default value is: START-END 12) Database Range Specify the sizes of the sequences search in a database. For example: 100-250 will search all sequences in a database with length between 100 and 250. Default value is: START-END STEP 4 : Submit Job
COMPARISON FASTA VS BLAST FASTA BLAST Fast A 1985 for protein, later modified to conduct search on DNA Basic Local Alignment Search Tool Developed in 1990 SIMILARITIES Both software compare biological sequences of DNA, amino acids, proteins and nucleotides of different species and look for the similarities BLAST use input data in FASTA format Both very fast, viable and saving time DIFFERENCES Cannot be modified Better for dissimilar sequence Can be modified Better for closely matched sequence BLAST faster than FASTA BLAST more accurate than FASTA BLAST more versatile and widely used than FASTA