FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

Similar documents
Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Bioinformatics for Biologists

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

FastA & the chaining problem

BLAST, Profile, and PSI-BLAST

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

Biology 644: Bioinformatics

Computational Molecular Biology

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Sequence Alignment & Search

An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

BLAST MCDB 187. Friday, February 8, 13

Finding homologous sequences in databases

Basic Local Alignment Search Tool (BLAST)

Lecture 4: January 1, Biological Databases and Retrieval Systems

Bioinformatics explained: BLAST. March 8, 2007

Utility of Sliding Window FASTA in Predicting Cross- Reactivity with Allergenic Proteins. Bob Cressman Pioneer Crop Genetics

Sequence alignment theory and applications Session 3: BLAST algorithm

BLAST. NCBI BLAST Basic Local Alignment Search Tool

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Sequence analysis Pairwise sequence alignment

Scoring and heuristic methods for sequence alignment CG 17

Pairwise Sequence Alignment. Zhongming Zhao, PhD

ICB Fall G4120: Introduction to Computational Biology. Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology

Lecture 5 Advanced BLAST

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

Dynamic Programming & Smith-Waterman algorithm

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

Chapter 4: Blast. Chaochun Wei Fall 2014

Similarity Searches on Sequence Databases

Database Similarity Searching

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Bioinformatics explained: Smith-Waterman

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

Sequence Alignment Heuristics

CAP BLAST. BIOINFORMATICS Su-Shing Chen CISE. 8/20/2005 Su-Shing Chen, CISE 1

Computational Genomics and Molecular Biology, Fall

Similarity searches in biological sequence databases

Database Searching Using BLAST

Heuristic methods for pairwise alignment:

BIOL 7020 Special Topics Cell/Molecular: Molecular Phylogenetics. Spring 2010 Section A

BLAST - Basic Local Alignment Search Tool

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 9

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

CS313 Exercise 4 Cover Page Fall 2017

Alignment of Pairs of Sequences

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

EBI services. Jennifer McDowall EMBL-EBI

BLAST & Genome assembly

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 13 G R A T I V. Iterative homology searching,

Sequence alignment algorithms

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Brief review from last class

CISC 889 Bioinformatics (Spring 2003) Multiple Sequence Alignment

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Algorithms in Bioinformatics: A Practical Introduction. Database Search

Biological Sequence Analysis. CSEP 521: Applied Algorithms Final Project. Archie Russell ( ), Jason Hogg ( )

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Algorithmic Approaches for Biological Data, Lecture #20

Distributed Protein Sequence Alignment

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Alignment of Long Sequences

Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--

BIOINFORMATICS A PRACTICAL GUIDE TO THE ANALYSIS OF GENES AND PROTEINS

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

GPU Accelerated Smith-Waterman

Tutorial 4 BLAST Searching the CHO Genome

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

BGGN 213 Foundations of Bioinformatics Barry Grant

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

New generation of patent sequence databases Information Sources in Biotechnology Japan

- G T G T A C A C

FINDING APPROXIMATE REPEATS WITH MULTIPLE SPACED SEEDS

Bioinformatics Sequence comparison 2 local pairwise alignment

From Smith-Waterman to BLAST

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

Data Mining Technologies for Bioinformatics Sequences

A Study On Pair-Wise Local Alignment Of Protein Sequence For Identifying The Structural Similarity

Biostatistics and Bioinformatics Molecular Sequence Databases

Fast Sequence Alignment Method Using CUDA-enabled GPU

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

BLAST & Genome assembly

Sequence Alignment. part 2

Comparison of Sequence Similarity Measures for Distant Evolutionary Relationships

Alignments BLAST, BLAT

Metric Indexing of Protein Databases and Promising Approaches

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Multiple Sequence Alignment. Mark Whitsitt - NCSA

EECS730: Introduction to Bioinformatics

Transcription:

FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Key Notes: Sequence Alignment - is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Sequence Similarity Searching - a method of searching sequence databases by using alignment to a query sequence. By statistically assessing how well database and query sequences match one can infer homology and transfer information to the query sequence. The current FASTA package contains programs for : Protein<>protein searches DNA<>DNA searches protein:translated DNA (with frameshifts) ordered or unordered peptide searches. The FASTA programs find regions of local or global (new) similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other programs provide information on the statistical significance of an alignment.like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. SSEARCH - search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein).

Another searching tools: GGSEARCH performs optimal global-global alignment searches using the Needleman- Wunsch algorithm. GLSEARCH performs an optimal sequence search using alignments that are global in the query but local in the database sequence. This can be useful when you want to match all of a short query sequence to part of a larger database sequence. FASTA Sequence Format in NCBI (Faiz) eg: The description part is differentiate from the sequence data by a greater-than (">") symbol in the first column.

Following after the > symbol,it is the identifier. and the rest of the line is the description. Between the ">" and the first letter of the identifier,there should be no space. The details in the description line is separated by the pipe( ) symbol. Following the description line is the sequence in standard one-letter code. Anything other than a valid code would be ignored for example,spaces, tabulators, asterisks and etc And there will be a blank line between the description and the sequence. Sequence should be represented in the standard nucleic acid code and IUB/IUPAC amino acid. Lower case letters are accepted and will be changed to upper-case letters. Meaning of the sequence identifier: gi 42494633 genbank-identifier number gb AAS17640.1 gb( genome bank ) accession number E[Murine hepatitis virus] Name of Genome/protein Another Example of FASTA format of sequence identifier on other sevices:

FASTA steps Step1:Hashing Finding the exact matches between 2 sequence with ktupvalue(word size) The lower ktup value the slower the searching and more sensitive Hot spot is used given by(i, j), where I and j are the location of query and database sequence respectively difference between I and j value gives value of offset Using similarity of offset positions of words give the region of alignment of the two sequences 10 of the best alignment according to same or different diagonal is saved FASTA also can use PAM250 scoring for aligned residue Example: Hash table

Step 2: Rescoring initial regions The 10 of the best alignment from the step 1 are further processed. These alignment is called initial regions Each of the region is rescore using matrix scoring(pam or BLOSUM) The score obtained is reported as int1 Step 3 : Joining diagonals 2 high scoring alignment step2 is joined with a gap to form single larger alignments The score of this alignment is reported as intn There 2 type of penalty gap(gap open penalty & gap extend penalty) The formula for intn: intn= int1-penalty gap

Step 4: perform dynamic programming to find optimal alignment The optimal alignment is obtained using Smith-Waterman program(dynamic programming) The score resulting alignment of the dynamic program is reported as opt FASTA steps summary 1. Identify 10 regions shared by 2 sequence 2. Rescore using PAM or BLOSSUM matrix. The rescore is saved as int1 3. Short diagonal is removed and long diagonal are joined by gap. The score for joined region is reported as intn 4. The best alignment is obtained by using s-w dynamic alignment programming give opt value

Significance Sequence alignments and database searching are key to all of bioinformatics. Understanding the significance of alignments requires an understanding of statistics and distributions. FASTA results Following is the FASTA output result. It shows the name and the details of the sequence being aligned in the first two or three rows. Followed by the statistics values that has calculated based on the alignment. Statistics values shown included z-score, bit score, expected value, Smith-waterman score and also similarity score.

Figure 1 FASTA output result Z-score Z-score is used to measures significance of a score based on the mean and standard deviation of random score distribution. The differentiation of similarity score from the mean of the random score distribution is standardize by the standard deviation of the random score distribution.to calculate the number of library sequences that could obtain score greater than or equal to score obtain in the search, Z-score can be used with extreme value distribution and poisson distribution. The larger the different between the real score and mean( in standard deviation unit), the higher the Z-scores, the more significant it is. As Z-scores is only dependent on sequences itself and independent on the database size, therefore, the can be compared to each other. Because of this, Z-score very useful for doing all-against-all pairwise sequence comparisons and is used instead of other.

Figure 2 formula of Z-score Expected value FASTA program can predict the number of sequences that would be expected to output a Z- score that is greater than or equal to the Z-score got in the search using the distribution of the Z-scores in the database purely by chance. This is named as E() or expect value. The Expect value (E) used to describe the number of alignments with a given score thatt are expected to be seem by searching a database of random sequences simply due to chance. With the increases of Score (S), expected value will decreases exponentially.expected value calculation take into consideration the distribution of the score in the database searched, thus it is database specific. The closer the E-value to zero, the more significant the match is. Virtually identical short alignment will get comparatively higher E-values. The higher E-values match with the idea that shorter sequences will have a higher occurring probability in the database purely by chance.increase of Z-score result in reduction in E-value.Sequences with E() values less than 0.01 for protein searches for a search of 10,000 entries protein database are almost homologous. m = Length of query (in nucleotides or amino acids) n = Size of database (in nucleotides or amino acids) K and λ =parameter depend on the matrix used(eg:blosum 62 K=0.14, λ =0.318) s /score S() (depend on E(S) = Kmn e λˢ = similarity score of database sequence and query sequence the matrix used) Bit Score Bit-score is a log-scaled version of a score.

Another expected value formula when we have bit score: s = similarity score of database sequence and query sequence(depend of matrix used. eg:blosum 62) K and λ =parameter depend the matrix used m = Length of query (in nucleotides or amino acids) n = Size of database (in nucleotides or amino acids) Bit score used to indicate how good a particular alignment is, level of significance of the alignment increase with the increase of score. Bit score calculation take into account the gaps and substitutions number related with each aligned sequence.the statistical significance of a bit score is based on the query and library sequences length and library size.for instances, 1 bit increase in score, will result in reduction of 2-fold in expectation; while 10 bit increase corresponds to 1000 fold reduction in expectation.as bit scores have been standardized with respect to the scoring system, they can be used to compare alignment scores from different searches. Smith-waterman Score Smith-waterman score is a parameter used to evaluate similar regions between two nucleotides or protein sequences. Instead of depend on the full pictures of the sequence, Smith-Waterman algorithm compares segments of all possible lengths and optimizes the likeness measure. The higher the Smith-waterman score, the more significant the protein sequence. It could be used to identify conserved domains in proteins, whichh may not extend over the entire sequence. Evaluating of result Best score initn: 952 init1: 952 opt: 952 Z-score: 1102.1 bits: 209.9 E() ): 2.7e-52

Smith-Waterman score: 952; 100.0% identity (100% similar) in 148 aa overlap (1-148:1-148) Above is the best score result, as it have similarity and identity of 100%. Besides, it has an extremely high Z-score which is 1102.1bits and very low expected value, 2.7e-52 which shows very high significance of the sequence, very high bits score of 209.9 that shows it is a very good alignment and extremely high Smith-Waterman score which is 952 that shows high local sequences and shows the very high significance of the sequence. Good score initn: 438 init1: 161 opt: 418 Z-score: 540.6 bits: 105.6 E(): 5.1e-21 Smith- Waterman score: 531; 72.4% identity (74.5% similar) in 145 aa overlap (1-116:2-141) Above is the good score result, as it have similarity and identity of 74.5% and 72.4% respectively. Besides, it has a high Z-score which is 540.6 bits and low expected value, 5.1e-21 which show high significance of the sequence, high bits score of 105.6 that shows it is a good alignment and high Smith-Waterman score which is 531 that shows high local sequences and shows the high significance of the sequence. Mediocre score initn: 250 init1: 107 opt: 167 Z-score: 262.4 bits: 53.3 E(): 1.6e-05 Smith-Waterman score: 233; 53.6% identity (58.0% similar) in 112 aa overlap (1-75:2-113) Above is the mediocre score result, as it have mediocre similarity and identity of 58.0% and 53.6% respectively. Besides, it has amoderate Z-score which is 262.4 bits and mediocre expected value, 1.6e-05 which show medium significance of the sequence, moderate bits score of 53.3 that shows it is a mediocre alignment and moderate Smith-Waterman score which is 233 that shows intermediate local sequences and shows the medium significance of the sequence. 1.1 FASTA program Sample of FASTA program on EMBL-EBI website :

Figure 3 FASTA program in EMBL-EBI Demo They are 4 step to used fasta. 1. Select the databank 2. Input protein /DNA/RNA sequence 3. Set parameter 4. Submit the job

Step 1: Select databank Can select multiple databases. The database will run the sequence. Eg: Step 2 : input protein/dna The sequence can be be in GCG, FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot format.

Step 3: Set Parameters 1) Choose Program: FASTA TFASX TFASY 2) Matric used 1. BLOSUM50 2. BLOSUM62 3. BLASTP62 4. BLOSUM80 5. PAM120 6. PAM250

Gap Open Penalty -12 by protein -16 by DNA Gap Extend Penalty -2 for protein -4 for DNA What is GAP? maximal consecutive run of spaces. an atomic insertion or deletion of a substring. when gap is too low, high sequence allignment is achievable. Causes of gaps: A single mutation can create a gap (very common). Unequal crossover in meiosis can lead to insertion or deletion of strings of bases. DNA slippage in the replication procedure can result in the repetition of a string. Retrovirus insertions. Translocations of DNA between chromosomes Gaps can occur : Before the first character of a string Inside a string After the last character of a string

4) Ktup To limit the word length of search. Use 2 for protein database. 5) Expectation Upper Limit Set the expectation value limit for score and alignment. Sequence with E() scores less than 0.01 is almost hologous. 6) Expectation Lower Limit Set the expectation value limit for score and alignment. this option will filter out the best matches and allow more distant relationships to be displayed. 6) Strand choose which DNA strand to search with when you are using a DNA sequence to compare against the DNA databanks. not required for protein/protein searches

Top sequence- will be searched as it is input. Bottom sequence- reverse and complement your input sequence. 7) Histogram Turn on/off the histogram in the FASTA result. The histogram gives a qualitative view of how well the statistical theory fits the similarity scores calculated by the program. 8) Filter filter out unwanted segments of sequence. For example, filtering repeat regions out of your query sequence. Thus, it can reduce the reporting of un-related sequences that match by chance. 9) Scores Maximum number of match summary result search Default value is: 50 10) Alignments - Maximum number of match alignments report result - Default value is: 50

11) Sequence Range Specify a range or section of the input sequence to use in the search. Example: Specifying '35-90' in an input sequence of total length 100. Default value is: START-END 12) Database Range Specify the sizes of the sequences search in a database. For example: 100-250 will search all sequences in a database with length between 100 and 250. Default value is: START-END STEP 4 : Submit Job

COMPARISON FASTA VS BLAST FASTA BLAST Fast A 1985 for protein, later modified to conduct search on DNA Basic Local Alignment Search Tool Developed in 1990 SIMILARITIES Both software compare biological sequences of DNA, amino acids, proteins and nucleotides of different species and look for the similarities BLAST use input data in FASTA format Both very fast, viable and saving time DIFFERENCES Cannot be modified Better for dissimilar sequence Can be modified Better for closely matched sequence BLAST faster than FASTA BLAST more accurate than FASTA BLAST more versatile and widely used than FASTA