INTRODUCTION TO BIOINFORMATICS

Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain sequence information. Link to NCBI web site: http://www.ncbi.nlm.nih.gov/ GENERAL SEARCH 1. The first tool we will explore is the basic search engine. Similar to google, you can enter any combination of search terms or the specific accession number of the sequence of interest in the search box. You can also specify which database to search from the drop down menu to the left of the search box. 2. Let s say we are interested in finding information relative to myosin, a muscle protein. Enter the word myosin in the search box and then click on Search. A new page will be displayed, as shown on the next page, showing the number of records found within the different databases.

Molecular Biology-2017 2 3. The databases most frequently used in this course are the nucleotide and the protein databases. Click on the nucleotide database to obtain the following page: 4. To refine your search, you may then choose from the menus on the left the species, molecule type or the specific taxon from the top organisms on the menu on the right. For

Molecular Biology-2017 3 this example, we will first choose mrna from the menu for molecule type. Then from the new window that is displayed, we will choose records specific to zebra fish (Danio rerio) from the top taxon s menu. 5. A list of records corresponding to your search criteria will then be displayed. From there, you can then search and access the specific record of interest. Information that can be obtained from these records is explained further on in this exercise. 6. For your assignment, use this approach to find the Protein accession number of the first record for actin, cytoplasmic 2 isoform 1 from the mouse (Mus musculus). 7. Use the general search engine to obtain the record with the accession number NG_009024. Once you ve obtained the record, answer the following questions for your assignment. Is this a nucleotide or a protein record? What was the source of the sequence; protein, mrna, or genomic DNA?

Molecular Biology-2017 4 SEARCHING WITH A NUCLEOTIDE SEQUENCE 1. The most common search engine used with either nucleotide or protein sequences is the Basic Local Alignment Search Tool (BLAST). You can access this search engine either from the popular resources menu on the right, or through the Resource list (A-Z) menu on the left.. 2. Resource List (A-Z) : On this page can be found most of the links you will be using throughout the year.

Molecular Biology-2017 5 3. Let s explore Blast. Click on the link Blast. You should obtain the following page. BLAST is a set of similarity search engines designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. Nucleotide blast compares a nucleotide sequence against a nucleotide sequence database. Protein blast Compares an amino acid query sequence against a protein sequence database. Blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. Tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. Tblastx compares a translated nucleotide sequence against a nucleotide sequence database dynamically translated in all reading frames.

Molecular Biology-2017 6 We will first use this program to gain information on different sequences that you will be working with. Note that one of these sequences represents the plasmid insert which you must verify in lab exercise 2. 4. Click on the nucleotide BLAST (Blastn) option. You should obtain the following page: 5. Before we can enter a sequence query, we must make sure that the format of the latter be one that is compatible with the program. Most sequence analysis software can handle a format called FASTA. The FASTA format is a text file, without any numbers or any other annotation which is preceded by a descriptive line of text. Here is an example: >John s sequence123 (Press enter after this line) AACGTCGGATTCAGGTACCCAGGAAAACTACATCTC The first line of your file must begin with the following symbol :">". This symbol informs the program that this line of text is for descriptive purposes only and that the sequence information starts on the next line. You can write anything to identify the sequence on this line. The next line represents the actual sequence. 6. Obtain the text document of unknown sequences available on the BIO3151 web page, by following the link: Sequences>Unknown genes. This document contains five sequences numbered 1-5. Convert each of these to FASTA format. You can do this in NOTEPAD.

Molecular Biology-2017 7 7. Copy and paste the first sequence into the nucleotide blast query box. Choose the database on which the search will be performed in the Choose Search Set menu. Choose other and "nucleotide collection (nr/nt)" from the drop down menu. 8. Now choose the program to do the search from the Program Selection menu. Choose: Somewhat similar sequences (blastn). Check the box "Show results in a new page" to display the results in a new browser window. 9. Click on BLAST. A new page will appear asking you to wait for the completion of your request. This may be quite fast or slow depending on how heavily the demands on the NCBI server are.

Molecular Biology-2017 8 10. Once your request has been completed a new page will appear, as shown below, indicating the results of your search. 11. Before analyzing the results, we will change the formatting options. Click on Formatting options at the top of the page. A new menu will appear as shown below: Choose the option Old view and then click on Reformat

Molecular Biology-2017 9 12. The potential matches to your sequence will now be presented in three formats. A graphical format such as the following: If you scroll down, a textual format such as this one:

Molecular Biology-2017 10 And further down, the actual sequence alignments: For this exercise, the format we are interested in is the list of different records representing matches. Amongst the information that can be obtained are the following values: Query coverage: This value indicates what extent of your input sequence (original query) matches the sequence record found. For instance if the original query is 631 nucleotides long and BLAST can align all 631 nucleotides of this query against a hit, then that would be 100% coverage. Remember, Query Coverage does not take into account the length of the hit, only the percentage of the query that aligns with the hit. The Expect value (E) is a parameter that describes the number of hits one can expect to see by chance when searching a database of a particular size. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. The lower the E-value, or the closer it is to zero, the more significant the match is.

Molecular Biology-2017 11 Ident. : BLAST calculates the percentage identity between the query and the hit in a nucleotide-to-nucleotide alignment. How do you explain the fact that more than one sequence possesses an identity of 100%? Note that some of the sequences represent whole genome sequences, for example the first one from this search. For this exercise you wish to obtain the sequence of the gene not the genome. These are sometimes followed by the letter G. Notice in the above example that the record followed by a G states a 100% identity but only 42% coverage. What does that mean? 13. Click on the accession number to view the record. You should obtain a record similar to the one shown below: To convert to FASTA 1 2 3 4 5 6 7 8

Molecular Biology-2017 12 14. Information that can be obtained from a nucleotide sequence record: The definition (#1): Provides a brief description of sequence; includes information such as source organism, gene name/protein name, or some description of the sequence's function. The accession number (#2): The unique identifier for a sequence record. Organism (#3): The formal scientific name for the source organism (genus and species). Source: (#4): Information including an abbreviated form of the organism name, sometimes followed by a molecule type.. CDS (#5): Coding sequence; region of nucleotides that corresponds with the sequence of amino acids in a protein (location includes start and stop codons). By clicking on this link you may obtain the mrna sequence from the Start to the Stop codons. o Gene = (#6): The name of the gene. o Product = (#7): The name of the gene s protein product. o Protein_id. (#8): This is the protein s accession number. By clicking on this link, you can obtain the protein record. 15. In several of the future exercises you will be required to obtain and save these sequences in FASTA format. To change the format to FASTA, choose FASTA at the top of the sequence record. You should be redirected to a page like the following one:

Molecular Biology-2017 13 16. You could now select and copy the description that is preceded by the symbol > as well as the sequence and paste it in the program such as Notepad if you wished to save the sequence in this format. 17. For your assignment, obtain the following information for each of the unknown sequences on this course s web site (Sequences > Unknown genes): Accession number Coverage Ident. E value The definition The organism from which this sequence was obtained The gene name The gene s product name The protein s accession number