Biostatistics and Bioinformatics Molecular Sequence Databases

Size: px

Start display at page:

Download "Biostatistics and Bioinformatics Molecular Sequence Databases"

Hubert Johnson
6 years ago
Views:

1 . 1

2 Description of Module Subject Name Paper Name Module Name/Title Dr. Vijaya Khader Dr. MC Varadaraj 2

3 1. Objectives: In the present module, the students will learn about 1. Encoding linear sequences of nucleic acids (DNA/RNA) and proteins using single letter codes 2. Creating sequence files using NotePad in different formats of sequence data for use by different programs 3. International public domain sequence archives and databases 4. Retrieval systems used by different sequence databases 5. Browsing genomes for understanding the gene arrangement along chromosomes 6. Converting one sequence format into another for use in other sequence analysis program 2. Concept Map Sequence Data encoding Format for handling sequence data Retrieval Systems Sequence Archives Sequence Format Conversion Genome Browsers 3. Molecular sequence data are known linear sequences of deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and proteins. Functional information may be derived from sequence data. In addition, the sequence data also have attached useful information about these molecules. This information is known as annotations of data. All the information in sequence and related annotations are stored in specific formats, particular to the database. These particular databases have also developed retrieval systems for accessing sequence data. We will understand major online sequence gateways to retrieve and browse sequence data as well as converting between various sequence formats. Back to concept Map 3.1. Sequence Data Encoding 3

4 The bioinformatics tools enable the biochemists to derive useful information from nucleotide or protein sequence data for biochemical analyses. Therefore, nucleotide or protein sequence data is an important resource for understanding the biochemical function of the genes and proteins. The linear sequences are represented by single letter codes for residues. The nucleotide or protein sequence data are stored as linear sequences of these single letter codes in the sequence databases. Recommended single letter codes for residues in nucleic acids (DNA and RNA) are shown next Symbol A C G T M R W S Y K V H D B N or X Represented Base Adenine Cytosine Guanine Thymine A/C A/G A/T C/G C/T G/T A/C/G A/C/T A/G/T C/G/T A/T/C/G i.e. Any Base Recommended single letter codes for residues in proteins i.e. amino acids are shown next Symbol Represented Logic to assign single letter code Amino Acid C Cysteine only one amino acid begin with this letter H Histidine -do- I Isoleucine -do- M Methionine -do- S Serine -do- V Valine -do- A Alanine more than one amino acid begins with this letter, 4

5 Back to concept Map therefore, this letter is assigned to the most commonly occurring amino acid G Glycine -do- L Leucine -do- P Proline -do- T Threonine -do- F Phenylalanine phonetically suggestive R Arginine -do- Y Tyrosine -do- Q Glutamine Qlutamine W Tryptophan Double Ring present in side chain D Aspartic acid a letter close to the initial is used (near A) E Glutamic acid a letter close to the initial is used (near G) K Lysine a letter close to the initial is used (near L) N Asparagine Contains N, not assigned to any other B D/N Z E/Q X Any amino acid 3.2. Formats for handling sequence data Specific bioinformatics software packages and online tools can read the sequence data in the recognised standard formats. This is similar to opening a text file saved in MSWord format will open with MSWord only. This file cannot be opened with Adobe reader because the MSWord format is not supported with Adobe reader. Similarly, the reverse is also not possible, i.e. MSWord cannot open file in Adobe PDF Reader format. Therefore, a given software will open files with supported and recognised standard formats. However, this is to make clear that text saved in MSWord file is not in sequence formats supported by various bioinformatics tools. There are several specific sequence formats available which can be used to save and store sequences. To save sequences in files, we need to provide two values in Save As dialog box. The first is file name to specify the primary name of the file and second is save as Type to specify the extension name of the file. Both names are joined automatically using a dot i.e. full stop or period. For example, if in NotePad, available with windows operating system, we enter the sequence of a nucleic acid or protein in plain text and then select Save As from file menu and provide the value mysequence for the file name and use the default save as type Text Documents (*.txt) in the save as dialog box, then the sequence will be saved as file mysequence.txt. When we try to use mysequence.txt file name having.txt as extension name, it is not recognised by sequence analysis programs. Even if the mysequence.txt is opened with a sequence analysis program, even then the plain text sequence in 5

mysequence.txt is not recognized as it is not a standard sequence format. Therefore, plain sequence in mysequence.txt cannot be read with or used with any sequence analysis software.

6 mysequence.txt is not recognized as it is not a standard sequence format. Therefore, plain sequence in mysequence.txt cannot be read with or used with any sequence analysis software. However, some online sequence analysis programs allows to paste the plain text sequence in the input text box. To understand the meaning of sequence formats, let us see the most commonly used standard sequence format, known as the FASTA format. The sequence in FASTA format can be saved with even Notepad or any other text editor. There are two steps. The first is to enter sequence and related information in the Notepad and then to save this file in FASTA format extension name as FA, so that the same can be read with all software packages demanding the sequence information in FASTA format. The sequence information is entered in two steps. The first is to enter the first line known as comment line starting with greater than sign i.e. > followed by some identification name or comment for the sequence. Suppose we have a sequence with name mysequence, for identification of this sequence, then in Notepad we will enter as follows: In this comment line we can continue entering any other information, such as annotation features. Continue in first annotation/ comment line with entering words/ tesxt, but without pressing enter key, as shown next: This shows that the entering information will continue in the same line. But initial information in this line is not visible. To view the whole line in one window, select Word Wrap command from Format menu, as shown: 6

$This comment line contains three pieces of annotation information separated by a delimiter character \.$

7 This will display the complete entered information as one paragraph displayed in multiple rows, three rows in the present case. So this comment line is actually one single paragraph, which may occupy multiple rows on computer screen, as seen, but it is actually a single line. This comment line contains three pieces of annotation information separated by a delimiter character \. Three pieces of information are the name of the sequence, then source from which sequence isolated and finally technique used to sequence this protein. After entering this information, press the enter key so that the cursor goes/ moves into the next line. In the next line (which is equivalent to next paragraph), the sequence of the protein or nucleic acid is entered, as shown below for protein sequence THISISTHESEQUENCEOFMYPROTEIN : Then save the file, by opening Save As dialog box and entering file name mysequence.fa, selecting all files from save as Type and clicking save button, as shown in below: 7

What is important in open file dialog box is in the dropdown list.

8 Then to open the saved file mysequence.fa, select All files in open file dialog box, as shown with arrow below and click open button: What is important in open file dialog box is in the dropdown list. Therefore, always select All Files from the choices in this dropdown list, if the FASTA format choice/option is not listed in this dropdown list. This will open the saved file as shown below: 8

9 Now this file mysequence.fa can be used with any software package recognising the FASTA format. This file can be used with any Text Editor such as MS Word which recognise the text stored as plain text in ASCII/ ANSI format. Therefore, the file mysequence.fa can be opened with MSWord after selecting in the open file dialog box. The file will be displayed in text window and we can select the sequence, as highlighted in light blue below: After selecting the sequence, click on in the bottom/status bar to check the word count. This will open Word Count dialog box. This will reveal that the length of sequences is 28 amino acids in this protein sequence as shown for dialog box, above. values in the Word Count In addition to entering single sequence information in one file, one may add any number of sequences information in one file, in FASTA format. Simply press enter key after the sequence to enter into next line. Then again add comment line starting with > sign and pressing enter key to go to next line and enter the sequence without pressing enter, as shown below for the second sequence information: 9

3. Molecular Sequence Archives The International Nucleotide Sequence Database Collaboration, is main archive of nucleotide sequences with three collaborators: GenBank http://www.ncbi.

10 In this way one can concatenate as many sequences in one file, in FASTA format, as one want to analyse. This is useful for pairwise and Multiple sequence alignment as well as phylogenetic analysis. Back to concept Map 3.3. Molecular Sequence Archives The International Nucleotide Sequence Database Collaboration, is main archive of nucleotide sequences with three collaborators: GenBank at NCBI, DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL). These three organizations exchange data on a daily basis. NCBI integrates nucleotide sequence database GenBank with other gene information databases for search in an integrated manner. 10

GenBank Sequence record format can be seen at http://www.ncbi.nlm.nih.gov/genbank/samplerecord/. NCBI Nucleotide sequence Gateway can also be reached directly at www.ncbi.nlm.nih.gov/nuccore/.

11 GenBank Sequence record format can be seen at NCBI Nucleotide sequence Gateway can also be reached directly at Similarly we have, the Universal Protein Resource (UniProtKB), a comprehensive archive of protein sequences. In addition, independent protein sequence Gateway at NCBI can be reached directly at ExPASy (Expert Protein Analysis System) server at SIB integrates UniProtKB database with other protein information databases, for searching in an integrated way. In addition to each of the sequence gateway providing access and retrieval system separately for nucleotide and protein sequences, we have integrated genome browsers for individual organisms, where we can have both gene and protein sequences with additional annotated information in an integrated 11

way. Both sequence retrieval (nucleotide or protein with annotations) systems and integrated genome browsers (nucleotide and protein sequences with annotations) are discussed next.

12 way. Both sequence retrieval (nucleotide or protein with annotations) systems and integrated genome browsers (nucleotide and protein sequences with annotations) are discussed next. Back to concept Map Retrieval Systems There are retrieval systems with each of the sequence archive. Following provides a partial list: Entrez (pronounced as Aahntray) is NCBI Expert Protein Analysis System (ExPASy) at SIB SRS at EMBL DBGET at DDBJ Entrez is NCBI s primary text search and retrieval system (gateway) and Entrez help can be reached at In the present example we will retrieve and download nucleotide and protein sequences, for Hpr from Enterococcus faecalis, a gene encoding 88 amino acid phosphocarrier protein. For the same, we have key information features. The first is organism Enterococcus faecalis and the second is name Hpr. Visit NCBI at and select nucleotide in the left dropdown list of databases to search, enter Hpr from Enterococcus faecalis in the text box and click to search. 12

13 We find that there are results to be displayed. This is long list to browse. Therefore use advanced search feature available below search text box: and in the builder section of ensuing page select fields to search and the data values to be matched, as shown next: Therefore, select Title and enter Hpr followed by selecting Organism and entering Enterococcus faecalis with click on search button. The ensuing results page shown only one record in GenBank format. 13

14 The GenBank format has three sections: First section, as shown above, is the HEADER section with general information about locus, source organism, literature references etc. Second section is FEATURES section, gene and coding sequence (CDS) information with external database (db_xref) links CAA for NCBI protein, and P07515 for UniProtKB/SwissProt protein databases, as highlighted next: One can click on these links to reach protein sequences. Finally the sequence section, as shown next: 14

15 Now to download the nucleotide sequence in FASTA, click on button, and select as shown: Selecting the desired format FASTA will display following: 15

Click on Create File button and save file in Save As dialog box with entering a full name (such as mysequence.fa) and selecting all files in Save as Type dropdown list.

16 Click on Create File button and save file in Save As dialog box with entering a full name (such as mysequence.fa) and selecting all files in Save as Type dropdown list. Even if the selected format for sequence was any other, say GenBank, we would entered the full name (such as GenBankHprProteinSequence.gbk) and selected all files in Save as Type dropdown list, before clicking save in Save As dialog box.. Now, click on Graphics to change display. The following window appears and just click on Tools Button to expand the list, as shown below: 16

This page provides tools for BLAST and Primer Search as well as for downloading sequence. Clicking on external database (db_xref) links CAA79533.

17 This page provides tools for BLAST and Primer Search as well as for downloading sequence. Clicking on external database (db_xref) links CAA for NCBI protein n features section, as highlighted above will take you to protein sequence entry NCBI. The features section in this record has important sites at residue numbers as shown next: Clicking on external database (db_xref) link domain family entry in CDD database NCBI, as shown next:, will open conserved protein 17

CDD is a protein annotation resource that consists of conserved domains in protein sequences to explicitly define domain boundaries and provide insights into sequence to structure and then to

18 CDD is a protein annotation resource that consists of conserved domains in protein sequences to explicitly define domain boundaries and provide insights into sequence to structure and then to function relationships. Clicking on external database (db_xref) links P07515 for UniprotKB/SwissProt in features section, as highlighted above, will take you to protein sequence entry in UniProtKB protein database. The features section in this record has important sites at residue numbers as shown next: The most important is Display menu. One could jump to any of the feature by just clicking. The features include, function, names & taxonomy, subcellular function, post-translational medications & processing, 18

interactions with other proteins, 3-d structures, conserved families and domains, sequence & external links to other sequence databases, publications & literature information. 3.3.1.2.

19 interactions with other proteins, 3-d structures, conserved families and domains, sequence & external links to other sequence databases, publications & literature information ExPASy (Expert Protein Analysis System) is the gateway for all protein sequence information available at UniprotKB. Before 2002, PIR produced the Protein Sequence Database (PIR-PSD), SIB produced manually-curated SwissProt and EMBL produced computationally translated coding sequences database TrEMBL, awaiting manual annotation for inclusion into SwissProt. In 2002 the three institutes pooled their resources and produced UniProtKB. It has two components. UniprotKB/SwissProt is the manually annotated component of UniProtKB. It contains manually reviewed and annotated proteins with information extracted from the literature and curator-evaluated computational analysis. UniProtKB/TrEMBL, on the other hand is computationally analyzed proteins which are manually reviewed and annotated with information extracted from the literature for their transfer into UniprotKB/SwissProt component of UniprotKB. Now, let us download Hpr from Enterococcus faecalis protein from UniProtKB database Gateway 19

20 Click on Reviewed (5) as shown by arrow above to display only SwissProt sequences, as shown next To download sequence in FASTA, adjust the settings in Download Tab as shown next and clock Go. 20

The FASTA sequence retrieved in browser window is displayed below Back to concept Map 3.3.2.

well as associated information for Enterococcus faecalis using a genome browser. Therefore, you search Enterococcus faecalis genome browser on Google.

21 The FASTA sequence retrieved in browser window is displayed below Back to concept Map Genome Browsers Since, in the present case we are specifically interested in Enterococcus faecalis, we will try to get the nucleic acid and protein sequences as well as associated information for Enterococcus faecalis using a genome browser. Therefore, you search Enterococcus faecalis genome browser on Google. This will display like this Click on the first link to reach Enterococcus faecalis genome browser page. This is bacterial genome browser page where we can browse the complete genomes various bacteria/archaea organisms. We can change to other organisms. 21

Bring your mouse over the gene number displayed on the left side and then on corresponding gene displayed next as, this is display as below.

22 However, without changing the group and genome organism, In the search text box enter Phosphocarrier protein Hpr, and press enter key. You will reach, the gene EF0709 encoding protein Phosphocarrier protein Hpr displayed in Genome Browser window. Bring your mouse over the gene number displayed on the left side and then on corresponding gene displayed next as, this is display as below. Now, click on gene and you will reach a page where you can click for link to all sequences for EF0709 gene, as shown below: Click on predicted protein, your browser will show the following protein sequence in FASTA format. Copy the complete FASTA sequence and save it as EfaecalisHpr.FA using Notepad. 22

>EF0709 length=88 MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGV MSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE Back to concept Map 3.4.

23 >EF0709 length=88 MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGV MSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE Back to concept Map 3.4. Interconverting sequence formats Sequence formats were designed by specific database developers/ groups/ companies, to hold the sequence data and other information about the sequence, for use in their own programs/ software packages. There are several sequence analysis software packages and online sequence analysis tools. A specific package/ tool will support only some recognised standard formats. This shows that there are several sequence formats but some are internationally recognised standard formats which are much more common than others. Almost every database of sequences such as GenBank, EMBL, SwissProt, PIR etc., has stored its data in its own format but it allows to download sequence data in additional formats also. But in case, we do not get sequence data in the desired format then we have the option of downloading the sequence data in the their database format and convert it to another format for use in with the desired sequence analysis package. To convert a sequence format to any other sequence format, go to Sequence Format Converters at 23

Now choose to Launch EMBOSS Segret and follow the three steps on the appearing browser window. First step is upload already saved file GenBankHprProteinSequence.

24 Now choose to Launch EMBOSS Segret and follow the three steps on the appearing browser window. First step is upload already saved file GenBankHprProteinSequence.gbk in GenBank format and choose it convert to SwissProt entry format (swissnew) and click Submit Button. The resulting window will display of histidine containing phosphocarrier protein Hpr from Enterococcus faecalis sequence in GenBank Format which can be downloaded and saved. 24

25 This site also provide ReadSeq program for sequence conversion for several input to output options. In addition, this site provides MView, a web interface to Transform a Sequence Similarity Search result into a Multiple Sequence Alignment or reformat a Multiple Sequence Alignment using the MView program. The Another implementation of Segret EMBOSS is available at Paste the FASTA sequence in the text box, then select the input sequence and output sequence from the dropdown lists and click submit request button. 25

histidine containing phosphocarrier protein Hpr

26 The result will appear in the Browser window and resulting window will display sequence of histidine containing phosphocarrier protein Hpr from Enterococcus faecalis sequence in SwissProt format: 26

27 Back to concept Map 4. Summary In this lecture we learnt about: Encoding linear sequences of nucleic acids (DNA/RNA) and proteins using single letter codes Creating sequence files using NotePad in different formats of sequence data for use by different programs International public domain sequence archives and databases Retrieval systems used by different sequence databases Browsing genomes for understanding the gene arrangement along chromosomes Converting one sequence format into another for use in other sequence analysis program 27

warm-up exercise Representing Data Digitally goals for today proteins example from nature

warm-up exercise Representing Data Digitally goals for today proteins example from nature Representing Data Digitally Anne Condon September 6, 007 warm-up exercise pick two examples of in your everyday life* in what media are the is represented? is the converted from one representation to another,