12. Key features involved in building biological 3databases

Size: px

Start display at page:

Download "12. Key features involved in building biological 3databases"

Christiana Haynes
5 years ago
Views:

12. Key features involved in building biological 3databases Central to the discipline of bioinformatics is the need to store biological information systematically in structured databases.

1 12. Key features involved in building biological 3databases Central to the discipline of bioinformatics is the need to store biological information systematically in structured databases. The first databases were really just simple formatted text files. These were organised so that particular records within them could be easily identified and linked. However, as the complexity and types of available biological data grew, and biologists wanted to ask more complex questions, database architectures also became more sophisticated. This topic guide will look at a range of aspects involved in creating biological databases, and how these may have to evolve to meet future needs, as new technologies give rise to new types of data. On successful completion of this topic you will: understand key features involved in building biological databases (LO3). To achieve a Pass in this unit you need to show that you can: summarise the design and management of biological databases (3.1) identify a range of records within a file (3.2) discuss the need for biological databases to store, organise and index basic biological processes (3.3) discuss the nature of new data available and the types of database and resources that might be used (3.4). 1

2 Key terms Coding region: The portion of an mrna sequence that is translatable into a polypeptide. Flat-file database: A plain-text file containing a number of data entries that lack structured interrelationships. 1 Building biological databases Complete genome sequencing was a major achievement. However, just amassing more data does not instantly make us more knowledgeable or provide miraculous understanding of the information we are collecting. Gaining biological and biomedical insights from raw genomic data is a huge undertaking. For example, as part of the process: genes must be located and their structures properly assembled coding regions must be translated functions must be assigned to genes and their products disease associations must be discovered, etc. These tasks depend on using the right computational tools and finding the right balance between human and machine input. Online access to databases gave scientists the ability to use public data in their own private research projects. Databases thus became invaluable as repositories of biological information. In addition to this, they are important because they allow logical connections to be made to related information in different resources via their annotations. Annotations are the intelligence or clues we attached to raw data to make them meaningful to, and reusable by, other researchers. For example, linear strings of nucleotide bases or amino acid residues are virtually useless on their own, but allied with information about their evolutionary relationships, biological functions, roles, interactions, disease associations, etc., they become elements of knowledge. The evolution of biological flat-file databases Database annotations add value to raw data, in principle allowing them to be reused quickly and conveniently. The more annotations there are, the richer the database content. The problem is, the more information added, the greater the need for disciplined approaches to data archiving if computers are to be able to access particular annotations reliably, they must be stored in a structured way. This begs two questions: i what kinds of annotation are crucial, and ii how should they be organised? We already saw that, for sequence data, adding notes about potential biological relationships, functions, roles, etc., is useful. To add further value, it is also helpful to add database specific details, such as when the sequence was submitted and when the database entry was last updated; links to relevant scientific literature (for example, to an article that describes the biological function of a sequence) and cross-references to information in related databases are also informative. Structuring such information sensibly and systematically to facilitate computer access is challenging. Plain text, like the page you are reading now, is accessible to humans, but means nothing to computers. Nevertheless, the earliest biological databases were created as plain-text files, or flat-files. This meant that particular pieces of information had to be pinpointed with specific tags to help computers identify the types of data being stored in those parts of the file. Inevitably, several 2

3 different flat-file formats evolved to store different types of biological data. Of these, one particular format became popular (because its structure was relatively simple) and was adapted for a variety of different data-types by a number of databases that are still in use today (e.g., Swiss-Prot, TrEMBL, PROSITE). This simple flat-file format was the one originally devised to store nucleotide sequences in the EMBL database. Figure gives a flavour of how a flat-file database is constructed. Figure : Creating a flat-file database. Numerous plain-text or flatfiles are appended to create a flat-file database. Each file, or database record, contains different data fields, each of which is identified by a specific tag. Here, the zoomed-in section shows a variety of tags found in the EMBL flatfile format, exemplified with a range of fields typical of UniProtKB entries. Flat-file RECORD TAG TAG Xn Flat-file database ID AC DT DE GN OS OC RN CC DR KW FT SQ Zoom A human-readable identifier A computer-readable code Date of creation of database record A descriptive title for the entry Gene name Organism source details Organism classification Cross references to publications Description of the function, etc. Cross-links to related databases Keywords Table listing sequence features Sequence details Key terms Rhodopsin: A light-sensitive biological pigment, found in the rod-shaped photoreceptor cells of the retinas of most vertebrates, that mediates vision in dim light; rhodopsin belongs to the superfamily of G protein-coupled receptors to which it gives its name. PubMed: An online interface to millions of biomedical literature citations from the MEDLINE database, from life science journals, online books, etc.; PubMed is a service of the National Center for Biotechnology Information (NCBI). Database fields and tags The EMBL flat-file format uses a series of two-letter tags to describe the data stored on each line of the file, as shown in Figure on page 4. The file begins with an identifying (ID) code (here, OPSD_HUMAN) and an accession (AC) number (here, P08100): the AC number is designed for computers to read; the ID code is more meaningful to humans in this case, OPSD_HUMAN denotes human rhodopsin. The AC and ID codes specify a given database entry. In principle, the AC number is invariant so that this sequence can always be tracked in any version of the database. Other important pieces of information within the flat-file include: DT, the date a sequence entered the database and when changes were last made to its entry DE, the description or title of the stored entity (here, the protein rhodopsin) GN, the source gene name (here, rho) OS, a more precise specification of the organism species (here, Homo sapiens) OC, a more precise specification of the organism classification (Eukaryota, Metazoa, Chordata, etc.). In addition, the file includes bibliographic citations: RN is the reference number RP gives the subject RM the literature database (PubMed) cross-reference RA the authors RL the place of publication. 3

4 Figure : Illustration of the flat-file format of a UniProtKB/Swiss-Prot entry. ID OPSD_HUMAN STANDARD; PRT; 348 AA. AC P08100; DT 01-AUG-1988 (REL. 08, CREATED) DT 01-AUG-1988 (REL. 08, LAST SEQUENCE UPDATE) DT 01-MAR-1992 (REL. 21, LAST ANNOTATION UPDATE) DE RHODOPSIN. GN RHO. OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. RN [1] RP SEQUENCE FROM N.A. RM RA NATHANS J., HOGNESS D.S.; RL PROC. NATL. ACAD. SCI. U.S.A. 81: (1984). RN [3] RP VARIANTS RETINITIS PIGMENTOSA. RM RA DRYJA T.P., HAHN L.B., COWLEY G.S., MCGEE T.L., BERSON E.L.; RL PROC. NATL. ACAD. SCI. U.S.A. 88: (1991). CC -!- FUNCTION: VISUAL PIGMENTS ARE THE LIGHT-ABSORBING MOLECULES THAT CC MEDIATE VISION. THEY CONSIST OF AN APOPROTEIN, OPSIN, COENTLY CC LINKED TO CIS-RETINAL. CC -!- TISSUE SPECIFICITY: ROD SHAPED PHOTORECEPTOR CELLS WHICH MEDIATES CC VISION IN DIM LIGHT.. CC -!- DISEASE: AUTOSOMAL DOMINANT RETINITIS PIGMENTOSA CAN BE DUE TO A CC DEFECT IN RHO. PATIENTS TYPICALLY HAVE NIGHT VISION BLINDNESS AND CC LOSS OF MIDPERIPHERAL VISUAL FIELD; AS THEIR CONDITION PROGRESSES, CC THEY LOSE THEIR FAR PERIPHERAL VISUAL FIELD AND EVENTUALLY CENTRAL CC VISION AS WELL. CC -!- SIMILARITY: TO ALL OTHER G-PROTEIN COUPLED RECEPTORS. STRONGEST TO CC ALL OTHER OPSINS. DR EMBL; K02281; HSOPS. DR MIM; ; NINTH EDITION. DR PROSITE; PS00237; G_PROTEIN_RECEPTOR. DR PROSITE; PS00238; OPSIN. KW PHOTORECEPTOR; RETINAL PROTEIN; TRANSMEMBRANE; GLYCOPROTEIN; VISION; KW PHOSPHORYLATION; LIPOPROTEIN; G-PROTEIN COUPLED RECEPTOR; ACETYLATION; KW RETINITIS PIGMENTOSA. FT DOMAIN 1 36 EXTRACELLULAR. FT TRANSMEM FT DOMAIN CYTOPLASMIC. FT DOMAIN EXTRACELLULAR. FT TRANSMEM FT DOMAIN CYTOPLASMIC. FT MOD_RES 1 1 ACETYLATION (BY SIMILARITY). FT CARBOHYD 2 2 BY SIMILARITY. FT BINDING RETINAL CHROMOPHORE. FT LIPID PALMITATE (BY SIMILARITY). FT DISULFID BY SIMILARITY. FT VARIANT T -> M (IN RETINITIS PIGMENTOSA). FT VARIANT P -> S (IN RETINITIS PIGMENTOSA). SQ SEQUENCE 348 AA; MW; CN; MNGTEGPNFY VPFSNATGVV RSPFEYPQYY LAEPWQFSML AAYMFLLIVL GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVLGG FTSTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLAGWSRYIP EGLQCSCGID YYTLKPEVNN ESFVIYMFVV HFTIPMIIIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWVPYAS VAFYIFTHQG SNFGPIFMTI PAFFAKSAAI YNPVIYIMMN KQFRNCMLTT ICCGKNPLGD DEASATVSKT ETSQVAPA // 4

5 Key terms MIM or OMIM (Online Mendelian Inheritance in Man): A comprehensive database of human genes and genetic disorders. Transmembrane domain: A hydrophobic segment of amino acids within a protein that crosses a membrane. Single-letter code: Letters of the alphabet used to denote the amino acids (A for alanine, P for proline, V for valine, etc.). A rich seam of annotation is stored in the comment (CC) field. In this example, we learn about the protein s function, tissue-specificity, disease associations, family relationships, etc. To facilitate swift computational processing of the file, many of the terms used here are also included as keywords (KW). Links to related information in other databases are made in the DR lines (such as here to EMBL, MIM and PROSITE). Further enriching the entry, various characteristics of the sequence itself are documented in the Feature Table (FT): for example, here we learn about the locations of the protein s transmembrane (TM) domains and functional sites (lipid and carbohydrate attachment sites, other binding sites, etc.), and about its sequence variants. Finally, the sequence is stored in the SQ field using the single-letter code, together with attributes such as its length and molecular weight. The entry terminates with the // symbol. Activity The link shows the history of the sequence of human insulin from its earliest Swiss-Prot entry in Scroll down and click on the fifth version (5.txt). What is the function of the protein? With what disease is the protein associated? How many amino acid residues are there in the active hormone? Database interoperability indexing biological data As we have already seen, storing information in databases is not very useful unless computers can access the data and help humans to interrogate and analyse the knowledge they contain. Achieving this requires adherence to standard data formats. For example, for protein and nucleotide sequences using the EMBL format, use of a common Feature Table format helps to improve data consistency and reliability, and facilitates database interoperation. Regulating the content, and the vocabulary and syntax used to describe the documented features, helps to ensure that the data can be readily accessed and manipulated by computer software. The principal means by which computers access database information is via their entries unique AC numbers and ID codes. This allows data from very different resources to be connected, whether from a nucleotide or protein sequence database, a protein family or structure database, a literature database, and so on. The more internal cross-references a database stores, the greater the web of connectivity that is possible from it. 5

6 Figure : Illustration of flat-file indexing. Data fields in flat-file databases can be linked via their two-letter tags. The main points of connectivity are the accession number (AC) or identifier (ID) tags. Here, literature (RP) and database cross-references (DR) in UniProtKB/ Swiss-Prot are linked to MEDLINE via a PubMed ID (PMID), to the PDB via the PDB ID, to EMBL via the EMBL AC tag and to PROSITE via the PROSITE AC tag. Reciprocal links from EMBL and PROSITE link back to UniProtKB/ Swiss-Prot via the Swiss-Prot AC tag. EMBL ID Q AC X RP MEDLINE PMID DR P02700 ABCD_YEAST PROSITE ID ATP-BIND UiniProtKB/Swiss-Prot ID ABCD_YEAST AC P02700 RP MEDLINE PMID DR EMBL X DR PDB 1TIM DR PROSITE PS00500 PDB ID 1TIM N CA C O CB AC DE DR PS00500 ATP-binding domain P02700 ABCD_YEAST MEDLINE PMID Exptl. studies of ATP binding of ACD protein Key term Flat-file index: An address or set of coordinates that allows query software to access specific parts of a flat-file database by means of designated tags. Interoperability: The ability of software systems or databases to communicate or exchange information seamlessly (to interoperate) without restriction. Relational database: A database in which data and their attributes are structured and stored in nonredundant tables in such as way as to facilitate information retrieval. Next-generation sequencing: Lowcost, parallelised, high-throughput technology capable of producing thousands or millions of sequences simultaneously (for example, 454 pyrosequencing, Illumina (Solexa) and SOLiD sequencing). Third-generation sequencing: Lowcost, single-molecule sequencing technology that aims to reduce the cost of sequencing a single human genome to US $1000 or less. The first tool to exploit this fact was SRS, the Sequence Retrieval System. SRS is an information indexing-tool that allows any flat-file database to be indexed to any other, permitting highly specific queries across different databases via a single interface, irrespective of their underlying data-types. Figure illustrates how integrated access to diverse information across different flat-file databases is made possible via links to and from their entries respective AC numbers and ID codes. The need for relational database management systems We have seen that flat-file databases can interoperate if they have been indexed or cross-referenced. This makes database queries fast and efficient, because they can be directed to specific parts of the file, rather than to the whole database(s). However, although this is effective for data integration, the approach is very brittle. Consider the role of AC numbers. If an AC number changes, its associated database entry suddenly becomes invisible to all resources to which it was formerly connected; to remain visible, all connected databases must incorporate the new AC number. Owing to its ease of use, the flat-file format was popular for many years. In time, as the pace of data acquisition increased and the accompanying body of scientific literature grew, keeping database data and annotations up to date became more time-consuming (and error-prone, because much of the work was manual). This prompted the use of relational database systems in order to help structure data more formally: here, data are managed in tables in such a way that changes in one table can be readily propagated to others, easing data-management burdens. For more complex resources too, like data warehouses, removing redundancy between databases and ensuring data consistency are easier to achieve using relational systems. The challenge now is not so much what we want to do with such systems today, but how they will need to adapt to future needs. The quantity of data that nextand third-generation sequencing technologies will produce is unprecedented, and will likely have a major impact on future database design. 6

7 Take it further More detailed information about flat-file database formats can be found in Chapter 3 of Introduction to Bioinformatics (Attwood and Parry- Smith, 1999), Prentice Hall. More detailed information on building biological databases, and the use of MySQL, can be found in Building Bioinformatics Solutions: with Perl, R and MySQL (Bessant, Shadforth and Oakley, 2008), OUP. Link Find out more about DNA and RNA structure and coding regions in Unit 7: Molecular biology and genetics. Activity Read the following news article describing the road to the US $1,000 human genome from the Human National Genome Research Institute website. What is one type of innovation that is being explored in order to allow revolutionary 3Gen technologies to deliver the $1000 genome? Further reading Attwood, T. and Parry-Smith, D. (1999) Introduction to Bioinformatics, Prentice Hall. Chapter 3 contains more information about flat-file database formats. Higgs, P. and Attwood, T. (2005) Bioinformatics and Molecular Evolution, Wiley-Blackwell. Refer to Chapter 5 for further details of biological databases. Bessant, C., Shadforth, I. and Oakley, D. (2008) Building Bioinformatics Solutions: with Perl, R and MySQL, OUP. Contains more detailed information on building biological databases, and the use of MySQL. Find out more about the road to the $1000 genome as described in the following news article from the National Human Genome Research Institute website. Checklist At the end of this topic guide you should be familiar with the following ideas about bioinformatics: the flat-file database (essentially a plain-text file) was the original means of managing raw sequence data the EMBL flat-file format was adopted by different databases because its structure made it easy to use and to adapt the EMBL format uses a series of two-letter tags to denote different database fields (ID and AC tags for the entry identifier and accession number, DE and GN tags for the descriptive title and gene name, CC for comments and FT for the Feature Table, etc.) tags allow the data in different parts of flat-file databases to be indexed, which allows them to be cross-linked to information in related databases flat-file databases are simple to understand but are brittle to changes in the data structure more sophisticated relational database management systems were devised to store and manage bioinformatics data more efficiently and more robustly. Acknowledgements The publisher would like to thank the following for their kind permission to reproduce their photographs: PhotoDisc: Lawrence Lawry All other images Pearson Education We are grateful to the following for permission to reproduce copyright material: Realia showing coding of a flat-file format of a UniProtKB/Swiss-Prot entry. Produced by Uniprot Consortium. Used by permission. In some instances we have been unable to trace the owners of copyright material, and we would appreciate any information that would enable us to do so. 7

Similarity searches in biological sequence databases

Similarity searches in biological sequence databases Volker Flegel september 2004 Page 1 Outline Keyword search in databases General concept Examples SRS Entrez Expasy Similarity searches in databases