Information Resources in Molecular Biology Marcela Davila-Lopez How many and where

Information Resources in Molecular Biology Marcela Davila-Lopez (marcela.davila@medkem.gu.se) How many and where

Data growth DB: What and Why A Database is a shared collection of logically related data, and a description of this data, designed to meet the information needs of a community of users. 1900s - 1950s punched cards (1951) 1790s Magnetic tape Fasta format Tab delimited

GenBank format Flat-file DB Pros: Text files easy! Search Linux commands Customized scripts PERL Drawbacks: Incompatible file formats Duplication of data Relate data Update Relational DB Search Structured Query Language (SQL) NO Duplication of data EASY to relate data THERE ARE NO Incompatible file formats

Object-oriented DB In this model, objects contain data and the action that can be taken on it. Pros: Can hold unstructured data: video, audio, photographs easily Handles highly complex information Provides a high performance level Compatible with OO programming languages such as Python, Java, Visual Basic.NET, C++. Search Object Query Language (OQL) Drawbacks: Difficult to use Expensive to develop Lack of experience standards and support for security Incompatible with relational dbs (more popular) Data Webhouse (Atlas UbiC) A distributed data warehouse that is implemented over the Web with no central data repository Advantages: Deals with data hetereogeneity Holds historical data Supports SQL, C++, Java, Perl, Toolbox Disadvantages: Underestimation of data loading Required data not capture Increased end-user demands High maintainance Distributed DB (Reciprocal Net) Data diff locations but shared Common DBMS Advantages: Drawbacks: Local autonomy Extra work must also be done to maintain/ Faults/overloads/modifications in secure multiple systems one DB system it won t affect others Extensive infrastructure means extra labour $ ntwk of small cpus < 1 large cpu costs Data is located near the site of It s a young field greatest demand

Local vs on-line access DB Types in Molecula Biology Primary sequence databases GenBank - U.S. (NCBI) EMBL - Europe (The European Molecular Biology Laboratory) DDBJ - DNA Data Bank of Japan UniProt - Universal Protein Resource Meta-databases (database of databases): They collect data from different sources and usually makes them available in new and more convenient form, or with an emphasis on a particular disease or organism. Entrez (NCBI) Bioinformatic Harvester (Karlsruhe Institute of Technology) What is in there? Data Primary: Nucleotide sequence Secondary: Protein domains Data quality Consistency Guidelines, nucleotide vs protein sequence, existence of cross-references, alternative names replaced with approved ones, misspellings... Redundancy Repetition of analysis several entries Non-redundant database Updates New entries, correction of existing ones. Version numbers Weekly, monthly... Curation Annotation from experimental data

Ontologies: controlled vocabulary, consistent descriptions used to classify/organize data. They may define relationships between the terms, making it a structured vocabulary. Gene Ontology Project: to describe gene product attributes in any organism across databases Microarray Gene Expression Data: describe gene expression experiments Sequence Ontology Project: describe features of nt or proteins Multiple Alingment Ontology: describe multiple sequence alignments and their methods, as well as structural or functional information The Gene Ontology DB 1998 The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases (FlyBase, SGC, MGD plants, animals, microbial genomes) Organizing principles: Cellular component: location Molecular function: activities, jobs transporting things around, binding to things, holding things together, changing one thing into another. Biological process: series of events of molecular functions The gene product cytochrome c can be described by the terms: oxidoreductase oxidative phosphorylation and induction of cell death mitochondrial matrix and mitochondrial inner membrane Application: GO terms What genes are related to cell division? Which GO terms are over-represented in cancer tissue as compared to normal tissuse? NCBI: National Center for Biotechnology Information Nucleotide Taxonomy Genome Project Protein Structure Literature (MeSH)

GenBank: An annotated collection of all publicly available nucleotide RefSeq: Collection of sequences (DNA,transcripts, proteins) integrated, non-redundant, well-annotated dbsnp: Broad collection of simple genetic polymorphisms. These are small genetic change, or variation, that can occur within a person's DNA sequence. They are the most common variations, approximately once every 100 to 300 bases, that can point to heritable phenotypes. They are useful to evaluate the predisposition to disease or as a diagnostic tool. They also aid to predict the response to drug regimens and are used as biological markers for the mapping of genes. dbest: Contains sequence data and other information on single-pass" cdna sequences called Expressed Sequence Tags. These are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. They can be used to study cells, tissues, organs under certain conditions, as a gene identification method and in the study of hereditary diseases DB-specific search Entrez:Global query cross-database Search System eutils: Entrez Programming Utilities UCSC: UC Santa Cruz Genome Bioinformatics group Contains assemblies for a large collection of genomes. Genome Browser Table Browser

Ensembl EBI and the Wellcome Trust Sanger Institute 1999 data of HGP Centralized resource Researchers studying genomes: Vertebrates, Model organisms and Plants, fungi, bacterias, protists All sequence data is fed into a software pipeline (Perl) into a relational DB for analysis and display BioMart: Query-oriented data integration system Based on distributed data warehousing ideas Single or multiple databases Results in table format User friendly UniProt: Universal Protein Resource 1980 s Protein sequence database High quality detailed curation EBI + SIB Quick release of data not yet annotated TrEMBL (Translation of EMBL nucleotide sequences). Only computationally annotated entries 2002 EBI + SIB + PIR Uniprot Consortium

MySQL Structured Query Language: a computer language designed for the retrieval and management of data in relational database management systems, database schema creation and modification, and database object access control management. It is an interactive programming language for querying information from and updating a database. +-------------------------------------------------------------+ Name Owner Species Sex Birth Death +-------------------------------------------------------------+ Fluffy Harold cat f 1993-02-04 Claus Gwen cat m 1994-03-17 Buffy Harold dog f 1989-05-13 Fang Benny dog m 1990-08-27 Bowser Diane dog m 1979-08-31 1995-07-25 Chirpy Gwen bird m 1998-09-11 Whistler Gwen bird 1997-12-09 Slim Benny snake m 1996-04-29 Puffball Diane hamster f 1999-03-30 +-------------------------------------------------------------+ Some commands/syntax: shell> mysql mysql> mysql> QUIT Bye mysql> USE mystore; Database changed mysql> SHOW TABLES; mysql> SELECT DATABASE(); mysql> DESCRIBE pet; mysql> LOAD DATA LOCAL INFILE /path/pet.txt INTO TABLE pet; mysql> INSERT INTO pet -> VALUES ( Puffball, Diane, hamster, f, 1999-03-30, NULL ); mysql> UPDATE pet SET sex = f WHERE name = Chirpy ; mysql> DELETE FROM pet WHERE owner = Harold ; SELECT FROM WHERE what_to_select which_table conditions_to_satisfy; mysql> SELECT * FROM pet; mysql> SELECT * FROM pet WHERE name = Bowser ; mysql> SELECT * FROM pet WHERE birth >= 1998-1-1 ; {Comparison operators: =,<>,<,>,!=,<=,>=} mysql> SELECT * FROM pet WHERE species = dog AND sex = f ;

{Logical operators: AND, OR, NOT} mysql> SELECT * FROM pet WHERE species = snake OR species = bird ; mysql> SELECT * FROM pet -> WHERE (species = cat AND sex = m ) -> OR (species = dog AND sex = f ); mysql> SELECT owner FROM pet; mysql> SELECT name,birth FROM pet; mysql> SELECT name, species FROM pet WHERE species = cat -> OR species = dog ; mysql> SELECT DISTINCT owner FROM pet; mysql> SELECT name, birth FROM pet ORDER BY birth; mysql> SELECT name, birth FROM pet ORDER BY birth DESC; _ matches any single character % matches an arbitrary number of characters (including zero) case-insensitive by default LIKE or NOT LIKE mysql> SELECT * FROM pet WHERE name LIKE b% ; mysql> SELECT * FROM pet WHERE owner LIKE b ; mysql> SELECT * FROM pet WHERE name LIKE ;. matches any single character [...] matches any character within the brackets {n} repeat-n-times * matches zero or more instances of the previous character ^ at the beginning $ at the end case-insensitive by default REGEXP or NOT REGEXP mysql> SELECT * FROM pet WHERE name REGEXP e ; mysql> SELECT * FROM pet WHERE species REGEXP ^[WF] ; mysql> SELECT * FROM pet WHERE name REGEXP ^...$ ; mysql> SELECT * FROM pet WHERE name REGEXP ^.{5}$ ; mysql> SELECT COUNT(*) FROM pet; mysql> SELECT species, COUNT(*) FROM pet GROUP BY species; mysql> SELECT owner, COUNT(*) AS petsno -> FROM pet -> GROUP BY owner; mysql> SELECT species,sex, COUNT(*) FROM pet -> WHERE species = dog OR species = cat -> GROUP BY species,sex;