Information Resources in Molecular Biology Marcela Davila-Lopez How many and where

Similar documents
Databases in Bioinformatics

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

Introduction to MySQL

Operating systems fundamentals - B07

Bioinformatics Hubs on the Web

Genome Browsers Guide

CAP BIOINFORMATICS Su-Shing Chen CISE. 8/19/2005 Su-Shing Chen, CISE 1

Genome Browser. Background and Strategy

Genome Browsers - The UCSC Genome Browser

New generation of patent sequence databases Information Sources in Biotechnology Japan

BioExtract Server User Manual

ENERGY SYSTEMS LABORATORY

Min Wang. April, 2003

Integrated Access to Biological Data. A use case

EBI patent related services

Data Modeling and Database Design

How to store and visualize RNA-seq data

mpmorfsdb: A database of Molecular Recognition Features (MoRFs) in membrane proteins. Introduction

The UCSC Genome Browser

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

EBI services. Jennifer McDowall EMBL-EBI

Tutorial 1: Exploring the UCSC Genome Browser

Biostatistics and Bioinformatics Molecular Sequence Databases

NCBI News, November 2009

Genomic Analysis with Genome Browsers.

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

2. Take a few minutes to look around the site. The goal is to familiarize yourself with a few key components of the NCBI.

Literature Databases

Applied Bioinformatics

ArrayExpress and Expression Atlas: Mining Functional Genomics data

Applied Bioinformatics

Introduction to Genome Browsers

User Guide for DNAFORM Clone Search Engine

Bioinformatics Data Distribution and Integration via Web Services and XML

BovineMine Documentation

Automatic annotation in UniProtKB using UniRule, and Complete Proteomes. Wei Mun Chan

Portals and workflows: Taverna Workbench. Paolo Romano National Cancer Research Institute, Genova

Data Curation Profile Human Genomics

EMBL-EBI Patent Services

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

mgu74a.db November 2, 2013 Map Manufacturer identifiers to Accession Numbers

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide Bioinformatics Resources.

mogene20sttranscriptcluster.db

Two Examples of Datanomic. David Du Digital Technology Center Intelligent Storage Consortium University of Minnesota

Manual of mirdeepfinder for EST or GSS

HsAgilentDesign db

EBI is an Outstation of the European Molecular Biology Laboratory.

Public Repositories Tutorial: Bulk Downloads

ClinVar. Jennifer Lee, PhD, NCBI/NLM/NIH ClinVar

Enabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-

hgu133plus2.db December 11, 2017

Ensembl Core API. EMBL European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD, UK

Creating and Using Genome Assemblies Tutorial

Finding and Exporting Data. BioMart

INTRODUCTION TO BIOINFORMATICS

HymenopteraMine Documentation

BIOINFORMATICS A PRACTICAL GUIDE TO THE ANALYSIS OF GENES AND PROTEINS

INTRODUCTION TO BIOINFORMATICS

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

An Introduction to Taverna Workflows Katy Wolstencroft University of Manchester

Environmental Sample Classification E.S.C., Josh Katz and Kurt Zimmer

CACAO Training. Jim Hu and Suzi Aleksander Spring 2016

Master Thesis. Andreas Schlicker

Using Biopython for Laboratory Analysis Pipelines

Sarah Cohen-Boulakia. Université Paris Sud, LRI CNRS UMR

You will be re-directed to the following result page.

Lecture 5 Advanced BLAST

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

Tutorial 4 BLAST Searching the CHO Genome

Software review. Biomolecular Interaction Network Database

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Chapter 30 Emerging Database Technologies and Applications

hgug4845a.db September 22, 2014 Map Manufacturer identifiers to Accession Numbers

Exercises. Biological Data Analysis Using InterMine workshop exercises with answers

Using many concepts related to bioinformatics, an application was created to

Record Count per latest data load (version) Pathways and sub pathways Total: 1600; NCI-Curated: 201; Reactome: 1399 Interactions 1,024,802

Facilitating Semantic Alignment of EBI Resources

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Geneious 2.0. Biomatters Ltd

Human Disease Models Tutorial

The ELIXIR of Linked Data

The GenAlg Project: Developing a New Integrating Data Model, Language, and Tool for Managing and Querying Genomic Information

User Manual. Ver. 3.0 March 19, 2012

Proceedings of the Postgraduate Annual Research Seminar

Protein Data Bank Japan

Goal-oriented Schema in Biological Database Design

Relational Databases for Biologists: Efficiently Managing and Manipulating Your Data

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

A Protocol for Maintaining Multidatabase Referential Integrity. Articial Intelligence Center. SRI International, EJ229

Geneious 5.6 Quickstart Manual. Biomatters Ltd

Blast2GO PRO Plugin for Geneious User Manual

Biosphere: the interoperation of web services in microarray cluster analysis

Down with Species-Specific Database Projects, Up with Data Services

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

UniProt - The Universal Protein Resource

CLC Server. End User USER MANUAL

SEEK User Manual. Introduction

Massive Automatic Functional Annotation MAFA

Transcription:

Information Resources in Molecular Biology Marcela Davila-Lopez (marcela.davila@medkem.gu.se) How many and where

Data growth DB: What and Why A Database is a shared collection of logically related data, and a description of this data, designed to meet the information needs of a community of users. 1900s - 1950s punched cards (1951) 1790s Magnetic tape Fasta format Tab delimited

GenBank format Flat-file DB Pros: Text files easy! Search Linux commands Customized scripts PERL Drawbacks: Incompatible file formats Duplication of data Relate data Update Relational DB Search Structured Query Language (SQL) NO Duplication of data EASY to relate data THERE ARE NO Incompatible file formats

Object-oriented DB In this model, objects contain data and the action that can be taken on it. Pros: Can hold unstructured data: video, audio, photographs easily Handles highly complex information Provides a high performance level Compatible with OO programming languages such as Python, Java, Visual Basic.NET, C++. Search Object Query Language (OQL) Drawbacks: Difficult to use Expensive to develop Lack of experience standards and support for security Incompatible with relational dbs (more popular) Data Webhouse (Atlas UbiC) A distributed data warehouse that is implemented over the Web with no central data repository Advantages: Deals with data hetereogeneity Holds historical data Supports SQL, C++, Java, Perl, Toolbox Disadvantages: Underestimation of data loading Required data not capture Increased end-user demands High maintainance Distributed DB (Reciprocal Net) Data diff locations but shared Common DBMS Advantages: Drawbacks: Local autonomy Extra work must also be done to maintain/ Faults/overloads/modifications in secure multiple systems one DB system it won t affect others Extensive infrastructure means extra labour $ ntwk of small cpus < 1 large cpu costs Data is located near the site of It s a young field greatest demand

Local vs on-line access DB Types in Molecula Biology Primary sequence databases GenBank - U.S. (NCBI) EMBL - Europe (The European Molecular Biology Laboratory) DDBJ - DNA Data Bank of Japan UniProt - Universal Protein Resource Meta-databases (database of databases): They collect data from different sources and usually makes them available in new and more convenient form, or with an emphasis on a particular disease or organism. Entrez (NCBI) Bioinformatic Harvester (Karlsruhe Institute of Technology) What is in there? Data Primary: Nucleotide sequence Secondary: Protein domains Data quality Consistency Guidelines, nucleotide vs protein sequence, existence of cross-references, alternative names replaced with approved ones, misspellings... Redundancy Repetition of analysis several entries Non-redundant database Updates New entries, correction of existing ones. Version numbers Weekly, monthly... Curation Annotation from experimental data

Ontologies: controlled vocabulary, consistent descriptions used to classify/organize data. They may define relationships between the terms, making it a structured vocabulary. Gene Ontology Project: to describe gene product attributes in any organism across databases Microarray Gene Expression Data: describe gene expression experiments Sequence Ontology Project: describe features of nt or proteins Multiple Alingment Ontology: describe multiple sequence alignments and their methods, as well as structural or functional information The Gene Ontology DB 1998 The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases (FlyBase, SGC, MGD plants, animals, microbial genomes) Organizing principles: Cellular component: location Molecular function: activities, jobs transporting things around, binding to things, holding things together, changing one thing into another. Biological process: series of events of molecular functions The gene product cytochrome c can be described by the terms: oxidoreductase oxidative phosphorylation and induction of cell death mitochondrial matrix and mitochondrial inner membrane Application: GO terms What genes are related to cell division? Which GO terms are over-represented in cancer tissue as compared to normal tissuse? NCBI: National Center for Biotechnology Information Nucleotide Taxonomy Genome Project Protein Structure Literature (MeSH)

GenBank: An annotated collection of all publicly available nucleotide RefSeq: Collection of sequences (DNA,transcripts, proteins) integrated, non-redundant, well-annotated dbsnp: Broad collection of simple genetic polymorphisms. These are small genetic change, or variation, that can occur within a person's DNA sequence. They are the most common variations, approximately once every 100 to 300 bases, that can point to heritable phenotypes. They are useful to evaluate the predisposition to disease or as a diagnostic tool. They also aid to predict the response to drug regimens and are used as biological markers for the mapping of genes. dbest: Contains sequence data and other information on single-pass" cdna sequences called Expressed Sequence Tags. These are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. They can be used to study cells, tissues, organs under certain conditions, as a gene identification method and in the study of hereditary diseases DB-specific search Entrez:Global query cross-database Search System eutils: Entrez Programming Utilities UCSC: UC Santa Cruz Genome Bioinformatics group Contains assemblies for a large collection of genomes. Genome Browser Table Browser

Ensembl EBI and the Wellcome Trust Sanger Institute 1999 data of HGP Centralized resource Researchers studying genomes: Vertebrates, Model organisms and Plants, fungi, bacterias, protists All sequence data is fed into a software pipeline (Perl) into a relational DB for analysis and display BioMart: Query-oriented data integration system Based on distributed data warehousing ideas Single or multiple databases Results in table format User friendly UniProt: Universal Protein Resource 1980 s Protein sequence database High quality detailed curation EBI + SIB Quick release of data not yet annotated TrEMBL (Translation of EMBL nucleotide sequences). Only computationally annotated entries 2002 EBI + SIB + PIR Uniprot Consortium

MySQL Structured Query Language: a computer language designed for the retrieval and management of data in relational database management systems, database schema creation and modification, and database object access control management. It is an interactive programming language for querying information from and updating a database. +-------------------------------------------------------------+ Name Owner Species Sex Birth Death +-------------------------------------------------------------+ Fluffy Harold cat f 1993-02-04 Claus Gwen cat m 1994-03-17 Buffy Harold dog f 1989-05-13 Fang Benny dog m 1990-08-27 Bowser Diane dog m 1979-08-31 1995-07-25 Chirpy Gwen bird m 1998-09-11 Whistler Gwen bird 1997-12-09 Slim Benny snake m 1996-04-29 Puffball Diane hamster f 1999-03-30 +-------------------------------------------------------------+ Some commands/syntax: shell> mysql mysql> mysql> QUIT Bye mysql> USE mystore; Database changed mysql> SHOW TABLES; mysql> SELECT DATABASE(); mysql> DESCRIBE pet; mysql> LOAD DATA LOCAL INFILE /path/pet.txt INTO TABLE pet; mysql> INSERT INTO pet -> VALUES ( Puffball, Diane, hamster, f, 1999-03-30, NULL ); mysql> UPDATE pet SET sex = f WHERE name = Chirpy ; mysql> DELETE FROM pet WHERE owner = Harold ; SELECT FROM WHERE what_to_select which_table conditions_to_satisfy; mysql> SELECT * FROM pet; mysql> SELECT * FROM pet WHERE name = Bowser ; mysql> SELECT * FROM pet WHERE birth >= 1998-1-1 ; {Comparison operators: =,<>,<,>,!=,<=,>=} mysql> SELECT * FROM pet WHERE species = dog AND sex = f ;

{Logical operators: AND, OR, NOT} mysql> SELECT * FROM pet WHERE species = snake OR species = bird ; mysql> SELECT * FROM pet -> WHERE (species = cat AND sex = m ) -> OR (species = dog AND sex = f ); mysql> SELECT owner FROM pet; mysql> SELECT name,birth FROM pet; mysql> SELECT name, species FROM pet WHERE species = cat -> OR species = dog ; mysql> SELECT DISTINCT owner FROM pet; mysql> SELECT name, birth FROM pet ORDER BY birth; mysql> SELECT name, birth FROM pet ORDER BY birth DESC; _ matches any single character % matches an arbitrary number of characters (including zero) case-insensitive by default LIKE or NOT LIKE mysql> SELECT * FROM pet WHERE name LIKE b% ; mysql> SELECT * FROM pet WHERE owner LIKE b ; mysql> SELECT * FROM pet WHERE name LIKE ;. matches any single character [...] matches any character within the brackets {n} repeat-n-times * matches zero or more instances of the previous character ^ at the beginning $ at the end case-insensitive by default REGEXP or NOT REGEXP mysql> SELECT * FROM pet WHERE name REGEXP e ; mysql> SELECT * FROM pet WHERE species REGEXP ^[WF] ; mysql> SELECT * FROM pet WHERE name REGEXP ^...$ ; mysql> SELECT * FROM pet WHERE name REGEXP ^.{5}$ ; mysql> SELECT COUNT(*) FROM pet; mysql> SELECT species, COUNT(*) FROM pet GROUP BY species; mysql> SELECT owner, COUNT(*) AS petsno -> FROM pet -> GROUP BY owner; mysql> SELECT species,sex, COUNT(*) FROM pet -> WHERE species = dog OR species = cat -> GROUP BY species,sex;