Deliverable D4.3 Release of pilot version of data warehouse

Similar documents
EBI patent related services

New generation of patent sequence databases Information Sources in Biotechnology Japan

EMBL-EBI Patent Services

EBI services. Jennifer McDowall EMBL-EBI

Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords [version 1; referees: awaiting peer review]

Deliverable D5.5. D5.5 VRE-integrated PDBe Search and Query API. World-wide E-infrastructure for structural biology. Grant agreement no.

Welcome - webinar instructions

Bioinformatics Hubs on the Web

How to store and visualize RNA-seq data

Enabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services

The CALBC RDF Triple store: retrieval over large literature content

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

The ELIXIR of Linked Data

Facilitating Semantic Alignment of EBI Resources

SureChem and ChEMBL. ACS CINF webinar. John P. Overington & Nicko Goncharoff

Software review. Biomolecular Interaction Network Database

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

SUPPLEMENTARY DOCUMENTATION S1

Geneious 5.6 Quickstart Manual. Biomatters Ltd

EukBank Submission Guidelines

Update: MIRIAM Registry and SBO

In the sense of the definition above, a system is both a generalization of one gene s function and a recipe for including and excluding components.

Open PHACTS. Deliverable 5.3.4

PROJECT FINAL REPORT. Tel: Fax:

INTRODUCTION TO BIOINFORMATICS

The PDB and experimental data

I. Creating a group with a private genome- Global Search

Customisable Curation Workflows in Argo

Retrieving factual data and documents using IMGT-ML in the IMGT information system

INTRODUCTION TO BIOINFORMATICS

The Merck Index on MedicinesComplete. User Guide

BHL-EUROPE: Biodiversity Heritage Library for Europe. Jana Hoffmann, Henning Scholz

Biostatistics and Bioinformatics Molecular Sequence Databases

User Guide for DNAFORM Clone Search Engine

Easy manual for SIRD interface

oordination upport Status: February 2017

Horizon 2020 INFRADEV Design studies. RICHFIELDS Working Packages 2 Deliverable D2.2. Project identity. Date delivered: M6

What do I do if my blast searches seem to have all the top hits from the same genus or species?

Coordinating Optimisation of Complex Industrial Processes

Recommendation for the Disclosure of Sequence Listings using XML (ST.26) Sue Wolski Office of PCT Legal Administration

WP6 D6.2 Project website

ELIXIR Human Data Use Case

Information Resources in Molecular Biology Marcela Davila-Lopez How many and where

Deliverable D1.4. Paving the Way for Future Emerging DNA-based Technologies: Computer-Aided Design and Manufacturing of DNA libraries

The European Variation Archive

Bio wikis. Paolo Romano Bioinformatics, National Cancer Research Institute, Genova

XML in the bipharmaceutical

COLLABORATIVE EUROPEAN DIGITAL ARCHIVE INFRASTRUCTURE

Acquiring Experience with Ontology and Vocabularies

Euro-BioImaging Preparatory Phase II Project

PROJECT PERIODIC REPORT

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

BovineMine Documentation

H2020 RIA COMANOID H2020-RIA

NCBI News, November 2009

Improvements to services at the European Nucleotide Archive

Deliverable Final Data Management Plan

Project Title: INFRASTRUCTURE AND INTEGRATED TOOLS FOR PERSONALIZED LEARNING OF READING SKILL

What is Internet COMPUTER NETWORKS AND NETWORK-BASED BIOINFORMATICS RESOURCES

DELIVERABLE. D3.1 - TransformingTransport Website. TT Project Title. Project Acronym

Workpackage WP 33: Deliverable D33.6: Documentation of the New DBE Web Presence

Viewing Molecular Structures

D5.2 FOODstars website WP5 Dissemination and networking

CROP WILD RELATIVES DATABASE. National Bureau of Plant Genetic Resources (Indian Council of Agricultural Research) Tutorial

Deliverable D4.4. Overview of external datasets, strategy of access methods, and implications on the portal architecture

DREAM - FP D8.1 DREAM website DREAM

EBI is an Outstation of the European Molecular Biology Laboratory.

RESEARCH INFRASTRUCTURES IN HORIZON Marine Melkonyan National University of S&T MISIS

Harmonizing biocaddie Metadata Schemas for Indexing Clinical Research Datasets Using Semantic Web Technologies

Ref. Ares(2015) /12/2015. D9.1 Project Collaborative Workspace Bénédicte Ferreira, IT

Transitioning to Symyx

CONTENTS 1. Contents

Protein Data Bank: An open access resource enabling basic and applied research and education in biology and medicine

D2.5 Data mediation. Project: ROADIDEA

Information Retrieval, Information Extraction, and Text Mining Applications for Biology. Slides by Suleyman Cetintas & Luo Si

Creating and Using Genome Assemblies Tutorial

Deliverable D2.4 AppHub Software Platform

LinkDB: A Database of Cross Links between Molecular Biology Databases

Lecture 5 Advanced BLAST

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide Bioinformatics Resources.

PSICon Daniel G. A. Smith The Molecular Sciences Software molssi.org

Building innovative drug discovery alliances. Knime Desktop tools for chemists

CS313 Exercise 4 Cover Page Fall 2017

Developing Online Databases and Serving Biological Research Data

Structural Bioinformatics

InfraStructure for the European Network for Earth System modelling. From «IS-ENES» to IS-ENES2

Event-Based Modeling and Processing of Digital Media

NEW FEATURES. Manage. Analyze. Discover.

JISC PALS2 PROJECT: ONIX FOR LICENSING TERMS PHASE 2 (OLT2)

INTAROS Integrated Arctic Observation System

Data Mining Technologies for Bioinformatics Sequences

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

DELIVERABLE 5.2 Establishment of a common graphical identity WP5 Dissemination

Taking a view on bio-ontologies. Simon Jupp Functional Genomics Production Team ICBO, 2012 Graz, Austria

Deliverable 10.1 Visual Identity Kit

Support-EAM. Publishable Executive Summary SIXTH FRAMEWORK PROGRAMME. Project/Contract no. : IST SSA. the 6th Framework Programme

The EU OPEN meter project

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

ELIXIR Compute platform

BioExtract Server User Manual

Transcription:

Deliverable D4.3 Release of pilot version of data warehouse Date: 10.05.17 HORIZON 2020 - INFRADEV Implementation and operation of cross-cutting services and solutions for clusters of ESFRI

Grant Agreement number: 654008 Project acronym: Contract start date: 01/06/2015 Project website address: www.embric.eu Due date of deliverable: 30/05/17 / month 24 Dissemination level: Public D4.3: Release of pilot version of data warehouse, page 2 of 9

Document properties Partner responsible EMBL-EBI Author(s)/editor(s) COCHRANE Guy, RAJAN Jeena and SILVESTER Nicole Version 1 Abstract The pilot version of the data warehouse has been developed to link genomic data with data in ELIXIR chemical resources. An SQLite database has been created with tables to link chemical data in the Chemical Entities of Biological Interest (ChEBI) database with taxonomy information, protein data in UniProt, chemical compound and target information in the ChEMBL database, as well as sequence and genome assembly information available in the European Nucleotide Archive (ENA), providing a powerful search function. This SQLite database is available to download from the ENA public FTP site and support is offered for its use. D4.3: Release of pilot version of data warehouse, page 3 of 9

Table of Contents Table of Contents... 4 1 Introduction... 5 1.1 The Pilot Data Warehouse... 5 1.2 Databases... 5 2 Tables in SQLite database... 6 2.1 Description of tables... 6 2.2 Example Queries... 7 3 Limitations... 9 4 Conclusion and Future... 9 D4.3: Release of pilot version of data warehouse, page 4 of 9

1 Introduction 1.1 The Pilot Data Warehouse The main objective of work package 4 is to provide sustainable data management services for the marine science community. One of these tasks is the development of a data warehouse to facilitate better utilisation of data resources across the whole cluster. A pilot version of the data warehouse has been developed to link genomic and chemical data from ELIXIR resources. Following an WP2.1 meeting in Plymouth with participants from WP2 and WP3 in December 2016 and a Chemical Biology workshop in Paris in February 2017 organised by WP4 it became clear there is a need to link data from specific resources. For this pilot we have fulfilled some of these requests, for example it is now possible to search for the source organism of a compound. We are limited on the data that can be mapped by the availability of data in the resources and how it is structured and linked. We have focused on connecting information from the ChEBI, UniPot, ChEMBL and ENA databases. The SQLite database, as well as the README file is available to download here: ftp://ftp.ebi.ac.uk/pub/databases/ena/collaboration/embricdb_v1.tar.gz ENA already provides cross-references to several external resources both internal to EBI and external. A full list can be viewed here: http://www.ebi.ac.uk/ena/data/xref/source 1.2 Databases This section describes the databases that information has been extracted from to create tables to populate the SQLite database we have created. ChEBI - Chemical Entities of Biological Interest is a freely available dictionary of molecular entities focused on small chemical compounds. The term molecular entity refers to any constitutionally or isotopically distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc., identifiable as a separately distinguishable entity. The molecular entities in question are either products of nature or synthetic products used to intervene in the processes of living organisms. UniProt - The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. ChEMBL - ChEMBL is a database of bioactive drug-like small molecules ENA - the European Nucleotide Archive (ENA) provides a comprehensive record of the world's nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. D4.3: Release of pilot version of data warehouse, page 5 of 9

2 Tables in SQLite database 2.1 Description of tables Below is the schema of the SQLite database. Table names are in bold and column names listed underneath. chebi_chembl_link -CHEBI_ID -ORGANISM -NCBI_TAX_ID -CHEMBL_ID -OTHER_TAX_ID chembl_data -UNIPROT_ID -CHEMBL_ID -DESCRIPTION -TARGET_TYPE ena_assemblies -ASSEMBLY_ACC -NCBI_TAX_ID -ORGANISM chebi_data -CHEBI_ID -ORGANISM -NCBI_TAX_ID -OTHER_TAX_ID -COMPONENT_TEXT -COMPONENT_ACCESSION -STRAIN_TEXT -STRAIN_ACCESSION chebi_reference -CHEBI_ID -UNIPROT_ID -DESCRIPTION chembl_targets -CHEMBL_ID -CHEMBL_TARGET_ID -TARGET_TYPE -NCBI_TAX_ID -ORGANISM The column headers in the tables contain the following information: CHEBI_ID ChEBI database identifier ORGANISM organism name at species or genus level NCBI_TAX_ID NCBI Taxonomy database identifier CHEMBL_ID ChEMBL database identifier (can be compound or target) OTHER_TAX_ID non-ncbi taxonomy identifiers UNIPROT_ID UniProt database identifier ASSEMBLY_ACC ENA assembly accession number COMPONENT_TEXT part of the organism the extract originates from COMPONENT_ACCESSION accession number of component part STRAIN_TEXT strain name D4.3: Release of pilot version of data warehouse, page 6 of 9

STRAIN_ACCESSION strain accession in culture collection TARGET_TYPE e.g. single protein, protein nucleic-acid complex, protein family CHEMBL_TARGET_ID internal CHEMBL target id number In addition, there are two text files containing the following information: 1) UniProt/SwissProt IDs to INSDC sequence data (from ENA) Columns: UNIPROT_ID SEQUENCE_ACC PROTEIN_AC 2) UniProt/TrEMBL IDs to INSDC sequence data (from ENA) Columns: UNIPROT_ID SEQUENCE_ACC PROTEIN_ACC It was not possible to include the UniProt to ENA mappings contained in the two text files above in the SQLite database due to the sheer volume of data which would result in the database being excessively large. NCBI taxonomy is the unified taxonomic index used across most ELIXIR resources. 2.2 Example Queries 1) Search for ChEBI ID, CHEMBL ID, NCBI Tax ID and ENA assembly accession for organism Nocardiopsis dassonvillei Search chebi_chembl_link and ena_assemblies tables: sqlite> select CHEBI_ID,CHEMBL_ID, ASSEMBLY_ACC from chebi_chembl_link inner join ena_assemblies on ena_assemblies.organism = chebi_chembl_link.organism where chebi_chembl_link.organism = 'Nocardiopsis dassonvillei'; Returns: ChEBI IDs - > CHEBI:69705, CHEBI:69706, CHEBI:69707, CHEBI:69708, CHEBI:69709, CHEBI:69710, CHEBI:69711 NCBI Tax ID -> 2014 ENA assembly accession: GCA_001877055 D4.3: Release of pilot version of data warehouse, page 7 of 9

2) Retrieve the CHEBI_IDs, CHEMBL_IDs and ENA genome accession IDs for all organisms that occur in both the chebi_chembl_link and ena_accessions tables select CHEBI_ID,CHEMBL_ID, ASSEMBLY_ACC from chebi_chembl_link inner join ena_assemblies on ena_assemblies.organism = chebi_chembl_link.organism; 3) Search using CHEBI:82642 Search chebi_chembl_link table sqlite> select * from chebi_chembl_link where CHEBI_ID = 'CHEBI:82642'; Returns: Organism -> Neopetrosia exigua NCBI Tax_ID -> 595373 CHEMBL ID -> CHEMBL371814 5) Retrieve all targets in ChEMBL that are SINGLE PROTEIN select * from chembl_targets where TARGET_TYPE = 'SINGLE PROTEIN ; D4.3: Release of pilot version of data warehouse, page 8 of 9

3 Limitations The current limitations of the pilot data warehouse in connecting data from different resources are due to the availability of the data in the resources and the mapping and structure of this data. For example, within the ChEBI database there are only approximately 7,000 entries with source species information. Also, strain information is not readily available in ChEBI, as either it has not been captured or has not been added, which means it is not currently possible to map data via strain information to the culture collections. There is more species information available than is currently publically accessible and we will liaise with ChEBI to ensure the addition of their species data and strain information where available. 4 Conclusion and Future This report describes the implementation of the pilot Data Warehouse, an SQLite database connecting genomic data with data in ELIXIR chemical resources as described above. The database is available to download from the ENA public ftp site, along with a README file describing the tables and their contents. Queries can be run, connecting small molecules with taxonomy information, compounds, protein, sequence and assembly data. The choice of SQLite for the database distribution was made on the basis that, while comparatively raw, direct access to a database such as this retains full potential for all kinds of query, while a more intuitive interface would limit potential uses. The interface offered at the moment is not intended for end users, rather these are provided for those involved in moving from prototype to final tool to allow us to understand what kinds of additional content and usage need to be supported. This is simply a prototype and the WP2 deliverable (including integration of the programmatic service interface we provide) is not due until towards the end of the project. The content of the warehouse will be imported into the Marine Metagenomics Portal at University of Tromsø within a couple of months offering an online service. Recognising that not all partners will have sufficient technical skills or support, we offer direct support for project participants in formulating queries against the database; please contact us at datasubs@ebi.ac.uk. The final version of the Data Warehouse is due towards the end of the project in month 45. For that version we plan to increase the mappings within the current databases that are linked and provide a searchable and more intuitive interface. Design for this eventual interface will be informed by those who use the pilot Data Warehouse and the queries that they construct in this first phase. We also intend to expand the number of databases that we can map to, including such databases as OBIS (Ocean Biogeographic Information System). We will further explore the use of strain data to link to culture collections and to the literature to facilitate data integration across small molecules, culture collections and genomics databases. D4.3: Release of pilot version of data warehouse, page 9 of 9