EBI is an Outstation of the European Molecular Biology Laboratory.

Similar documents
EBI patent related services

mpmorfsdb: A database of Molecular Recognition Features (MoRFs) in membrane proteins. Introduction

About the Edinburgh Pathway Editor:

Welcome - webinar instructions

EBI services. Jennifer McDowall EMBL-EBI

Search and Result Help Document

Finding data. HMMER Answer key

Applied Bioinformatics

Blast2GO User Manual. Blast2GO Ortholog Group Annotation May, BioBam Bioinformatics S.L. Valencia, Spain

Blast2GO PRO Plugin for Geneious User Manual

Applied Bioinformatics

Finding and Exporting Data. BioMart

ConSAT user manual. Version 1.0 March Alfonso E. Romero

Biology 644: Bioinformatics

EMBL-EBI Patent Services

Deliverable D5.5. D5.5 VRE-integrated PDBe Search and Query API. World-wide E-infrastructure for structural biology. Grant agreement no.

The VCell Database. Sharing, Publishing, Reusing VCell Models.

BovineMine Documentation

Tutorial:OverRepresentation - OpenTutorials

CONTENTS 1. Contents

Tutorial. Step 1. Step 2. Figure 1

mogene20sttranscriptcluster.db

Software review. Biomolecular Interaction Network Database

TBtools, a Toolkit for Biologists integrating various HTS-data

A generic and modular platform for automated sequence processing and annotation. Arthur Gruber

Facilitating Semantic Alignment of EBI Resources

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

Data systems supporting chemical informatics and small molecule discovery for crop protection research.

Mascot Insight is a new application designed to help you to organise and manage your Mascot search and quantitation results. Mascot Insight provides

CLC Server. End User USER MANUAL

Information Resources in Molecular Biology Marcela Davila-Lopez How many and where

Manatee and the Annotation System Architecture. An In-depth Look Inside Manatee Development and the Annotation Process

BLAST, Profile, and PSI-BLAST

mgu74a.db November 2, 2013 Map Manufacturer identifiers to Accession Numbers

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

Blast2GO Command Line User Manual

New generation of patent sequence databases Information Sources in Biotechnology Japan

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

org.hs.ipi.db November 7, 2017 annotation data package

Bioinformatics Hubs on the Web

Ontology-Based Mediation in the. Pisa June 2007

Enabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services

How to store and visualize RNA-seq data

MDA Blast2GO Exercises

Structural Bioinformatics

Master Thesis. Andreas Schlicker

HymenopteraMine Documentation

Finding sets of annotations that frequently co-occur in a list of genes

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Genome Browser. Background and Strategy. 12 April 2010

HsAgilentDesign db

Multiple Sequence Alignment

hgu133plus2.db December 11, 2017

An overview of Cytoscape for network biology with a focus on residue interaction networks

PFstats User Guide. Aspartate/ornithine carbamoyltransferase Case Study. Neli Fonseca

Automatic annotation in UniProtKB using UniRule, and Complete Proteomes. Wei Mun Chan

The LAILAPS Search Engine - A Feature Model for Relevance Ranking in Life Science Databases

Clustering of Proteins

Blast2GO Teaching Exercises

Utilizing Databases in Grid Engine 6.0

Documentation of HMMEditor 1.0

How to use KAIKObase Version 3.1.0

UniProt - The Universal Protein Resource

Annotating a Genome in PATRIC

Manual of mirdeepfinder for EST or GSS

BioExtract Server User Manual

15-780: Graduate Artificial Intelligence. Computational biology: Sequence alignment and profile HMMs

User Guide Written By Yasser EL-Manzalawy

Protein Information Tutorial

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

Resume Ruchira S. Datta

CACAO Training. Jim Hu and Suzi Aleksander Spring 2016

Let's Play... Try to name the databases described on the following slides...

Languages and tools for building and using ontologies. Simon Jupp, James Malone

Tutorial: Using the SFLD and Cytoscape to Make Hypotheses About Enzyme Function for an Isoprenoid Synthase Superfamily Sequence

ArrayExpress and Expression Atlas: Mining Functional Genomics data

ygs98.db December 22,

A cell-cycle knowledge integration framework

Ensembl Core API. EMBL European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD, UK

ClueGO - CluePedia Frequently asked questions

IPA: networks generation algorithm

hgug4845a.db September 22, 2014 Map Manufacturer identifiers to Accession Numbers

Automation of bioinformatics processes through workflow management systems

Introduction to Web Services

Lecture 5. Functional Analysis with Blast2GO Enriched functions. Kegg Pathway Analysis Functional Similarities B2G-Far. FatiGO Babelomics.

PARALIGN: rapid and sensitive sequence similarity searches powered by parallel computing technology

20.453J / 2.771J / HST.958J Biomedical Information Technology Fall 2008

8/19/13. Computational problems. Introduction to Algorithm

Editing Pathway/Genome Databases

Reactome Error! Bookmark not defined. Reactome Tools

Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords [version 1; referees: awaiting peer review]

The genexplain platform. Workshop SW2: Pathway Analysis in Transcriptomics, Proteomics and Metabolomics

Blast2GO Teaching Exercises SOLUTIONS

Using the Distributed Annotation System

A distributed computation of Interpro Pfam, PROSITE and ProDom for protein annotation

Genome 559. Hidden Markov Models

Deliverable D4.3 Release of pilot version of data warehouse

MetaPhyler Usage Manual

BIOZON: a system for unification, management and analysis of heterogeneous biological data

A Semantic Model for Federated Queries Over a Normalized Corpus

Transcription:

EBI is an Outstation of the European Molecular Biology Laboratory.

InterPro is a database that groups predictive protein signatures together 11 member databases single searchable resource provides functional analysis of proteins by classifying them into families and predicting domains and important sites Enables whole genome analysis

InterPro Consortium Consortium of 11 major signature databases

Protein signatures More sensitive homology searches Each member database creates signatures using different methods and methodologies: manually-created sequence alignments automatic processes with some human input and correction entirely automatically.

Why do we need predictive annotation tools?

What are protein signatures? Protein family/domain Multiple sequence alignment Build model Search Protein analysis it. UniProt Significant match ITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELK Mature model

Member databases METHODS Hidden Markov Models Finger- Prints Profiles Patterns Sequence Clusters Structural Domains Protein features (active sites ) Functional annotation of families/domains Prediction of conserved domains

InterPro entry

InterPro entry

The InterPro entry: types Family Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure Domain Distinct functional, structural or sequence units that may exist in a variety of biological contexts Repeats Short sequences typically repeated within a protein Sites PTM Active Site Binding Site Conserved Site

InterPro Entry Groups similar signatures together Adds Adds extensive extensive annotation Links Links to other to other databases databases Structural information and viewers Quality control Removes redundancy

InterPro Entry Groups similar signatures together Adds Adds extensive extensive annotation Links Links to other to other databases databases Structural information and viewers Hierarchical classification

Interpro hierarchies: Families FAMILIES can have parent/child relationships with other Families Parent/Child relationships are based on: Comparison of protein hits child should be a subset of parent siblings should not have matches in common Existing hierarchies in member databases Biological knowledge of curators

InterPro hierarchies: Domains DOMAINS can have parent/child relationships with other domains

Domains and Families may be linked through Domain Organisation Hierarchy

InterPro Entry Groups similar signatures together Adds Adds extensive extensive annotation Links to other databases Links to other databases Structural information and viewers

InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation Links Links to other to other databases databases Structural information and viewers The Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics

InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation Links Links to other to other databases databases Structural information and viewers UniProt KEGG... Reactome... IntAct... UniProt taxonomy PANDIT... MEROPS... Pfam clans... Pubmed

InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation Links to other databases Links to other databases Structural information and viewers PDB 3-D Structures SCOP Structural domains CATH Structural domain classification

InterProScan Protein Sequence Analysis algorithm Raw Matches Filtering algorithm Reported Matches Predictive Models

InterProScan access Interactive: http://www.ebi.ac.uk/tools/pfa/iprscan/ Webservice (SOAP and REST): http://www.ebi.ac.uk/tools/webservices/services/pfa/iprscan_rest http://www.ebi.ac.uk/tools/webservices/services/pfa/iprscan_soap Downloadable: ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/

Why redesign InterProScan? InterProScan 4 complicated installation complicated update limited queuing system Only guaranteed with LSF limited configurability reliability

InterProScan 5.0 aims Easy install and configuration Modular Expandable Easily integrated into existing pipelines Incorporate new data model / XML exchange format Easy to port on to different architectures: Desktop machine Simple LAN LSF PBS Sun Grid Engine...cloud? GRID? Reliablity

InterProScan 5 Technology

Architecture Cluster platform Web services Java API InterPro website JMS: monitoring queues Job Management: scheduling analyses Business Logic: performing analyses One-way dependencies + replaceable layers = low-coupling + maintainable Oracle PostgreSQL HSQLDB Database Access XML Data Model File I/O File system

Java Messaging Service Master Schedules tasks & sub-tasks, and places on queue Broker Manages queues & topics Broker starts workers on demand Workers take tasks off queues Monitoring & Management Application Web or stand-alone app to monitor & manage InterProScan Worker Peforms Worker task / sub-task Peforms Worker and task / reports sub-task Peforms Worker back Worker and to task / Worker reports Broker sub-task Peforms Peforms back and to task / task / reports Broker sub-task back and Performs to sub-task and task / reports Broker back to sub-task, reports Broker back reports to back Broker to Broker Simple and robust programming model Mature and stable standard current JMS version released in 2002 Guaranteed message delivery to a single worker Easy to monitor Flexible easy to implement on multiple platforms

Beta release functionality

Installation Requirements Java 1.6 Linux Perl Installation process wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/i5 dist.tar.gz tar xzf i5 dist.tar.gz ready to use

./interproscan.sh i test_proteins.fasta o test_proteins.tsv goterms A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 Pfam PF00085 Thioredoxin 9 112 1.3E 28 T 08 07 2011 IPR013766 Thioredoxin domain Biological Process:cell redox homeostasis (GO:0045454) A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 ProSitePatterns PS00194 Thioredoxin family active site. 32 50 T 08 07 2011 IPR017937 Thioredoxin, conserved site Biological Process:cell redox homeostasis (GO:0045454) A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 PIRSF PIRSF000077 null 4 113 1.50000307E 27 T 08 07 2011 IPR005746 Thioredoxin Molecular Function:protein disulfide oxidoreductase activity (GO:0015035), Biological Process:glycerol ether metabolic process (GO:0006662), Biological Process:cell redox homeostasis (GO:0045454), Molecular Function:electron carrier activity (GO:0009055) A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 PRINTS PR00421 Thioredoxin family signature 39 48 T 08 07 2011 IPR005746 Thioredoxin Molecular Function:protein disulfide oxidoreductase activity (GO:0015035), Biological Process:glycerol ether metabolic process (GO:0006662), Biological Process:cell redox homeostasis (GO:0045454), Molecular Function:electron carrier activity (GO:0009055) A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 PRINTS PR00421 Thioredoxin family signature 78 89 T 08 07 2011 IPR005746 Thioredoxin Molecular Function:protein disulfide oxidoreductase activity (GO:0015035), Biological Process:glycerol ether metabolic process (GO:0006662), Biological Process:cell redox homeostasis (GO:0045454), Molecular Function:electron carrier activity (GO:0009055) A2YIW7 f927b0d241297dcc9a1c5990b58bf3c4 122 PRINTS PR00421 Thioredoxin family signature 31 39 T 08 07 2011 IPR005746 Thioredoxin Molecular Function:protein disulfide oxidoreductase activity (GO:0015035), Biological Process:glycerol ether metabolic process (GO:0006662), Biological Process:cell redox homeostasis (GO:0045454), Molecular Function:electron carrier activity (GO:0009055) Default tab-separated values output

./interproscan.sh i test_proteins.fasta o test_proteins.xml goterms F xml <?xml version="1.0" encoding="utf 8" standalone="yes"?> <protein matches xmlns="http://www.ebi.ac.uk/schema/interpro"> <protein> <sequence md5="f927b0d241297dcc9a1c5990b58bf3c4">maaeegvviachnkdefdaqmtkakeagkvviidftaswcgpcrfiapvfaeyakkfpgavflkvdvdelkev AEKYNVEAMPTFLFIKDGAEADKVVGARKDDLQNTIVKHVGATAASASA</sequence> <xref id="a2yiw7"/> <matches> <fingerprints match graphscan="iii" evalue="2.500000864e 7"> <signature name="thioredoxin" desc="thioredoxin family signature" ac="pr00421"> <models> <model name="thioredoxin" desc="thioredoxin family signature" ac="pr00421"/> </models> <signature library release version="41.1" library="prints"/> </signature> <locations> <fingerprints location score="0.0" pvalue="0.0" motifnumber="3" end="48" start="39"/> <fingerprints location score="0.0" pvalue="0.0" motifnumber="2" end="89" start="78"/> <fingerprints location score="0.0" pvalue="0.0" motifnumber="1" end="39" start="31"/> </locations> </fingerprints match> <hmmer2 match score="100.5" evalue=" INF"> <signature name="thioredoxin" ac="pirsf000077"> <models> <model name="thioredoxin" ac="pirsf000077"/> </models> <signature library release version="2.74" library="pirsf"/> </signature> <locations> <hmmer2 location hmm length="0" hmm end="108" hmm start="1" evalue="1.50000307e 27" score="0.0" end="113" start="4"/> </locations> </hmmer2 match>...etc XML output

Pre-calculated match lookup BerkeleyDB-backed REST web service Includes matches for all of UniParc (27 million sequences) 250 million matches Fast response Integrated into i5.

Other functionality Increased reliability Precalculated match lookup Configuration simple properties file Nucleotide sequence getorf map matches to nucleotide coordinates Pathway mapping KEGG, Reactome, MetaCyc, Unipathway

Future functionality Webservice Interact directly with architecture: LAN LSF PBS Sun Grid Engine Database persistence Oracle MySQL Postgres etc Graphical output Other functionality ask!

InterProScan 5 timeline Beta release August 2011 InterProScan 4 still maintained Full release Early 2012 InterProScan 4 deprecated interproscan-5-dev@googlegroups.com

Acknowledgements Team leader Developers Bioinformaticians Curators Sarah Hunter Matthew Fraser Anthony Quinn Phil Jones Craig McAnulla Alex Mitchell Sebastien Pesseat Maxim Scheremetjew Siew-Yit Yong Amaia Sangrador Any Questions Stand 302

Come and see us at booths 9 and 10! Job opportunities PhD and postdoc positions Training in person and online Services Industry programme EBI is an Outstation of the European Molecular Biology Laboratory.