SELF-SERVICE SEMANTIC DATA FEDERATION

Similar documents
SEBI: An Architecture for Biomedical Image Discovery, Interoperability and Reusability based on Semantic Enrichment

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Bioqueries: A Social Community Sharing Experiences while Querying Biological Linked Data (

Linked Data: Fast, low cost semantic interoperability for health care?

A Semantic Web-Based Approach for Harvesting Multilingual Textual. definitions from Wikipedia to support ICD-11 revision

Advances in Data Integration & Representation in Systems Biology

JENA: A Java API for Ontology Management

Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains

Unstructured Text in Big Data The Elephant in the Room

Improving Interoperability of Text Mining Tools with BioC

Korea Institute of Oriental Medicine, South Korea 2 Biomedical Knowledge Engineering Laboratory,

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

Knowledge Representations. How else can we represent knowledge in addition to formal logic?

Software review. Biomolecular Interaction Network Database

warwick.ac.uk/lib-publications

Introduction to RDF and the Semantic Web for the life sciences

A Semantic Web Approach to Integrative Biosurveillance. Narendra Kunapareddy, UTHSC Zhe Wu, Ph.D., Oracle

Semantic Web. Dr. Philip Cannata 1

XML in the bipharmaceutical

CHAPTER 1 INTRODUCTION

Complex Query Formulation Over Diverse Information Sources Using an Ontology

Languages and tools for building and using ontologies. Simon Jupp, James Malone

Integrated Access to Biological Data. A use case

Integrating large, fast-moving, and heterogeneous data sets in biology.

This presentation is for informational purposes only and may not be incorporated into a contract or agreement.

The National Cancer Institute's Thésaurus and Ontology

Executive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design)

Interoperability. Doug Fridsma, MD PhD President & CEO, AMIA

Development of Contents Management System Based on Light-Weight Ontology

Opus: University of Bath Online Publication Store

National Centre for Text Mining NaCTeM. e-science and data mining workshop

Contents. G52IWS: The Semantic Web. The Semantic Web. Semantic web elements. Semantic Web technologies. Semantic Web Services

Ecotoxicology Data Federation with SADI Semantic Web Services

APPLYING KNOWLEDGE BASED AI TO MODERN DATA MANAGEMENT. Mani Keeran, CFA Gi Kim, CFA Preeti Sharma

Semantic Web Technologies

Using Ontologies for Data and Semantic Integration

An Algebra for Protein Structure Data

New Approach to Graph Databases

The Data Web and Linked Data.

Wither OWL in a knowledgegraphed, Linked-Data World?

Agent-Enabling Transformation of E-Commerce Portals with Web Services

D B M G Data Base and Data Mining Group of Politecnico di Torino

HymenopteraMine Documentation

Webinar Annotate data in the EUDAT CDI

Semantic Web. Semantic Web Services. Morteza Amini. Sharif University of Technology Spring 90-91

Extracting reproducible simulation studies from model repositories using the CombineArchive Toolkit

BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering

Core Technology Development Team Meeting

Portals and workflows: Taverna Workbench. Paolo Romano National Cancer Research Institute, Genova

From Raw Sensor Data to Semantic Web Triples Information Flow in Semantic Sensor Networks

Electronic Health Records with Cleveland Clinic and Oracle Semantic Technologies

Semantic-Based Web Mining Under the Framework of Agent

<Insert Picture Here> Semantic Technologies

About the Edinburgh Pathway Editor:

Data mining fundamentals

SmartData Fabric distributed virtual data, graph data and master data management, analytics and security. Solutions and Key Features Revision 2.

enanomapper database, search tools and templates Nina Jeliazkova, Nikolay Kochev IdeaConsult Ltd. Sofia, Bulgaria

Text mining tools for semantically enriching the scientific literature

Taxonomy Tools: Collaboration, Creation & Integration. Dow Jones & Company

Web Resources. iphemap: An atlas of phenotype to genotype relationships of human ipsc models of neurological diseases

SciVerse ScienceDirect. User Guide. October SciVerse ScienceDirect. Open to accelerate science

NCI Thesaurus, managing towards an ontology

cbioportal /5/401

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Humboldt-University of Berlin

Use of Semantic Technologies at Eli Lilly and Company. J Phil Brooks Information Consultant, SE Data Team Discover IT Eli Lilly and Company

Cost-Benefit Analysis of Retrospective vs. Prospective Data Standardization

The CALBC RDF Triple store: retrieval over large literature content

Semantic Technology. Opportunities

Unlocking the full potential of location-based services: Linked Data driven Web APIs

Semantic Web for Chemical Genomics need, how to, and hurdles

VISO: A Shared, Formal Knowledge Base as a Foundation for Semi-automatic InfoVis Systems

Realising the first prototype of the Semantic Interoperability Logical Framework

Big Linked Data ETL Benchmark on Cloud Commodity Hardware

A Semantic Model for Federated Queries Over a Normalized Corpus

SEMANTIC SUPPORT FOR MEDICAL IMAGE SEARCH AND RETRIEVAL

Open PHACTS. An Introduction and Explanation March Acknowledgements: Contains contributions from across the Open PHACTS partners.

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Towards a Semantic Clinical Data Warehouse: A Case Study of Discovering Similar Genes

Acquiring Experience with Ontology and Vocabularies

Semantic Integration with Apache Jena and Apache Stanbol

PROJECT PERIODIC REPORT

Protégé Plug-in Library: A Task-Oriented Tour

Orchestrating Music Queries via the Semantic Web

Scaling the Semantic Wall with AllegroGraph and TopBraid Composer. A Joint Webinar by TopQuadrant and Franz

Multi-agent and Semantic Web Systems: Linked Open Data

Information Workbench

A Knowledge Model Driven Solution for Web-Based Telemedicine Applications

Semantic Web. Semantic Web Services. Morteza Amini. Sharif University of Technology Fall 94-95

Dartgrid: a Semantic Web Toolkit for Integrating Heterogeneous Relational Databases

Expressive Querying of Semantic Databases with Incremental Query Rewriting

Customisable Curation Workflows in Argo

User Guide. v Released June Advaita Corporation 2016

Mining the Biomedical Research Literature. Ken Baclawski

Database of Curated Mutations (DoCM) ournal/v13/n10/full/nmeth.4000.

WHO ICD11 Wiki LexWiki, Semantic MediaWiki and the International Classification of Diseases

OWL-DBC The Arrival of Scalable and Tractable OWL Reasoning for Enterprise Knowledge Bases

A Dream of Software Engineers -- Service Orientation and Cloud Computing

Transcription:

SELF-SERVICE SEMANTIC DATA FEDERATION WE LL MAKE YOU A DATA SCIENTIST Contact: IPSNP Computing Inc. Chris Baker, CEO Chris.Baker@ipsnp.com (506) 721 8241

BIG VISION: SELF-SERVICE DATA FEDERATION Biomedical researchers and clinicians use data and knowledge from multiple sources: online and in-house DB and Web services; nomenclatures, ontologies; Web sites, scientific publications, patents, etc Our query engine HYDRA allows non-technical users to query distributed heterogeneous data sources as a single DB, without help from programmers.

SELF-SERVICE DATA FEDERATION WITH SEMANTIC WEB SERVICES

QUERY EXAMPLES Find the names of drugs that contain chemical category Y as active ingredients Find documents mentioning enzyme activity X, extract info on protein mutations and visualize mutations on 3D structure Find patients with precondition X diagnosed with infections Y resulting from procedure Z Find patients that were diagnosed with ABC after the same patients were diagnosed XYZ. Find patients diagnosed with X while taking drug C.

ANATOMIZED QUERY EXAMPLE Query: Find documents mentioning "haloalkane dehalogenase activity", extract information about mutations and visualize the mutations on 3D protein structure images. To execute this query, HYDRA automatically finds and orchestrates 5 services from our demo registry: PubMed search: keyword query document PubMed IDs PDF retrieval: PubMed ID PDF file URL ASCII extraction: PDF file ASCII text Text mining: ASCII text mutation info Visualization: mutation & protein 3D image (Jmol)

WHAT IS SO COOL ABOUT IT? Data federation at its best: independent, heterogeneous data sources (PubMed doc search, PubMed Central for PDFs); not only data is integrated: ASCII extraction, text mining and 3D visualisation are algorithms! Execution is completely automatic: HYDRA finds and invokes the services without any help from the user.

HYDRA MINI DEMO

FIND PUBMED IDS OF DOCUMENTS MENTIONING PROTEIN P22607 AND CO-MENTIONED PROTEINS

SERVICES IN THE REGISTRY

SPARQL GENERATION AND EXECUTION

EXPORTED RESULTS IN AN EXCEL SPREADSHEET

WHAT IS SO COOL ABOUT IT? Querying is semantic, so it can be self-service: users need not know anything about how the data is organised. They just need to know the terminology of the problem domain. Reasoning is used to compute answers: the query says just document and protein, and the data may use specific document and protein types. The class hierarchies are taken from domain ontologies.

HOW IS THIS ALL POSSIBLE?

HOW IS THIS ALL POSSIBLE? Key ingredient: the SADI framework for Semantic Web services. SADI = Semantic Automated Discovery and Integration In a nutshell, SADI services are: RESTful services (GET/POST) consuming and producing one format (RDF!), with semantic descriptions (in OWL!) fully defining their functionality.

SADI SERVICE I/O Input: RDF graph describing some named resource (i.e., a URI, not a blank node). In other words, RDF graph defining an object with the given URI. Output: another RDF graph providing additional (computed) info about the input object or, in other words, linking it to other objects. Since all SADI services talk the same language (RDF), they are 100% syntactically interoperable: output of one SADI service can be directly consumed by any other SADI services.

computebmi service I/O

SEMANTIC SERVICE FUNCTIONALITY DESCRIPTION OWL syntax is (ab)used to define what RDF graphs are acceptable as input, and what RDF graphs may be produced in the output. Input(computeBMI) = Person and (has_height exactly 1 (Measurement and (has_value exactly 1 float.))) Output(computeBMI) = has_bmi exactly 1 float

PRACTICAL IMPLICATIONS The OWL expressions for a service I/O completely define: what the service expects and can accept as input, and what RDF assertions the service can add to the input. Together with total syntactic interoperability, this feature ENABLES THE DEVELOPMENT OF CLIENTS THAT, GIVEN A SPARQL QUERY, CAN FIGURE OUT WHICH SERVICES CAN BE USED TO COMPUTE IT, AND INVOKE THEM AUTOMATICALLY!

RESOURCE PUBLISHING WITH SADI Specify the source of data / software you want to publish with SADI. For example a database about drugs treating specific diseases can support a service linking Drug IDs to disease IDs. Model data semantically: find ontologies describing your domains and decide how your data will be expressed in the terms of these ontologies.

RESOURCE PUBLISHING WITH SADI Define your services I/O semantically: decide how to describe the operation of your services in the terms of the domain ontologies, i.e., what will be written in the input and output classes. Code the business logic of your services in Java, Perl or Python. If a service wraps a DB, convert the input RDF into a query and the query results back to RDF. The effort is usually tiny compared to the modelling. Overall development costs may be considerable, but this cost is well amortized because SADI services are highly reusable, due to their unprecedented degree of interoperability and discoverability.

SERVICE DEVELOPMENT COST CONSIDERATIONS Wrapping existing functionality as SADI services is usually trivial: available libraries (Java, Perl, Python) hide the communication-related issues. However, some programming is always required to process IO: RDF (Jena) internal representation RDF (Jena) Complete or partial automation is desirable!

AUTHORING SADI SERVICE DESCRIPTIONS: TUTORIAL MATERIALS CSHALS 2013 http://www.iscb.org/cshals2013-program/cshals2013-tutorial#sadi Introduction http://www.iscb.org/cms_addon/conferences/cshals2013/tutorial/sadi_0_intro.pdf Writing SADI Services Tutorial Part 1 http://www.iscb.org/cms_addon/conferences/cshals2013/tutorial/sadi_1_bmi_demo.pdf Writing SADI Services Tutorial Part 2 http://www.iscb.org/cms_addon/conferences/cshals2013/tutorial/sadi_2_get_measurements_demo.pdf

AUTOMATIC SERVICE GENERATION Observation: some patterns in service code are very frequent - opportunity for automation. Example: online DB LinkDB (Kyoto U.) maps between many bio. entity IDs, e.g., KEGG Gene UniProt; o Too many almost identical services to program manually, so we wrote a program that generates service code automatically from its semantic description. Nearly completely automatic generation is possible for relational data - SQL DB, triplestores, spreadsheets, CSV, etc. (forthcoming PhD thesis by Md. Sadnan Al-Manir).

BRIEF HISTORY AND CURRENT STATE OF SADI Proposed by Mark Wilkinson in 2009 and originally developed by his group at UBC. Over $1.6M invested in academic and commercial SADIrelated projects. Open-source Java, Perl and Python APIs for writing SADI services. One open-source and one commercial query client (SHARE and HYDRA). Multiple case studies in Genomics, Cheminformatics, Lipidomics, Ecotoxicology and Clinical Intelligence.

IPSNP S HYDRA AND QUERY GUI

HYDRA QUERY ENGINE Given a SPARQL query, HYDRA analyses it by using an intelligent logic-based algorithm (proprietary). HYDRA sends requests descriptions of potentially useful services from available SADI service registries. HYDRA processes the descriptions and figures out which services have to be invoked on what data and in what order.

HYDRA VS SHARE SHARE is a first prototype of a query engine for SADI developed by Mark Wilkinson s group at UBC. Proves the concept well (inspired us, anyway). But it s gradware: the architecture is naive and unnecessarily complex simultaneously. It cannot scale up by incremental improvement. HYDRA s architecture was developed from the start with the scalability in mind (details are proprietary).

REASONING-ENABLED QUERYING Some queries are too complex unless generality can be exploited: o For example, query concerning all antibiotics require generalisation, otherwise all types of antibiotics would have to be enumerated in the query. Much better way to do this is to import a classification of drugs and use it in query execution. HYDRA facilitates such reasoning and even more complex reasoning with rules.

QUERY COMPOSITION GUI FOR NON-TECHNICAL USERS SPARQL is nicer than many other query languages, but it s still too complex for non-technical users. IPSNP s browser-based GUI facilitates o intuitive graphical query composition and editing; o Google-like keyword based query interpretation.

QUERY COMPOSITION Queries built based on entry of Google-like Keyphrases: Keyphrase: document mentions protein P22607

A QUERY GRAPH IS GENERATED FOR THE KEYPHRASE document mentions protein P22607

ADDING AN ADDITIONAL KEYPHRASE Keyphrase: has pubmed id :

QUERY GRAPH IS EXTENDED WITH SECOND KEYPHRASE Keyphrase: document mentions protein P22607 Keyphrase: has pubmed id

OPTION 2: MANUALLY ADD CLASSES, INCOMING AND OUTGOING PROPERTIES

MANUALLY ADDED PROPERTY

PLANNED HYDRA-BASED PRODUCTS HYDRA + GUI = self-service query tool both standalone and cloud-based edition with a critical mass of SADI services HYDRA as a Java API to be used as middleware. May be embedded in OEM partners software. IPSNP will also implement HYDRA-based turnkey solutions for prominent Bioinformatics and Clinical IT problems.

WHO WE ARE

IPSNP COMPUTING INC. Startup based in NB, Canada, building on and commercializing prior academic research on SADI. Founded to develop an industrial strength query client for SADI, to super-cede the research proof-of-concept prototype SHARE (UBC). Development stage: HYDRA - advanced alpha, GUI is pre-alpha. On-going pilots: customers in Bioinformatics / Clinical IT. Looking for angel investors and/or customers.

THANK YOU Further materials/services are available on request: Live demos. Publications on previous case studies. Business info packs for investors. Training/consulting. Pitch slides for hospital pilots on hospital-acquired infections surveillance, and clinical trial cohort selection.

Q AND A SLIDES

OVERALL REFERENCE ARCHITECTURE FOR CLINICAL TRIALS

DIAGNOSTIC ELEMENTS AS QUERIES PATIENT HAS LEUKOPENIA

CHECKING ELIGIBILITY CRITERIA Cost reduction: automatic pre-selection of patients Timely alerts about eligible patients

QUERYING SADI WEB SERVICES SELECT?term?name WHERE {?protein ont:hastag keyword:parkinson.?protein pred:hasgoterm?term.?term pred:hastermname?name } CLIENT QUERY ENGINE SERVICE REGISTRY SERVICES Find Gene Ontology terms (biological process, cellular component, and molecular function annotations) for proteins associated with Parkinson's disease: