SELF-SERVICE SEMANTIC DATA FEDERATION WE LL MAKE YOU A DATA SCIENTIST Contact: IPSNP Computing Inc. Chris Baker, CEO Chris.Baker@ipsnp.com (506) 721 8241
BIG VISION: SELF-SERVICE DATA FEDERATION Biomedical researchers and clinicians use data and knowledge from multiple sources: online and in-house DB and Web services; nomenclatures, ontologies; Web sites, scientific publications, patents, etc Our query engine HYDRA allows non-technical users to query distributed heterogeneous data sources as a single DB, without help from programmers.
SELF-SERVICE DATA FEDERATION WITH SEMANTIC WEB SERVICES
QUERY EXAMPLES Find the names of drugs that contain chemical category Y as active ingredients Find documents mentioning enzyme activity X, extract info on protein mutations and visualize mutations on 3D structure Find patients with precondition X diagnosed with infections Y resulting from procedure Z Find patients that were diagnosed with ABC after the same patients were diagnosed XYZ. Find patients diagnosed with X while taking drug C.
ANATOMIZED QUERY EXAMPLE Query: Find documents mentioning "haloalkane dehalogenase activity", extract information about mutations and visualize the mutations on 3D protein structure images. To execute this query, HYDRA automatically finds and orchestrates 5 services from our demo registry: PubMed search: keyword query document PubMed IDs PDF retrieval: PubMed ID PDF file URL ASCII extraction: PDF file ASCII text Text mining: ASCII text mutation info Visualization: mutation & protein 3D image (Jmol)
WHAT IS SO COOL ABOUT IT? Data federation at its best: independent, heterogeneous data sources (PubMed doc search, PubMed Central for PDFs); not only data is integrated: ASCII extraction, text mining and 3D visualisation are algorithms! Execution is completely automatic: HYDRA finds and invokes the services without any help from the user.
HYDRA MINI DEMO
FIND PUBMED IDS OF DOCUMENTS MENTIONING PROTEIN P22607 AND CO-MENTIONED PROTEINS
SERVICES IN THE REGISTRY
SPARQL GENERATION AND EXECUTION
EXPORTED RESULTS IN AN EXCEL SPREADSHEET
WHAT IS SO COOL ABOUT IT? Querying is semantic, so it can be self-service: users need not know anything about how the data is organised. They just need to know the terminology of the problem domain. Reasoning is used to compute answers: the query says just document and protein, and the data may use specific document and protein types. The class hierarchies are taken from domain ontologies.
HOW IS THIS ALL POSSIBLE?
HOW IS THIS ALL POSSIBLE? Key ingredient: the SADI framework for Semantic Web services. SADI = Semantic Automated Discovery and Integration In a nutshell, SADI services are: RESTful services (GET/POST) consuming and producing one format (RDF!), with semantic descriptions (in OWL!) fully defining their functionality.
SADI SERVICE I/O Input: RDF graph describing some named resource (i.e., a URI, not a blank node). In other words, RDF graph defining an object with the given URI. Output: another RDF graph providing additional (computed) info about the input object or, in other words, linking it to other objects. Since all SADI services talk the same language (RDF), they are 100% syntactically interoperable: output of one SADI service can be directly consumed by any other SADI services.
computebmi service I/O
SEMANTIC SERVICE FUNCTIONALITY DESCRIPTION OWL syntax is (ab)used to define what RDF graphs are acceptable as input, and what RDF graphs may be produced in the output. Input(computeBMI) = Person and (has_height exactly 1 (Measurement and (has_value exactly 1 float.))) Output(computeBMI) = has_bmi exactly 1 float
PRACTICAL IMPLICATIONS The OWL expressions for a service I/O completely define: what the service expects and can accept as input, and what RDF assertions the service can add to the input. Together with total syntactic interoperability, this feature ENABLES THE DEVELOPMENT OF CLIENTS THAT, GIVEN A SPARQL QUERY, CAN FIGURE OUT WHICH SERVICES CAN BE USED TO COMPUTE IT, AND INVOKE THEM AUTOMATICALLY!
RESOURCE PUBLISHING WITH SADI Specify the source of data / software you want to publish with SADI. For example a database about drugs treating specific diseases can support a service linking Drug IDs to disease IDs. Model data semantically: find ontologies describing your domains and decide how your data will be expressed in the terms of these ontologies.
RESOURCE PUBLISHING WITH SADI Define your services I/O semantically: decide how to describe the operation of your services in the terms of the domain ontologies, i.e., what will be written in the input and output classes. Code the business logic of your services in Java, Perl or Python. If a service wraps a DB, convert the input RDF into a query and the query results back to RDF. The effort is usually tiny compared to the modelling. Overall development costs may be considerable, but this cost is well amortized because SADI services are highly reusable, due to their unprecedented degree of interoperability and discoverability.
SERVICE DEVELOPMENT COST CONSIDERATIONS Wrapping existing functionality as SADI services is usually trivial: available libraries (Java, Perl, Python) hide the communication-related issues. However, some programming is always required to process IO: RDF (Jena) internal representation RDF (Jena) Complete or partial automation is desirable!
AUTHORING SADI SERVICE DESCRIPTIONS: TUTORIAL MATERIALS CSHALS 2013 http://www.iscb.org/cshals2013-program/cshals2013-tutorial#sadi Introduction http://www.iscb.org/cms_addon/conferences/cshals2013/tutorial/sadi_0_intro.pdf Writing SADI Services Tutorial Part 1 http://www.iscb.org/cms_addon/conferences/cshals2013/tutorial/sadi_1_bmi_demo.pdf Writing SADI Services Tutorial Part 2 http://www.iscb.org/cms_addon/conferences/cshals2013/tutorial/sadi_2_get_measurements_demo.pdf
AUTOMATIC SERVICE GENERATION Observation: some patterns in service code are very frequent - opportunity for automation. Example: online DB LinkDB (Kyoto U.) maps between many bio. entity IDs, e.g., KEGG Gene UniProt; o Too many almost identical services to program manually, so we wrote a program that generates service code automatically from its semantic description. Nearly completely automatic generation is possible for relational data - SQL DB, triplestores, spreadsheets, CSV, etc. (forthcoming PhD thesis by Md. Sadnan Al-Manir).
BRIEF HISTORY AND CURRENT STATE OF SADI Proposed by Mark Wilkinson in 2009 and originally developed by his group at UBC. Over $1.6M invested in academic and commercial SADIrelated projects. Open-source Java, Perl and Python APIs for writing SADI services. One open-source and one commercial query client (SHARE and HYDRA). Multiple case studies in Genomics, Cheminformatics, Lipidomics, Ecotoxicology and Clinical Intelligence.
IPSNP S HYDRA AND QUERY GUI
HYDRA QUERY ENGINE Given a SPARQL query, HYDRA analyses it by using an intelligent logic-based algorithm (proprietary). HYDRA sends requests descriptions of potentially useful services from available SADI service registries. HYDRA processes the descriptions and figures out which services have to be invoked on what data and in what order.
HYDRA VS SHARE SHARE is a first prototype of a query engine for SADI developed by Mark Wilkinson s group at UBC. Proves the concept well (inspired us, anyway). But it s gradware: the architecture is naive and unnecessarily complex simultaneously. It cannot scale up by incremental improvement. HYDRA s architecture was developed from the start with the scalability in mind (details are proprietary).
REASONING-ENABLED QUERYING Some queries are too complex unless generality can be exploited: o For example, query concerning all antibiotics require generalisation, otherwise all types of antibiotics would have to be enumerated in the query. Much better way to do this is to import a classification of drugs and use it in query execution. HYDRA facilitates such reasoning and even more complex reasoning with rules.
QUERY COMPOSITION GUI FOR NON-TECHNICAL USERS SPARQL is nicer than many other query languages, but it s still too complex for non-technical users. IPSNP s browser-based GUI facilitates o intuitive graphical query composition and editing; o Google-like keyword based query interpretation.
QUERY COMPOSITION Queries built based on entry of Google-like Keyphrases: Keyphrase: document mentions protein P22607
A QUERY GRAPH IS GENERATED FOR THE KEYPHRASE document mentions protein P22607
ADDING AN ADDITIONAL KEYPHRASE Keyphrase: has pubmed id :
QUERY GRAPH IS EXTENDED WITH SECOND KEYPHRASE Keyphrase: document mentions protein P22607 Keyphrase: has pubmed id
OPTION 2: MANUALLY ADD CLASSES, INCOMING AND OUTGOING PROPERTIES
MANUALLY ADDED PROPERTY
PLANNED HYDRA-BASED PRODUCTS HYDRA + GUI = self-service query tool both standalone and cloud-based edition with a critical mass of SADI services HYDRA as a Java API to be used as middleware. May be embedded in OEM partners software. IPSNP will also implement HYDRA-based turnkey solutions for prominent Bioinformatics and Clinical IT problems.
WHO WE ARE
IPSNP COMPUTING INC. Startup based in NB, Canada, building on and commercializing prior academic research on SADI. Founded to develop an industrial strength query client for SADI, to super-cede the research proof-of-concept prototype SHARE (UBC). Development stage: HYDRA - advanced alpha, GUI is pre-alpha. On-going pilots: customers in Bioinformatics / Clinical IT. Looking for angel investors and/or customers.
THANK YOU Further materials/services are available on request: Live demos. Publications on previous case studies. Business info packs for investors. Training/consulting. Pitch slides for hospital pilots on hospital-acquired infections surveillance, and clinical trial cohort selection.
Q AND A SLIDES
OVERALL REFERENCE ARCHITECTURE FOR CLINICAL TRIALS
DIAGNOSTIC ELEMENTS AS QUERIES PATIENT HAS LEUKOPENIA
CHECKING ELIGIBILITY CRITERIA Cost reduction: automatic pre-selection of patients Timely alerts about eligible patients
QUERYING SADI WEB SERVICES SELECT?term?name WHERE {?protein ont:hastag keyword:parkinson.?protein pred:hasgoterm?term.?term pred:hastermname?name } CLIENT QUERY ENGINE SERVICE REGISTRY SERVICES Find Gene Ontology terms (biological process, cellular component, and molecular function annotations) for proteins associated with Parkinson's disease: