The National Center for Biomedical Ontology Current State and Future State Architecture PDF Free Download

The National Center for Biomedical Ontology Current State and Future State Architecture 2009 Benjamin Dai Chief Software Architect March 2009

Table of Contents 1 Executive Summary... 4 2 Introduction... 4 3 NCBO Software Principles... 4 4 NCBO Products & Technology Landscape... 5 5 Current State: High Level Architecture... 7 6 Current State: BioPortal... 9 6.1 BioPortal Web Application... 9 6.1.1 Service Oriented Architecture... 11 6.1.2 Consolidating the Bioportal back-end... 12 6.2 BioPortal Widgets... 13 7 Current State: Annotations and Ontology-Based Services... 14 8 Current State: NCBO PURL Server... 19 9 NCBO Fabric... 19 9.1 Protégé 4 Plugin for BioPortal... 20 9.2 NLM License Server Integration for UMLS... 20 9.3 OBO to OWL Converter Wrapper... 21 9.4 BioPortal FOAF User... 22 10 Future Technologies and Solutions... 23 10.1 Future State: Advancing Visualization... 23 10.1.1 History Rich Ontology Navigation... 23 10.1.2 Ontology Mapping... 24 10.1.3 User Interface Mashups... 25 10.1.4 BioPortal User Studies... 26 10.2 Future State: NCBO Fabric - Web Service Harmonization... 26 10.3 Future State: IT Operations Management... 28 10.4 Future State: NCBO Node... 28

10.4.1 The Layered Architecture of an NCBO node... 29 10.4.2 Business Logic Tier and Storage Tier... 30 10.4.3 Services Tier... 32 10.4.4 Consumers of NCBO Node Services... 33 10.5 Future State: RDF Triple Store Back-end... 34 11 Conclusion... 35 3

1 Executive Summary The National Center for Biomedical Ontology (NCBO) vision is that all biomedical knowledge and data are disseminated on the Internet using principled ontologies, such that the knowledge and data are semantically interoperable and useful for furthering biomedical science and clinical care. The architecture and solutions presented in this paper represents a step forward in executing this vision in delivering repositories and software tools for using this biomedical information in research and to ultimately enhance scientific discovery. 2 Introduction This report presents the National Center for Biomedical Ontology (NCBO) current state and future architecture roadmap. The architecture aligns to the industry definition defined by the ANSI/IEEE Standard 1471-2000 specification for architecture: The fundamental organization of a system, embodied in its components, their relationships to each other and the environment, and the principles governing its design and evolution. In context of this definition, architecture can have multiple designs. In contrast, a design has one implementation. This document starts with the current state of the architecture and design of the NCBO products and technologies. It then continues with a presentation of technology and solutions expected in the short-term and for the NIH grant renewal. The underlying principle of the current and future state architecture focuses on a web-based architecture which leverages both principles of service-oriented architecture (SOA) and resource-oriented architecture. This document incorporates contributions from Stanford University (i.e., Michael Dorf, Nigam Shah, Natasha Noy, Tim Redmond, Clement Jonquet, and Nipun Bhatia), Mayo Clinic (Harold Solbrig, and Jim Buntrock, and Tom Johnson), and University of Victoria (Chris Callendar, Sean Falconer, and Lars Grammel). 3 NCBO Software Principles All technology decisions, strategic direction, and appropriate architecture are guided by the NCBO Software Development Principles 1 established by the Chief Software Architect (CSA) in collaboration with all NCBO developers. The guidelines listed below can also be found on the NCBO GForge Site 2. Each NCBO software release should deliver the simplest capability with the biggest positive end-user impact. Where possible, NCBO developers will re-use existing NCBO technologies. 1 Dai, B. NCBO BioPortal Principles Document. January, 16, 2008. https://bmirgforge.stanford.edu/gf/download/docmanfileversion/253/364/ncbosoftwaredevelopmentprinciples20080404.doc. 2 GForge Site. November 27, 2008. https://bmir-gforge.stanford.edu/gf/project/ncbo/. 4

Software technology selection must consider software industry best practice, support, and strong developer communities. Software developers are responsible for long-term maintainability of their software. All NCBO software and related materials should be managed in a central repository. Software developers should directly or indirectly (e.g., through an end-user representative) connect with end-users to ensure software capabilities deliver value. Software developers are responsible for documentation of their code as well development of appropriate supporting artifacts (e.g. user guides). There should be a requirements contract (i.e., rules of engagement) to ensure that software developers have the artifacts they need to deliver features driven by endusers. The first principle is the core concept of iterative software development. To avoid the classic analysis-paralysis risk, common in many failed software projects, the NCBO will deliver the simplest software that has known immediate user value in each release. These principles are guides to assist the NCBO with effective decision-making. However, there have been and will be circumstances where principles may be bent or broken. For example, regarding the last principle of a requirements contract, there often may be no time for subject-matter experts to capture requirements effectively (e.g., lack of detail or changing requirements late in a project plan). An underlying principle that drives all NCBO software is service-oriented and resource-oriented architectures that leverage web-standards and approaches. In this way, the NCBO solutions are easily integrated and consumed by end-user communities. Details of these principles are extolled throughout this document. 4 NCBO Products & Technology Landscape The NCBO software deliverables include a suite of products and technologies. Figure 1 represents a high-level picture of these products, associated technologies, how they relate together, and how NCBO end-users interface with the NCBO deliverables. The blue circles represent internal NCBO efforts. The green textured circle represent sample external partners (only a subset to serve the purpose of this diagram). The dotted relationships with NCBO PURL Server and the NCBO Fabric represents future-state plans that will be rolled out in 2009. 5

NCBO End-Users Web Browser Protégé Java FlexViz Visualization Web Browser Web Browser Microsoft/ Science Commons Add-in REST NCBO PURL Server REST BioPortal - J2EE Web Application and Ruby on Rails LexGrid Java REST UCSF Native API REST Resources, Annotator, and Ontology-Based Services (OBS) J2EE and REST REST Other Partner Organizations REST REST NCBO Fabric NCBO products and technologies. External NCBO partner products. NCBO UMLS Service Protégé 4 Plugin for BioPortal External Hosted OBOtoOWL Service BioPortal FOAF RDF User Service Figure 1. Overview of the NCBO products, technologies, and their desired relationship. The diagram presents a highlevel landscape of NCBO products, associated technologies, how they ultimately will relate to one another, and how NCBO end-users interface with them. The blue circles represent internal NCBO efforts. The green textured circles represent the contributions of sample external partners (only a subset to serve the purpose of this diagram). Note the relationships with NCBO PURL Server and the NCBO Fabric represents plans that will be rolled out in 2009. The majority of integration points between NCBO products will be through REST web services. The NCBO has selected the Java/J2EE software infrastructure as the development of choice for all products. Below is a description of each NCBO product: BioPortal A J2EE web application for accessing, visualizing, analyzing, uploading, and searching a large repository of biomedical ontologies, terminologies, and annotations. Ontology-Based Services (OBS) A set of inter-related Java-based REST/SOAP services that use ontologies in BioPortal to perform concept recognition, ontology traversal, concept expansion, and to determine concept similarity. Resources (short for Open Biomedical Resources [OBR]) A set of annotations generated automatically. These annotations are presented through integration with BioPortal UI which enables researchers to search for biomedical resources associated (or annotated) with specific ontology terms. 6

Annotator (formerly OBA) A Java-based REST/SOAP web service that given some free text (e.g., a journal abstract of 200 to 300 words), returns a set of recognized concepts and related concepts using Ontology-based services for concept recognition and ontology traversal. 3 4 The following are the five key technologies that NCBO leverages (for the sake of brevity, not all technologies are listed): Protégé A server-side application developed at Stanford, which provides the ability to store, update, and query OWL ontologies from a relational database. It is also an ontology editor and a knowledge-base editor, which allows domain experts to build knowledgebased. LexGrid A server-side application, developed at Mayo, which provides support for a distributed network of lexical resources such as terminologies and ontologies. FlexViz An Adobe Flex plug-in, developed at University of Victoria, is an advanced graphbased tool for browsing and searching ontologies. It lets the user navigate a single ontology displaying the concepts as nodes in a graph and the relationships between concepts as arcs. The visualization can either show the immediate neighborhood of a particular concept, or the hierarchy to root. There are many different graph layouts which can be run to rearrange the nodes on the screen, such as a vertical tree layout and grid layout. NCBO PURL Server A PURL (Persistent URL) server dedicated for the biomedical communities to enable identification and sharing of persistent URLs for biomedical resources (e.g., ontologies, concepts, and mappings). NCBO Fabric The NCBO Fabric is based on a NetKernel resource-oriented architecture engine that enables rapid development of workflows, composition of REST services, and REST service transformations to serve the needs of the NCBO end-user communities needs. Four prototype examples are presented in Section 9 of this document: NCBO UMLS Service, Protégé 4 Integration with BioPortal, Berkeley OBOtoOWL Service, and BioPortal FOAF RDF User Service. The NCBO plans to more fully integrate the NCBO Fabric in 2009. 5 Current State: High Level Architecture Figure 3 represents a high-level architecture view of how the NCBO Products are delivered to NCBO end users. In addition to accessing NCBO products through the BioPortal (using a Web browser), end users can leverage the BioPortal, Resources, Annotator, and OBS content through API calls (i.e., REST web services). These API service calls can be consumed by external applications (e.g., the BioLit project used NCBO services to develop a Microsoft Word plugin for ontology based tagging while authors are writing their manuscripts think smart tags that are driven by an ontology). 3 Jonquet C and Shah NH. Ontrez Project Report. SMI Technical report - 1289. Stanford. CA 2008. 4 Jonquet C, Musen MA and Shah NH. A System for Ontology-Based Annotation of Biomedical Data. Accepted at Data Integration in Life Sciences 2008 conference, Every, France. 7

The layered architecture presented in the BioPortal is described in the BioPortal Web Application section. The principles of innovation through consumption of the NCBO REST services will be discussed in the Service-Oriented Architecture section. Figure 3. Overview of the Current NCBO Architecture and Design. A high-level architecture view of how the NCBO Products will be delivered to NCBO end-users. In addition to accessing NCBO products through the BioPortal (using a Web browser), end users can leverage the BioPortal, Resources, Annotator, and OBS content through API calls (i.e., REST web services). These API calls can be consumed by external applications (e.g., the BioLit project used NCBO services to develop a Microsoft Word plugin for ontology based tagging while authors are writing their manuscripts think smart tags that are driven by an ontology). Note that there is implicit integration among NCBO products not shown in Figure 3. For example, BioPortal integrates with LexGrid to store/serve RRF and OBO ontologies. From an API perspective, the NCBO provides three categories of REST web services. Ontology services include both BioPortal and UMLS Services which enables access to ontologies, concepts, and related meta-data. Annotation services enable annotation of biomedical free-text that helps identify biomedical concepts through using ontology and concepts stored in our repositories. Data services enable query of annotations created associate biomedical resources (e.g., clinical trials, medical reports, etc) with biomedical concepts through use of ontologies and concepts. Figure 4 presents an overview of these services and corresponding NCBO users of these services. 8

Figure 4. Overview of NCBO Services. Ontology services include both BioPortal and UMLS Services which enables access to ontologies, concepts, and related meta-data. Annotation services enable annotation of biomedical free-text that helps identify biomedical concepts through using ontology and concepts stored in our repositories. Data services enable query of annotations created associate biomedical resources (e.g., clinical trials, medical reports, etc) with biomedical concepts through use of ontologies and concepts. 6 Current State: BioPortal Starting in late 2007, the NCBO began a significant BioPortal refactoring to improve software flexibility and maintainability, to increase integration across NCBO products (e.g., Resources[OBR] consumption of BioPortal web services), and to enable more collaboration with external partners. A core of the architecture delivered by the BioPortal will re-used for all other web-based NCBO applications as they move from Research to Production. For example, Resources is in the process of migrating to the same layered BioPortal solution that leverages both Spring and Hibernate (leading open source Java frameworks). 6.1 BioPortal Web Application The NCBO BioPortal architecture uses a classic architecture layered approach, which decouples the logic and domain object models between each layer. This approach decouples the versioning and changes in one layer from another. Thus, as software modules evolve (which they always do), the impact on the rest of the software sub-systems is significantly reduced. Furthermore, the BioPortal enforces decoupling through web services (e.g., REST) and the enterprise pattern of Dependency Injection. Note that Protégé 3.x has been incorporated into business logic layer (in addition to pre-existing LexGrid) to enable full support of OWL ontologies and the capability to store ontology based meta data, mappings, and marginal notes. The following diagram presents a simple overview of the layers in the BioPortal architecture followed by a description of each. 9

Figure 5. NCBO BioPortal Layered Architecture. The BioPortal will leverage a classic architecture layered approach which decouples the logic and domain object models from layer to another. Furthermore, the BioPortal enforces decoupling through REST web services (possibly SOAP if necessary) and the enterprise pattern of Dependency Injection. Note that Protégé 3.x has been incorporated into business logic layer (in addition to pre-existing LexGrid) to enable full support of OWL ontologies and the capability to store ontology based meta data, mappings, and marginal notes. The Presentation Tier delivers the BioPortal user-interface which currently uses the Ruby on Rails technology. Ruby on Rails is a leading mature UI framework supported by large software development communities. It enables rapid prototyping as well as solid integration with web services. The Interface Tier consists of both REST web services that present all BioPortal capabilities to the upper tiers (e.g., upload ontology, download ontology, display concept, administrative functions, etc). The Presentation Tier is primarily driven by the REST services (currently implemented with RESTlet libraries). Thus, the BioPortal Presentation Tier is just one vanilla consumer of the BioPortal REST services. In the principle of Service-Oriented Architecture, any number of partner developers could consume the set of existing REST services for any number of purposes. For example, one could implement a completely different UI than that currently exposed by the BioPortal. Or, one could simply want to consume a single REST services for integrating into a back-end workflow with no UI at all. Please see the NCBO wiki page named NCBO REST Services to learn about consuming BioPortal REST services. 10

The Business Logic Tier uses the Spring technology which enables a partner to insert any software implementation that abides to the NCBO-defined interfaces. This is achieved through the Dependency Injection enterprise pattern which is core to the Spring framework. For example, if NCI requires use of a different flavor of LexGrid, a module could be easily implemented by NCI using the NCBO interfaces and deployed without having to modify the core BioPortal software. This architectural approach enables incorporation of new capability into the BioPortal through packaging and configuration. The need to do a major BioPortal software release to support a customized implementation is significantly reduced. Please see the NCBO wiki page named NCBO-OOR Server-Side Customization for examples of how to customize the back-end modules. The Persistence Tier uses the Hibernate technology as a basic object-relational mapping to the back-end relational database. Hibernate is used for storing administrative (e.g., user information) and external ontology data (e.g., ontology attributes specified at upload time). All ontology content is stored in Protege and LexGrid as shown in the Business Logic layer. 6.1.1 Service Oriented Architecture The BioPortal architecture approach expands our focus from BioPortal user interface features (i.e., the presentation layer) to BioPortal re-usable services. Service-Oriented Architecture (SOA) also enables decoupling of software subsystems. Decoupling is achieved through the use of REST web services which implement all BioPortal capabilities (e.g., get ontology concept, upload ontology, and search ontology). In this design, the BioPortal user interface (i.e., presentation tier) is the first consumer of the set of services. The REST web services in the Interface Tier to the Presentation Tier (shown in Figure 4) provide a natural decoupling which enables any internal or external applications such as those developed by collaborating grant projects; e.g. cananolab project led by Nathan Baker and David Paik - to consume those web services without worrying about the internal implementation of the services. Note that the NCBO leverages a type of SOA named Resource-Oriented Architecture where every NCBO asset is a resource with associated representations. Figure 6 presents an example of how a BioPortal REST web service is requested and consumed by the BioPortal User-Interface to display a list of ontologies. The request begins with a standard HTTP URL request to the List Ontology service. XML is returned from the HTTP request and presented through the BioPortal user-interface. 11

Figure 6. Example of BioPortal User-Interface Consuming a REST Web Service. The diagram presents an example of how BioPortal REST web service is requested and consumed by the BioPortal User-Interface to display a list of ontologies. The request begins with a standard HTTP URL request to the List Ontology service. XML is returned from the HTTP request and presented through the BioPortal user-interface. 6.1.2 Consolidating the Bioportal back-end While architecturally it is preferable to have a single ontology/terminology back-end repository to reduce complexity, no single repository meets all NCBO requirements. The NCBO must support the OWL language and RDF/XML, OBO, RDF and RRF formats. As shown in the diagram, Protégé 3.x has been incorporated into the business-logic layer to enable full support of OWL ontologies and the capability to store ontology based meta-data, mappings, and marginal notes. LexGrid continues to support the diverse terminology formats required by NCBO (e.g., the RRF format used for UMLS is particularly critical). As shown in Figure 5, the combination of LexGrid and Protégé jointly provide sufficient support for the diverse standards and formats. All OWL ontology requests (e.g., get concept, find concept, and upload) will be routed to Protégé. All other ontology/terminology requests such as RRF and OBO will be routed to LexGrid. Long-term, the NCBO plans to rearchitect the BioPortal to leverage both normalized store and native OWL/RDF triple store in collaboration between Stanford University and Mayo Clinic. With this in mind, adding, removing, and changing back-ends to the BioPortal is entirely customizable and adjustable by design. The server-side architecture leverages the principles of aspect-oriented programming. Thus, the majority of server-side component in the BioPortal can be more easily modified. For example, when new ontologies are loaded into BioPortal (either through the UI or through back end scheduled jobs), they are loaded with one of two possible back-end repositories Protege or LexGrid. All OWL/Protege ontologies are loaded into Protege. All OBO/RRF (which are biomedical specific formats) are loaded into LexGrid. Both back-ends can be configured to point to different back-ends without rebuilding the entire application. Figure 7 presents a picture of the triaging that occurs in the BioPortal: 12

Figure 7. NCBO BioPortal Server-Side Customization. The majority of server-side component in the BioPortal can be more easily modified. In this drawing, when new ontologies are loaded into BioPortal (either through the UI or through back end scheduled jobs), they are loaded either to two possible back-end repositories Protege or LexGrid. Both backends can be configured to point to different back-ends without rebuilding the entire application. For further documentation on customizing BioPortal, please see the NCBO wiki document named NCBO-OOR Server-Side Customization. 6.2 BioPortal Widgets A bi-product of the resource-oriented architecture used by the BioPortal is the ability to present NCBO ontologies, annotations, and related meta-data through a multitude of other user-interface channels. The idea of BioPortal Widgets came out of the NCBO Project Meeting in December 2008 where NCBO developers would create simple UI components that people could embed in their web-sites. For example, if a NCBO partner wanted to embed a BioPortal search box in their web site, it could be as easy as simply copying and pasting snippets of HTML and Javascript we have created into their web sites. An initial set of simple BioPortal widgets have been developed to demonstrate the value of this approach. They include a Simple Search Widget, an Ontology Feed Widget, and a FlexViz Widget. After the NCBO has garnered some initial success with simple BioPortal widgets, it will begin considering more sophisticated user customization capabilities. For example, the NCBO will investigate User- Interface Mashups possibilities as elaborated in the Future State Architecture section on Visualization. This effort will be led by the University of Victoria. 13

7 Current State: Annotations and Ontology-Based Services The creation of ontology based annotations is still not as wide-spread as desired in the biomedical communities. Aside from annotation of gene products for their molecular function, biological process and cellular component, most experimental datasets such as microarray data, tissue arrays, clinical trials, radiology and cellular microscopy images are still annotated in free-text. The primary reasons for this are: 1) lack of a one-stop-shop for bioontologies solved by creating OBO; 2) lack of userfriendly tools for annotating experimental data with ontology terms at submission time and 3) lack of a scalable mechanism to create ontology-annotations of the massive amounts of experimental data as well as storing them. The NCBO addresses the gaps by providing a set of Ontology-Based Services, the Annotator (to automatically process a piece of raw text to annotate it with relevant ontology concepts and return the annotations), and the Resources (to annotate some biomedical resources content to identify the biomedical concepts to which they relate and provide the annotations) solutions. The combination of these services can be used for concept recognition from text, for annotation, for accelerating curation, for programmatic access to latest versions of ontologies, and for data summarization. The Ontology-Based Services (OBS), shown in Figure 8, is a group of four basic services: 1. A Concept Recognizer Service takes as input a term and returns an ontology concept (or concepts) corresponding for the term. The concept or concepts resturned are from the ontologies in BioPortal ontology repository 5. 2. Ontology Traversal Service takes as input a concept and a (set of) relationships to traverse and returns a set of concepts related to the input concept based on the input relationship 6. For instance, this service can provide all parents of a concept or all of its children. 3. Concept Similarity Service takes as input two concepts and returns a similarity metric between them. Similarity can be based on the results of the ontology traversal, on the content of annotations, or on other resources 7. This capability is a high priority for UCSF Driving Biological Project (DBP) clinical trials workflows and they will be a first consumer of this service. 4. Concept Expansion Service takes an ontology concept and returns related concepts based on an ontology traversal, or concept similarity, or both. 5 Bhatia N, Shah NH, Rubin DL, Chiang AP and Musen MA. Comparing Concept Recognizers for Ontology-Based Indexing: MGREP vs. MetaMap. Under review for AMIA annual symposium 2008. 6 Shah NH and Musen MA. UMLS-Query: A Perl Module for Querying the UMLS. Under review for AMIA annual symposium 2008. 7 Lee WN, Shah NH, Sundlass, K and Musen MA. Comparison of Ontology-based Semantic Similarity Measures. Under review for AMIA annual symposium 2008. 14

Figure 8. Ontology-Based Services (OBS). The four basic ontology-based services: (1) a Concept-Recognizer service takes a term or a set of terms and returns ontology concepts; (2) an Ontology-Traversal service takes a concept and a relationship to traverse and returns a set of concepts; (3) a Concept-Similarity service takes two concepts and returns a similarity metric between them; (4) a Concept-Expansion service takes a concept and returns a set of related concepts. The dotted lines in the diagram indicate the use of another service or resource. For instance, the Concept-Similarity service uses the Ontology-Traversal Service and the Ontology Repository in BioPortal to determine the similarity metric between two concepts. The Annotator Service is a Java-based REST web service that takes as input a set of terms (or a text string, such as a journal abstract) and returns a set of concepts based on these terms. It uses the Concept-Recognizer service from OBS to find relevant ontology concepts and uses the Ontology- Traversal service to find the parents of these concepts. NCBO uses these services to annotate resources to create the automated annotations via REST web service invocations. In other words, the Annotator is used in the workflow for annotating the biomedical resources (e.g., GEO, PubMed, ClinicalTrials.gov, etc) which populate the Resources Index. The Annotator workflow is illustrated in Figure 9 where free-text is input into the workflow and a list of annotations is received in the output. 15

Figure 9. Annotator Workflow. First, direct annotations are created from raw text according to a dictionary that use terms from a set of ontologies. Second, different components expand the first set of annotations using ontologies semantics. Several groups have expressed interest in using the Annotator for diverse uses ranging from triaging new publications (based on the ontology terms recognized in the abstract) to identifying gene expression experiments for their favorite condition for further analysis. For example, Shai-Shen Orr, a postdoc with Atul Butte has used our Annotator to analyze the textual metadata of individual experiments (GSMs) in GEO and has identified those that study a particular immunological disease in mouse models (Simon Twigger's DBP also has a similar goal). Amit Sheth's group, a collaborating R01, is examining the use of the Annotator for semantically annotating web services. Resources (OBR) uses the resources annotated with ontology concepts using the Annotator to enable users to find resources relevant to their search terms. The user provides the set of search terms and the Resources workflow uses the Concept-Recognizer Service to produce a set of ontology concepts relevant to the query. It then uses the Concept-Expansion Service to find an expanded set of concepts and queries the annotations with this set to find the relevant resources. The process to create the Resources index of resources and presentation through the BioPortal is presented in Figure 10. The Resources workflow used the Annotator to index biomedical resource with ontology concepts. Consequently, the index can then be used to enhance search of the data and be presented through the BioPortal User- Interface. 16

Figure 10. Resources Index and BioPortal User-Interface. The NCBO has used the annotator to index biomedical data with ontology concepts. The index can then be used to enhance search of the data and be presented through the BioPortal User-Interface. The value of the Resources workflow is that it provides a one-stop-shop for the biomedical community to search across high-value biomedical resources (e.g., PubMed) using direct and inferred ontology concepts stored in the NCBO ontology repository. Without such a service, end-users would have to go to multiple sites using many different search mechanism to find relevant biomedical content. With the Resources solution, an end-user simply needs to use the Resources web services or go to the Resources user-interface (either through the BioPortal or stand-alone Resources application) to query across numerous biomedical resources. The NCBO started with the an initial set of biomedical resources (ClinicalTrials.gov, GEO, ArrayExpress, ARRS GoldMiner, and NextBIO) and have ramped up efforts in adding several others (CDD, OMIM, PharmGKB, Reactome, ResearchCrossroads, UniProt, PubMed, and cananolab). Going forward, the NCBO will continue to add more resources to afford the biomedical community to do even more powerful ontology-enabled searches across many high-value biomedical resources. Finally, Figure 11 presents a high-level diagrams that details how the Annotator Service contributes to the Resources Index and uses the Ontology-Based Services (OBS) described earlier. The Annotator service takes a set of terms, uses the Concept-Recognizer service to find relevant ontology concepts and uses the Ontology-Traversal service to find the parents of these concepts. 17

Figure 11. Annotator Service and the Resources index and service. Annotator Service and Resources use the Ontology-Based Services (OBS) described earlier. The Annotator service takes a set of terms, uses the Concept-Recognizer service to find relevant ontology concepts and uses the Ontology-Traversal service to find the parents of these concepts. NCBO uses these service to annotate resources in Resources and to create an index of these resources. The Resources service takes a user query in the form of a set of terms and returns the relevant resources from Resources. To find the relevant resources, it uses the Concept-Recognizer to go from the terms supplied by the user to ontology concepts, and to expand this set of concepts with related concepts, using the Concept-Expansion Service. It then uses this expanded set of concepts to query the Resources index to find relevant resources. The NCBO currently implements these workflow through a simple Java module which explicitly invokes each step of the workflow. In the future, the NCBO will consider a much more flexible approach by leveraging the NCBO Fabric (described later in this document). In terms of the appropriate business process modeling for workflows, the NCBO will consider adoption of BPMN for modeling. BPMN has been established by OMG 8 and is the closest thing to a business process modeling standard in the IT industry. There are currently 44 BPMN product implementations 8. In the biomedical informatics domain, Taverna 9 is usually associated with workflows. We note that the Taverna workflows are analogous to experimental protocols and allows users to integrate many different software tools, including web services, such as those provided by the National Center for Biotechnology Information, The European Bioinformatics Institute, the DNA Databank of Japan (DDBJ), SoapLab, BioMOBY and EMBOSS. The NCBO services such as the Annotator service should easily be consumed as an ingredient by tools such as the Taverna workbench. However, workflow orchestration such as chaining the concept recognizer, concept expander and the Resources query services to create a 8 OMG BPMN. January 16, 2008. <http://www.bpmn.org/>. 9 http://taverna.sourceforge.net/ 18

functional service (the Resources search service that runs on the bioportal server) is a very different task and cannot be accomplished with Taverna. 8 Current State: NCBO PURL Server The NCBO PURL Server represents the first official dedicated biomedical PURL (Persistent URL) server to solely serve the needs of the biomedical community. NCBO has partnered with OCLC to use the same PURL Server software OCLC officially deployed and announced in March 2009 (i.e., http://www.purl.org). The PURL Server is dedicated for the biomedical communities to enable identification and sharing of persistent URLs for biomedical resources (e.g., ontologies, concepts, and mappings). It is accessible through http://purl.bioontology.org. The NCBO BioPortal and other NCBO services will be one of the first consumers of the NCBO PURL Server in 2009. Large biomedical organizations need to establish consistent URLs to identify their biomedical entities. PURLs provide a safe/consistent mechanism to establish URLs to biomedical entities when they change (e.g., are moved to a different server or even a different domain space). A PURL is simply a URL address on the Internet that points to other web resources. If a web resource changes location (and hence URL), a PURL pointing to it can be updated. Thus, end-users of a PURL has confidence that they can always use the same Web address, even though the resource in question may have moved. The hardest aspect regarding effective adoption of PURLs in a community is not technical, but social. Thus, the NCBO plans to be first adopters of the NCBO PURL Server by first investigating appropriate models, practices, and policies for our NCBO PURLs. Where appropriate, the NCBO will re-use identifier practices and models established over the last 10 years which include DOI (Digital Object Identifier) and LSID (Life-Science Identifiers). This research will focus more on learning from the models established and less on coupled technologies. In parallel, the NCBO is leading a Persistent Identifier Initiative to help drive use of PURLs and appropriate naming standards for biomedical resources. Ultimately, the NCBO will support and serve the PURL server infrastructure (much like OCLC has done so for the world) and provide a technology channel for biomedical communities to establish and execute these naming standards. On a parallel path, the NCBO is jump-starting initial investigation in creating distributed modes of NCBO work starting with basic database replication of a PURL server back-end. 9 NCBO Fabric The NCBO Fabric enables simple consumption of and integration with NCBO services (particularly REST web services). It is based on an open source technology named NetKernel. The NCBO Fabric enables composition of services, resources, and deployment of workflows. The workflows are architecturally a type of orchestration. Orchestration describes the process flow where services are controlled and coordinated by a central coordinator. The web services do not need to know that they are involved in a composition and that are part of a process flow (such as the NCBO Annotator service). It is the 19

responsibility of the central process to know the flow of service execution and the relationships between the services. NetKernel provides a simple mechanism to orchestrate NCBO services. It delivers an approach called resource oriented computing which enables NCBO developers to create software at a logical level (represented by REST web services) and to processes information with great flexibility with minimal to no coupling to the physical level of software module APIs (could be REST web services or language-specific APIs) and domain-specific object instances. An architectural benefit of this approach is that participating web services in a workflow are simplified, are easier to manage, and can evolve more easily without significant impact to other participating web services. As part of the evaluation of possible workflow/resource-based architecture solutions, the NCBO developed four functioning prototypes that validated NetKernel could meet the needs of the NCBO. Each of the four prototypes represents different strategies where the NCBO Fabric will add value, flexibility, and productivity to our NCBO software development practice. Note that NCBO Fabric - Web Service Harmonization section describes future work using the NCBO Fabric. 9.1 Protégé 4 Plugin for BioPortal The Protégé 4 Plugin for BioPortal represents an example of rapid integration with external products. Protégé 4 essentially had an existing plugin (named TONES) that required a specific sequence and structure for REST web service calls to function. After some brief collaboration between the NCBO and the Protégé team, a functional Protégé 4 plugin was created by simply wrapping the BioPortal REST services with a REST web service API that the TONES plugin understood. The effort to create the functional prototype was completed in less than day without modifying one line of BioPortal code. Figure 12 presents an overview of how the plugin works. Figure 12. Protégé 4 Plugin for BioPortal. This represents an example of rapid integration with external products. Protégé 4 essentially had an existing plugin (named TONES) that required a specific sequence and structure for REST web service calls to function. 9.2 NLM License Server Integration for UMLS The NLM License Server Integration for UMLS vocabularies represents an example of wrapping an external resource and making it better. In mid 2008, coordinators for the NLM UMLS Team were impressed with a NCBO UMLS prototype presentation by the Chief Software Architect and wanted to know how NCBO could collaborate with NLM. In fall of 2008, NLM delivered an early stage implementation of the NLM License Server provided the ability to validate any UMLS license key using SOAP web services. The NCBO developed a secure cached license server wrapper around the NLM 20

License Server that enabled UMLS license checks using REST web services. The UMLS License Services enables a scalable and responsive access to validating UMLS licenses. Figure 13 presents how the UMLS integration works. Figure 13. NLM License Server Integration for UMLS. It represents an example of wrapping an external resource and making it better. In fall of 2008, NLM had delivered early stage implementation of the NLM License Server provided the ability to validate any UMLS license key using SOAP web services. The NCBO developed a secure cached license server wrapper around the NLM License Server that enabled UMLS license checks using REST web services. The UMLS License Services enables a scalable and responsive access to validating UMLS licenses. 9.3 OBO to OWL Converter Wrapper The OBO to OWL Converter Wrapper is another NCOB Fabric example of wrapping an external resource and making it better. The NCBO developed a simple prototype wrapper around the OBO to OWL Converter which was developed by Stanford and is hosted on a Berkeley server. One of the challenges was that the Berkeley server could not handle more than a few simultaneous requests before crashing since the server was not geared for a production deployment (as is common for many research efforts). After the wrapper was developed, the NCBO Fabric enabled the ability to throttle requests calls to the Berkeley OBO to OWL Converter to one request at a time and queuing all subsequent calls. The NCBO is pleased that the prototype service has not experienced any downtime for the last year. Figure 14 illustrates the prototype integration. Figure 14. OBO to OWL Converter Wrapper. It represents an example of wrapping an external resource and making it better. The NCBO developed a simple wrapper around the OBO to OWL Converter hosted on a Berkeley server. One of 21

the challenges was that the Berkeley server could not handle more than a few simultaneous requests before crashing since the server was not geared for a production deployment (as is common for many research efforts). After the wrapper was developed, the NCBO Fabric enabled the ability to throttle requests calls to the Berkeley OBO to OWL Converter to one request at a time and queuing of all subsequent calls. The NCBO is pleased that our prototype wrapper service has not experienced any downtime for the last year. 9.4 BioPortal FOAF User The BioPortal FOAF User is an example of wrapping an internal resource and changing the representation. This service basically wraps an existing BioPortal User service and translates the data into FOAF (Friend Of A Friend) format. The FOAF format is the leading semantic web format in use throughout the world. This prototype demonstrates the ability to wrap an existing NCBO service for integration into other domains without changing a single line of BioPortal code. Figure 15 presents how this prototype works. Figure 15. BioPortal FOAF User. It represents an example of wrapping an internal resource and changing the representation. This service basically wraps an existing BioPortal User service and translates the data into FOAF (Friend Of A Friend) format. The FOAF format is the leading semantic web format in use throughout the world. This prototype demonstrates the ability to wrap an existing NCBO service for integration into other domains without changing a single line of BioPortal code. 22

10 Future Technologies and Solutions This section focuses on future NCBO technology and solutions we expect both in the short-term (i.e., 2009) and long-term (i.e., for the next 5 year NIH grant). Each section will start with the temporal context of the focus areas and then delve into an elaboration on these efforts. 10.1 Future State: Advancing Visualization Over the past four years the University of Victoria has contributed compelling user interface capabilities and strategies to the NCBO. These include the FlexViz tool for visualizing single ontologies, a new BioPortal Search interface for performing advanced queries across all ontologies, and new Resource (OBR) Search interface that allows querying for annotations by concept and by resource element, and has support for annotating text directly. There has also been a significant amount of research done in the clinical trials domain that has led to the development of two tools designed specifically for visualizing clinical trials data: CTSearch and CTeXplorer. This section presents how the University of Victoria intends to further improve interfaces for the NCBO technologies as we integrate the above mentioned tools into BioPortal and as we develop a more complete Visualization Framework. Figure 16. FlexViz showing the Cell Type ontology Figure 17. The new Search Interface (with FlexViz integration) 10.1.1 History Rich Ontology Navigation This project furthers our work using historical interaction and usage data to develop improved visual support for navigating and comprehending the structure of ontologies. 23

10.1.1.1 Determining Interest The work described in this section will be completed over the next year. As users explore the ontologies on BioPortal, the Diamond degree of interest model being developed can associate a value with each term in the ontology. This value can then be used to support highlighting of important concepts or filtering of unimportant concepts. A naive approach is to simply associate the importance, or interest level to the number of user selections of the concept. Adding a distance metric to this calculation mitigates the tendancy of the naive approach to "bump" up the interest value for terms nearest the root. In addition to a click-based approach, the number of NCBO annotations and mappings can be used to contribute to the overall interest value. 10.1.1.2 Providing Recommendations The work described in this section will be completed over the next three years (assuming renewal). The next step in this work is to combine the above approaches to provide visualizations based on ideas of collaborative filtering within the BioPortal. The main idea behind collaborative filtering is the automation of the ''word of mouth'' process by which people recommend items to one another. When people need to choose between a variety of unfamiliar options, they will often rely on the recommendations of others. When there are many options, however, it becomes impractical to contact experts to obtain advice on each options. With collaborative filtering, recommendations are made collectively making this problem more manageable. Our future work in this area will be to explore and develop ''heat-map'' style visualizations to draw users to more recommended, highly used or preferred terms. 10.1.2 Ontology Mapping In this project, we are adapting the Java-based CogZ project for exploring and navigating mapping correspondences on the Internet. This web-based version will be integrated into BioPortal to assist users with exploring an existing mapping between two ontologies. Currently, BioPortal only displays mappings as a list with links to the relevant concepts. The project will also investigate support for developing mappings collaboratively. The web-based CogZ will be developed using Adobe Flex, which provides a powerful platform for developing web-based applications. We have had previous success with developing graph-based visualizations for BioPortal using this technology. Communication with BioPortal occurs through web services, where the services provide access to the ontologies, concepts, and mappings stored in the repository. 24

We plan to begin development on this project in the Summer of 2009 and have a prototype ready for deployment to BioPortal in the Fall. Figure 18. Screenshot of the current Mapping prototype web interface 10.1.3 User Interface Mashups The UI Mashups described in this section will likely occur over the next year. UI mashups will provide a user interface that allows users of BioPortal to assemble NCBO widgets and their interactions using a web interface. Such a mashup assembly approach goes beyond service orchestration and current portals where users can place widgets next to each other. The difference between those approaches is that visual components are remixed, and that interactions between them can be customized. UI Mashups will enable users to contribute functionality by assembling different widgets into mashups. They will be able to remix the functionality of the BioPortal as well as to integrate 3rd party resources seamlessly. This user generated content (in our case functionality) can contribute to the functionality and usability of the BioPortal by providing user interfaces and adapted functionality for custom use cases. 25

As part of our research on mashups, we have already compared different mashup development environments and paradigms. We also reviewed the existing literature on end user development and mashups. Based on the insights we gained from this, we started exploring wiget-based mashups in depth. So far, we have created several widgets for the IBM mashup center that support the creation of highly interactive information mashups. The next step is understanding what mashup development paradigm (e.g. dataflow mashup, widget based mashup, textual vs. visual assembly) suits the NCBO domain best. To answer this question, we will implement several prototype mashups with NCBO services using different approaches. These prototypes and the knowledge gathered during their creation will then inform the decision which mashup development paradigm should be chosen for the NCBO domain. Once it is decided which mashup development paradigm to choose, we will start implementing the corresponding framework and development support. 10.1.4 BioPortal User Studies BioPortal user studies were conducted in the Fall of 2008 with eight participants recruited from partnering research groups at two North American universities. A demographic survey was given to each participant to gather information about their experience with ontologies and their role in their organization. Each participant was given approximately 10 tasks to evaluate the following aspects of BioPortal s user interface: searching for specific terms, browsing for ontologies, exploring ontologies via the tree view, exploring ontologies via the visualization, and using BioPortal to research a topic in the user s area of specialization. The participants roles ranged from professors and informaticians to graduate and undergraduate students. Their experience with ontologies varied widely, some having worked with ontologies containing only 10 classes and others having worked with ontologies containing over 2,000,000 classes. A summary of the findings from these studies was sent to the NCBO and the recommendations provided are now reflected in the updated BioPortal interface. In February 2009, another user study was conducted with seven additional participants who were present at an OBI Consortium workshop. The participants again had diversified backgrounds and expertise. The data from this last study is currently being analyzed and already interesting and informative patterns of usage are emerging. A second report summarizing the findings from this study will be sent to the NCBO. 10.2 Future State: NCBO Fabric - Web Service Harmonization The Web Service Harmonization effort will be tackled in 2009. A major challenge for NCBO software development is transitioning initial research efforts into production solutions. Thus, while some NCBO REST web services have been solidified with multiple production releases (e.g., NCBO BioPortal Services), others have evolved from multiple prototypes (e.g., OBS). The NCBO plans to leverage the NCBO Fabric (i.e., NetKernel) to help harmonize our REST web services to follow similar convention, URL signatures, and XML response structures. The goal of web service harmonization is the make life easier for the end-user developers to consume our services. 26

There are four types of web service harmonization activities the NCBO will focus on: 1. URL Domain - The NCBO has decided to use the domain of rest.bioontology.org for all REST web services. 2. REST Service Names The NCBO has decided to name the three types of services as follows: a../bioportal All queries for ontology and concept data use this service (including OBS) b../annotator All Annotator requests use this service. c../resources All Resources (i.e., OBR) requests use this service. 3. XML Response Content The NCBO will establish a consistent XML response structure for REST web service calls that return the same kind of data. For example, XML for a concept description should look similar independent of the service that is called. This will reduce the complexity for end-users to develop different XML parsers to handle the same essential content when handling REST web services. A good litmus test is whether a XML Schema can be defined to validate the content returned. 4. REST URL Signature The NCBO will establish a consistent URL signature for how REST web services are called. For example, often end-users want the ability to page responses from REST web service invocations that return large lists of XML content. The NCBO will standardize on a simple pattern for all appropriate REST web services using the URL parameter of offset and limit. offset indicates the first index in the list of XML content to return where 0 is the first item. limit indicates the number of items to return starting from that index. For example, a request for 20 records in a list starting with the 5 th record would have an offset of 4 and a limit of 20. The NCBO Fabric allows gluing together services from different web applications without coupling their implementations together. For example, OBS provides a rich set of ontology services that include requests for content that is not in BioPortal (and vice-versa). Instead of trying to harmonize these two sets of services together by coupling the OBS and BioPortal server-side code-base, the NCBO Fabric can more easily provide the unified signature for BioPortal and wrap the OBS service. The NCBO Fabric can do the appropriate translations to ensure the XML content returned match, can ensure URL signatures are the same (e.g., limit and offset parameters), and can enable consistent REST service naming. If a request is made for an OBS concept in the BioPortal repository that doesn t exist, an intelligible error message will be generated that explains the failed response. Ultimately, though an ontology concept may be served by OBS, the REST signature should appear to be coming from the BioPortal domain (i.e., rest.bioontology.org), have a similar URL signature, and return similarly structured XML. Long-term, it might make sense to merge the back-end software modules of OBS and BioPortal. In the meantime, the NCBO Fabric provides a clean way in the short-term and mid-term to tightly integrate the two web applications and still ensure they remain de-coupled. 27

A bi-product of this effort will be the establishment of a Production server environment for NetKernel. At this time, there is only a development and staging environment for the NCBO Fabric. 10.3 Future State: IT Operations Management IT Operations Management is an ongoing iterative effort that began in 2009 and continues to evolve as NCBO technologies grow in complexity and the end-users of our services grow. The NCBO will need to ensure that the necessary hardware and software infrastructure is in place to support growing users of NCBO services and capabilities as part of an overall infrastructure architecture. This effort will ensure that all pieces are in place to accelerate adoption of NCBO services in subsequent years. For example, as the number of biomedical sources of data (e.g., PubMed) are added to the Resources (OBR) stores and indexes, the database demands will require dramatic growth (e.g., possibly in scale of terabytes). As the number of end-users who consume the NCBO REST web services grow, management, redundancy, and scaling will be required. As the number of mission-critical end-users workflow that depend on our NCBO web services grow, the NCBO will need to establish appropriate service-level agreements (SLA) to ensure that quality of service (QoS). 10.4 Future State: NCBO Node The NCBO Node work represents work proposed for the next NIH Grant Renewal (i.e., the next 5 years). This section addresses only a single centralized node in the NCBO architecture and was driven heavily through close collaboration between Stanford University and Mayo Clinic. As we prepare the NIH renewal proposal, the NCBO will work on developing and understanding the implementation of a federated architecture for NCBO nodes. The key components of a single NCBO node are as follows: 1. A backend that provides storage of and access to ontologies and terminologies. The backend will have two major components: (a) A Native component for storing and accessing OWL and RDF ontologies in their native format (OWL or RDF) (b) A Normalized component that stores terminological information about all terminologies that are not available in OWL and RDF format and that provides terminological access (possibly justin-time) to the OWL and RDF ontologies. 2. A set of Web services and APIs that provide access to the information stored in the backend: (a) Standard RDF and OWL APIs for OWL and RDF ontologies (b) Common Terminology Services (terminological access) to both OWL/RDF ontologies and to other terminologies that are stored in the backend. NCBO products and other applications will use the Web services to access the ontologies and provide user interface components (e.g., BioPortal and WebProtégé), language bindings for other programming and scripting languages (e.g., PERL, Python, etc.), and other applications and services, including applications developed outside of NCBO. 28

10.4.1 The Layered Architecture of an NCBO node The NCBO Node architecture uses a classic architecture layered approach, which decouples the logic and domain object models between each layer. Optimally, this approach decouples the versioning and changes in one layer from another. Thus, as software modules evolve, this decoupling significantly reduces the impact on the rest of the software sub-systems. The architecture of an NCBO node has two layers: the Business Logic/Storage Tier and the Services Tier (Figure 119). The Business Logic/ Storage Tier will provide a representation of the ontologies and terminologies being stored and ontological and terminological APIs for querying and accessing these ontologies. The Services Tier exposes the APIs of the Business Logic Tier through REST and web services. We did not include the Presentation Tier in an NCBO node. While we expect that many implementations of NCBO node will include a user-interface component in the Presentation Tier, such as BioPortal, there can be nodes that have no Presentation Tier at all. Similarly, there can be NCBO nodes that have different user interfaces that use the NCBO Service Tier to render the information that is relevant for a particular domain. The NCBO own user interface components and external user interfaces and applications use the same set of services in the Service Tier. Consumers of the NCBO Node services and APIs are presented above the NCBO Node box. They utilize the services to provide access to the NCBO BioPortal user-interface, to enable remote editing services, to open avenues for SPARQL queries, to access REST web services (and SOAP web services where appropriate), and to enable the use of helper language bindings such as Perl, Python and Java. The NCBO BioPortal user-interface will continue to use REST web services as the primary channel of integration. 29

Figure 19. The layered architecture of an NCBO node. The Business Logic/Storage Tier provides storage for ontologies and terminologies. It has the normalized storage for all ontologies and terminologies and the native storage for OWL, RDF, and Protégé frames ontologies. It also contains the knowledge base with the metadata that supports the infrastructure of the node. The Service Tier provides various services that enable access to the ontologies, terminologies, and metadata stored in the Storage Tier. 10.4.2 Business Logic Tier and Storage Tier The bottom layer of the NCBO architecture is the Business Logic and Storage layer. This tier is responsible for loading, storing, and serving the contents of the ontologies and terminologies in the NCBO repository, as well as the metadata about this content. There will be two interrelated ontology storage mechanisms A native ontology storage mechanism (Section 10.4.2.1) that holds the original sources for OWL and RDF ontologies (and possibly Protégé frames ontologies). It will provide a full RDF and OWL ontology API s which will faithfully represent all the distinctions that are present in the original RDF or OWL ontology. A normalized ontology storage mechanism (Section 10.4.2.2) that will provide a filtered view of the native OWL and RDF ontologies and will additionally provide the primary storage of biomedical terminologies, such as ICD-9, UMLS, SMOMED, and GO. 30

The Business Logic and Storage Tier will also contain an RDF knowledge base with all the NCBO repository metadata (Section 10.4.2.4 including salient information about the ontologies and terminologies (provenance, documentation, etc.), notes and reviews provided by users, mappings and metadata on the mappings, and other metadata. 10.4.2.1 The Native ontology storage The native ontology storage is the primary mechanism for all RDF and OWL ontologies in the NCBO repository. The main requirement for the native backend is that the backend API provides a faithful representation of the contents of ontologies in these formats. For RDF ontologies, we will use an offthe-shelf triple store to store and access the ontologies. We will use the Sesame Sail API to provide a uniform access mechanism to the data. For the OWL ontologies, we will use the OWL API as the backend. We expect that there will be several open-source implementations of an API access to OWL 2 ontologies, and we will explore which API provides the best scalability for NCBO purposes. For the moment, one such API is already available, the OWL API developed at the University of Manchester. This API supports the emerging OWL 2 standard and is currently the most widely used open-source OWL API in the world. The Native backend will provide a SPARQL endpoint for all RDF and RDF Schema ontologies and data. 10.4.2.2 The Normalized ontology storage The new LexGrid model will provide a common normalized model for all ontologies and terminologies in the NCBO backend. This common normalized model will be exposed through a set of Common Terminology Services (CTS). We will use the normalized storage component as the primary storage for the biomedical ontologies and terminologies that do not use OWL or RDF as their primary format (these incldue key terminologies such as ICD-9, SNOMED, UMLS, GO, and others). We will develop a common terminological model to implement the normalized view. This model will be the new LexGrid model. From the different terminologies, we will extract the information that the common terminological services require. We will not develop mechanisms for storing the complete terminologies beyond what the common terminological model requires. The NCBO will develop the new LexGrid model as an extension of the SKOS model for representing vocabularies. The new LexGrid model will be expressed in terms of RDF and SKOS vocabulary. As a result, if an OWL or RDF ontology uses SKOS to define such elements as preferred names, synonyms, definitions, etc. (even if it is not aware of the NCBO s LexGrid model), we will be able to extract the terminological information directly from the ontology. The Mayo team is also working closely with the Semantic Web Deployment Working Group at W3C to extend the next version of the SKOS Recommendation to incorporate the distinctions that are critical to biomedical terminologies. For the moment, the NCBO plans to use the normalized storage for ontologies in the OBO format. If we must represent OBO ontologies in the native format, then we will use one of the available mechanisms for converting OBO to OWL. In particular, the OWL API has its own OBO parsers and renders. 31

10.4.2.3 Populating the normalized model and the integration of the native and normalized storage components We will use several approaches to extract the information required for normalization from the ontologies and terminologies that do not use the new LexGrid model directly: The Mayo team will use their expertise and experience with developing terminology access to biomedical terminologies to extract this information on a case-by-case basis from the prominent biomedical terminologies that re available in their own proprietary formats (e.g., UMLS, SNOMED). For the OWL and RDF ontologies that do not use standards, such as SKOS, we will provide a way for the authors to specify some of the information require by the normalized model as part of their metadata. For example, we can ask the ontology authors which property contains the preferred name for their classes. The synonyms, where the definitions are contained, and so on. If an open-source SKOS API becomes available (there are already initial implementations of such an API), we will extend this API to provide additional terminological services rather than build our own. 1 We will provide a tight integration between the normalized and the native storage components. Specifically, the native storage component will provide access to the properties of the OWL and RDF classes that contain the terminological information required by the normalized component. For example, the NCBO will explore the feasibility of computing the information for the normalized model for the OWL and RDF ontologies in a just-in-time fashion, rather than materializing it beforehand. If such just-in-time access provides acceptable computation guarantees, we will prefer such access as it will allow us not to store the same information in multiple places. If the just-intime access is not feasible, we will develop mechanisms to synchronize the normalized information with the original ontology (in particular, if editing is enabled). 10.4.2.4 BioPortal Meta-Data The Bioportal must track a significant amount of data about the ontologies and terminologies in the repository, the notes and reviews that the users provides, the metrics that we compute as part of the ontology evaluation, the information on users and their projects. Furthermore, the NCBO repository will contain mappings between elements in different ontologies and metadata on these mappings, providing provenance of the mapping itself, information on application context where the mapping is valid, and so on. We will user an RDF knowledge base to store the BioPortal metadata. As with other RDF resources, we will use an off-the-shelf triple store a Sesame Sail API to store and access the data. 10.4.3 Services Tier The Services Tier is responsible for exposing selected aspects of the business logic tier through REST web services. The Web Services layer will use the information from both the native and normalized storage components. In particular, for OWL and RDF ontologies, the Service layer will use the normalized storage component and its API to access the printable names, textual definitions, synonyms of an RDF resource 32

or an OWL entity, and other information represented in the LexGrid model. In diagram in Figure 1 we refer to the terminological services as CTS services. The web services layer also exposes web-editing capabilities and a general purpose set of ontology and terminology APIs. For RDF and OWL ontologies, the web services will provide native access to the ontology content (Services for native OWL and RDF access in Figure 16). The cagrid services will implement the SOAP interfaces that cagrid applications can use to access the NCBO content. The metadata services will provide access to BioPortal metadata and mappings. Other services can include, for example, a web-service access to a SPARQL endpoint in an RDF triple store in our native storage component. We envision that this collection of services will grow as application developers start using NCBO services and start expressing new requirements for service-based access to NCBO content. 10.4.4 Consumers of NCBO Node Services The first consumers of the NCBO node services will be the user-facing applications that will be implemented by NCBO itself or its close collaborators. Mainly, it is the NCBO BioPortal user interface and the WebProtégé user interface. The BioPortal user interface currently uses the Ruby on Rails technology. However, the architecture enables any number of user-interface technologies. For example, the current implementation of the WebProtégé user interface is built using the Google Web Toolkit (GWT). As the new web editing services are added and new workflows emerge (e.g., workflows between the BioPortal UI and the WebProtégé UI), the Service Tier will need to request, aggregate, and compose responses based on data from both the Normalized and Native storage components. Consumers of the NCBO Node Services consist of both end-users and application developers. The following list describes a few of these consumers: One important channel by which end-users will interact with the NCBO Node is through a web browser. The web browser will run javascript code that will allow the user to interact efficiently with the NCBO Node server. This interaction will include end-users of the BioPortal UI and the WebProtege UI. The CTS models exposed through the Normalized backend will be directly consumed and used by developers in the cagrid community. Developers of NCBO applications and partner client applications will be able to use Perl, Python and Ruby languages to query and access ontologies residing on the NCBO Node. Developers of NCBO applications and partner client applications will be able to use REST services to query and access ontologies residing on the NCBO Node. Developers and ontologists who want direct access to OWL or RDF api will be able to go directly to the Native backend to query and access those ontologies. Developers who want SPARQL access to RDF data will be able to use our web service interface to issue SPARQL queries. 33

10.5 Future State: RDF Triple Store Back-end The RDF Triple Store Back-End represents work the NCBO may propose for the next the NIH Grant Renewal (i.e., the next 5 years). This section presents findings as a result of the NCBO analyzing RDF Triple stores as a possible back-end for the NCBO BioPortal. The effort included analyzing of load times, response times, and inferencing capabilities of Jena SDB backed with MySQL, Sesame native, Mulgara and Virtuoso. The evaluations were performed using the UniProt data and ontologies from Bioportal. The conclusion of this analysis suggested a hybrid approach which combines a native triple store with the fine level access of an API. In particular, the findings indicate a superior performance of native stores like Sesame native, Mulgara and Virtuoso. This is in coherence with the current emphasis on development of native stores since their performance can be optimized for RDF. However, these native triple stores, especially Mulgara and Virtuoso, are constrained by the absence of an API. The Bioportal requires the presence of an API. Native triple stores provide superior performance but are hampered from the inability of third party APIs to load large datasets into them. Sesame s native store gives a performance comparable to the other native stores with an inherent in-house API, however there are doubts about its scalability beyond 50 Million triples. Keeping these issues in mind we suggest a hybrid approach which combines the best of native triple stores and fine level access provided by triple stores with in-built APIs. In the hybrid approach, the ontologies are first loaded and preprocessed in an in-memory RDF model using a popular API like Jena or Sesame. In the preprocessing step the namespaces are extracted and inferenced using the API is done. These inferenced ontologies are then written out to disk. These inferenced ontologies are read from disk and loaded into a native triple store like Mulgara using the load scripts provided by the triple store and without use of any API. Figure 20 illustrates this process. 34

Figure 20. RDF Triple Store Hybrid Approach. The ontologies are first loaded and preprocessed in an in-memory RDF model using a popular API like Jena or Sesame. In the preprocessing step the namespaces are extracted and inferencing using the API is done. These inferenced ontologies are then written out to disk. These inferenced ontologies are read from disk and loaded into a native triple store like Mulgara using the load scripts provided by the triple store and without use of any API. 11 Conclusion This report presents a snapshot of the National Center for Biomedical Ontology (NCBO) current state and future state architecture. The document starts with the current state of the architecture and design of the NCBO products and technologies. It then continues with a presentation of future technology and solutions expected in the short-term and for the NIH grant renewal. The underlying principle of the current and future state architecture focuses on a web-based architecture which leverages both principles of service-oriented architecture (SOA) and resource-oriented architecture. The Chief Software Architect expects the deliverables completed by the NCBO to serve as a launching base for subsequent years to accelerate adoption and to continue the evolution of the NCBO software products, technologies, services, and overall infrastructure. The goal of the NCBO architecture is the delivery of a system of tightly integrated and loosely coupled cooperating components that include BioPortal, Resources (OBR), the Annotator, OBS, the NCBO PURL Server, the NCBO Fabric, Protégé, LexGrid, and visualization technologies (e.g., FlexViz, Degree of Interest, and UI Mashups) while leveraging solid semantic web technologies (e.g., RDF and triple stores). Such a system will enable seamless data flow between components, de-coupled evolution of components, flexibility in serving the many biomedical community workflows, improved scalability, industrial-strength software operations, and enhanced coordination across software development teams. The combination of all the components working together in a cohesive system deliver a far greater capability that sum of the individual components. 35