The National Cancer Institute's Thésaurus and Ontology

Similar documents
NCI Thesaurus, managing towards an ontology

SEMANTIC SUPPORT FOR MEDICAL IMAGE SEARCH AND RETRIEVAL

Knowledge Representations. How else can we represent knowledge in addition to formal logic?

Languages and tools for building and using ontologies. Simon Jupp, James Malone

Interoperability and Semantics in Use- Application of UML, XMI and MDA to Precision Medicine and Cancer Research

warwick.ac.uk/lib-publications

Acquiring Experience with Ontology and Vocabularies

Comparing SNOMED CT and the NCI Thesaurus through Semantic Web Technologies

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

WHO ICD11 Wiki LexWiki, Semantic MediaWiki and the International Classification of Diseases

Table of Contents. iii

Coherence and Binding for Workflows and Service Compositions

Extracting Ontologies from Standards: Experiences and Issues

0.1 Knowledge Organization Systems for Semantic Web

OWL a glimpse. OWL a glimpse (2) requirements for ontology languages. requirements for ontology languages

Challenges in Deploying and Managing Large Terminologies: NCI Thesaurus

Ontology Summit2007 Survey Response Analysis. Ken Baclawski Northeastern University

OASIS Electronic Trial Master File Standard Technical Committee

A Semantic Web-Based Approach for Harvesting Multilingual Textual. definitions from Wikipedia to support ICD-11 revision

CEN/ISSS WS/eCAT. Terminology for ecatalogues and Product Description and Classification

Semantic Technologies and CDISC Standards. Frederik Malfait, Information Architect, IMOS Consulting Scott Bahlavooni, Independent

OWL DL / Full Compatability

Ontology Engineering. CSE 595 Semantic Web Instructor: Dr. Paul Fodor Stony Brook University

Semantic Web Technologies: Web Ontology Language

Balanced Large Scale Knowledge Matching Using LSH Forest

European Conference on Quality and Methodology in Official Statistics (Q2008), 8-11, July, 2008, Rome - Italy

OWL and tractability. Based on slides from Ian Horrocks and Franz Baader. Combining the strengths of UMIST and The Victoria University of Manchester

Smart Open Services for European Patients. Work Package 3.5 Semantic Services Definition Appendix E - Ontology Specifications

A method for recommending ontology alignment strategies

Main topics: Presenter: Introduction to OWL Protégé, an ontology editor OWL 2 Semantic reasoner Summary TDT OWL

An Introduction to the Semantic Web. Jeff Heflin Lehigh University

The Model-Driven Semantic Web Emerging Standards & Technologies

SKOS. COMP62342 Sean Bechhofer

Disease Information and Semantic Web

Ontologies SKOS. COMP62342 Sean Bechhofer

Pattern-based design, part I

Community-based ontology development, alignment, and evaluation. Natasha Noy Stanford Center for Biomedical Informatics Research Stanford University

Semantic MediaWiki A Tool for Collaborative Vocabulary Development Harold Solbrig Division of Biomedical Informatics Mayo Clinic

Contents. G52IWS: The Semantic Web. The Semantic Web. Semantic web elements. Semantic Web technologies. Semantic Web Services

A Knowledge-Based System for the Specification of Variables in Clinical Trials

Converting a thesaurus into an ontology: the use case of URBISOC

SELF-SERVICE SEMANTIC DATA FEDERATION

Today: RDF syntax. + conjunctive queries for OWL. KR4SW Winter 2010 Pascal Hitzler 3

New Approach to Graph Databases

Helmi Ben Hmida Hannover University, Germany

Short notes about OWL 1

Use of the CIM Ontology. Scott Neumann, UISOL Arnold DeVos, Langdale Steve Widergren, PNNL Jay Britton, Areva

GraphOnto: OWL-Based Ontology Management and Multimedia Annotation in the DS-MIRF Framework

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

Chapter 8: Enhanced ER Model

RDF Schema. Mario Arrigoni Neri

Tony Mallia Edmond Scientific

Simplified Approach for Representing Part-Whole Relations in OWL-DL Ontologies

H1 Spring B. Programmers need to learn the SOAP schema so as to offer and use Web services.

An Online Ontology: WiktionaryZ

Wondering about either OWL ontologies or SKOS vocabularies? You need both!

Lesson 5 Web Service Interface Definition (Part II)

Semantic Web. Ontology Pattern. Gerd Gröner, Matthias Thimm. Institute for Web Science and Technologies (WeST) University of Koblenz-Landau

SWAD-Europe Deliverable 8.1 Core RDF Vocabularies for Thesauri

A New Approach to Managing the Evolution of OWL Ontologies

Taking a view on bio-ontologies. Simon Jupp Functional Genomics Production Team ICBO, 2012 Graz, Austria

Semantic Knowledge Discovery OntoChem IT Solutions

Using Ontologies for Data and Semantic Integration

ICT for Health Care and Life Sciences

Natural Language Processing with PoolParty

Using DAML format for representation and integration of complex gene networks: implications in novel drug discovery

CHAPTER 1 INTRODUCTION

Study and guidelines on Geospatial Linked Data as part of ISA Action 1.17 Resource Description Framework

TIM: A Semantic Web Application for the Specification of Metadata Items in Clinical Research

Software Configuration Management Using Ontologies

Ontology Development. Qing He

Semantic Technology. Opportunities

Lessons Learned from a Greenhorn Ontologist

MeSH: A Thesaurus for PubMed

Enhancing information services using machine to machine terminology services

Mining the Biomedical Research Literature. Ken Baclawski

The Semantic Web DEFINITIONS & APPLICATIONS

THE GETTY VOCABULARIES TECHNICAL UPDATE

Utilizing NCBO Tools to Develop & Use an ECG Ontology

Copyright 2012 Taxonomy Strategies. All rights reserved. Semantic Metadata. A Tale of Two Types of Vocabularies

Position Paper W3C Workshop on RDF Next Steps: OMG Ontology PSIG

Data formats for exchanging classifications UNSD

Automation of Semantic Web based Digital Library using Unified Modeling Language Minal Bhise 1 1

The Semantic Planetary Data System

Semantic Technologies

WP7: Patents Case Study

Chapter 3 Research Method

Two Case Studies of Ontology Validation

Ontological Modeling: Part 11

Representing Multiple Standards in a Single DAM: Use of Atomic Classes

Blending FHIR RDF and OWL

Semantic Annotation for Semantic Social Networks. Using Community Resources

SEMANTIC WEB AND COMPARATIVE ANALYSIS OF INFERENCE ENGINES

MeSH : A Thesaurus for PubMed

NLM HL7 RFQ CHI mapping contract August/September, 2006

Harmonizing biocaddie Metadata Schemas for Indexing Clinical Research Datasets Using Semantic Web Technologies

Ontological Modeling: Part 7

Thanks to our Sponsors

Vocabulary Harvesting Using MatchIT. By Andrew W Krause, Chief Technology Officer

A Quality Assurance Framework for Ontology Construction and Refinement

Transcription:

The National Cancer Institute's Thésaurus and Ontology Jennifer Golbeck 1, Gilberto Fragoso 2, Frank Hartel 2, Jim Hendler 1, Jim Oberthaler 2, Bijan Parsia 1 1 University of Maryland, College Park 2 National Cancer Institute, Center for Bioinformatics golbeck@cs.umd.edu, fragosog@mail.nih.gov, hartel@mail.nih.gov, hendler@cs.umd.edu, bparsia@isr.umd.edu, oberthaj@mail.nih.gov Introduction The NCI Thésaurus is a public domain description logic-based terminology produced by the National Cancer Institute, and distributed as a component of the NCI Center for Bioinformatics cacore distribution [1]. It is deep and complex compared to most broad clinical vocabularies, implementing rich semantic interrelationships between the nodes of its taxonomies. The semantic relationships in the Thésaurus are intended to facilitate translational research and to support the bioinformatics infrastructure of the Institute. The NCI Thésaurus evolved from the NCI Metathesaurus. NCI Metathesaurus is based on the National Library of Medicine Unified Medical Language System (UMLS) Metathesaurus. The NCI Metathesaurus has been operational since 1999. A public version is available at http://ncimeta.nci.nih.gov. The need for a comprehensive NCI-wide terminology arose because NCI staff requires access to timely and accurate information about activities related to the scientific mission of the Institute. The collection, storage and retrieval of data related to NCI research programs is necessary to analyze, manage, and report about these activities. Though centralized coding of NCI-supported research-related activities met some of these needs, supplementary data coding had become common. This coding was assigned independently within various components of the Institute, and was frequently based on locally developed term lists or other informal vocabulary, making it difficult to find and combine information across programs. The NCI source vocabulary within the NCI Metathesaurus encompasses the terminology used by the various offices and divisions within the Institute, with the goal of providing a common vocabulary to increase the interoperability of information systems. The NCI vocabulary provided not only an initial Institute-wide integrated vocabulary, but also rich mappings of NCI terminology to the numerous other biomedical vocabularies. Useful as it is however, the NCI Metathesaurus is not well suited to serve as a coding nomenclature 1. The hierarchies of the NCI terminology in NCI Metathesaurus are not true IS_A hierarchies, for example. The NCI Thésaurus, which became operational in late 2001, is designed to address this need. It is intended for NCI offices and divisions to use the Thésaurus as a source of codes associated with terminology concepts to annotate data and other information artifacts and facilitate information retrieval. The NCI Thésaurus is not yet a true ontology 2, as it contains numerous primitive concepts. It might be fairly described as a nomenclature with ontologic features. It is strongly ontology-like across several taxonomies however; containing detailed semantic relationships among genes, diseases, drugs and chemicals, anatomy, organisms, and proteins. Among these categories, thousands of concepts are defined. Each concept's definition contains descriptive information, such as synonyms and English definitions, and explanations of how it relates to other concepts. The Thésaurus has been expanded to support bioinformatics-related research efforts, and in this role, it is a source of information on asserted relationships between entities. With reference to a concept representing an animal model of a human disease, for example, the Thésaurus might describe relationships such as genetic abnormality is associated with the disease and the organisms in which it occurs. Although editors 1 We define a nomenclature as addressing the symbol to object relation in the semiotic triangle. 2 We define an ontology as addressing the symbol to concept relation in the semiotic triangle.

have defined a number of named ontologic relations to support the description-logic based structure of the Thésaurus, additional relationships are considered for inclusion as required to support dependent applications. The detailed information provided about such a broad range of medical terms and topics is a rich and useful data source for the medical community. In an effort to make the knowledge in the Thésaurus more useful and accessible to the public, the National Cancer Institute and the University of Maryland, College Park have worked together to produce an OWL [2] ontology from the Thésaurus. In the following sections, this paper will describe the terminology development process at NCI, and the issues associated with converting a description logic based nomenclature to a semantically rich OWL ontology. Ontology Development Given the rapid pace of research findings and clinical refinement related to cancer prevention and treatment, the content of the NCI Thésaurus evolves rapidly. To keep the information as up to date as possible, a new release is made every month. With nearly a dozen full time subject matter experts working on different parts of the terminology, a well-structured and complex process is used to ensure accuracy and coherence. NCI uses the Apelon, Inc.'s Terminology Development Environment and Manager software tools[3] in the development process. There are eight main steps in terminology development. The full process is diagramed in Figure 2, and the steps described below reference the numbered arcs in the figure. At the beginning of each cycle, the Lead Modeler, working from a local database with the current version of the Thésaurus creates a series of worklists which detail the concepts that need to be edited by each modeler(1). The worklists are exported through the Apelon Manager and assignments are made to each Modeler or Expert Modeler (2). Each modeler has a local copy of the NCI Thésaurus. This prevents problems where one modeler may make a change that impacts changes made by another. Once each modeler makes changes to their local copies of the nomenclature (3), a Change Set that captures the editing performed in their local copy is sent back to the workflow manager (4). The changes are analyzed to identify any conflicts that may have arisen in the process. If any corrections are needed, new assignments are made for the expert modelers, and the cycle repeats until no conflicts are found (5). On a weekly basis, new baselines containing the consolidated changes are released to the modelers to update their local databases (6). To make a full export of the new nomenclature available to the public on the web, several steps are still required. All of the changes are imported into a trial database and classified using description logic rules. If there are any unexpected results, problematic changes are either eliminated or corrected. Once classification is successful, comparisons and reviews are made between the new classified nomenclature and the previous version. In addition, the release candidate is also made available to developers of dependent projects for regression testing (7). When the final version is approved, it is published on the web (8) at http://nciterms.nci.nih.gov. As part of the cacore distribution, it is periodically also made available in both a flat file and XML formats. The flat file, however, does not include all the elements of the nomenclature, such as role relations 3. The February, 2003 release of NCI Thésaurus contained about 26,000 concepts and about 71,000 terms divided among 24 taxonomies. The taxonomies cover administrative, applied and basic science and clinical terminology. Conversion To OWL Converting the NCI Thésaurus to an ontology in OWL Lite is a multi-step process. At the end of the production cycle, the Thésaurus is exported in Apelon's Ontylog XML format. Though the semantics 3 These files can be downloaded from ftp://ftp1.nci.nih.gov/pub/cacore/evs/.

intended by the modelers were included in the XML, and recoverable using the Apelon server, the complex relationships are not apparent to any third-party software. Manager Manager's Database 1 2 Modeler's Database 3 Modeler 4 Manager 5 6 7 External Validation 8 Publication Figure 2. diagram of the NCI Thésaurus editing and publication cycle. This is an abbreviated version of the editing workflow process. The numbers in the figure correspond to the numbers in the text above that describes the process. In creating an ontology that represented these semantics, the first choice was to decide among OWL's three sub-languages: OWL Lite, OWL DL, and OWL Full[2]. Though OWL Full is the most expressive of the three languages, the first two have several advantages. OWL DL and its formal subset, OWL Lite, are guaranteed to be both computationally complete and decidable. The NCI Thésaurus is created using a DL tool, so the restrictions of these two variants of the language were a reasonable fit. Since the elements required to express the relationships in the Thésaurus fell within the restrictions of OWL Lite, we chose to encode in that sublanguage. One of the most important steps in transforming the Thésaurus into an ontology is to represent the concepts and their connections in a machine processable way. In our ontology, each concept is given a formal designator and the relationships between them are formalized in the base ontology language. This overcomes any ambiguity between natural language based descriptive text, and formal concept names. In the NCI Thésaurus, names of entities are semantically rich. Some are simple one-word names, while others are long complex descriptions, e.g. Anti-human chorionic gonadotropin vaccine/gemcitabine and Transporter 1, ATP-Binding Cassette, Sub-Family B (MDR/TAP). To do this, the original concept names were converted to proper RDF identifiers by removing any spaces and substituting illegal characters with underscores. Names that began with numbers were also prefixed with underscores to make them legal names. The names in their original forms are preserved as rdfs:label. The four major types of definitions made in the Thésaurus are Kinds, Roles, Properties, and Concepts. Kinds are the top-level super classes for all of the concepts defined in the Thésaurus, and basically

enumerate the different categories of concepts. Kinds include such things as Anatomy, Biological Processes, Chemicals and Drugs, and Diagnostic and Prognostic Factors. Each Kind is converted to an owl:class. A Concept also translates to an owl:class, as it describes a specific concept under one of the Kind categories. The bulk of the Thésaurus comprises concept definitions, and this is also where the most complex semantics occur. As such, it was the most difficult part of the conversion. Consider the following abridged concept definition from the Thésaurus XML: <conceptdef> <name>chlorambucil</name> <code>c362</code> <id>362</id> <primitive/> <kind>chemicals_and_drugs_kind</kind> <definingconcepts> <concept> Nitrogen Mustard Compound </concept> </definingconcepts> <definingroles> <role> <some/> <name> Chemical_or_Drug_FDA_Approved_for_Disease </name> Chronic Lymphocytic Leukemia </value> </role> </definingroles> <properties> <property> <name>full_syn</name> <![CDATA[ <term-name> phenylbutyric acid nitrogen mustard </term-name> <term-group>sy</term-group> <term-source>nci</term-source>]]> </value> </property> <property> <name>semantic_type</name> Pharmacologic Substance</value> </property>... </properties> </conceptdef> As shown above, each Concept in the Thésaurus has three main types of associated data: defining concepts, defining roles, and properties. A "defining concept" is essentially a super class, so we used rdfs:subclassof relationships here. Roles translate to rdf:properties in OWL. Roles describe how concepts relate to one. Generally, Roles have domains and ranges of Kinds, and the values are further limited to specific Concepts within concept definitions. The "Defining roles" within a concept definition provide these local restrictions on the ranges of roles. For example, the class "Chlorambucil", a drug, has the defining role: <Chemical_or_Drug_FDA_Approved_for_Disease> Chronic Lymphocytic Leukemia </Chemical_or_Drug_FDA_Approved_for_Disease> There is a class "Chronic_Lymphocytic_Leukemia" defined in the ontology, as well as the role property "Chemical_or_Drug_FDA_Approved_for_Disease." In the OWL output, there is a local restriction on the range of that role to have owl:somevaluesfrom the Chronic_Lymphocytic_Leukemia class. <owl:restriction> <owl:onproperty rdf:resource="#rchemical_or_drug_fda_approved_for_disease"/> <owl:somevaluesfrom rdf:resource="#chronic_lymphocytic_leukemia"/> </owl:restriction> Properties in the Thésaurus, on the other hand, contain metadata that describes the class, but not its instantiations or subclasses. Properties are defined as owl:annotationproperties in the ontology. An owl:annotationproperty is a

subclass of an rdf:property, and, like rdfs:comment and rdfs:label 4, can be attached to any class, property or instance. This allows Properties from the Thésaurus to be associated directly to a Concept s corresponding class, without violating the rules of OWL Lite. In addition to explicitly named Properties, code and id attributes are special cases. Every entity in the Thésaurus Concepts, Kinds, Properties, and Roles has an associated code and id. These are used as unique identifiers in the Apelon development software, and, as such, are not defined explicitly as Roles or Properties. This required a hard-coded definition of both entities. They were defined as owl:annotationproperties, just like other Properties in the Thésaurus, and this allowed them to be attached to all of the definitions in the ontology. The example code above shows two Properties in addition to code and id. Each Property is defined to have a name and a value. The names of the properties correspond to actual property definitions included elsewhere in the files. Values needed to be attached, via the Properties, to the new class. Since the Properties are owl:annotationproperties, this is done in the same way a property would be associated with an instance. The final OWL ontology is made up of approximately 450,000 triples in a file that is over 33MB. A description and current OWL version is available for download at http://www.mindswap.org/2003/cancerontology. Currently, we are exploring the inclusion of the OWL properties for controlling versions (including identification of deprecated classes and properties). When this is completed, publication of the Thésaurus to OWL will be incorporated into the NCI monthly production cycle. Lessons Countless organizations have their own intricate and well-developed thesauri that have been refined over many years. With consummate converters, these thesauri can become a strong and useful element on the semantic web. Like the Thésaurus from the National Cancer Institute, many are available in an unusual form of XML, or not in XML at all. As in this case, simple translations using standard XML parsers often will not be a reasonable solution for making the transition to RDF or OWL. The most time consuming process in this conversion was a making careful analysis of the Thésaurus to understand the best way to translate it into OWL. In the terminology development process, Concepts are not strictly considered to be instances or classes. However, because of the hierarchical structure within the collection of Concepts, as well as the Roles that serve as local restrictions, they clearly translated best to classes in OWL. Discovering the distinction between Roles (local restrictions on properties), and Properties (annotations to a specific class) was another important step. At this point in the development, we were forced to make a choice in which species of OWL to use. Developers attached properties that are not inherited to concepts. If the translator directly attached Properties to classes, the ontology would be in OWL Full. For computational benefits, we decided to work in OWL Lite, and thus owl:annotationproperties were used. For other conversions, these same types of distinctions and decisions must be made. The expressive power of a proprietary encoding can vary widely from that in OWL or RDF. Understanding the original semantics and engineering a solution that most closely duplicates it is critical for creating a useful and accurate ontology. References [1] NCI Center for Bioinformatics cacore : http://ncicb.nci.nih.gov/ncicb/core [2] OWL Web Ontology Language: http://www.w3.org/2001/sw/webont [3] Apelon, Inc.: http://apelon.com 4 rdfs:comment and rdfs:label are actually two pre-defined annotation properties in OWL. There are three others: owl:versioninfo, rdfs:seealso, and rdfs:isdefinedby. More details are available in the OWL Reference at http://www.w3.org/tr/owl-ref/.