Ontologies Guidelines for Best Practice

Size: px

Start display at page:

Download "Ontologies Guidelines for Best Practice"

Laura Owens
5 years ago
Views:

Ontologies Guidelines for Best Practice To support practical application and mapping Author (s) Pistoia Alliance Ontologies Mapping Project team Version

It places emphasis on the application of ontologies in the Life Science industry to encourage best practice and to aid mapping of ontologies in a

1 Ontologies Guidelines for Best Practice To support practical application and mapping Author (s) Pistoia Alliance Ontologies Mapping Project team Version 1.2 Date 7 th April 2016 Summary These best practice guidelines are designed to check how suitable source ontologies are for mapping. It places emphasis on the application of ontologies in the Life Science industry to encourage best practice and to aid mapping of ontologies in a particular domain. This public resource was developed as part of the Pistoia Alliance Ontologies Mapping project.

Contents... 1 Contents... 2 Context and Purpose... 4 Background... 4 Motivation and purpose... 4 Best practice use cases... 5 Use Case : Curation of disease annotation... 5 Use Case: Data harmonisation.

2 Contents... 1 Contents... 2 Context and Purpose... 4 Background... 4 Motivation and purpose... 4 Best practice use cases... 5 Use Case : Curation of disease annotation... 5 Use Case: Data harmonisation... 6 Use Case : Text Mining... 6 Use Case : Data integration... 7 Use Case : Experimental Investigation... 8 Guidelines for Best Practice... 9 Format... 9 URIs Versioning Documentation Users Authority locus Maintenance License Content delineation Content coverage Content quality Textual definitions of 20

3 Naming conventions Relations Conserved URIs Positive and negative aspects Appendices Application of the Guidelines to a checklist Mapping of the Guidelines References of 20

4 Context and Purpose Background These guidelines for best practice support the application and mapping of ontologies in the life sciences as part of the Pistoia Alliance Ontologies Mapping project which creates better tools and services for mapping ontologies to facilitate their exploitation. Ontologies can include hierarchical relationships; taxonomies; classifications and/or vocabularies which are becoming increasingly important for support of research and development. They have numerous applications such as knowledge management, data integration and text mining where researchers need to analyse large quantities of complex data as part of their daily work. The Ontologies Mapping Project will give users access to standardised tools and methodologies to map and visualise ontologies, to understand ontology structure, potential overlaps and equivalence of meaning. The outcome of this project will be to help users to better integrate, understand and analyse their data more effectively. Motivation and purpose This document describes the best practice guidelines that are designed to check how suitable source ontologies are for mapping. It places emphasis on the application of ontologies in the Life Science industry to encourage best practice and to explain how this relates to the mapping of ontologies in a particular domain. In some areas of Life Science such as Clinical Sciences, best practice is mature enough to be governed by appropriate authorities (e.g. FDA, CDISC, EMA, IDMP etc.), whereas in Preclinical and Translational Research areas, best practices and data standards tend to be much less mature and can even be absent. These guidelines will identify and align with existing communities and authorities, especially those that are relevant to research and early development for particular ontology domains regarded as critical to industry needs rather than the vast field of Life Science. 4 of 20

5 Best practice use cases The use cases below exemplify show typical applications for ontologies and mappings of disease, phenotype and experimental investigation, which serve as "test cases" for these guidelines. Use Case : Curation of disease annotation Publicly available datasets from the Array Express database can be curated into data/metadata cataloguing platforms, to create an exemplar resource. Ontologies and standards are used throughout the platform to ensure the standardisation of data and metadata. Ontology terms are chosen to represent certain diseases described within the Array Express experiments. Examples are given below where two reference disease are exploited:- Human Disease Ontology (DOID) and Human Phenotype Ontology (HP) which bring complementary strengths. Terms for curation were text searched using the Ontobee or NCBO Bioportal ontology browsers. The examples below show how mapping between these ontologies enables harmonised application but this process depends on the context of application which is annotation of data from a gene expression experiment. The term uveal melanoma (Human Disease Ontology DOID_6039) was used for describing the disease in the Array express experiment E-GEOD Mda-9/Syntenin-1 is expressed in uveal melanoma and correlates with metastatic progression. o Uveal melanoma is also present as an exact synonym in the Human Phenotype Ontology term: Intraocular melanoma(hp): o There is also the term uveal melanoma cell which which appears in the results list and is present in the BRENDA tissue ontology - this was not the term we wanted since we wanted to represent the disease under investigation and not the cell type. The term Renal Clear Cell Carcinoma (Human Disease Ontology DOID_4467) was used to describe the disease in the Array express experiment E-GEOD Digital gene expression (DGE) sequencing of 10 pairs samples between kidney normal tissue and cancer tissue. o However, in Experimental Factor Ontology EFO there is similar term: clear cell renal carcinoma (EFO): which has Renal clear cell carcinoma listed as an alternative term. o Also, the obsolete term renal clear cell carcinoma appears in the results list following a text search which we would not use since it is obsolete. 5 of 20

6 Use Case: Data harmonisation This use case shows harmonisation of data obtained from multiple ontological sources to improve standardisation of named entities with similar meaning. Diseases can be described by different acronyms and shorthand text. Assigning ontologies can bring integrity to the data that allows improved integration and querying across datasets. The examples below show equivalent entities in three ontologies, EFO, DOID and HP have different synonyms across these resources which can be harmonised through mapping of equivalence. myocardial infarction (EFO): myocardial infarction (DOID): Myocardial infarction (HP): o can be described as: MI, Myokardinfarkt hypertension (EFO): hypertension (DOID): Hypertension (HP): o can be described as: Arthypertension, HTN diabetes mellitus (EFO): diabetes mellitus (DOID): Diabetes mellitus (HP): o can be described as: Diab1, Diab2, Diabmellitus, diabetes, Diab_Mellitus, diabtype Use Case : Text Mining In general, Ontologies are not designed to support Text Mining and the related Semantic Search. It is however good practice to use these resources for Text Mining and, in particular, for lexical extraction/ named entity extraction. Frequently used resources are for example MeSH or MedDRA. Ontologies mapping can be very useful to support Text Mining because usually the synonym lists of Ontologies (if there are any) are not comprehensive. Mapping ontologies would therefore allow for automatic synonym enrichment where the synonym set would be the union of all synonyms provided by the mapped input ontologies. On the other hand, creating larger synonym sets from ontologies should be done carefully because by transitivity larger synonym sets can be derived due to different levels of granularity. ICD-10 clusters a lot of different concepts in code whereas other resources such as SNOMED or MedDRA are more fine-grained. Some Ontologies are even not appropriate for Text Mining as they contain mainly longer pre-coordinated phrases which do not occur as exact matches in free text. An example would be the Gene Ontology. A concept like "apoptotic process involved in heart morphogenesis" will rarely be used as such in a free text but rather be circumscribed by several phrases. Another important parameter for the usage of Ontologies in Text Mining are the linguistic capabilities of the Text Mining tool itself. By and large there are two approaches. On the one extreme you could spell out 6 of 20

7 all lexical variations of a term (a term can of course consist of multiple words). This will inflate your dictionary for Text Mining but it will speed up the indexing/ annotation process. On the other extreme you could leave the detection of variations to algorithms including deflection, de-derivation and decomposition as well as permutations of words. The second scenario would result in a lean terminology but a higher computational load at indexing/ annotation time. What is completely missing in Ontologies is a linguistic description layer for word forms and their relationship to each other (eg normal form vs case variations, plural etc.). Another example for linguistic descriptions would be ambiguity markers or confidence levels for a specific term in an annotation scenario. Please note that these may vary depending on the context or use case (e.g. indexing a canteen menu with a gene annotator in enterprise search). In the context of Ontologies Mapping, a typical use case for text mining is the creation of an exhaustive synonym list from multiple input sources to serve lexical extraction or named entity extraction. Here is an example for the drug product domain: The Roche drug "tamiflu" has a lot of different synonym types depending on the status of the pipeline. We have compiled the synonyms from different input sources including open access sources (ChEMBL, ChEBI, DrugBank etc.) and commercial ones (Pharmaprojects, Integrity, Pharma Partnering etc.). Here would be an extract of the synonym list: tamiflu (trade name), oseltamivir (generic name, INN), GS-4104, HSDB-7433, RO (LabCode), CCOC(=O)C1=C[C@@H](OC(CC)CC)[C@H](NC(=O)C)[C@@H](N)C1 (SMILES), ethyl (3R,4R,5S)-4-acetamido-5-amino-3-(pentan-3-yloxy)cyclohex-1-ene-1-carboxylate (IUPAC name) etc. Variations such as GS4104, Ro are pre-calculated Use Case : Data integration Data integration includes the data harmonization use case described above. Commonly an ETL process (Extract Transform Load) is used to integrate data from multiple sources with heterogeneous data schemes and formats and which often use different ontologies as a reference vocabulary, e.g, for diseases. Thus the role of ontology mappings for data integration are two-fold: Firstly, for integration at schema-level using mappings to upper-level or mid-level ontologies, e.g., to BFO, OGMS, RO. Secondly, mappings are needed between large reference ontologies which are used by different sources to express data, e.g., diseases may be expressed using DOID, parts of SNOMED CT, ICD-10, MedDRA etc. These mappings of reference ontologies are used in the transformation step to harmonize the reference vocabulary of different sources. In the paper "From Symptoms to Diseases - Creating the Missing Link", we demonstrate how multiple sets of mappings can be used to integrate information of disease symptom relations from many different 7 of 20

8 ontologies of the BioPortal. We show that mapping quality is essential to obtain valuable integration results. Use Case : Experimental Investigation Taxonomies and ontologies provide a vocabulary and semantic model for the representation of laboratory analytical processes. It is used for standardized representation of laboratory analytical processes, involved materials and devices and corresponding results to overcome vendor-specific formats. To enhance interoperability, the Allotrope Foundation ( defines a set of mappings to other ontologies. For instance, entities of the processes domain, are mapped to the following OBO ontologies: Chemical Methods Ontology (CHMO), e.g., "atmospheric pressure chemical ionization": afp:afp_ skos:closematch obo:chmo_ Ontology for Biomedical Investigations (OBI), e.g., "planning": af-p:afp_ skos:closematch obo:obi_ Information Artifact Ontology (IAO), e.g., "plan": af-p:afp_ skos:closematch obo:iao_ Mass Spectrometry Ontology (MS), e.g., "mean of spectra": af-p:afp_ skos:closematch obo:ms_ of 20

9 Guidelines for Best Practice These guidelines for best practice in ontologies support their application and mapping in the selected ontologies domain (Disease and Phenotype for now). The Open Biological and Biomedical Ontologies (OBO) Foundry ( have developed numerous principles, which have been accepted by this open community ( Many of these accepted principles can be regarded as guidelines for best practice, so we reuse much of the description of each relevant OBO principle below. The following guidelines also align well with "Ten Simple Rules for Selecting a Bio-ontology" by Malone et al [PLOS: Computational Biology 2016] DOI: /journal.pcbi The only missing rule is 10: Sometimes an Ontology is Not Needed at All. This rule points out that selection of an ontology should be driven by understanding the user requirements which is crucial for deciding whether an ontology is really needed. Other forms of knowledge representation, such as vocabularies, are often much simpler to understand than ontologies and maybe sufficient to meet the requirements of the user. Format The ontology is in a format made available in a common formal language, in an accepted concrete syntax. The purpose of a common format is to allow the maximum number of people to access and reuse an ontology. Recommended implementations include OBO format, OWL or OWL2 concrete syntax such as RDF/XML, OWL2-XML or OWL2-Manchester syntax. This means that to achieve interoperability requires an acceptable syntax to be implemented in one of the commonly accepted representational models (e.g. OBO, OWL, SKOS etc., possibly with defined restrictions). More details, including examples, can be found via the FP_002_format OBO wiki page ( - accepted). The representation of the ontologies can also make use of vocabularies mainly implemented in RDF. These describe a set of standardized pre-defined concepts and predicates such as SKOS, VoID, FOAF etc. In case of overlapping standards, resources like LOV (Linked Open Vocabularies) might help ( because they provide a comprehensive overview, of which vocabularies are mostly applied. Many ontologies in the Disease, Phenotype and Experimental Investigation domains are available in one or more of these common formats via the OBO foundry ( NCBO BioPortal ( or ontology home web sites. Ontologies that use a non-standard format are likely to impede interoperability which will limit their application and mapping. 9 of 20

10 URIs Each class and relation (property) in an ontology should have a Uniform Resource Identifier (URIs) to address identifier space ( The identifier should be constructed from a base URI, a prefix that is unique to the ontology (e.g. GO, CHEBI, HPO) and a local identifier (e.g ). The local identifier should be a numeric string and not consist of labels or mnemonics meaningful to humans. This means that ontology IDs will take the form <IDSPACE> : <NUMBER>. The ontology prefix (<IDSPACE> must be registered with an appropriate authority, such as OBO library, in advance. Although it is tempting to make a URI meaningful to humans, their primary purpose is machine readability where the overriding consideration is stability of URIs (see Cool URIs don't change: which facilitates interoperability. More details, including examples, can be found via the FP_003_URIs OBO wiki page ( - accepted). This guideline aligns with Malone et al 2016 Rule 3: The Ontology Classes and Relationships Should Persist. Most ontologies in the Disease, Phenotype and Experimental Investigation domains use URIs to address identifier space. They are available via the OBO foundry ( NCBO BioPortal ( or directly from home web sites. Ontologies that use non-standard or human readable identifiers are likely to impede interoperability which will limit their application and mapping. Versioning The ontology must disclose versioning through metadata to reflect the history of change. The provider should show through this metadata that it has procedures for identifying distinct successive versions. This description summarises the FP_004_versioning OBO wiki page ( - accepted). This guideline aligns and is extended to include access to previous versions with Malone et al 2016 Rule 8: Previous Versions Should Be Available. Versioning can not only be applied to the entire ontology as more fine-grained approaches exist. The RDF specification of the MetaData registry for CDISC based on ISO supports a versioning at the concept level (mms:administereditem). Please note that validated environments require versioning at a term/ concept level. Most ontologies in the Disease, Phenotype and Experimental Investigation domains disclose versioning. Those that do not are likely to be of very limited value. 10 of 20

11 Documentation The ontology must be documented in sufficient quality and detail. This documentation should be located on the ontology home website in the form of a published paper describing the ontology and manuals for developers and users. Essential aspects of the documentation should also be recorded as metadata, embedded within the ontology. This description summarises the FP_008_documented OBO wiki page ( - accepted). Most ontologies in the Disease, Phenotype and Experimental Investigation domains provide documentation on the home website as links to publications and manuals. Absence of documentation makes the ontology less likely to be adopted for application. Users The ontology developers should document the evidence that the ontology is used by multiple independent people or organisations. This ensures that the ontology tackles a relevant scientific area and does so in a usable and sustainable fashion. It is important to be able to illustrate usage outside of the immediate circle of ontology developers and stakeholders. More details, including examples of evidence, can be found via the FP_009_users OBO wiki page ( - accepted). This guideline aligns with Malone et al 2016 Rule 6: The Ontology Should Be Developed by the Community but Not Incapacitated by It. Many ontologies in the Disease, Phenotype and Experimental Investigation domains provide evidence of a substantial user community. They often include a documentation page with links to databases using the ontology for annotation. Good examples are semantic web resources e.g. Array Express, usage and diverse software applications, including text mining and analysis workflow pipelines. Also, publications showing the ontology is being used in research. Such evidence of a substantial user community makes the ontology more likely to be a credible source. Authority locus There should be clear responsibility for the ontology, for ensuring continued maintenance in light of scientific advance and prompt response to user feedback. A single point (mechanism) of contact for support and feedback should be provided on the ontology home website. This description has been adapted from the FP_011_locus_of_authority OBO wiki page ( - accepted). This guideline is also related to Malone et al 2016 Rule 6: The Ontology Should Be Developed by the Community but Not Incapacitated by It. 11 of 20

12 Most ontologies in the Disease, Phenotype and Experimental Investigation domains identify the responsible leader and development team, along with a details for making contact e.g. with queries and feedback from the users. Provision of a simple mechanism for feedback from users to the provider is an important aspect of best practice which could be regarded as a process" feature. Maintenance Ontologies have to be maintained to reflect the continuous advance of science, otherwise they become stale and unable to represent the latest knowledge. The ontology provider must provide evidence that the ontology is being maintained with appropriate regularity, rigorous quality and a funding source. This evidence of maintenance and funding should be documented through the ontology home web site. This accepted OBO principle, included in the original 2006 principles, requires update on on FP_016_maintenance OBO wiki page ( - accepted). This guideline aligns with Malone et al 2016 Rule 7: The Ontology Should Be under Active Development. Another important criteria with regards to Maintenance is response time. How long does it take to process a request? Resources like the NCI Thesaurus or MedDRA offer means to place a request, however, update cycles can be very long. Most ontologies in the Disease, Phenotype and Experimental Investigation domains provide documented evidence of maintenance and funding on their home web site. Those that do not are likely to be of very limited value. License Openly available ontologies can be used by all without any constraint other than (a) its origin must be acknowledged and (b) it is not to be altered and subsequently redistributed in altered form under the original name or with the same identifiers. All ontologies available in the OBO Foundry are open whereas license terms for ontologies available through the BioPortal can be much more restricted. This is important to understand because it could impact on interoperability and freedom to undertake mapping. Further details about recommendations for open license, implementation and examples can be found via the FP_001_open OBO wiki page ( - accepted). This guideline aligns with Malone et al 2016 Rule 9: Open Data Requires Open Ontologies. Many ontologies in the Disease, Phenotype and Experimental Investigation domains are available openly via the OBO foundry, NCBO BioPortal or directly from ontology home web sites. Ontologies that have license restrictions are likely to impede interoperability which can limit their application and mapping. 12 of 20

13 Content delineation Each class and relation (property) in an ontology should have clearly delineated content of acceptable precision. The ontology should be orthogonal to other related ontologies which adhere to best practice. The major reason for this is to allow two different ontologies, for example anatomy and biological process, to be combined through additional relationships. These relationships could then be used to constrain when terms could be jointly applied to describe complementary (but distinguishable) perspectives on the same biological or medical entity. As a corollary to this, we would strive for community acceptance of a single ontology for one domain, rather than encouraging rivalry between ontologies. This description summarises the FP_005_delineated_content OBO wiki page ( - accepted). This guideline aligns with Malone et al 2016 Rule 1: The Ontology Should Be about a Specific Domain of Knowledge. A important aspect of content delineation is to make it clear whether the ontology is designed to be a reference for a particular domain. Alternatively, it could be intended as an application ontology which is designed to support a particular application, often in multiple domains. An example of this is the Experimental Factor Ontology (EFO) which is an application ontology designed to support experimental investigations. EFO includes the reuse of relevant reference ontologies such as as Human Phenotype Ontology (HPO) and (Human) Disease Ontology (DO). An upper level ontology is another type of content delineation which is designed to bridge across more specific ontologies in particular domains. An example of this in the Disease and Phenotype domain is Basic Formal Ontology (BFO) as an generic upper ontology, Ontology for Biomedical Investigations (OBI) as a disease-neutral ontology and numerous disease specific ontologies. Many ontologies in the Disease, Phenotype and Experimental Investigation domains have clearly delineated content which makes them more likely to bring unique value for application. It also makes it more likely, as discussed already, that delineated content will facilitate interoperability through mapping additional relationships between different ontologies, including equivalence. Content coverage Ontologies should include content of acceptable coverage so that there is sufficient number of concepts to cover an ontology domain and provides enough terms and associated metadata such as name, label, definition and synonyms. There also needs to be sufficient breadth and depth of coverage combined with organisational principles (e.g. taxonomies, partonomies) and granularity (detail and depth of modelling). Another obvious aspect of coverage is missing content. This guideline aligns with Malone et al 2016 Rule 2: The Ontology Should Reflect Current Understanding of Biological Systems. The number of instances of classes and relations can give an indication of coverage. This guideline can also be tested through sampling of instances in the ontology. 13 of 20

14 Many ontologies in the Disease, Phenotype and Experimental Investigation domains have sufficient coverage of content to represent knowledge in meaningful ways. Inadequate coverage is likely to be of limited use. Content quality Ontology content should be of acceptable quality where this has two aspects. First to what extent have these guidelines for best practice been respected (formal correctness). Second, has the content of the domain been properly modeled (correctness of the content). This guideline aligns with Malone et al 2016 Rule 2: The Ontology Should Reflect Current Understanding of Biological Systems. This guideline can be tested through sampling of instances in the ontology. For example in the Phenotype and Disease domain, an ontology could contain two concepts called "Tooth Disease" and "Caries". "Caries" is a sub-concept of "Tooth Disease". However, "Tooth Caries" is a synonym of "Tooth Disease". Formally, no one could prevent you from adding the synonym relationship between "Tooth Disease" and "Tooth Caries". However, from an engineering perspective the domain in not properly captured. Poor representation of knowledge such as this will limit the usefulness of an ontology. Textual definitions The ontology needs to contain textual definitions for a substantial and representative fraction, plus equivalent formal definitions (for at least a substantial number of terms). For terms lacking textual definitions, there should be evidence of implementation of a strategy to provide definitions for all remaining undefined terms. Text definitions should be unique (i.e. no two terms should share a definition). This is the vocabulary or dictionary component of an ontology that provides definitions for class terms and those with equivalent meaning (i.e. synonyms). This description summarises the FP_006_textual_definitions OBO wiki page ( - accepted). This guideline aligns with Malone et al 2016 Rule 4: Classes Should Contain Textual Definitions. This guideline can be tested through sampling of instances in the ontology. Many ontologies in the Disease, Phenotype and Experimental Investigation domains have textual definitions (vocabulary) for class terms which makes them more likely to bring unique value for application. A high proportion of quality textual definitions will facilitate interoperability through mapping the meaning (semantics) of equivalence in different ontologies. Naming conventions Naming conventions used by ontology providers tend to be a heterogeneous and inconsistent. This is because names emerge often in an ad hoc manner rather than through an agreed nomenclature. Of 14 of 20

15 course there are exceptions which are much more mature and consistent, such as the HUGO Gene Nomenclature Committee which provides the authoritative source of human gene names ( Another excellent example is Chemical Entities of Biological Entities (ChEBI) which started as a curated nomenclature for small molecules and has developed into a mature ontology in the OBO Foundry. The OBO principle wiki page, FP_012_naming_conventions ( is under development and mostly mentions the publication entitled "Surveybased naming conventions for use in OBO Foundry ontology development by Schober et al 2009 ( This guideline aligns with Malone et al 2016 Rule 5: Textual Definitions Should Be Written for Domain Experts. This guideline can be tested through sampling of instances in the ontology. Some ontologies in the Disease, Phenotype and Experimental Investigation domains have naming conventions driven by support for applications such as phenotype for an inherited disease or clinical terms for clinical investigations which can result in naming conventions of mixed quality and form. This can hinder interoperability, making it difficult to map between different ontologies in this domain. Relations Best practice for the representation of relations in ontologies is still emerging. This is because the standard formats for ontologies such as OBO and OWL use instance level relations rather than type level relations. Here types equal what are described in textbooks whereas instances are what we observe, measure or perform experiments on. One approach to representation of relations in an ontology is to make use of an upper level ontology, such as Basic Formal Ontology (BFO) ( This is described fully in the recent book by Arp, Smith and Spear: Building Ontologies with Basic Formal Ontology published by MIT Press, August 17, The original formulation of the OBO principle for relations is noted as requires some modifications on the FP_007_relations OBO wiki page ( - accepted). Many ontologies in the Disease, Phenotype and Experimental Investigation domains include relations, usually at the instance level. This is an emerging area of best practice which may impact on mapping of equivalence between ontologies in this domain. Conserved URIs Ontologies often overlap in a particular domain. This overlap is harmless and can be mapped readily when URIs are preserved to the source ontologies. Here interoperability is guaranteed wherever the relevant terms and their URIs are conserved e.g. Gene Ontology in the OBO Foundry, where reuse is evident in BioPortal search results. Similarly, upper level or 15 of 20

16 "Meta ontologies e.g. Uberon in the anatomy domain, contain cross references to source URIs which makes overlap harmless by design. Many ontologies in the Disease, Phenotype and Experimental Investigation domains reuse terms and include cross references to source URIs. This best practice makes mapping between ontologies which overlap straightforwards. However, when source URIs are NOT conserved, overlap between ontologies in the same domain, tend to be harmful, making mapping of equivalence much more difficult. 16 of 20

17 Positive and negative aspects Positive and negative aspects of the above guidelines of best practice are listed in the Table below. The positive aspects are encouraged whereas the negative aspects can hinder application and mapping of ontologies and should to be avoided or minimised:- Guideline Positive aspect Negative aspect 1. Format Open standard Non-standard 2. URIs Used and persistent Not used and not persistent 3. Versioning Used with date Not used and no date 4. Documentation High quality and coverage Poor or absent 5. Users Evidence beyond provider Poor or missing evidence 6. Authority Clearly defined Unclear or missing 7. Maintenance Evidence of currency and sustainability Poor or missing evidence 8. License Clearly defined terms and conditions License terms can restrict use 9. Content delineation Clear Unclear or no delineation 10. Content coverage * Acceptable Inadequate or sparse or gaps 11. Content quality * Acceptable Poor or inaccurate 12. Textual definitions * Acceptable Insufficient or absent 13. Naming conventions * Acceptable Insufficient or absent 14. Relations Consistent, clear model Inconsistent 15. Conserved URIs Cross reference to source URIs Missing source URIs *tested through relevant sampling 17 of 20

18 Appendices Application of the Guidelines to a checklist Pistoia Alliance Guidelines checklist for disease phenotype experimental investigation ontologies.xlsx The guidelines checklist has been populated with ontologies for disease, phenotype and experimental investigation to illustrate use of the guidelines. Local download of this sheet can serve as a template for further consideration of ontologies in other data domains. Mapping of the Guidelines to OBO principles and the 10 rules of Malone et al 2016 Guideline OBO Principle 10 Rules of Malone et al Format 2. URIs 3. Versioning Rule 3: The Ontology Classes and Relationships Should Persist. Rule 8: Previous Versions Should Be Available 4. Documentation 5. Users 6. Authority 7. Maintenance Rule 6: The Ontology Should Be Developed by the Community but Not Incapacitated by It Rule 6: The Ontology Should Be Developed by the Community but Not Incapacitated by It Rule 7: The Ontology Should Be under Active Development 18 of 20

19 8. License Rule 9: Open Data Requires Open Ontologies 9. Content delineation 10. Content coverage 11. Content quality Rule 1: The Ontology Should Be about a Specific Domain of Knowledge Rule 2: The Ontology Should Reflect Current Understanding of Biological Systems Rule 2: The Ontology Should Reflect Current Understanding of Biological Systems 12. Textual definitions Rule 4: Classes Should Contain Textual Definitions 13. Naming conventions Rule 5: Textual Definitions Should Be Written for Domain Experts 14. Relations Conserved URIs 19 of 20

20 References The Open Biological and Biomedical Ontologies (OBO) Foundry ( NCBO BioPortal ( "Survey-based naming conventions for use in OBO Foundry ontology development by Schooner et al 2009 ( Arp, Smith and Spear: Building Ontologies with Basic Formal Ontology published by MIT Press, August 17, 2015 Malone et al "Ten Simple Rules for Selecting a Bio-ontology" PLOS: Computational Biology 2016 DOI: /journal.pcbi of 20

A Semantic Web-Based Approach for Harvesting Multilingual Textual. definitions from Wikipedia to support ICD-11 revision

A Semantic Web-Based Approach for Harvesting Multilingual Textual Definitions from Wikipedia to Support ICD-11 Revision Guoqian Jiang 1,* Harold R. Solbrig 1 and Christopher G. Chute 1 1 Department of