Deliverable 4.4: Final specification of EHR4CR semantic interoperability solutions

Size: px

Start display at page:

Download "Deliverable 4.4: Final specification of EHR4CR semantic interoperability solutions"

Diane Baldwin
6 years ago
Views:

1 Electronic Health Records for Clinical Research Deliverable 4.4: Final specification of EHR4CR semantic interoperability solutions Version 1.0 Final 29/02/2016 Project acronym: EHR4CR Project full title: Electronic Health Records for Clinical Research Grant agreement no.: Budget: 16 million EURO Start: End: Website: The EHR4CR project is partially funded by the IMI JU programmed Coordinator: Managing Entity: 1

2 Document description Deliverable no: 4.4 Deliverable title: Final specification of EHR4CR semantic interoperability solutions Description: This deliverable describes the final implementation of the EHR4CR semantic interoperability services provided to address the needs of any specific project intending to use any services defined as part of the three EHR4CR use case (PFS, PRS or CTE). It describes the result of the activities executed during Task 4.3, Task 4.4 and Task 4.5. Extension of Task 4.3 (Terminologies Services and Tools) to the context of CTE Task 4.4 (Knowledge Models Mapping and Management Services and Tools) consisting in the design and implementation of tools supporting structural and terminological mapping between the EHR4CR Common Information Model and the models used in EHR/CDW systems and clinical research systems. Task 4.5 (Knowledge Authoring) consisting in the design and implementation of tools supporting the creation and management of EHR4CR semantic resources The deliverable describes the governance of the semantic interoperability platform, the adopted approach and overall specifications. It describes the EHR4CR standardization pipeline needed to fulfill the requirements of the EHR4CR uses cases (PFS, PRS or CTE) and the EHR4CR semantic interoperability services (SIS) used during both the set up and execution phases of EHR4CR use case (PFS, PRS or CTE). It also describes first evaluation of the EHR4CR standardization pipeline and semantic interoperability services, and the design and implementation of the EHR4CR Clinical Data Warehouse and of the extract-transform-load (ETL) process adopted for CDW population. Final Status: Version: 1.0 Date: 29/02/2016 Deadline: Editors: C. Daniel, S. Hussain, E. Sadou, D. Ouagne, K. Forsberg, E. Zapletal, Mark Mc Gilchrist Outputs: Type Description Communication Vehicle Report Other 2

3 Document history Date Revi Author(s) Changes sion 30/09/ S. Hussain, C. Daniel Table of content and first contributions 11/11/ C. Daniel, D.Kalra Draft of section Governance, process, responsibilities and roles 09/12/ C. Daniel Draft of section sections informatics infrastructure 09/01/ C. Daniel Draft of section sections informatics infrastructure, evaluation, conclusion 06/04/ C. Daniel Input from Sebastian Mate and Mark McGilchrist 16/04/ C.Daniel,S.Hussain, D.Ouagne, E.Sadou Draft of glossary, specification of the semantic interoperability services 28/04/ C.Daniel Improved draft of glossary, semantic interoperability specification 05/05/ C.Daniel, E.Zapletal Draft section for structural mapping. Input from Eric Zapletal: EHR4CR Terminology Mapping Status Manager (TMSM) section 27/05/ C.Daniel, E.Zapletal Input from E.Zapletal: Structural mapping section 19/06/ C.Daniel, S.Hussain, Section about the EHR4CR Clinical Data D.Ouagne, E.Sadou, Warehouse M.McGilchrist 30/07/ C.Daniel, S.Hussain 22/11/ C.Daniel, D.Ouagne, Discussion & Conclusion E.Sadou 29/02/ C.Daniel Final review 3

4 Table of Contents 1 Introduction EHR4CR platform & use cases overview EHR4CR Semantic Interoperability Services overview Objective of the deliverable Outline of the deliverable WP4 deliverables interdependencies Reference documents Reference Documents Developed Documents Definitions and acronyms Definitions Acronyms 14 2 Semantic interoperability overall specification Approach What is a mediation model? Why do we need a mediation model? How to build the EHR4CR Common Information Model as mediation model? Governance Overview of the specifications Semantic interoperability requirements for patient identification based on eligibility criteria (use case 1 & 2) Semantic interoperability requirements for data extraction and form prepopulation (use case 3) 18 3 Standardization pipeline Managing the EHR4CR Common Information Model Mapping local semantic resources to the EHR4CR Common Information Model Overview of the standardization process before the execution of any EHR4CR use case EHR4CR Common Information Model (CIM) management Central/local mapping management 28 4 Semantic interoperability resources EHR4CR Common Information Model (mediation model) What is the EHR4CR mediation model? How is the EHR4CR mediation model built and maintained? FHIR-based templates and data elements Terminologies/Ontologies Semantic Resource Repository Tools and services EHR4CR Common Information Model Editor (CIME) EHR4CR Terminology Mapping Suite (TMS) Structural mapping 49 5 EHR4CR semantic interoperability services (SIS) 54 4

5 5.1 Introduction and Scope Service Definition Principles Comparison of the SIS/CTS2 Service Functional Models Structure of the SIS specification Implementation Considerations 56 6 EHR4CR Clinical Data Warehouse Introduction ETL process and guidance for user acceptance testing Mappings The Dundee 200 test data Future work 65 7 Evaluation of the semantic resources and services Evaluation framework The need of high quality query language and model The need of high quality mediation model (patient data model) The need of an efficient standardization pipeline within participants data providers Results Query model and language Mediation model Standardization pipeline for data providers 72 8 Conclusion The EHR4CR semantic interoperability platform Limits, related projects and perspectives References 74 9 Appendix List of clinical trials Detailed Functional Model for each of Interface Semantic Interoperability Services (SIS) Business Scenarios Scenario A: cts2:codesystem Scenario B: Scenarios about cts2:valueset Scenario C.: Scenarios about hl7: Templates Detailed Functional Model for each Interface Semantic services used by SDM/ODM editor Introduction Usage of a SDM-ODM container SDM elements for patient recruitment SDM-ODM extension for third party SDM-ODM designer Global Definitions = protocol 91 5

1 Introduction The EHR4CR (Electronic Health Records for Clinical Research) project aims to improve the efficiency and reduce the cost of conducting clinical trials, through better leveraging

6 1 Introduction The EHR4CR (Electronic Health Records for Clinical Research) project aims to improve the efficiency and reduce the cost of conducting clinical trials, through better leveraging routinely collected clinical data in electronic healthcare records (EHRs) and using it at key points in trial design and execution life-cycle. The EHR4CR platform automates the reuse of EHR data stored in existing EHR systems or Clinical Data Warehouses (CDWs) and implements three use cases - protocol feasibility testing, patient identification and recruitment for clinical trials, supporting clinical trial execution and adverse event reporting. Figure 1. EHR4CR services for reusing EHR data during key points in trial design and execution life-cycle: protocol feasibility services (PFS), patient identification and recruitment (PIR) and clinical trial execution and serious adverse reporting (CTE). The EHR4CR platform contributes to the automation of the clinical research process from the design of the protocol until the submission of the data to the regulatory agencies (see Figure 1). During this process the protocols and case report forms are key documents that are becoming more available in electronic formats in compliance with the relevant CDISC standards in both pharma companies and university research centers. Figure 2. Automation and standardization of the clinical research process 6

EHR4CR services are demonstrated by 11 pilot hospitals in 5 European countries (see Figure 3). Figure 3. EHR4CR services demonstrated by 10 clinical research (EFPIA) & 11 hospital pilot sites 1.

The EHR4CR architecture designed in WP3 (WP 3: Architecture and Integration) defines how the tools and services of WP4 (Semantic interoperability), WP5 (Data Protection, Privacy & Security) and WP6

7 EHR4CR services are demonstrated by 11 pilot hospitals in 5 European countries (see Figure 3). Figure 3. EHR4CR services demonstrated by 10 clinical research (EFPIA) & 11 hospital pilot sites 1.1 EHR4CR platform & use cases overview The EHR4CR platform is a loosely coupled service platform, which orchestrates independent services. The EHR4CR architecture designed in WP3 (WP 3: Architecture and Integration) defines how the tools and services of WP4 (Semantic interoperability), WP5 (Data Protection, Privacy & Security) and WP6 (end-user Platform Services) integrate. Table 1 describes the WP6 services required to support the four pilot scenarios. The end-user WP6 services are built upon services defined in WP3-5 to properly access EHR/CDW systems. Table 1. WP6 service tools support the three pilot scenarios - protocol feasibility, patient identification and recruitment, clinical trial execution and serious adverse event reporting. These services are built upon the services defined in WP3-5 to properly access the existing EHR/CDW systems. Use cases Description Services WP6 1- Protocol feasibility (PFS) 2-Patient identification & recruitment (PIR) 3-Clinical trial execution and serious adverse event reporting (CTE) Leverage clinical data to design viable trial protocols and estimate recruitment Detect patients eligible for trials and better utilize recruitment potential Optimize clinical trial execution Distributed queries over heterogeneous EHRs or CDWs Distributed queries over heterogeneous EHRs or CDWs Workflow execution Workflow execution Pre-population of forms (distributed queries over heterogeneous EHRs or CDWs) 7

WP4 WP5 Re-use of clinical data to pre-populate ecrfs and adverse event reporting forms Semantic Interoperability Services (Resources & Terminology) (SIS) Access policy,

8 WP4 WP5 Re-use of clinical data to pre-populate ecrfs and adverse event reporting forms Semantic Interoperability Services (Resources & Terminology) (SIS) Access policy, pseudonymization/de-identification, patient content services 1.2 EHR4CR Semantic Interoperability Services overview In this context the objective of the Semantic Interoperability Services provided by WP4 is to allow: Clinicians in hospitals (data providers of of the EHR4CR network), while using their own words, to simultaneously utilize the most appropriate reference codes for meaningful re-use of routinely collected clinical data in electronic healthcare records (EHRs) in the context of clinical research conducted at an international level. Investigators of the EHR4CR network to use a semantically-enabled platform to efficiently perform sophisticated web searches across European hospitals, to find clinically relevant results that can help improve clinical research. The clinical terms normally used by clinicians are usually mapped to often local - coding terminologies used locally for care coordination and secondary use of the clinical content. These local coding terminologies do not necessarily match with international administrative and clinical reference terminologies such as ICD-10-CM, SNOMED CT, LOINC, ATC, etc. used within the EHR4CR European network. The aim is that clinicians in hospitals can go on capturing, storing and searching their clinical content according to local terminologies while providing to the EHR4CR users a cross-border access to this important clinical information according to international reference terminologies. In addition to maintaining a wide range of curated semantic resources (healthcare template/data elements/value sets and terminologies) the EHR4CR semantic interoperability platform also created tools and services to support the mapping between local terminologies used in the hospitals and reference terminologies used in EHR4CR queries. 8

9 1.3 Objective of the deliverable The objective of D4.4 Final specification of EHR4CR semantic interoperability solutions (M48) is to describe the final implementation of the EHR4CR semantic interoperability services (SIS) provided to address the needs of any specific project intending to use any services defined as part of the three EHR4CR use case (PFS, PRS or CTE). D4.4 describes the result of the activities executed during Task 4.3, Task 4.4 and Task 4.5. Extension of Task 4.3 (Terminologies Services and Tools) to the context of CTE Task 4.4 (Knowledge Models Mapping and Management Services and Tools) consisting in the design and implementation of tools supporting structural and terminological mapping between the EHR4CR Common Information Model and the models used in EHR/CDW systems and clinical research systems. Task 4.5 (Knowledge Authoring) consisting in the design and implementation of tools supporting the creation and management of EHR4CR semantic resources Outline of the deliverable Chapter 2 describes the governance of the semantic interoperability platform, the adopted approach and overall specifications. Chapter 3 describes the EHR4CR standardization pipeline needed to fulfill the requirements of the EHR4CR uses cases (PFS, PRS or CTE). The process, responsibilities and user roles - on both clinical research and hospital side are defined. The information technology infrastructure semantic resources and tools - developed to support the different actors is presented. Chapter 4 describes the EHR4CR semantic interoperability services (SIS) used during both the set up and execution phases of EHR4CR use case (PFS, PRS or CTE). Chapter 5 describes first evaluation of the EHR4CR standardization pipeline and semantic interoperability services (SIS) Chapter 6 describes the design and implementation of the EHR4CR Clinical Data Warehouse and of the extract-transform-load (ETL) process adopted for CDW population. Chapter 7 provides the conclusive statements of the deliverable WP4 deliverables interdependencies D4.1 Inventory of information and knowledge models and Definition of EHR4CR Information Models (M12) describes activities executed during Task 4.1: Inventory of information and knowledge models - inventory is based on the information systems at the pilot sites, EFPIA partner preferred solutions and products and standards relevant to the domain and Task 4.2: Definition of EHR4CR Information Models Specification of EHR4CR standard representation for clinical data : knowledge models, core dataset, template and archetype registry/repository 9

10 D4.2 Design and implementation of semantic interoperability tools for PFS and PRS (M24) describes activities executed during Task 4.3: Terminologies Services and Tools specification of minimal services (with limited tools support) for working with multiple terminologies (including reference terminologies and relevant ontologies) : services for managing harmonized collection of terminologies in use across clinical trials and EHR systems, for terminology translation services (cross-mappings dealing with multi-lingual resources) D4.3 Report on authored knowledge models (M36) resulting from activities executed during Task 4.5: Knowledge Authoring consisting in authoring the specific clinical knowledge models required for all four EHR4CR scenarios across the selected disease areas. D4.4 Final specification of EHR4CR semantic interoperability solutions (M48) is built on top of D4.1, D4.2 and D4.3 describing models and services based on the requirements of PFS, PRS and CTE use cases. D4.4 extends the scope of previous deliverables by describing the final EHR4CR normalization and the specification of additional services addressing the requirements of CTE use case (use case 3). 1.4 Reference documents Reference Documents No. Name of Document Author Date 1 EHR4CR_Protocol_Feasibility_SRS_v1.0 (freeze candidate) T.Karakoyun, W.Kuchinke, C.Ohmann, C.Krauth 2 EHR4CR Subject Recruitment SRS_v1.2 T.Karakoyun, C.Krauth, M.Eckert, B.Braasch, B.Trinczek December 16, 2011 November 16, EHR4CR_Trial_Execution_SRS_v1.1 T.Karakoyun, C.Krauth, M.Eckert November 26, CT2 5 ISO ISO OHDSI - OMOP Developed Documents No. Name of Document Author Date 1 Terminology Mapping Editor_SRS E.Sadou, S.Hussain 2 Semantic Interoperability Services (SIS)_SRS D.Ouagne, E.Sadou 3 Local Workbench SDM extension_srs M.Neukum July 22, ODM-SDM Editor Extension_SRS M.Neukum, D.Ouagne, E.Sadou September 27, Definitions and acronyms Definitions The following definitions of key vocabulary terms used in the deliverable. Most of the definitions come from documents provided by organizations contributing to international efforts in the domain of semantic interoperability such as the HL7 Version 3 Standard: Common Terminology Services HL7 (Draft Standard for Trial Use - DSTU Release 2 October 2009), Semantic Health Net (SHN) European network of Excellence (SHN), ISO 21090, ISO Table 2. List of Definitions 10

11 Term Information model Note In the EHR4CR project Terminology model Code system Definition and source [SHN] Semantic artifact providing information structures, relationships, and constraints to represent data. The meaning they convey relies on the intuitive and common-sense understanding of natural language labels and descriptions, not a priori referring to any ontological foundation. Note: In the healthcare domain, several decade-long, large-scale efforts of different standard definition organizations (SDOs) have focused on specifying both the syntax and the semantics of patient clinical information. Information models in the domain of patient care. The HL7 or EN standards define the semantics of meta-structures and stated the need for layers of semantic expressiveness including: i) generic reference information models EN ISO , openehr Reference Model, HL7 Reference Information Model (RIM) or FHIR resources; ii) more detailed meta-data models like CEN/ISO Archetypes/Templates*, HL7 s Detailed Clinical Models (DCMs) or HL7 FHIR resources and profiles that instantiate generic reference models and are tailored to the needs of structured data acquisition. It is important to note that these models require associated robust data element model* such as that defined by ISO and data type model such as that defined by ISO iii) terminology models* such as ICD or SNOMED-CT. Information models in the domain of clinical research: the Clinical Data Information Standards Committee (CDISC) non-profit organization has developed a number of standards for study design (SDM), study data collection, study data analysis (ADAM), and submission to the regulatory bodies (SDTM). The semantic interoperability platform mediates different particular information models Source information models of information systems used in hospital sites to collect clinical data during patient care (Electronic Healthcare Records (EHRs)) or to process it for secondary use (Clinical Data Warehouses (CDWs)). Target information models of information systems (Clinical Trial Management Systems (CTMS), Clinical Data Management Systems (CDMS), Electronic Data Capture (EDC) systems) used in clinical research sites to collect information of clinical trials including clinical data from participating hospital sites. Semantic artifact providing the description of domain entities. Terminology models vary from simple lists, hierarchies, multi-axial systems to ontologies. In this document, consistently with CTS2, we will use the term of code system (or terminology or vocabulary) to designate any terminology model (including ontologies) and will adopt the SHN definition of ontologies. [CTS2] Managed collection of concept identifiers, usually codes, but 1 Home Page for ISO/IEC Information Technology -- Metadata registries. ISO/IEC 11179, Information Technology -- Metadata registries (MDR) [accessed 09/24/15] 11

12 (terminology, vocabulary) Ontology Mediation Model In the EHR4CR project Semantic resource In the EHR4CR project Template Note In the EHR4CR project sometimes more complex sets of rules and references. They are often described as collections of uniquely identifiable concepts with associated representations, designations, associations, and meanings. Examples of Code Systems include ICD-10, SNOMED CT, LOINC, etc. To meet the requirements of a Code System as defined by HL7, a given concept representation must resolve to one and only one meaning within the Code System. In the CTS2 terminology model, a Code System is represented by the Code System class. [SHN] Semantic artifact formally describing properties and relations of types of entities. Domain-independent categories, relations and axioms are typically provided by top-level ontologies, whereas the types of things specific to a domain of interest are represented by domain ontologies. [SHN] Semantic artifact used to mediate different particular information models implemented in different settings. A mediation model can be formally described within information models (e.g. HL7 FHIR resources, HL7 RIM-based models, openehr, EN ISO 13606, etc.) or within an ontological framework, which is mostly independent of the specific layout of information models. The EHR4CR Common Information Model (CIM) consists in a integrative semantic abstraction providing a homogeneous view that enables to mediate across heterogeneous particular patient-centric information models implemented in both hospital (source information models) and clinical research (target information models) sites for conducting clinical research studies. The EHRCR Common Information Model (CIM) consists of a set of semantic resources used for clinical data standardization and query specification. Any element of an information model or a terminology model used to represent the meaning of health information stored, shared, exchanged and/or processed by information systems. The common EHR4CR semantic resources consist in a shared set of templates and data elements with their associated value sets and concepts that enables to mediate across heterogeneous representations of patientcentric health information. The common EHR4CR semantic resources are stored and maintained in a metadata registry framework extending the ISO/IEC and are accessed through standardized interfaces: EHR4CR semantic interoperability services (SIS). Any detailed meta-data models that instantiate any generic reference metadata model (such as EN ISO , openehr Reference Model, HL7 RIM, HL7 FHIR resources) and is tailored to the needs of structured data acquisition in a specific context HL7 s Clinical Document Architecture (CDA) is a high level template defining the structure and generic content for any type of clinical document. The derived Continuity of Care Document (CCD)) applied to the CDA schema is defined to produce a desired level of information structure and content for a particular purpose a particular type of document. EHR4CR Common Elementary Templates (CETs) consists in a set of FHIRbased computable elementary information models of elements used for 12

13 query specification. Data element [ISO11179] Any atomic element of an information model (generic (meta data) information model or template) can be considered a data element. Note The NCI has developed the Cancer Data Standards Repository (cadsr) initiative to standardize common data elements used in cancer research. Similarly CDISC has developed CDASH in order to represent a minimum set of core data elements defined across all research studies. CDISC Shared Health and Research Electronic Library (CSHARE) aims at building a global, accessible electronic library, which enables data element definitions beyond the scope of CDASH. NCI cadsr, CSHARE utilize the ISO/IEC standard as the semantic basis for the metadata repository (MDR) of common data elements. In the EHR4CR EHR4CR Common Data Elements (CDEs) consists in granular, computable project elementary information models of elements used for query specification. Data type [ISO21090] Concept [CTS2] Unitary mental representation of a real or abstract thing; an atomic unit of thought. It should be unique in a given Code System. Concepts as abstract, designation-independent representations of meaning are important for the design and interpretation of information models. Terminology best practices dictate that concepts are not deleted from code systems, but are instead deprecated or retired from use, although nothing in the model prevents this. A concept may have synonyms in terms of textual representation (as Terms or Designations). Note Concepts may be simple or compositional in nature. A compositional concept is one that contains more than one concept concatenated within it. Example: in SNOMED CT, the concept of Malignant tumor of breast «is a combination of Malignant neoplasm of primary, secondary, or uncertain origin (morphologic abnormality) and the Finding site (attribute) Breast structure (body structure). Term [CTS2] Representations of concepts. The designation identifier must (Designation) uniquely map to a given text string, bitmap, etc. within the context of the containing concept. In some terminologies, every unique text string will have exactly one identifier, which means that the same identifier may occur under more than one concept. In other terminologies, there may be more than one identifier for a given text string, meaning that the identifier is unique to the concept. Service software must not assume either model. Example: in SNOMED CT, the concept of fever has the fully specified name of fever (finding), a preferred name of fever, and synonyms of febrile and pyrexia. These are all designations for the concept of fever. Concept domain Mapping (Association) [CTS2] Named category of like concepts (a semantic type) that will be bound to one or more attributes in a static model whose data types are coded. Concept domains exist to constrain the intent of the coded element while deferring the association of the element to a specific coded terminology until later in the model development process. Thus, concept domains are independent of any specific vocabulary or code system. Example: Concept domains represent an abstract conceptual space such as "countries of the world", "the gender of a person used for administrative purposes", languages of the world, etc. [CTS2] Binary relationships or linkages between concepts. An association links a source concept to a target concept, often implying a direction of the 13

14 Value Set Binding realm Jurisdictional Domain association from a source to a target. There is a separate concept version level representation to identify the associations supported within a specific version of a code system. [CTS2] Uniquely identifiable set of valid concept representations, where any concept representation can be tested to determine whether or not it is a member of the value set. Value set complexity may range from a simple flat list of concept codes drawn from a single code system, to an unbounded hierarchical set of possibly post-coordinated expressions drawn from multiple code systems. A value set has a definition, which describes a set of codes referencing a collection of unique concept identifiers, and can be resolved to an expansion, which is a set of concept designations defined by the concept identifier. The collection of unique identifiers referenced by a value set is drawn from one or more code systems. Each of these identifiers is represented by a code. [CTS2] In HL7, all model instances must declare a particular binding realm based on the jurisdiction from which they originate, for which they are destined, or for some third jurisdiction by site-specific agreement. The declared binding realm applies to the entire model or specification artifact: it is not specific to individual elements of that model or artifact. A binding realm has a unique code and a steward. The name of the Binding Realm is carried in the model instance. In the interest of maximizing interoperability, interoperability spaces should be as large as possible: binding realms are preferred to be large-grained. A binding realm is used to provide and manage the bindings of value sets to reflect rules within a conformance space e.g., a country. Anybody that may define and manage its own code systems or concepts, including localization of a broader code system. It is specifically to allow for localization of certain concept elements. A Jurisdictional Domain could be a country, group of countries, a territory (e.g. state), an SDO, an individual organization or even department within an organization. A Binding Realm is represented as a Jurisdictional Domain in the CTS2 model. Jurisdictional Domain is intended to encompass the HL7 concept of Realm, however is broader in scope than an HL7 Realm. HL7 rules prohibit new codes being added to a code system locally, but do allow for additional concept relationships, concept properties and designations. (However, any organization could use the same model, interfaces etc. to define its own code systems for internal use.) This class provides the link to those classes to enable the localization to be recorded and managed Acronyms Table 3. List of Abbreviations and Acronyms Abbreviation/ DEFINITION Acronym CEN/ISO EHR-Communication standard ADE Adverse Drug Event ATC Anatomical Therapeutic Chemical CCD Continuity of Care Document CDA Clinical Document Architecture 14

15 CDE CDW CEN CET E2B (R2) EHR EMA FDA FHIR HL7 ICD ICSR IHE MDR OMOP Common Data Element Clinical Data Warehouse European Committee for Standardization Common Element Template ICH message standard based on HL7 for Individual Case Safety Reports Electronic Health Record European Medicines Agency Food and Drug Administration Fast Healthcare Interoperability Resources Health Level Seven The International Classification of Diseases Individual Case Safety Report Integrating the Healthcare Enterprise Metadata Registry Observational Medical Outcomes Partnership 2 Semantic interoperability overall specification Chapter 2 provides a summary of the semantic interoperability approach describes the governance of the EHR4CR semantic interoperability platform and provides an overview of the specifications. 2.1 Approach The EHR4CR project developed a semantic interoperability platform providing a consistent integrative semantic abstraction on top of existing application representations that enables to mediate caross heterogeneous applications - Electronic health records (EHRs) and Clinical Data Warehouses (CDWs) storing routinely collected clinical data at hospital sites What is a mediation model? A mediation model provides a homogeneous view of the clinical data contained within disparate databases of data providers so that data users can access these data using a library of standard queries that have been written based on the mediation model Why do we need a mediation model? Electronic health records (EHR) support insurance reimbursement processes and clinical practice at the point of care. Each has different logical organizations and physical formats, and the terminologies used to describe the clinical information conditions vary from source to source. Clinical Data Warehouses (CDWs) support secondary use of clinical data and allow users to generate evidence from a wide variety of sources and support collaborative research across data sources both within and outside the hospitals. Clinical Data Warehouses (CDWs) also implement various information models and terminology models. EHR4CR faces the challenge of improving semantic interoperability of clinical information in order to better leverage routinely collected clinical data in electronic healthcare records (EHRs) during the execution of clinical trials. The EHRCR Common Information Model (CIM) is a standard-based expressive and scalable mediation model*, allowing dynamic mappings between data structures and semantics for consistent interpretation of clinical data accessed from varying sources. 15

16 2.1.3 How to build the EHR4CR Common Information Model as mediation model? Our approach is based on the realistic assumption that the co-existence between several standard semantic artefacts - namely information models (e.g. EN ISO information model and archetypes, openehr, HL7 RIM, C-CDA and FHIR specifications, CDISC ODM, etc.) and terminologies/ontologies (e.g. LOINC, ATC, SNOMED CT, etc.) as well as proprietary implementations for representing the content of health information in systems (EHR systems, CDWs, CTMS, EDC systems, etc.) will endure. Therefore achieving broad-based, scalable and computable semantic interoperability across multiple domains and systems requires a consistent use of multiple standards, clinical information models and terminology models. The EHR4CR project provides a mediation model the EHR4CR Common Information Model consisting in a set of multilingual semantic resources based on multiple standards (see section about the EHR4CR Common Information Model). The EHR4CR project also proposes a standardization process allowing disparate information models and coding systems of participant sites to be harmonized to a standardized model and standard terminologies. Once hospital CDWs/EHRs are connected to the EHR4CR platform and source information models mapped to the EHR4CR Common Information Model, distributed queries can be specified based on the EHR4CR Common Information Model and executed over heterogeneous sources. Routinely collected clinical data can be used at different key points in trial design and execution life-cycle. 2.2 Governance A governance body the Semantic Interoperability Board - establishes the rules for the standardization process of health information within the jurisdictional domain of the EHR4CR network (described here as the Board 2 ). The Board is in charge of the definition and efficient execution of the standardization process of health information in order to fulfil the requirements of the EHR4CR uses cases 3. The standardization process has to be useful for data users in addition to being manageable for data owners. Therefore the board is in charge of: Providing a high quality mediation model (EHR4CR Common Information Model) used to mediate data integration from different sources EHRs/CDWs and ensuring that hospital sites (data providers) provide high quality structural and terminology mappings between local semantic resources (source information models of EHRs/CDWs and local terminologies) and this mediation model. Supporting data users in clinical research sites to set up query specification in the context of the EHR4CR use cases. Users need to represent eligibility criteria and/or clinical items of clinical trials based on the mediation model (EHR4CR Common Information Model). The Board defined the standardization process as well as the responsibilities and roles of the participating actors. 2.3 Overview of the specifications A first version of the EHR4CR semantic Interoperability platform has been designed and implemented to support the different actors in accomplishing their tasks within the standardization process and EHR4CR use case execution. 2 This board itself relies on the European Institute for Innovation through Health Data. 3 Three EHR4CR use case (PFS, PRS or CTE), and potentially for other services defined in the future 16

17 Tools and services are used for i) authoring and maintaining EHR4CR shared semantic resources and ii) supporting the definition of query specifications in the context of the EHR4CR use cases. The various use cases addressed by the EHR4CR project can be grouped into two high-level functional categories (see Table 4): patient identification based on pre-defined eligibility criteria (use cases 1 & 2) and extraction of patient-specific data for pre-populating individual forms of a research protocol (use cases 3) Semantic interoperability requirements for patient identification based on eligibility criteria (use case 1 & 2) An investigator wants to identify patients, based on a set of eligibility criteria, in different healthcare facilities based on predefined inclusion/exclusion criteria. Scenario 1: Protocol Feasibility Service (PFS) In the context of feasibility studies, the investigator runs the queries on a central workbench. The EHR4CR queries return aggregated data (counts and percentages) that might be crosstabulated by a number of key eligibility criteria. Data will be returned only if counts are sufficiently large to protect privacy (see Figure 4). Scenario 2: Patient Recruitment Service (PRS) In the context of patient recruitment, once the clinical trial is set up (approvals obtained, clinical investigators recruited and contract completed), the investigator runs the queries on a local workbench. The EHR4CR queries return only a pseudonymized list of eligible patients. Based on local knowledge, the investigator may delete individuals from the list. Only treating physicians have access to a re-identified list of patients and may, when appropriate, invite patients to participate in the trial. No individual patient level data would be returned to the organization conducting the clinical research prior to patient consent. Patient-centric data can be accessed through a number of disparate CDWs whose source information models* have to be aligned with common templates* and data elements* of the EHR4CR Common Information Model* in which the inclusion/exclusion criteria (target information models*) are also expressed. During the set up phase: o The user accesses the workbench of the EHR4CR platform to represent a list of free text eligibility criteria as query specifications based on the Common Information Model (CIM) and to execute the queries on the CDWs of the hospital sites. Semantic interoperability services (SIS) are used at the workbench, by the query builder of the EHR4CR platform to access the common EHR4CR Semantic Resources (templates, data elements, value sets, terminologies) in order to represent eligibility criteria using common templates (e.g. observations, procedures, medication statement, etc.) combined with Boolean operators and temporal constraints according to the query model. o If needed, the user may require the creation of missing semantic resources (missing relevant templates for representing eligibility criteria). The Common Information Model (CIM) Editor is used to update the mediation model and the Terminology Mapping Editor (TME) is used to map local terminologies to central terminologies used during the authoring of the new templates and data elements. During the execution phase, semantic interoperability services (SIS) are used at the endpoint of the EHR4CR platform to access terminology mappings. 17

Figure 4. Need of semantic interoperability for use case 1 (Protocol Feasibility) & 2 (Patient Recruitment) 2.3.

forms using patient data resident in a number of disparate EHRs or EHR extracts 4.

18 Figure 4. Need of semantic interoperability for use case 1 (Protocol Feasibility) & 2 (Patient Recruitment) Semantic interoperability requirements for data extraction and form prepopulation (use case 3) An investigator at a clinical research site wants to pre-populate clinical research or patient safety forms using patient data resident in a number of disparate EHRs or EHR extracts 4. Query specifications are derived from the content of the form and executed against an EHR or EHR extract to pre-populate an instance of the defined form. The EHR4CR queries are run only if the patient gave his full informed consent for participating to the clinical trial and for the extraction of data from his/her EHR. Scenario 3: Clinical Trial execution (EHR data extraction and form pre-population) In the context of electronic data capture, during a visit, the clinical investigator opens the ecrf which is automatically pre-populated by extracted data. All data extracted from the EHR are human validated before the ecrf is completed and finally submitted to the Clinical Research Organization (CRO) managing the data collection of the clinical research (see Figure 5). In the context of Adverse Drug Reaction (ADR) reporting, when a clinician documents symptoms, findings or results those are suggestive of a serious adverse drug reaction, the EHR4CR platform prompts the clinician to complete a patient safety form which is automatically pre-populated by extracted data. All data automatically extracted from the EHR are human validated before the form is completed and finally submitted to the sponsor, CRO or regulatory agency (see Figure 5). The IHE Structured Data Capture (SDC) 5 profile utilizes the IHE Retrieve Form for Data Capture (RFD) profile for retrieving and submitting forms in a standardized and structured format. In 4 EHR extracts can be considered as esources 5 based on the work of the US Office of the National Coordinator for Health Information Technology, Standards & Interoperability (S&I) Framework SDC Initiative. The IHE SDC profile consists of four new standards that enable EHRs to capture and store structured data: i) a standard for the Common Data Elements used to fill the specified forms or templates; ii) a standard for the structure or design of the form or template (container); iii) a standard for how EHRs 18

traditional RFD, form pre-population is done by a Form Manager system, such as a research electronic data capture (EDC) system, using data exported from the EHR.

In this approach, the data element definitions within the form are interpreted by the EHR system, and corresponding instance data are retrieved from the EHR database and applied to the form.

19 traditional RFD, form pre-population is done by a Form Manager system, such as a research electronic data capture (EDC) system, using data exported from the EHR. The IHE SDC profile introduces a second mode called of auto-population, where the EHR applies data directly to the form. In this approach, the data element definitions within the form are interpreted by the EHR system, and corresponding instance data are retrieved from the EHR database and applied to the form. SDC adds the concepts of i) a forms repository and the option of persistent forms based on the emerging ISO/IEC Meta-model for Framework Interoperability (MFI) form compliance model and ii) an ISO-IEC Metadata Registries for the Definition of Common Data Elements. This profile also supports optional use of IHE Data Element Exchange (DEX) Profile for auto-populating and prepopulating forms. Figure 5. Different steps of the Clinical Trial execution. Actors of the IHE Structured Data Capture (SDC) profile (Form Filler, Form Manager, Form Receiver, and Form Archiver) exchange standard-based transactions. Semantic interoperability services (in green) support the process. The different steps of the Clinical Trial execution use case are: 1. Study Manager (SM) creates a study in CDISC SDM format 2. SM creates or refines queries 3. The EDC manager imports the CDISC ODM file created by the sponsor, uses annotation tools to annotate ecrf template data element the with the Central Date Element Repository and save it in the central workbench database as a study attribute 4. SM publishes for interest to a list of hospitals 5. Data Relation Manager (DRM) analyses ecrf template (with ecrf template visualization tool) and patient recruitment queries, then gives a participation status (accept or decline) interact with the form or template; iv) a standard to enable these forms or templates to auto-populate with data extracted from the existing EHR. 19

6. DRM and Investigator setup the local service repository and the EDC system that will be used for the ecrf prepopulation, they update mapping to match local terminology (using local mapping tool)

20 6. DRM and Investigator setup the local service repository and the EDC system that will be used for the ecrf prepopulation, they update mapping to match local terminology (using local mapping tool) and submit the new ecrf template to the central EDC manager with a representative dataset and all relevant information (transformation, translation, transcoding, calculation, ) 7. EDC Manager checks the site specific mapping and dataset. If he is not satisfied with it, he returns it with comments, in this case, DRM and Investigator redo step 6 until the mapping is approved 8. When the mapping has been approved for a site, the SM can send ready to go to this site 9. Patient recruitment can start 10. CTE take place. For each visit of a recruited patient, an ODM file is generated and imported using the ecrf import tool to prepopulate it. The result is checked by the investigator 11. Using the dashboard on the central workbench, SM can follow the course of the clinical trial Figure 6. The interaction diagram of the IHE Structured Data Capture (SDC) profile Semantic interoperability requirements and overall specification The model of the ecrf or adverse event reporting form (target Clinical Information Models*) has to be aligned with the mediation model (EHR4CR Common Information Model) so that query specifications based on the mediation model can be specified and run on disparate EHRs whose source Information Models* have also been aligned with the mediation model. During the set up phase: o The CDSIC SDM-ODM file of the clinical trial, used to generate the specification of ecrfs or AE reporting forms, needs to be annotated with EHR4CR Common Data Elements using the SDM-ODM editor extension. Semantic interoperability services (SIS) are used during the annotation process to access the common EHR4CR 20

Semantic Resources (templates, data elements, value sets, terminologies) in order to represent eligibility criteria. o The user accesses the EHR4CR platform to upload the CDSIC SDM-ODM annotated file.

21 Semantic Resources (templates, data elements, value sets, terminologies) in order to represent eligibility criteria. o The user accesses the EHR4CR platform to upload the CDSIC SDM-ODM annotated file. Therefore, the CDSIC SDM-ODM annotated file can be used for query specification during the execution of the auto-population step of the IHE SDC profile. o If needed, the user may require the creation of missing semantic resources (missing relevant templates for representing eligibility criteria). The Common Information Model (CIM) Editor is used to update the mediation model and the Terminology Mapping Editor (TME) is used to map local terminologies to central terminologies used during the authoring of the new templates and data elements. During the execution phase, semantic interoperability services (SIS) are used at the endpoint of the EHR4CR platform to access terminology mappings. Figure 7. The need of semantic interoperability for use case 3 (clinical trial execution) A set of tools Common Information Model (CIM) Editor, Terminology mapping Editor (TME), Local Workbench SDM extension, SDM-ODM editor extension and semantic interoperability services (SIS) have been designed and developed for: Providing a set the shared EHR4CR semantic resources of the mediation model (EHR4CR Common Information Model) and ensuring that hospital pilot sites provide a high quality structural and terminology mappings of the local semantic resources to the mediation model. Supporting users in clinical research sites to set up query specification in the context of the EHR4CR use cases. 21

Figure 8. EHR4CR Semantic Interoperability platform: a set of EHR4CR Semantic Resources and Semantic Interoperability Services (SIS) are used during EHR4CR use case execution.

22 Figure 8. EHR4CR Semantic Interoperability platform: a set of EHR4CR Semantic Resources and Semantic Interoperability Services (SIS) are used during EHR4CR use case execution. Table 4 summarizes the use of tools and semantic interoperability services (SIS) of the Semantic Interoperability platform during the set up and execution phases of the three EHR4CR use cases. Table 4. Use of tools and semantic interoperability services (SIS) of the Semantic Interoperability platform Phases Protocol Feasibility Patient Recruitment Clinical Trial Execution (PFS) (PRS) (CTE) /Adverse event Reporting (ADR) Prerequisite Feasibility study protocol Setup phase Manual preprocessing of the use of free case text eligibility criteria Execution phase of the Clinical trial protocol (+/- CDISC SDM-ODM file) (optional step) Uploading the CDISC SDM file including free text eligibility criteria using the [Local Workbench SDM extension] Manual pre-processing of free text eligibility criteria Semantic annotation of CDISC SDM-ODM files using the [SDM-ODM editor extension] and semantic interoperability services [SIS] Updating the EHR4CR Resources to cover the scope of the clinical trial: Data element creation/update using the [CIM Editor] + Terminology mapping & validation using [TME] Query specification (workbench) using [SIS] Query execution (endpoint) using [SIS] Query refinement & execution (workbench & Query specification using [SIS] Query execution (auto population transaction) 22

23 use case endpoint) using [SIS] using [SIS] Outcome of the use case Get the potential number of patients per site Identification of the potential sites to start a recruitment protocol 3 Standardization pipeline Contact Sites, if agreed, start recruitment Screening and update of the patient count (Contact Sites If agreed, start recruitment), Real time execution of autopopulation of ecrf or AD reporting form per patient/per visit. Source Data Verification by investigator. Chapter 3 describes the EHR4CR standardization process needed to address the semantic interoperability requirements for EHR4CR use case execution. The aim of this process is the standardization of structured EHR data into a unified, concept-based model. The workflow of the standardization process has been defined as well as the responsibilities and roles of the participating actors. This chapter also provides the use case diagram of the semantic interoperability platform supporting the different actors - on both clinical research and hospital side in accomplishing their tasks to during the standardization process. The EHR4CR standardization process includes: i) the management of the mediation model (EHR4CR Common Information Model), ii) the structural and terminology mappings between local semantic resources (source information models of EHRs/CDWs and local terminologies) and this mediation model. The structural and terminology mappings is used in two 3.1 Managing the EHR4CR Common Information Model The EHR4CR Common Information Model has been developed, and can be extended, through a global consensus-based development process 6 based upon both i) eligibility criteria and data items defined across a given set of specific clinical trials (bottom up approach) and ii) standards reference clinical information models (top down approach). The EHR4CR Common Information Model is developed and evolved through repeated cycles using a "Learning by Doing" approach. The board is in charge of the maintenance and quality assurance of the EHR4CR Common Information Model. Based on an initial subset of semantic resources produced during the timeframe of the EHR4CR project, the system will be iterated to address the needs of any new specific research study or project intending to use any of the EHR4CR services (PFS, PRS or CTE), and any additionally-defined services. The board is in charge of managing and prioritizing the requests for authoring/curating semantic resources templates, data elements, value sets, concepts and associations - that may be submitted to extend the existing set of resources. Requests require actions from various actors that hold roles and responsibilities based on the configured governance workflow and that shall be executed within a predefined timeframe. 6 Defined consistently with the governance principles defined by CDISC SHARE 23

3.2 Mapping local semantic resources to the EHR4CR Common Information Model The standardization pipeline addresses the organization of the mappings required as part of the set up any specific project

24 3.2 Mapping local semantic resources to the EHR4CR Common Information Model The standardization pipeline addresses the organization of the mappings required as part of the set up any specific project (clinical trial) intending to use any EHR4CR services (PFS, PRS or CTE) and potentially for other services defined in the future. During the timeframe of the EHR4CR project (EHR4CR CIM version 0.1, version 0.2 & version 0.3), mappings were developed manually through repeated cycles using a "Learning by Doing" approach. The board will support the efficiency and quality of mappings that are developed by hospital sites between their local semantic resources and the EHR4CR Common Information Model. It will provide some centralized resources to support the mapping process and its quality assurance, including standardized mappings, educational resources and data quality assessment tools. The board will also advice on the necessary expertise and roles required within hospital sites to manage those mappings. The hospital itself will ultimately be responsible for the efficiency and quality of the mappings it defines and implements. The EHR4CR Service Provider and each hospital will negotiate and agree the point at which the mappings and their implementation are sufficiently satisfactory to enable platform services to become operational. 3.3 Overview of the standardization process before the execution of any EHR4CR use case Validate semantic resource (hospital perspective) Figure 9. Standardization process For each new study (clinical trial), authorized users of the EHR4CR platform (trial sponsors, hospitals, EHR4CR board) (the requester) submit the feasibility study (PFS) or clinical trial (PRS, CTE) protocol and the corresponding list of pre-processed eligibility criteria and data items. In collaboration with the requester 7, the board identifies the need of the creation of new semantic resources - templates, value sets, concepts, and terms in order to cover the scope of the use case and assigns each request to a reviewer(s). A Reviewer may add information to the request and can take action to approve or reject each of the new semantic resources created to fulfill the request. The resources are created under the responsibility of the board who assigns 7 Ideally involving Investigators/Data managers in hospital sites 24

25 each request to Author/Curator(s) 8 responsible of developing the semantic resources corresponding to the request. A semantic resource is not published until the request has received the approval from the Reviewer. The board is in charge of following that the requests are addressed on time. As soon as a request is fulfilled and the corresponding new semantic resources - templates, value sets, concepts, terms are available centrally (as part of the EHR4CR Common Information Model), the study sponsor and hospital sites of the network especially those involved in the study - are notified in order to validate the resources and to check that the central terms used in the resources are mapped to their local terms. The board is responsible for ensuring that each hospital site involved in the study has been notified of the mappings that need to be developed in the context of the clinical trial. For each given project (clinical trial), the hospital site is in charge of identifying mappings that need to be authored/curated and assigns to Author/Curator(s) the developing of these mappings. The mappings are not used until the request has received all necessary approval EHR4CR Common Information Model (CIM) management The board defines the responsibilities and roles of the different actors using or in charge of the maintenance and quality assurance of the EHR4CR Common Information Model: semantic resource user, administrator, provider, author/curator, translator, reviewer, mapper and semanic enabled application developer. A Resource User is an actor such as a subject matter expert or terminologist. Resource User activities include, but are not limited to, querying for specific resources and browsing or comparing resources. Specializations of the Resource User actors are defined in the figure and table below. Depending on the kind of resources (terminology, value set, template/data element, mapping) or tasks (administration, authoring/curation, mapping, translation) one person can hold more than one role. 8 Investigators/Data managers in hospitals are probably best candidates for authoring data elements 25

26 Figure 10. Use cases of the semantic resource management Actor (Roles) Description/ Responsibilities Tool Organization EHR4CR Semantic Resource Provider Semantic EHR4CR Resource Administrator EHR4CR Semantic Resource Author / Curator The EHR4CR Semantic Resource Provider is the actor the individuals or organization that is responsible for the development of the EHR4CR semantic resources (including external resources templates (HL7 models, CEN archetypes, CDISC SDTM, CDISC CDASH, NCI CDE (cadsr), UCUM), value sets (provided by HL7/IHE), code systems (ICD- 10, ATC, SNOMED CT, LOINC, PathLex, etc.)) The EHR4CR Resource Administrator is an actor responsible for ensuring the availability and overall maintenance of the EHR4CR semantic services. This includes, but is not limited to loading content into the server, and making available the required functionality to address the specific needs of users. The EHR4CR Resource Author / Curator is an actor such as a subject matter expert or terminologist who is responsible maintaining EHR4CR semantic resources CIM Editor (used manually through the user interface EHR4CR Service Provider EHR4CR Service Provider EHR4CR Service Provider or Hospital sites 26

27 EHR4CR Semantic Resource Reviewer/Validator EHR4CR Semantic Resource Human Language Translator EHR4CR Terminology Mapper including but not limited to, the development of new resources templates, data elements, value sets, concepts and associations - that may be submitted to the Resource Provider as an extension of an existing set of resources. According to the type of resource that is authored/curated, we distinguish: o A Terminology Author / Curator is an actor who is responsible maintaining a new terminology content or the extension of an existing terminology with local concepts. Terminology Authors / Curators may not necessarily belong to the Terminology Provider's organization. o A Value set Author / Curator is an actor with specific domain knowledge, as well as expertise in controlled terminologies who develop s and maintains domain-or applicationspecific terminology value sets. o A Template/Data Element Author / Curator is an actor with specific domain knowledge, as well as expertise in information models and controlled terminologies who develops and maintains domain-or applicationspecific templates and data elements. The EHR4CR Semantic Resource Reviewer/Validator is an actor who is responsible of investigating requests and of validating new semantic resources. A reviewer may add information to the request and can take action to approve or reject the request. Terminology Human Language Translator A Terminology Human Language Translator is an actor with domain knowledge who is also familiar with the languages and dialects which they are responsible for translating. An EHR4CR Terminology Mapper is an actor (human or system) that is responsible for creating or maintaining specialized associations, or mappings between EHR4CR terminology and concepts from different code systems. An EHR4CR Terminology Mapper is in charge of validating and/or importing mappings provided by external provides (e.g. UMLS (mappings between SNOMED CT/MedDRA or SNOMED CT/ICD- or import of txt or csv files) CIM Editor CIM Editor CIM Editor EHR4CR Service Provider or Hospital sites EHR4CR Service Provider or Hospital sites EHR4CR Service Provider 27

28 10 or SNOMED CT/NCI thesaurus, etc.) Semantic Enabled Application Developer A Semantic Enabled Application Developer is an actor who is responsible for the development of software applications that make explicit use of different types of Semantic interoperability services at EHR4CR EHR4CR workbench at pharma company/clinical semantic resources: templates, data workbench research units elements, value sets, concepts. Semantic resources are used through standard semantic services specified in the dedicated section of the deliverable. All EHR4CR semantic resource may have different status: draft, cancelled, active deprecated. At creation semantic resources are automatically published in Draft status. A resource in Draft status may be cancelled or activated if the resource gets validation. A resource in Active status may be deprecated Central/local mapping management The Board also defines responsibilities and roles within hospital pilot sites of the EHR4CR network in the process of managing the mappings between local and the central terminologies used to mediate data integration in the EHR4CR project. Actor (Roles) Description/ Responsibilities Tool Organization Hospital Semantic Resource Provider Hospital Semantic Resource Administrator Hospital Semantic Resource Author / Curator Hospital Terminology Human Language Translator The Hospital Provider is the actor the individuals or organization that is responsible for the development of the semantic resources (including imported external resources) at the hospital The Hospital Resource Administrator is an actor responsible for ensuring the availability and overall maintenance of the resource of the hospital repository. This includes, but is not limited to loading content into the server, and making available the required functionality to address the specific needs of users. The Hospital Resource Author / Curator is an actor such as a subject matter expert or terminologist who is responsible maintaining semantic resources including but not limited to, the development of new resources templates, data elements, value sets, concepts and associations - that may be submitted to the Resource Provider as an extension of an existing set of resources. The Hospital Terminology Human Language Translator A Terminology Human Language Translator is an actor with domain knowledge who is also familiar with the languages and dialects which they are responsible for translating. (local terminology server) (local terminology server) (local terminology server) Hospital site Hospital site Hospital site 28

29 Hospital Terminology Mapper Semantic Resource Reviewer/Validator Semantic Enabled Application Developer The Hospital Terminology Mapper is an actor (human or system) that is responsible for creating or maintaining specialized associations, or "mappings" between concepts from different code systems. The Semantic Resource Reviewer/Validator is an actor who is responsible of investigating requests and of validating new semantic resources (including mappings). A reviewer may add information to the request and can take action to approve or reject the request. A Semantic Enabled Application Developer is an actor who is responsible for the development of software applications that make explicit use of different types of semantic resources: templates, data elements, value sets, concepts. Semantic resources are used through standard semantic services specified in the dedicated section of the deliverable. Terminology Mapping Suite (TMS) Terminology Mapping Suite (TMS) Semantic interoperabil ity services at Endpoint Hospital site Hospital site EHR4CR endpoint at hospital site 4 Semantic interoperability resources A first version of the semantic interoperability platform supporting the different actors in accomplishing their tasks within the standardization process has been developed. This platform consists in a mediation model and a set of tools. This section presents the EHR4CR Common Information Model (CIM) (mediation model). 4.1 EHR4CR Common Information Model (mediation model) What is the EHR4CR mediation model? The EHRCR Common Information Model (CIM) mediation model is an integrative semantic abstraction providing a homogeneous view that enables to mediate across heterogeneous particular information models implemented in both hospital (source information models) and clinical research (target information models) sites for conducting clinical research studies. This mediation model consists of a set of semantic resources 9 used for clinical data standardization and query specification. The common EHR4CR semantic resources consist in a shared set of templates and data elements with their associated value sets and concepts that enables to mediate across heterogeneous representations of patient-centric health information. The common EHR4CR semantic resources are stored and maintained in a metadata registry framework and are accessed through standardized interfaces: EHR4CR semantic interoperability services (SIS). 9 Any element of an information model or a terminology model used to represent the meaning of health information stored, shared, exchanged and/or processed by information systems. 29

30 4.1.2 How is the EHR4CR mediation model built and maintained? The EHR4CR Common Information Model (mediation model) has been developed, and can be extended, through a global consensus-based development process 10 in order to cover the scope of both i) eligibility criteria and data items identified from a given set of specific clinical trials (bottom up approach) and ii) standards reference clinical information models (top down approach). The EHR4CR Common Information Model is developed and evolves through repeated cycles using a "Learning by Doing" approach. A first iteration of the development of the EHR4CR Common Information Model (version 0), based on a bottom up approach, started to cover the scope of 14 clinical trials selected to demonstrate the "Protocol Feasibility Services" (PFS) use case (EHR4CR CIM version 0.1), a second iteration covered the scope of 17 additional clinical trials selected to demonstrate the "Patient Recruitment Services" (PRS) use case (EHR4CR CIM version 0.2) and the third iteration covered the scope of 28 additional clinical trials selected to demonstrate the "Clinical Trial Execution" (CTE) implemented for (EHR4CR CIM version 0.3). Each new version of the EHR4CR Common Information Model has an extended scope and improved quality. Table 5. List of version of the EHR4CR Common Information Model (mediation model) Version 0.1 Version 0.2 Version 0.3 Version 1.0 Scope Approach BOTTOM UP (Use case and clinical trials driven - see the list in Appendix) Content PFS (n=14) PFS+PRS (n=14+17=31) Semantic classes/categories Common Element Templates Terminologies/Ont ologies Demographics, Conditions, Diagnosis, Medications, Vital Signs, Results (lab, anatomic pathology), Procedures Observation Substance administration Procedure Patient Encounter Observation Medication statement Procedure PFS+PRS+CTE (n= =59) HL7 CCD sections & UMLS semantic types FHIR resources Patient Encounter Condition Observation Medication statement Procedure ICD, SNOMED, LOINC, PathLex, ATC ICD, SNOMED, LOINC, ATC, ICD- O, Pubcan, TNM, PathLex BOTTOM UP + TOP DOWN (Cross project harmonization) 11 From additional pharma/hospital CT From Models (OMOP, CDISC SHARE) data elements Cross project harmonization Cross project harmonization Cross project harmonization 10 Defined consistently with the governance principles defined by CDISC SHARE 11 A cross project harmonization is being initiated as part of the Semantic Health Net Initiative. The Semantic Inter Operability task consists in identifying common fragments from models used in different European projects e.g EHR4CR, epsos, EXPAND. 30

31 Mappings between reference terminologies from UMLS (CUI) Mapping between SNOMEDCT and MedDRA, NCI-T,ICD- 9, ICD-10, ICD-O FHIR-based templates and data elements The EHR4CR Common Information Model (CIM) consists in a set of multilingual semantic resources based on multiple standards (see figure 11 & 12). The EHR4CR templates are based on FHIR resources (Patient, Encounter, Condition, Observation, Procedure and MedicationStatement) (see table 6). FHIR-based resources were organized into categories based on HL7 CCD sections and UMLS semantic types: Demographics, Encounters, Advance directives, Problems, Family History, Social History, Alerts, Medications, Immunizations, Vital Signs, Results (lab, anatomic pathology), Procedures, Plan of Care, Lifestyle Choice, Ethical consideration. FHIR resources were enriched in order to fulfil the requirements of the project and represent the required semantic content. Some specific value sets were defined for some data elements of the FHIR templates. Figure 11. FHIR-based resources were organized into categories based on HL7 CCD sections and UMLS semantic types. For example, the clinical observable entity: Eastern Cooperative Oncology Group (ECOG) performance status is defined using the template designed for clinical observations EHR4CR templates are composed of data elements that are bound to a set of international reference terminologies selected by the project: ICD, SNOMED-CT, LOINC, ATC, ICD-O, Pubcan, TNM, PathLex. These terminologies are, when possible, imported into the collaborative editor from the official source of the terminology provider in order to bind the EHR4CR resources to up-to-date terminologies. The terminology binding is done through the definition of value sets corresponding to the data elements of each template. Figure 12 illustrates the terminology binding done for the Observable entity: ECOG performance status. The EHR4CR editing tool supports faceted templates. We defined a limited set of generic templates (e.g. Observation) with facets, so that it is possible for each code of the template (e.g. Observable entity SCT/ /ECOG performance status) to define its corresponding value set (e.g. SCT/ /ECOG performance status finding). 31

As much as possible, we enriched and/or merged reference terminologies in order to build multilingual terminologies and value sets (in English, French at least and when possible in the four languages

32 As much as possible, we enriched and/or merged reference terminologies in order to build multilingual terminologies and value sets (in English, French at least and when possible in the four languages of the EHR4CR partners: English, French, German and polish). Figure 12. Copy screen of the EHR4CR collaborative editing tool The clinical observable entity: Eastern Cooperative Oncology Group (ECOG) performance status is defined using the template designed for clinical observations (see table 6). Terminology binding. The data element: code (DataType=ConceptDescriptor (CD)) is associated to a Value set defined as a set of TOP SNOMEDCT or LOINC codes e.g. SCT/ /ECOG performance status. The data element: value (DataType=ConceptDescriptor (CD)) is associated to a Value set defined as a set of concepts (ordered children of SCT/ /ECOG performance status finding: 0/SCT/ ECOG 0; 1/SCT/ ECOG 1; 2/SCT/ ECOG 2; 3/SCT/ ECOG 3; 4/SCT/ ECOG 4; 5/SCT/ ECOG 5). The current version of the EHR4CR CIM includes 6 FHIR-based templates (and 6 additional specialized templates) and a subset of 15 corresponding data elements. Table 6 describes the content scope of the templates. Four patient demographic data elements (gender, birth time, deceased indicator, and deceased time) are part of the patient template. Four data elements (code, discharge disposition code, effective time, and length of stay) are part of the Encounter template. We distinguished two types of Conditions: diseases on one hand and signs and symptoms on the other hand. We defined 25 categories of diagnoses (including discharge diagnosis, primary diagnosis, secondary diagnosis, admitting diagnosis, etc.). Diseases are encoding using codes from a value set combining ICD 10 (n=12,318 codes) and a subset of SNOMED CT codes. In the current version we defined four specialized Observation templates and defined clinical observable entities (n=26), vital signs (n=5), laboratory observable entities (n=2000) and anatomic pathology observable entities (n=80). Value sets corresponding to categorical observable entities were defined and populated with more than 1000 codes from SNOMED CT, ICD-O (Pubcan), TNM, PathLex and EHR4CR-T. We defined as part of the Procedure template a small value set SNOMED CT procedures (n=57). As part of the MedicationStatement, we selected ATC (n=5,655 codes) as the value set attached to the data element consumablecode. The terminology binding of the EHR4CR CIM involves more than concepts from reference terminologies internationally used. All the concepts are at least bilingual (English and French). Table 6: Description and structure of the six core FHIR-based-templates of the EHR4CR mediation model. 32

33 Template (nb. of data elements) Patient (n=4) Encounter (n=4) Condition (n=2) clinicalobs ervation (n=2) Procedure (n=1) Template scope Specialized template scope A Patient is a uniquely identified person. Clinical statements attached to this Patient may be recorded within the source systems. An Encounter occurrence correspond to a period of time a Patient continuously receives medical services from one or more providers at a care site in a given setting within the health care system. Conditions state the presence of a clinical disease, sign or symptom, etc. nondiseasecondition: correspond to symptoms (observed by the patient) or signs (observed by a care provider). diseasecondition: are inferred from medical claims data, textual clinical document, collected via forms (e.g. from a problem list), etc. Data element Terminlogy binding Value set Nb. of concep ts administrativegendercode SCT gender types 4 birthtime deceasedind deceasedtime Code SCT encounter types 6 dischargedispositioncode effectivetime lengthofstayquantity Category SCT condition types 4 Code Subset of SCT findings 16 Category SCT diagnostic types 25 Code diseases (ICD10+subset of SCT diseases) A (numerical or clinicalobservation: Name subset of SCT observable 26 categorical) Observation records of entities is a sign or a symptom or measurements Value value sets specific to 95 the result of any performed by a each categorical procedure which is clinician at bed side observable entity either observed by a (including scores, Provider or reported by the Patient. grades, stages, etc.) vitalsignobservation: Name subset of SCT vital signs 5 refer to blood pressure, body Value temperature, pulse rate and respiratory rate. laboratoryobservatio Name subset of LOINC codes 2000 n: refer to laboratory tests. Value (Top 2000) value sets specific to each categorical observable entity >500 anatomicpathologyob servation: records of measurements performed by a pathologist analyzing tissues/cells with a microscope (including scores, grades, stages, etc). A Procedure occurrence correspond to the record of an activity or process ordered by, or carried out by, a healthcare provider on the patient with a diagnostic or therapeutic purpose. Procedures are inferred from medical claims include, computerized orders in EHRs, etc. Name Value Code subset of LOINC codes (Top 80) value sets specific to each categorical observable entity (e.g. ICD-O, TNM, etc) subset of SCT procedures 80 >

34 Medication Statement (n=2) A medication statement is inferred from clinical events associated with orders, prescriptions written, pharmacy dispensing, procedural administrations, and other patient-reported information. Medication includes medicines, vaccines, and large-molecule biologic therapies. administrationunitcode consumablecode ATC codes 6000 The current limited set of FHIR-based templates allows the representation of the main textual clinical data (signs, symptoms, diseases, outcome, procedures, care plans, etc.). We defined context-dependent value sets for representing multiple views or contextual information (e.g. organ specific scores or histologic types, etc.) Terminologies/Ontologies As much as possible existing resources are imported. Some of external resources are overlapping (e.g. ICD-10 and SNOMED CT; MedDRA and SNOMED CT; NCI-T and SNOMED CT). Associations between these reference terminologies are available in UMLS. Some of the external resources need to be translated and/or extended, EHR4CR translations/extensions need to be captured and managed. At last, some specific resources need to be created. An EHR4CR terminology was created in order to create concepts that are in the scope of the project but do not exist in the selected reference terminologies. We integrated the UMLS CUI in order to allow multi-terminology binding. 34

35 Table 7. Summary of selected reference terminologies used in EHR4CR Terminology/ Description Ontology (General information, Technology, Use) (Name, Provider, Availability, Steward/Custodian ) ICD-10 /classifications/icd/en/ Developed by WHO, managed by a Revision Steering Group. Available to download free of charge by license for non-commercial research purposes ICD-O-3 /classifications/icd/adapt ations/oncology/en/ Developed by WHO, managed by the Secretariat / WHO International Association of Cancer Registries c/o International Agency for Research on Cancer (Lyon, France) Available to download free of charge Terminology model: A multi-lingual first generation coded classification system, using a fixed subsumption hierarchy with a simple semantic list alphanumerically referenced. Number of concepts: 14,400 concepts. Format: Csv database files Languages: 6 official WHO languages (Arabic, Chinese, English, French, Russian and Spanish) and a total of 42 languages Use & scope: International standard diagnostic classification for all general epidemiological, many health management purposes and clinical use. All diseases, morbidity associated with pregnancy, childbirth and the puerperium, congenital malformations and abnormalities, a wide variety of signs, symptoms, abnormal findings and health complaints, factors influencing health status (e.g. social circumstances) and categories for external causes of injury or disease (e.g. poisoning, transport accidents). Terminology model: A multi-axial classification of the site, morphology, behavior, and grading of neoplasms. The topography axis uses the ICD-10 classification of malignant neoplasms (except those categories which relate to secondary neoplasms and to specify morphological types of tumors) for all types of tumors, thereby providing greater site detail for non-malignant tumors than is provided in ICD-10. In contrast to ICD-10, the ICD-O includes topography for sites of hematopoietic and reticuloendothelial tumors. The morphology axis provides five-digit codes ranging from M-8000/0 to M-9989/3. The first four digits indicate the specific histological term. The fifth digit after the slash (/) is the behavior code, which indicates whether a tumor is malignant, benign, in situ, or uncertain (whether benign or malignant). A separate one-digit code is also provided for histologic grading (differentiation). Number of concepts: Format: Csv database files Languages: Chinese, Czech, English, Finnish, Flemish/Dutch, French German, Japanese, Korean, Portuguese, Spanish, Romanian, Turkish Use & scope: Used principally in tumor or cancer registries for coding the site (topography) and the histology (morphology) of neoplasms, usually obtained from a pathology report. Creation date: Last date change: 2000 Conclusion (Use in the EHR4CR project & issues) Rationale: Selected as reference terminology due to broad international use Issues: Licensing issue for commercial use Non-standard ICD-10 extensions Classification with a single axis subsumption hierarchy & limited value in expanding higher level concepts for searching or querying within the EHR4CR applications. Rationale: Selected as reference terminology due to broad international use Issues: Possible licensing issue for commercial use Non-standard ICD-O extensions 35

36 Pubcan Developed by WHO, managed by the Secretariat / WHO International Association of Cancer Registries c/o International Agency for Research on Cancer (Lyon, France) Not available to download LOINC (Logical Observation Identifiers Names and Codes ) onal Developed by Regenstrief Institute, Inc., Indianapolis, USA. Free of charge to all users. ATC (Anatomical Therapeutic Chemical Classification System) c_ddd_index/ Developed by WHO Collaborating Centre for Drug Statistics Methodology (Norwegian Institute of Public Health). Cost= 200 (No formal license needed when ATC system is an integrated part of a database). Snomed CT Owned since April 2007 by the International Terminology model: A classification of precoordinated representation of the site, morphology, behavior, and grading of neoplasms based on ICD-O-3. Number of concepts: Format: no available export format Languages: English Use & scope: idem ICD-O-3 Creation date & Last date change: Terminology model: A multi-lingual second generation vocabulary and coding system. Number of concepts: > 72,000 (including >50,000 laboratory terms) Format: CSV format text file, Access database and release to release (Change File and Change Report). Tooling and support: Languages: English, German, Spanish, French, Chinese. Use & scope: A set of universal names and ID codes for identifying laboratory test results or clinical observations. Usual categories of chemistry, hematology, serology, microbiology, toxicology; concepts for vital signs, hemodynamic, intake/output, EKG, obstetric ultrasound, cardiac echo, urologic imaging, gastro endoscopic procedures, pulmonary ventilator management, selected survey instruments (e.g. Glascow Coma Score, PHQ-9 depression scale, CMS-required patient assessment instruments), and other clinical observations. Creation date : Last date change: (LOINC 2.50) Terminology model: Five-level classification Number of concepts: 5717 concepts (4464 concepts are ATC 5th levels (substance level)). Format: no available export format Languages: English, Spanish and German Use & scope: Classification of medicines according to their active substance(s). Medicines are divided into different groups according to the organ or system on which they act and their therapeutic, pharmacological and chemical properties. All medicinal substances that are active ingredients in licensed medicinal products internationally Creation date & Last date change: Terminology model: A multilingual thesaurus with an ontological foundation. Concepts are organized into acyclic taxonomic (is-a) hierarchies. Concepts may have multiple parents. Created by the merger, expansion, and restructuring of the Rationale: Selected as reference terminology, since built from the broadly internationally used ICD-O and useful for pre-coordinated representation of the site, morphology, behavior, and grading of neoplasms. Issues: Possible licensing issue for commercial use Classification with a single axis subsumption hierarchy & limited value in expanding higher level concepts for searching or querying within the EHR4CR applications. Rationale: Subsets of LOINC selected as reference terminology due to broad international use (more than 35,000 people in 163 countries) Issues: Formal representation of observable entities but nonstandard data types and missing formal representation of value sets The broad scope of LOINC requires a step by step import & mapping strategy based on subsets (e.g. LOINC Top 2000 results, LOINC Top 300 orders) Rationale: Selected as reference terminology due to broad international use Useful when eligibility criteria reference medicines use by therapeutic category Issues: Primarily designed for pharmacoepidemiology Does not contain the lower level medicinal product concepts (for example it contains "atenolol" but not "atenolol 50mg tablets") Does not support any dosage representation or calculation Substances formulated into a number of different types of medicinal products (for example hydrocortisone) may have several different ATC codes Some national extensions to ATC include lower level concepts. Rationale: Subsets of SNOMED CT are candidate for selection as reference terminology due to international use and the ontological foundation of SNOMED CT. 36

37 Health Terminology Standards Development Organization (IHTSDO) Requires a license (national membership OR affiliate license) PathLex ( ) nical_framework/upload /IHE_PAT_Suppl_APSR_A ppendix_value_sets_201 1_03_31.xls Developed by IHE and HL7 Anatomic Pathology in collaboration with the College of American pathologists (CAP) and ADICAP. Started in 2010 TNM (UICC and AJCC staging systems) ources/tnm Maintained by the Union for International Cancer Control (UICC) & American Joint Committee on Cancer (AJCC) Purchased at Wiley Online Library tnms04.pub3/f ull College of American Pathologists (CAP) SNOMED RT (Reference Terminology) and the UK National Health Service (NHS) Clinical Terms (also known as the Read codes) Number of concepts: 311,000 concepts linked by approximately 1,360,000 links, called relationships (2011). Format: EL++ formalism, incorporated into OWL 2 as an OWL 2 EL Profile Tooling and support: Numerous online and offline browsers are available. Languages: American English, British English and Spanish, with other translations under way or nearly completed in French, Dutch, Danish, and Swedish. Use & scope: Body structure, Clinical finding, Environment or geographical location (environment / location), Event, Observable entity (observable entity), Organism, Pharmaceutical / biologic product, Physical force, Physical object (physical object), Procedure, Qualifier value, Record artifact, Situation with explicit context, Social context, Specimen, Staging and scales, Substance Creation date : January Last date change: Terminology model: A lexicon unifying and supplementing other terminologies, such as Snomed CT, LOINC, ICD-O to ensure semantic consistency within and across standards (HL7 v2.5, HL7 v3, DICOM). Format: csv files Number of concepts: 1700 terms or expressions Use & scope: published as part of the IHE content profile Anatomic Pathology Structured Report (APSR): 21 HL7 CDA templates (a generic template for APSR and 20 organ-organ-specific templates including templates for cancer-specific organizers). Terminology model: multi-axial classification Format: pdf Number of concepts: Language: Arabic, Chinese, Czech, French, German, Italian, Japanese, Latvian, Polish, Portuguese, Russian, Turkish Use & scope: assessing the diagnosis, treatment, and prognosis of patients with cancer. Internationally agreed-upon standards to describe and categorize cancer stages and progression. It contains important updated organ-specific classifications that oncologists and other professionals who manage patients with cancer need can use to accurately classify tumors for staging, prognosis and treatment. The UICC TNM staging system is the common language in which oncology health professionals can communicate on the cancer extent for individual patients as a basis for decision making on treatment management and individual prognosis but can also be used, to inform and evaluate treatment guidelines, national cancer License cost is a limitation in non SNOMED CT member countries. Issues: Limited reasoning capabilities due to omissions and redundancies of semantic content (duplicate primitive and defined concepts) Cost of the affiliate license for non SNOMED CT member countries. Rationale: Subsets of Pathlex are candidate for selection as reference terminology to represent grades and scores in Anatomic Pathology (when LOINC or SNOMED CT codes are not available). Issues: Limited scope (20 IHE CDA templates of structured reports) Rationale: Selected as reference classification due to broad international use for Aid treatment planning, Provide an indication of prognosis, Assist in the evaluation of treatment results, Facilitate the exchange of information between treatment centers, Contribute to continuing investigations of human malignancies, Support cancer control activities, including through cancer registries. Issues: Cost of the classification. 37

38 Terminolog y Mapping planning and research. Creation date : Last date change: Edition 7 published Semantic Resource Repository The semantic resources are stored into a semantic metadata repository (MDR). We use the term of metadata (literally "data about data") to distinguish data collection structures from patient data that populate those structures, i.e. instance-level. Metadata should be described using well-defined metadata schema so as to represent the semantics of the instance data and will include concepts and relationships as well as bindings to terminologies. Metadata scheme may be expressed in a number of different programming languages e.g. HTML, XML, UML, RDF, etc. We used the international standard ISO/IEC to define metadata. This standard provides the definition of a "data element" registry, describing disembodied data elements. It is important to note that ISO/IEC covers just the definition of elements and does not dictate the persistence structures or retrieval strategies. In the healthcare domain, another ISO standard ISO plays a key role in the ISO/IEC based data element definitions since it provides the appropriate formal representation of the data type for Data Element Concept and of any type of the Value Domain data type. ISO especially provides a formal of the coded data types and addresses the binding with terminologies. 4.2 Tools and services Tools and services have been developed for supporting the different actors in accomplishing their tasks within the standardization pipeline. These tools support: the management of the mediation model (CIM Editor) the mapping between hospital local sources and the mediation model that are used for clinical data transformation (standardization) during the ETL processes and/or query transformation. We distinguish: o terminology mapping between local terminologies used in hospital CDWs/EHRs and reference terminologies used in the mediation model (central EHR4CR Common Information Model) supported by the Terminology Mapping Suite (TMS) o structural mapping between information models of CDWs (i2b2 and EHR4CR CDW) and the mediation model (done manually). the terminology mapping between electronic Clinical Research Forms (ecrfs in CDISC ODM format) and the mediation model supported by the SDM-ODM CDISC Editor extension. The table 8 provides an overview of the tooling developed for supporting the different actors in accomplishing their tasks within the standardization pipeline. Table 8. List and description of the supportive tools and services Type of Users & role Description Availability tool/service Yr 2 Yr 3 Yr 4 Yr 5 Common Information Web-based Service provider Authoring/curating EHR4CR V0 V1 V1 V2 Model Editor (CIME) editor (Board)/Hospital sites semantic resources Terminology Web-based Service provider Authoring/curating and validating V0 V1 Mapping Services editor (Board) terminology mapping (reference (TME) and terminologies) Terminology Hospitals (mapping Authoring/curating and validating Mapping Services edition) terminology mapping (local 38

39 EHR4CR Terminology Mapping Status Manager (TMSM) SDM-ODM editor extension Software Users (sponsors, investigators) (mapping validation) Hospitals (mapping status) CDISC ODM editor (semantic enabled application using semantic interoperability services for annotation of CDISC ODM files) terminologies/ehr4cr terminologies) Management of the mapping tasks (local terminologies/ehr4cr terminologies): access to current status and worklists Search/Access Search concept or browsing concept hierarchies Get associated Data type, ValueSet, Unit V0 V1 V EHR4CR Common Information Model Editor (CIME) Functional scope EHR4CR CIM Editor allows the EHR4CR Semantic Resource Author / Curator to: Browse the repository of EHR4CR semantic resources (Common Element Templates (e.g. observations, procedures, substance administrations, etc.), Common Data Elements, Value Sets and Terminologies) Search for any type of EHR4CR semantic resources Edit new Common Element Templates (e.g. observations, procedures, substance administrations, etc.), Data Elements and Value Sets and link them to reference terminologies (terminology binding) Import semantic resources (Clinical Elementary Templates, Data Elements, Value Sets and terminologies from external providers (e.g. UMLS, BioPortal, HL7, etc.) Export any type of EHR4CR semantic resources (Clinical Elementary Templates, Data Elements, Value Sets, terminology binding and query specification) in standard formats (e.g. SKOS) Create/modify the model of the EHR4CR semantic resources (Common Element Templates (e.g. observations, procedures, substance administrations, etc.), Common Data Elements, Value Sets and Terminologies) 39

Figure 13. Common Information Model (CIM) Editor: A collaborative editor for authoring/curation of various semantic resources: Templates, Data Elements; Value Sets, Terminologies.

40 Figure 13. Common Information Model (CIM) Editor: A collaborative editor for authoring/curation of various semantic resources: Templates, Data Elements; Value Sets, Terminologies. In order to manage the terminology binding during the creation of new data elements or value sets the CIM Editor supports a searching interface allowing the selection of standard concepts from reference terminologies. Standard terminologies currently supported include HL7 vocabulary, ICD-10, LOINC, ATC, SNOMED CT, PathLex, ICD-O, and UCUM EHR4CR Terminology Mapping Suite (TMS) Functional scope Terminology mapping process, supported by TMS, is guided through two main workflows: i) Terminology Loading workflow; and ii) Terminology Mapping workflow Terminology Loading Workflow The Terminology Mapper uploads the local value sets corresponding to the scope of the EHR4CR mediation model (i.e a set of central value sets, model such as diagnosis codes, clinical findings or vital signs codes, procedure codes, etc. (see 4.1.4)) into the Terminology Mapping Editor (TME) using a predefined loading format. 40

41 Figure 14. Terminology loading workflow 41

4.2.2.1.2 Terminology Mapping Workflow Figure 15.

42 Terminology Mapping Workflow Figure 15. Terminology mapping workflow The Terminology Mapper uses the Terminology Mapping Editor (TME) GUI to set up the scope of the mapping task selecting the local and central value sets to be mapped. In the defined scope, the Terminology Mapper uses the TME GUI to perform manual mappings. In addition, he/she use Terminology Mapping Services to run automatic processes for finding new mappings/validating mappings. At the end of the automatic process he/she displays the results (new mappings and/or erroneous mappings) and completes the mapping. Before validation, the mapping is at a draft status. Once validated by the Terminology Mapper, the mapping is at a frozen state and the system prevents any changes to it. When the mapping is available for review, the Mapping Reviewer revises the mappings corresponding to the request he has been assigned to. Once validated by the Mapping Reviewer, the mapping is at production state. Local mapping coverage evaluation is very important for data providers because mapping coverage has a direct impact on query performances: the wider the coverage is, the more accurate and sensitive the queries are. The Terminology Mapper uses the EHR4CR Terminology Mapping Status Manager (TMSM) to visualize the mapping coverage in his/her hospital. The Terminology Mapper identifies the value sets of the mediation model (e.g. diagnosis codes, clinical findings or vital signs codes, procedure codes, etc.) for which the mapping is missing or incomplete and the list of the central concepts of the value set that still need to be mapped. If any update occurs in any local value set, the new version has to be uploaded in TME. Thanks to the EHR4CR Terminology Mapping Status Manager (TMSM), the Terminology Mapper identify any need for updating existing mappings between local/central value sets after any change occurred in either a local or a central value set. The Terminology Mapper can use Terminology Mapping Services to run automatic processes for updating of any existing mapping between a local and central value set after any change 42

43 occurred in either the local or the central value set. Already existing validated mappings are preserved during the execution of the automatic processes Terminology Mapping Editor (TME) and Terminology Mapping Services Tool description Terminology Mapping Editor (TME) provides an interactive and collaborative interface to the terminology experts allowing them to define mappings between two terms from different value sets +/- terminologies. TME presents the content of the two (user-selected) value sets in a hierarchical view, allows user to (i) browse and search clinical terms from each of the two given value sets, (ii) define mapping relation between 1-1 or 1-many terms across two different value sets and (iii) define mapping type (exact match, narrow match, broad match, close match) (see Figure 16). 43

44 Figure 16. EHR4CR Terminology Mapping Editor A set of Terminology Mapping Services are used to provide (i) an initial version of the mapping between a local and central value set, (ii) a validation of any mapping between a local and central value set, and (iii) an update of any existing mapping between a local and central value set after any change occurred in either the local or the central value set EHR4CR Terminology Mapping Status Manager (TMSM) Tool description The GraphicalMappingValidator (GMV) is a first version (v0) of the EHR4CR Terminology Mapping Status Manager (TMSM). The data fetched from these 3 different sources is aligned and formatted into a FreeMind compatible file that can be read and displayed by the FreeMind tool. 44

45 Figure 17. EHR4CR main categories displayed in the GraphicalMappingValidator tool The GMV script retrieves the whole EHR4CR central terminology hierarchy and then adds the local items when an alignement is found in the local mapping server. To discover the mapping, the user has to navigate through the hierarchy by opening the intermediate levels. 45

46 Figure 18. Mapped central items (in green) and the corresponding mapped local items (in blue) for the "Drug or medicament" category Mapped central items are displayed with a green background and local items are displayed with a blue background. When a central item is not yet mapped with a local item it is displayed with a white background. The GMV can also display the mapping coverage (number of central items mapped / total number of central items) for each branch and for the overall hierachy: 46

47 Figure 19. Mapping coverage statistics (nb_concepts, nb_mapped_concepts and mapping_coverage) displayed in the GMV tool Figure 20. Mapping coverage statistics (details) for "Drug or medicament" category The GMV can also displays the number of patients and the number of observations in the CDW for each local items: 47

2 Technical implementation GraphicalMappingValidator (GMV) is based on the use of the FreeMind software (http://freemind.sourceforge.net/wiki/index.

48 Figure 21. CDW counts for each local items {n= # of patients; p= # of observations} mapped to central item [ATC:J07BB Influenza vaccines ]. The CDW counts enable an additional mapping validation because they show how many patients/observations can be theoretically addressed by the EHR4CR central items Technical implementation GraphicalMappingValidator (GMV) is based on the use of the FreeMind software ( which is known for organizing and managing any kind of data structure in a very simple manner. The main developing effort for setting up this tool was to generate a FreeMind compatible file that contains all the information needed to audit and validate the local mapping. From a technical point of view, the GMV script is merging data extracted from 3 different EHR4CR components in order to generate the GMV FreeMind file: The EHR4CR central terminology server in order to fetch the central items hierarchy The local mapping server in order to fetch the existing local items The Clinical Data Warehouse (CDW) Figure 22. The GMV architecture 48

49 The GMV script is implemented as a Groovy ( script that is accessible by the EHR4CR developer community in the subversion repository at the URL Groovy is based on the Java language (the GMV script can be run on any platform that already supports Java) and allows scripting syntax shortcuts that foster application development. GMV, as a temporary solution for evaluating and auditing the mapping, has really taken advantage from this feature Tool limitations and future developments The GMV tool is only a visualization tool: the mapping cannot be changed nor exported to the local mapping server. But, thanks to 1) the navigation features of FreeMind, 2) the colors attributes (green/blue for mapped central/local items) and 3) the content statistics (mapping coverage, patient counts, observation counts) the user can very easily get an overview on the mapping coverage Structural mapping The EHR4CR structural mapping layer is a way of addressing the connection of the EHR4CR local components to different local clinical data warehouses. As a matter of fact, clinical data warehouses may differ from one site to another by numerous aspects, among which: Data storage model (HL7 based, star-schema based, snowflake based ) Database engine : SQL based (ORACLE, MSSQL, POSTGRE ), nosql based (Hadoop, Neo4J, ) or object oriented (Caché ) The structural mapping layer enables to take all these different possible configurations into account by explicitly defining for a predefined set of ECLECTIC templates (i.e. the high level medical objects) the corresponding database statements that will be used by the platform to retrieve corresponding data from the clinical data warehouse. Each template (GENERAL, PROCEDURE, MEDICATION ) requires a predefined set of data elements that must be given by the database statement (SQL-based most of the time). In some context, the template needs some parameters (to get the patient data for a given set of medications codes for example). Therefore, the structural mapping layer is using a temporary table (Q_CD) in which these parameters are stored. Some database statements are using this table to fetch the template dataset. The set of database statements for all ECLECTIC templates are gathered in a XML-based configuration file. During the duration of the project two different clinical data warehouses engine has been used: 1. The native EHR4CR based CDW 2. The i2b2 based CDW For these 2 CDW flavors, a different structural mapping configuration file has been designed. 49

50 Category Demographic s Demographic s Procedures Medication administered Condition Observations and measuremen ts with a physical value and unit Observations and measuremen ts with a categorical value Observations and measuremen ts with an ordinal value Diagnoses ECLECTIC query templates and corresponding SQL queries EHR4CR Query Template/Param eters Patient/Attribut e Gender Patient/Birth date Procedure/ Attribute code Medication/ List of medication codes Existential Observation /List of observation codes Numeric Observation/ List of observation codes Coded Value Observation/ List of observation codes Coded Ordinal Observation/ List of observation codes Diagnosis/ List of diagnosis codes EHR4CR template Substance Administration Observation Observation Observation Observation Observation EHR4CR data element and example query SQL statement in i2b2 SCT/ /Gender Select sex_cd from DataType=CD Value set patient_dimension e.g SCT/ /Male Birth date Select birth_date from patient_dimension DataType=CD Value set (extension) e.g. SCT/ /Prostatectomy Attribute code DataType=CD Value set (extension) e.g. ATC/ A10AB/Insulins and analogues for injection, fastacting Attribute code DataType=CD Value set SCT/ /Ability to comply with treatment Attribute value DataType=BL Yes Attribute code DataType=CD Value set (intension) e.g. LOINC/ /Hemoglobin A1c/Hemoglobin.total in Blood Attribute value DataType=PQ e.g UCUM/%/percent Attribute code DataType=CD Value set (intension) e.g. SCT/ /ECOG performance status finding Attribute value DataType=CD Value set (extension) e.g. SCT/ /ECOG 1 Attribute code DataType=CD Value set (intension) e.g. SCT/ /ECOG performance status finding Attribute value DataType=CO Value set (extension) e.g. SCT/ /ECOG 1/1 Attribute code DataType=CD Value set (intension) e.g. SCT/ Final diagnosis (discharge) Attribute value DataType=CD Select INSTR(CONCEPT_CD, ':') + 1) from observation_fact Select INSTR(CONCEPT_CD, ':') + 1) from observation_fact Select INSTR(CONCEPT_CD, ':') + 1) from observation_fact Select INSTR(CONCEPT_CD, ':') + 1), nval_num, units_cd from observation_fact Select INSTR(CONCEPT_CD, ':') + 1), tval_char from observation_fact Select INSTR(CONCEPT_CD, ':') + 1), concept_cd = tval_char from observation_fact Select INSTR(CONCEPT_CD, ':') + 1) from observation_fact 50

51 Value set (extension) e.g. ICD10/I21/Acute myocardial infarction Structural mapping in i2b2 sites Table 9. ECLECTIC query templates and corresponding SQL queries based on i2b2 CDW Name Paramet SQL queries based i2b2 CDW ers generalquery No SELECT S.PATIENT_NUM IDPATIENT, S.BIRTH_DATE DOB, S.SEX_CD GENDERCODE, ' ' AS GENDERCODESYSTEM FROM PATIENT_DIMENSION S WHERE S.PATIENT_NUM IS NOT NULL deadquery No SELECT S.PATIENT_NUM IDPATIENT, S.BIRTH_DATE DOB, S.DEATH_DATE EFFECTIVETIME, S.SEX_CD GENDERCODE, ' ' GENDERCODESYSTEM FROM PATIENT_DIMENSION S WHERE VITAL_STATUS_CD = 'N' AND S.PATIENT_NUM IS NOT NULL medicationquery List of CD SELECT S.PATIENT_NUM IDPATIENT, S.BIRTH_DATE DOB, S.SEX_CD GENDERCODE, ' ' GENDERCODESYSTEM, A.START_DATE EFFECTIVETIME, Q.NUM NUM FROM OBSERVATION_FACT A, PATIENT_DIMENSION S, Q_CD Q WHERE A.PATIENT_NUM = S.PATIENT_NUM AND A.CONCEPT_CD = Q.CODESYSTEM ':' Q.CODE ORDER BY S.PATIENT_NUM, A.START_DATE procedurequery List of CD SELECT S.PATIENT_NUM IDPATIENT, S.BIRTH_DATE DOB, S.SEX_CD GENDERCODE, ' ' GENDERCODESYSTEM, A.START_DATE EFFECTIVETIME, Q.NUM NUM FROM OBSERVATION_FACT A, PATIENT_DIMENSION S, Q_CD Q WHERE A.PATIENT_NUM = S.PATIENT_NUM AND A.CONCEPT_CD = Q.CODESYSTEM ':' Q.CODE ORDER BY S.PATIENT_NUM, A.START_DATE existenceobservationquery List of CD SELECT S.PATIENT_NUM IDPATIENT, S.BIRTH_DATE DOB, S.SEX_CD GENDERCODE, ' ' GENDERCODESYSTEM, A.START_DATE EFFECTIVETIME, Q.NUM NUM, SUBSTR(A.CONCEPT_CD, INSTR(A.CONCEPT_CD, ':') + 1) AS LOCALCODE FROM OBSERVATION_FACT A, PATIENT_DIMENSION S, Q_CD Q WHERE A.PATIENT_NUM = S.PATIENT_NUM AND A.CONCEPT_CD = Q.CODESYSTEM ':' Q.CODE ORDER BY S.PATIENT_NUM, A.START_DATE diagnosisquery List of CD SELECT S.PATIENT_NUM IDPATIENT, S.BIRTH_DATE DOB, S.SEX_CD GENDERCODE, ' ' AS GENDERCODESYSTEM, SUBSTR(A.CONCEPT_CD, INSTR(A.CONCEPT_CD, ':') + 1) LOCALCODE, A.START_DATE EFFECTIVETIME, Q.NUM NUM FROM OBSERVATION_FACT A, PATIENT_DIMENSION S, Q_CD Q WHERE A.PATIENT_NUM = S.PATIENT_NUM AND A.CONCEPT_CD = Q.CODESYSTEM ':' Q.CODE ORDER BY S.PATIENT_NUM, A.START_DATE numericobservationquery List of CD SELECT S.PATIENT_NUM IDPATIENT, S.BIRTH_DATE DOB, S.SEX_CD GENDERCODE, ' ' 51

52 GENDERCODESYSTEM, A.START_DATE EFFECTIVETIME, SUBSTR(A.CONCEPT_CD, INSTR(A.CONCEPT_CD, ':') + 1) EVENTCODE, Q.CODESYSTEM EVENTCODESYSTEM, CASE WHEN A.VALTYPE_CD = 'N' THEN A.NVAL_NUM WHEN A.VALTYPE_CD = 'T' THEN TO_NUMBER(A.TVAL_CHAR) END AS PHYSVALUE, A.UNITS_CD PHYSUNITCODE, Q.NUM NUM, ' ' AS physunitcodesystem FROM OBSERVATION_FACT A, PATIENT_DIMENSION S, Q_CD Q WHERE A.PATIENT_NUM = S.PATIENT_NUM AND A.CONCEPT_CD = Q.CODESYSTEM ':' Q.CODE ORDER BY S.PATIENT_NUM, A.START_DATE codedobservationquery List of CD SELECT S.PATIENT_NUM IDPATIENT, S.BIRTH_DATE DOB, S.SEX_CD GENDERCODE, ' ' GENDERCODESYSTEM, A.START_DATE EFFECTIVETIME, A.TVAL_CHAR VALUECODE, Q.CODESYSTEM VALUECODESYSTEM, Q.NUM NUM FROM OBSERVATION_FACT A, PATIENT_DIMENSION S, Q_CD Q WHERE A.PATIENT_NUM = S.PATIENT_NUM AND A.CONCEPT_CD = Q.CODESYSTEM ':' Q.CODE ORDER BY S.PATIENT_NUM, A.START_DATE codedordinalobservationq uery List of CD SELECT S.PATIENT_NUM IDPATIENT, S.BIRTH_DATE DOB, S.SEX_CD GENDERCODE, ' ' GENDERCODESYSTEM, A.START_DATE EFFECTIVETIME, SUBSTR(A.CONCEPT_CD, INSTR(A.CONCEPT_CD, ':') + 1) LOCALCODE, A.CONCEPT_CD ( CASE WHEN A.VALTYPE_CD = 'T' THEN '=' A.TVAL_CHAR WHEN A.VALTYPE_CD = 'N' THEN '=' A.NVAL_NUM ELSE '' END ) AS VALUECODE, Q.CODESYSTEM AS VALUECODESYSTEM, NULL AS ORDINALVALUE, A.CONCEPT_CD AS ROOTCODE, Q.CODESYSTEM AS ROOTCODESYSTEM, Q.NUM AS NUM FROM OBSERVATION_FACT A, PATIENT_DIMENSION S, Q_CD Q WHERE A.PATIENT_NUM = S.PATIENT_NUM AND A.CONCEPT_CD = Q.CODESYSTEM ':' Q.CODE ORDER BY S.PATIENT_NUM, A.START_DATE Structural mapping in EHR4CR-CDW Table 10. ECLECTIC query templates and corresponding SQL queries based on EHR4CR-CDW Name Parameters SQL queries based EHR4CR-schema CDW generalquery No SELECT s.id AS idpatient, s.birthtime AS DoB, s.administrativegendercode AS gendercode, s.administrativegendercodesystem AS gendercodesystem FROM Subject AS s WHERE s.id IS NOT NULL; deadquery No medicationquery List of CD SELECT s.id AS idpatient, s.birthtime AS DoB, s.administrativegendercode AS gendercode, s.administrativegendercodesystem AS gendercodesystem, a.id AS idsbam, a.effectivetime AS effectivetime, a.effectivetimelow AS effectivetimelow, a.effectivetimehigh AS 52

53 effectivetimehigh, m.code AS materialcode, r.code AS routecode FROM ((Administration AS a LEFT JOIN Subject AS s ON a.idsubject=s.id) LEFT JOIN V_CD AS m ON m.id=a.materialcode) LEFT JOIN V_CD AS r ON r.id=a.routecode WHERE a.materialcode in (SELECT DISTINCT c.id FROM V_CD as c WHERE CONCAT(c.codeSystem, ':', c.code) in ('OID:ATC_Code', )) ORDER BY s.id, effectivetime; procedurequery List of CD SELECT s.id AS idpatient, s.birthtime AS DoB, s.administrativegendercode AS gendercode, s.administrativegendercodesystem AS gendercodesystem, p.id AS idproc, p.effectivetime AS effectivetime FROM (Procedures AS p LEFT JOIN Subject AS s ON p.idsubject=s.id) WHERE p.code in (SELECT DISTINCT c.id FROM V_CD as c WHERE CONCAT(c.codeSystem, ':', c.code) in ('OID:SNOMED- CT_Code', )) ORDER BY s.id, effectivetime; existenceobservationquery List of CD diagnosisquery List of CD SELECT s.id AS idpatient, s.birthtime AS DoB, s.administrativegendercode AS gendercode, s.administrativegendercodesystem AS gendercodesystem, o.id AS idobs, o.effectivetime AS effectivetime FROM (Observation AS o LEFT JOIN Subject AS s ON o.idsubject=s.id) where CONCAT(o.valueCodeSystem, ':', o.codevalue) in ('OID:ICD_code', ) and o.valuereftype='cd' ORDER BY s.id, effectivetime; numericobservationquery List of CD SELECT s.id AS idpatient, s.birthtime AS DoB, s.administrativegendercode AS gendercode, s.administrativegendercodesystem AS gendercodesystem, o.id AS idobs, o.effectivetime AS effectivetime, d.code AS eventcode, d.codesystem AS eventcodesystem, o.physvalue, o.physvaluelow, o.physvaluehigh, o.physunit as physunitcode FROM (Observation AS o LEFT JOIN Subject AS s ON o.idsubject=s.id) LEFT JOIN V_CD AS d ON o.code = d.id WHERE o.code in (SELECT DISTINCT c.id FROM V_CD as c WHERE CONCAT(c.codeSystem, ':', c.code) in ('OID:LOINC_Code', )) and o.valuereftype='pq' ORDER BY s.id, effectivetime; codedobservationquery List of CD SELECT s.id AS idpatient, s.birthtime AS DoB, s.administrativegendercode AS gendercode, s.administrativegendercodesystem AS gendercodesystem, o.id AS idobs, o.effectivetime AS effectivetime, v.code AS valuecode, v.codesystem AS valuecodesystem FROM (Observation AS o LEFT JOIN Subject AS s ON o.idsubject=s.id) LEFT JOIN V_CD as v ON v.id = o.codevalue WHERE o.code in (SELECT DISTINCT c.id FROM V_CD as c WHERE CONCAT(c.codeSystem, ':', c.code) in ('OID:SNOMED- CT_Code')) and o.valuereftype='cd' ORDER BY s.id, effectivetime; 53

54 codedordinalobservationquery List of CD Under developement 5 EHR4CR semantic interoperability services (SIS) Chapter 5 describes the EHR4CR semantic interoperability services (SIS) developed for semantically enabled-applications. The semantic interoperability services (SIS) are developed to support i) the standardization process during the set up phase. SIS are available for ETL tools used during EHR4CR CDW population (see section 6) and for the CDISC SDM-ODM editor extension used for the structural mapping between CDISC SDM-ODM and the mediation model) and ii) the execution phase of EHR4CR use cases (PFS, PRS or CTE). SIS are available for the EHR4CR query builder for query specification and for the EHR4CR endpoint for query transformation. EHR4CR Semantic Interoperability Services (SIS) provide a standardized interface for the usage and management of semantic resources (terminologies, ontologies, value sets, data elements, templates) chosen by the users of the service in their deployment environment. These services consist in a modular, common and universally deployable set of behaviors which can be used to deal with a shared set of semantic resources. The services contribute to interoperability by fostering the authoring of semantic resources via its authoring profile and by supporting an easy access to the foundational elements of shared semantics. This goal is realized via the expansion of the original functionality outlined in HL7 s Common Terminology Service Release 2 (CTS2) Specification. 5.1 Introduction and Scope The SIS have been developed following the Service Specification Development Framework Methodology. This methodology was proposed by HL7 and OMG for defining the Healthcare Services Specification Project specifications. The methodology sets out an overall process, and also defines the responsibilities of the Service Functional Model. Section sets out the business context for this particular specification, but firstly it is important to understand the overall context within which this specification is written, i.e. its purpose from a methodology standpoint. 5.2 Service Definition Principles The high level principles regarding service definition that have been adopted by the Services Specification Project are as follows: Service Specifications shall be well defined and clearly scoped and with well understood requirements and responsibilities. Services should have a unity of purpose (e.g., fulfilling one domain or area) but services themselves may be composable. Services will be specified sufficiently to address functional, semantic, and structural interoperability. It must be possible to replace one conformant service implementation with another meeting the same service specification while maintaining functionality of the system. A Service at the SFM level is regarded as a system component; the meaning of the term (system) component in this context is consistent with UML usage. A component is a modular unit with well-defined interfaces that is replaceable within its environment. A component can always be considered an autonomous unit within a system or subsystem. It has one or more 54

55 provided and/or required interfaces, and its internals are hidden and inaccessible other than as provided by its interfaces. Each Service s Functional Model defines the interfaces that the service exposes to its environment, and the service s dependencies on services provided by other components in its environment. Dependencies in the Functional Model relate to services that have or may in future have a Functional Model at a similar level; detail dependencies on low-level utility services should not be included, as that level of design is not in scope for the Functional Model. The manner in which services and interfaces are deployed, discovered, and so forth is outside the scope of the Functional Model. 5.3 Comparison of the SIS/CTS2 Service Functional Models The goal of the SIS is to provide a standardized interface for the usage and management of semantic resources. The services contribute to interoperability by supporting an easy access to the foundational elements of shared semantics. It also fosters the authoring of semantic resources via its authoring profile. This goal is realized via the expansion of the original functionality outlined in HL7 s Common Terminology Service (CTS 2) Specification defines the functional requirements of a set of service interfaces to allow the representation, access, and maintenance of semantic content either locally, or across a federation of terminology service nodes. Similarly to the CTS 2 service, the SIS defines both the expected behaviors of a semantic resource service and a standardized method of accessing semantic content. Terminologies provide the atomic building blocks of shared semantics, concepts. Other building blocks in the scope of SIS are value sets, data elements, templates and associations. The functional scope of SIS is broader than the CTS2; SIS provides a modular, common and universally deployable set of behaviors which can be used to deal not only with a set of terminologies, value sets and associations but also with data elements and templates chosen by the users of the service in their deployment environment. This consistent approach to semantic content interaction benefit other business context services by providing a level of semantic interoperability that currently only exists in a limited form. Although specified in this section to provide standalone capabilities for semantic content access and management purposes, SIS are used in conjunction with other services in EHR4CR use cases. This overview section reuses and extends parts of the CTS2 specifications. 5.4 Structure of the SIS specification In order to provide for the maximum implementation flexibility, this functional model defines several enumerated functional profiles for SIS. These profiles serve to subset and focus the functionality of a SIS implementation to accomplish a targeted set of tasks. The functional profiles include: SIS code system profile - The SIS code system profile outlines functional coverage necessary for a service to declare itself as being a conformant SIS service. The SIS code system profile includes capabilities for searching and query code system content. SIS value set profile - The SIS value set profile outlines functional coverage necessary for a service to declare itself as being a conformant SIS service. The SIS value set profile includes capabilities for searching and query value set content. SIS template profiles - The SIS template profile outlines functional coverage necessary for a service to declare itself as being a conformant SIS service. The SIS template profile includes capabilities for searching and query template content. 55

56 5.5 Implementation Considerations SIS specification is an interface specification, not an implementation specification. As such, it is intended to facilitate the development of an implementable interoperability mechanism for terminology resources. SIS are intended to expose a single or multiple semantic sources for use by various applications that may or may not be within the same organization, providing a standardized method for semantic resources access. SIS provide for semantic interoperability between organizations. While coded concepts from structured terminology can unambiguously identify the concept(s) being communicated, a standard way of structuring and communicating those coded entries is required. SIS can be used in an inter-organizational setting where each organization maintains its own security and application specific provisions. SIS enable access to a high availability or international standard terminology resource, made available to subscribers via a SIS interface. Since semantic content is not static, SIS provide functionality to maintain and update semantic resources. Updates and update requests to semantic sources need to be reviewable and traceable over time. Often, a terminology source provider will want to maintain the gold standard or master release of a code system, as to maintain a consistent standard terminology that can be used across multiple organizations and realms. Notwithstanding, users of any given source terminology may wish to extend that terminology for their own use, and may even wish to recommend the addition of those local extensions to the semantic resources repository provider to be included as part of the release. SIS provides a mechanism to allow for users to extend a given semantic resources repository, share those extensions with others, or feed those extensions back to the source provider in a structured format to be reviewed, modified as necessary, and fed into a SIS server as input to update the source terminology with the content contained in the change request. 6 EHR4CR Clinical Data Warehouse Chapter 6 describes the design and implementation of the EHR4CR clinical Data Warehouse and of the extract-transform-load of EHR data into the EHR4CR CDW 6.1 Introduction At the start of the project in 2011 only two sites (FAU and APHP) had existing warehouses that could directly connect to the platform, both of these being I2B2 warehouses (which are not HL7 compliant). The other sites therefore had a need to deploy a warehouse and perform regular extraction-transformation-load (ETL) operations. EHR4CR did not choose to nominate a warehouse for use with the EHR4CR platform (such as I2B2), and as part of its commitment to these and future sites a project-specific warehouse was developed for installation by the sites along with associated ETL guidance. (Existing I2B2 sites already had ETL in place.) The technology chosen for the EHR4CR project warehouse is Relational and its design is compliant with the HL7 compliant EHR4CR Information Model (IM, v2-r3 24/04/2012, represented in UML in Figure A). (The workbench Eligibility Criteria Model is translated directly to SQL by the endpoint software for both the EHR4CR and I2B2 warehouses using their respective query templates discussed in section ) 56

Figure A - EHR4CR Information Model represented in UML Derivation of a Physical Model The first step towards a project-specific relational warehouse was the generation

follows: 1. Unwanted attributes in the IM were dropped 2. The structure of the information model was used as the initial physical model 3.

The physical model was de-normalized to a degree to improve the efficiency of expected queries 5. The ClinicalStatementRelationship class was dropped 6.

The final model was expressed as a database schema and provided as DDL 9. The final model was expressed as a UML diagram 10.

57 Figure A - EHR4CR Information Model represented in UML Derivation of a Physical Model The first step towards a project-specific relational warehouse was the generation of a physical model (PM) and associated database schema from the common information model (IM) or mediation model and this was achieved in a series of decisions as follows: 1. Unwanted attributes in the IM were dropped 2. The structure of the information model was used as the initial physical model 3. Foreign keys were introduced for inter-class relationships and embedded classes 4. The physical model was de-normalized to a degree to improve the efficiency of expected queries 5. The ClinicalStatementRelationship class was dropped 6. Mandatory columns were established 7. Support for specific relational technologies was introduced, e.g. MySQL, Oracle, MS-SQL 8. The final model was expressed as a database schema and provided as DDL 9. The final model was expressed as a UML diagram 10. Guidance was prepared and highlighted unresolved issues Unwanted attributes in the IM were dropped Since the IM is derived from existing HL7 classes there are some attributes of these classes and other referenced classes that are not required for our purposes. For example, HL7 mood, status and priority codes were not required; as were activity and availability time. In addition it was 57

58 agreed that negation would not be represented within the model and that we would rely solely on terminology for this. Other attributes dropped from the (abstract) ClinicalStatement class included: interruptibleind, independentind, derivationexpr and repeatnumber. The structure of the information model was used as the initial physical model Each major class of the information model (Observation, Procedure, SubstanceAdministration) and associated classes (Subject, Encounter, Participation) were introduced as tables into the physical model. In addition, attributes of these classes (which are classes in themselves) were also introduced as tables, e.g. attributes of class type CD. Foreign keys were introduced for inter-class relationships and embedded classes All inter-class and embedded class relationships were represented by foreign keys within the appropriate table. For example: a foreign key integer field idencounter was inserted into the Observation table to establish the related Encounter table record. a foreign key integer field code was inserted into the Observation table to establish the related CD table record holding details of the code, including value, code system, rubric, etc. The physical model was de-normalized to a degree to improve the efficiency of expected queries The resulting physical model had many related tables and foreign keys, and queries invariably required a number of joins which would undermine performance of the warehouse. Specific requests were made from WPG2 for some inter-table relationships to be removed and the child table field structure to be brought within the parent table. For example: the PQ table storing physical quantities was removed and the fields brought within the Observation table, these being defined as physicalvalue, physicalvaluelow, physicalvaluehigh, and physicalunit a table for effectivetime was removed and the fields brought within various tables, these being defined as effectivetime, effectivetimelow and effectivetimehigh with type DATETIME (in MySQL for example). With the exception of administrativegendercode in the Observations table all codes are placed in the V_CD table where all codes and their rubrics reside. One advantage of the latter is the ability to perform mapping of site codes to central codes using this table alone and retaining a chain from the central to the original codes. The ClinicalStatementRelationship class was dropped The ClinicalStatementRelationship class allows for relationships of varying complexity to be established between records of the 3 main tables. This complexity is not generally useful for a warehouse given the nature of the likely queries and it was therefore decided to drop this relationship. Since the Eligibility Criteria Model does not have computational capabilities this has implications. For example, if we wish to query against BMI derived from coincident height and weight observations then the BMI record must be created by the ETL process; no other means are possible. Mandatory columns were established A crucial element of the physical model is to establish those fields which must be present for platform queries to execute meaningfully. For example, the model requires that subject birthtime is always available as many queries will not execute without this variable. Sites with 58

59 only age available will have to compute an estimated birthtime during ETL. In addition, effective time (all tables), subject ID (subject table) and organization ID (organization table, derived from the HL7 participation class) are mandatory and must be made available or generated during ETL. Nearly all sites can furnish this data. Where data is generally not available from sites the relevant fields are defined NULL; this covers fields that are optional, conditional or not required at this time, e.g. administration.dosequantity (not required), observation.physvalue (conditional) or observation.idencounter (optional). Support for specific relational technologies was introduced Most sites will have procured ICT products from their preferred vendor, such as Microsoft, Oracle or perhaps open source solutions. It is very unlikely that sites will wish to introduce an alternative vendor for EHR4CR, therefore it is important that the physical model can be implemented in a number of vendor technologies (and that the end-point software can deal with this). In regard to the SQL language syntax, vendor implementations will introduce variations which must be catered for. End-point software will have to handle variations in data manipulation language (DML) while the physical model must handle variations in data definition language (DDL). The latter uses SQL-92 syntax almost exclusively. DDL variations include (but are not limited to): Date and time syntax Sequencing of Create Table statements Optimization statements Rather than provide separate DDL for each vendor, the DDL script has embedded guidance highlighting the substitutions to apply before creating the database. For example, for dates and times the timestamp data type is specified, but this should be substituted by DATETIME or DATE depending on the vendor platform used at a site. The final model was expressed as a database schema and provided as DDL The relational DDL for scripting the EHR4CR warehouse can be found in Appendix X. This file provides definitions and guidance for each table and field and their relationship to the Information Model classes. Fields are normally given names from the IM, and their type within the IM is also indicated. For example, within the Subject table we find the field birthtime which is of type TimeStamp in the physical model and class TS in the IM. Guidance is provided for obfuscation of this value where sites feel the need to do this, or where this is already the case within their source systems. For some fields mandatory coding is specified, e.g. for administrativegender within the Subject table. General guidance on code mapping is provided by WP4 and the services available there. The final model was expressed as a UML diagram The UML diagram for the final data model is given in Figure B as output from Enterprise Architect version 9 (EA v9). This file can be found on the project SharePoint. 59

60 Figure B - Enterprise Architect diagram for EHR4CR physical model. 6.2 ETL process and guidance for user acceptance testing The ETL process to prepare site data for the EHR4CR warehouse for user acceptance testing (UAT) of the platform is shown in Figure C. 60

Figure C - ETL for EHR4CR native warehouse used in user acceptance testing. Note the operations at top to obfuscate the data within the resulting warehouse.

61 Figure C - ETL for EHR4CR native warehouse used in user acceptance testing. Note the operations at top to obfuscate the data within the resulting warehouse. The project sites represent three different kinds of data source: hospital systems (6); individual clinic systems (2); and regional repositories of hospital data (3). Therefore, the domain coverage and size of the warehouses vary significantly. The majority of sites obfuscated their data because there was insufficient protection from the platform end-point software. End-point features thought absolutely necessary for release of pseudonymised, real data to the warehouse included: local query audit; local control of query execution; certification/validation of the software; and fuzzing/blocking of returned counts. This self-imposed obfuscation included the scrambling or shifting of event times and k-anonymization, sometimes making the data internally inconsistent from a domain perspective. In addition, sometimes for performance reasons, many sites did not provide complete data, but relatively small subsets, either random subsets or subsets tailored to one or more of the studies contained within the UAT. Also, some sites had data that was not coded and so did not pass through ETL into the warehouse. Each site must design an ETL suitable for its data. What is important is that the semantics of the target warehouse structure and terminology are clearly understood by all sites using that warehouse and that the ETL operations they perform yield the desired data records and site coding. ETL consists of three phases: Extraction During the extraction phase data is located and converted into a single format that is appropriate for further transformation. At this point, if necessary, the semantics of the data can be explored by profiling the data, i.e. enumerating the unique data values and parsing them into patterns that make the field readily comprehensible. Any documentation that offers definition or provenance of the data should then be consulted and any inter-field correlations checked or 61

62 noted. Rejection criteria/rules should then be defined and any rejections logged. Such criteria should include quality assessment as a minimum. Rejection rules should also include those data values considered confidential and where local information governance (IG) does not permit their release. Where IG does permit release this should be indicated using the confidentialitycode in the main tables of the warehouse. Finally, further rejections should take place if the data is to be added incrementally to the warehouse. The precise details of the latter will be site specific. Transformation During the transformation phase a structural transformation from the source data model to the warehouse model is performed. Any code mappings from source system to central system takes place at query runtime. Structural transformation is usually straightforward, but can be more complex with one-to-many and many-to-one field mappings. An example of one-to-many would be a single field for a named laboratory result, such as a serum sodium value, transforming to multiple fields in the warehouse for code, value and unit (see Figure D). An example of a many-to-one field mapping might involve a field whose semantics depends on the value of another field. Missing data also presents a choice for site ETL. For example, some sites do not have date-ofbirth available and provide age instead. It is important that uniform guidance is given for calculating an approximation and ensuring that all event times in the longitudinal patient record are consistent with this. An important part of the ETL is the generation of a pseudo-identifier for each patient within the warehouse. This must be defined in the Subject table and carefully applied to the 3 main fact tables. This subject id will not be exposed to the orchestrator or workbench, but will be processed by the end-point software. There must be support for de-identification of possible recruits. The EHR4CR warehouse uses a 4-byte integer to reference a subject. It is expected that the relationship of this subject id to the local hospital identifier is kept in a local mapping table that is not exposed to the central EHR4CR platform and may reside in a separate local component of the platform. The definition and naming of this table is under local control. Sources that implement k-anonymization will not provide the ability to identify patients for recruitment using the warehouse and must re-query the source EHRs. Sites may record participations in clinical statements by providing site-unique organization IDs. It is not anticipated that the warehouse will store details of individual medical practitioners, but only the organizations they work for. Organization size may vary from individual hospital clinics or wards to entire hospital groups. Organization IDs must be defined in the Organization table and carefully applied to the 3 main fact tables. It is important that the detail of the ETL processes which gave rise to a clinical statement record is available to end-users so that any subtleties can be considered. The warehouse model offers a field (ETLSpec) that indexes information of this kind in the table ETLSpec. This table is at present a placeholder for a fuller specification which will be developed in due course. In addition, the context in which data is collected can have a profound effect on interpretation. As well as ETLSpec a further field may be made available to cover information of this kind. Both these developments would involve additional tables being defined within the warehouse. Comprehensive guidance has yet to be developed and is the subject of future work. This will involve both general guidance and particular guidance and will evolve over time and will require continuous coordination and online support. Figure D shows an example transformation. 62

Load Figure D - Example of structural and terminology mapping of Serum sodium 143mmol/L The complexity of the final (bulk) load is determined by whether the load is de novo or permits incremental

63 Load Figure D - Example of structural and terminology mapping of Serum sodium 143mmol/L The complexity of the final (bulk) load is determined by whether the load is de novo or permits incremental additions of records and substitutions of corrected records. The details of this are left to the sites. Whatever method is used, the load method and timeliness of the source data should be known to the end-user. Data quality It is desirable to develop some queries, executed through the platform, that compute measures of quality within particular disease areas. This will be the subject of future work. Automation The overall ETL should contain features necessary for continuous or frequent operation: a workflow with selectable processing components processes and components that are efficient, scalable, and maintainable scheduling monitoring and alerting a life cycle with audit and compliance bulk extract and load The above features list is common in ETL tools available today, which comprise both open source and proprietary offerings: Open Source 63

64 Pentaho Talend Open Studio Scriptella JasperETL from JasperSoft CloverETL Apatar Proprietary Pervasive Software Astera Centerprise Expressor SAS Data Integration Server SAP BusinessObjects Data Integrator Integration Services of Microsoft SQL Server IBM InfoSphere DataStage Informatica PowerCenter Oracle Data Integrator Stone Bond Technologies SnapLogic A number of EHR4CR sites have experience of Talend Open Studio and have used this for their ETL. 6.3 Mappings It should be noted that source-target terminology mapping services are not required at ETL for the EHR4CR warehouse. However, the necessary mappings must be developed in parallel using the services from WP4. In practice sites used their own tools to generate terminology mappings since the relevant tools were not yet available from WP4. For example, Dundee used a relatively simple mapping editor that generated mapping files that could be imported into the local terminology service (see Figure F). 64

65 Figure F: Terminology mapping editor used by Dundee to map site codes to central codes. 6.4 The Dundee 200 test data To facilitate the deployment and testing of end-point software at sites, a test dataset consisting of Dundee data relating to 200 diabetic patients is made available to other sites. This data is pseudonymised and partially obfuscated. The dataset comprised a total of 100k records and includes the data listed in Table A. Sample of 200 diabetics Local coding Central coding Birth, death, gender Local S-CT Height, weight, BMI Local S-CT BP: systolic, diastolic Local S-CT Laboratory (clinical chemistry) Local LOINC Medication BNF ATC Hospital episodes (diagnosis, procedure) ICD10, OPCS4.x, national ICD10, S-CT Table A - Data and coding used within the Dundee-200 test data. S-CT: SNOMED-CT; BNF: British National Formulary (drugs); OPCS 4.x: Office of Population Censuses and Surveys, Classification of Interventions and Procedures, v4.x; National: e.g. dischargedispositioncode 6.5 Future work While there is agreement on the structure of the EHR4CR CDW, performance measures, and conformance testing, there is still work to be done on: Support for ETL provenance Support for data context Data quality measures Site and platform resource requirements Site skill sets 7 Evaluation of the semantic resources and services Chapter 7 describes an evaluation of the EHR4CR standardization pipeline and semantic interoperability services (SIS). Our goal was twofold, first to define a set of desiderata for 65

66 developing a Common Information Model and computable eligibility queries; second to use this conceptual framework to describe the strengths and limitations of the EHR4CR semantic platform. Our approach consisted first of extending the desiderata for computable representations of electronic health records-driven phenotype algorithms proposed by Mo et al [Mo15] in order to propose a conceptual framework for comparing mediation approaches and semantic interoperability solutions developed by platforms supporting cross border research. Second, we instantiated the conceptual framework in the context of the EHR4CR project in order to evaluate how far the development of the mediation model and the standardization efforts met the expected requirements of the project. The conceptual framework of computable representations of phenotype algorithms consists of a set of requirements related to the three main components of a semantic interoperability platform: 1) query language and model, 2) patient data model (mediation model) and 3) standardization pipeline for data providers. 7.1 Evaluation framework The need of high quality query language and model There is a need to manage eligibility criteria in order to accelerate the development of new clinical research protocols and related clinical research documents (e.g., case report forms, data collection forms, training materials, etc.). Related effort include EligWriter [Gennari01] and Designa-Trial [Nammuni02] that supported the re-use of eligibility criteria during clinical trial protocol authoring, as well as ERGO [ERGO15] and ASPIRE [Niland07] that support eligibility criteria annotation. The definition of computable phenotype algorithms require interoperability with patient data. The knowledge representation requirements for eligibility criteria in this context are more stringent, including highly expressive language(s) to achieve executable eligibility rules, a patient information model, and an appropriate clinical terminology to facilitate mapping from eligibility concepts to patient data The need of high quality mediation model (patient data model) Since each data source is not designed with a primary focus of cross-domain integration, initiatives for integrating clinical care and clinical research data have been often limited to nonscalable, system (or vendor)-specific efforts [Cuggia11, El Fadly11, Schreiweis14]. In an expanding research landscape, cooperation infrastructures are now being built to allow research projects to reuse patient data from federated systems from many different sites in different countries and therefore in a multilingual settings. Non-standard, and often conflicting, vendor approaches to representing data pose challenges to infrastructure developers, who must build solutions to work with clinical data across multiple formats. Systems developed during the last decade in order to compute eligibility criteria - including GUIDE, GLIF3, SAGE, ERGO, CRFQ - largely adopted some form of Virtual Medical Records (VMR) [81Weng] based on the HL7 Reference Information Model (RIM) [Jenders97], which provides an abstraction layer on top of a real EHR. Nowadays HL7 FHIR specifications are gaining interest. Although there is no consensus in the medical informatics community regarding a standard patient information model, the development of HL7 FHIR shows promise to mitigate the classic site-specific data mapping problem. A controlled mediation model is required to support federated access to heterogeneous data sources. 66

67 7.1.3 The need of an efficient standardization pipeline within participants data providers Beyond the creation and continuous extension of the standard-based mediation model, the process of harmonizing heterogeneous data sources, called data standardization in this paper, relies also on the capability of different actors in hospital sites to align the local structures and content of their EHR systems or Clinical Data Repositories to the mediation model. Few EHR systems or Clinical Data Repositories in hospitals implement standard reference models such as HL7 RIM, EN ISO or openehr. Most of them rely on proprietary models. Furthermore, although the need for controlled vocabularies in EHR systems is widely recognized, system developers have often dealt with this need by creating ad hoc sets of controlled terms for use in their applications so that information in one system cannot be recognized and used by other systems. Differences between the controlled vocabularies of two systems exist even when both systems were created by the same developers. Therefore mapping local models and/or controlled vocabularies is a challenging and time consuming task for terminologists in participant hospitals. Efficient supportive mapping tools are required to enable terminologists to develop and maintain semantic mapping between the proprietary models and the mediation model. 7.2 Results Table 11 provides the list of 23 requirements defined for the three main components involved in the definition of phenotype algorithms 1) query language and model, 2) patient data model (mediation model) and 3) standardization pipeline for data providers. Table 1 also provides a qualitative evaluation the strengths and limitations of the EHR4CR platform in the implementation of phenotype algorithms and its capacity to support the different actors in accomplishing their tasks during the data standardization process at both setup and execution phases of the EHR4CR use cases. Req A.1 Table 11. Results of the evaluaton of the EHR4CR semantic interoperability platform Desiderata proposed by Mo et al. [Mo15] A-The need of high quality query language Implement set operations and relational algebra for modeling phenotype algorithms, represent phenotype criteria with EHR4CR use case structured rules Mo 2015; Req.4&5 ++ Req A.2 Support both human readable and computable representations of phenotype algorithms Mo 2015; Req.3 ++ Req A.3 Support defining temporal relations between events Mo 2015; Req.6 ++ Req A.4 Provide representations for text searching and natural language processing Mo 2015; Req.8 Req A.5 Query language shall be generic and standard based Req A.6 Query builder shall be intuitive ++ Req A.7 Provide interfaces for external software algorithms Mo 2015; Req.9 Req A.8 Maintain backward compatibility Mo 2015; Req.10 B-The need of high quality patient data model (mediation model) Req B.1 The mediation model shall be based on standard domain knowledge and reference models provided by standard development organizations that are and will be used by EHR

68 vendors, clinicians, and government mandates (e.g. Meaningful Use Stage 3 in US). Req B.2 Req B.3 Req B.4 Req B.5 Req B.6 Req B.7 Req B.8 Req B.9 Req B.10 The mediation model shall use standard terminologies, ontologies and value sets that are multilingual and internationally used Mo 2015; Req Support customization for the variability and availability of EHR data among sites. Possible use of internally defined extensions of existing standard terminologies (in order to add any missing concept or any missing description in any specific language) Mo 2015; Req.2 ++ The mediation model shall use mappings between reference terminologies (e.g. SNOMED-MedDRA, SNOMED CT-NCI Thesaurus) in order to allow end users to access semantically equivalent content through different terminologies + The mediation model shall be expressive enough to represent i) multimodal (sign, symptoms, diseases, outcomes, procedures, care plans, etc. as well as images, signals, etc.) and multi-scale clinical data including molecular findings such as genomics information; ii) specimen related information, family related information, etc. iii) multiple granularities, multiple consistent views, context representation The mediation model shall be scoped to the needs of the users of the research network in the context of dedicated use cases but scalable and sustainable (designed to be rapidly and efficiently scoped to cover any new requirement, extensible in terms of structure and content) ++ The mediation model shall be represented using standard formal languages allowing semantic reasoning (e.g. semantic web languages) in order to recognize redundancy or inconsistency A robust version management process shall be provided for any type of semantic resource of the mediation model ++ A dedicated tool is required for supporting the authors of the mediation model to efficiently create/update the semantic resources of the model. The editor shall support a collaborative editing process. The creation and update process shall be user-friendly and adapted to medical experts (through user interface, but also through import of simple csv files used to capture medical knowledge in a format that is understandable for medical experts). The editor shall allow the authors to create new semantic resources from standard terminologies (e.g. SNOMED CT, LOINC, ATC, ICD-O) or value sets. The standard resources are imported from the official terminology providers and up-to-date. ++ The semantic resources shall be accessible to any application through standardized semantic services based on new web

69 technologies, such as Representational State Transfer (REST)- based APIs/web services, recently been adopted by HL7. C-The need of efficient standardization pipeline within data providers Req C.1 Automatic mapping algorithms supporting terminologists in identifying corresponding concepts in the mediation model on one side and local models on the other side. These algorithms shall i) use the descriptions and synonyms of the concepts; ii) address multi lingual issues; iii) use existing mappings between reference terminologies (e.g. when local sources are mapped to a standard terminology which is not used in the mediation model (e.g. NCI Thesaurus), using the mapping between SNOMED CT and NCI Thesaurus to propose automatic mappings between local concepts and SNOMED CT concepts in the mediation model) Mo 2015; Req.1 Req C.2 Natural language processing for semantic annotation of text Mo 2015; Req.8 Req C.3 Formal representation of mappings and version management Req C.4 Use case driven support for prioritizing the mapping effort. The terminologist needs to know within the list of the data elements of the mediation model that are not yet mapped to local data elements, the ones that need to be mapped in priority according to different criteria (e.g. data elements that are the most frequently used in distributed queries, data elements corresponding to a specific phenotype algorithm, etc.) Req C.5 Mappings shall be accessible to any application through standardized semantic services based on new web technologies, such as Representational State Transfer (REST)- based APIs/web services, recently been adopted by HL Query model and language In this section, we describe the characteristics of the EHR4CR Eligibility Criteria Model (EC Model) and ECLECTIC language regarding the 8 requirements stated in the A-Query model and language section of the conceptual framework. Req A.1: Implement set operations and relational algebra for modeling phenotype algorithms, represent phenotype criteria with structured rules The EHR4CR Eligibility Criteria Model (EC Model) is extensible query model representing eligibility criteria in UML language to meet the expressivity needs of computationally viable eligibility criteria. An ad-hoc language ECLECTIC (Eligibility Criteria Language for European Clinical Trial Investigation and Construction) has been developed in order to ensure that it can express only queries that the object model can represent. The UML class diagram and language grammar are two alternative representations of the same model. The resultant object model, although hidden away from the user s eyes, lies at the heart of the query engine, and is key for model transformation and query serialization in different forms. Req A.2: Support both human readable and computable representations of phenotype algorithms 69

ECLECTIC is also a human-readable serialization of the object hierarchy, which allows us to reason about the model and perform validation prior to implementation. Req A.

70 ECLECTIC is also a human-readable serialization of the object hierarchy, which allows us to reason about the model and perform validation prior to implementation. Req A.3: Support defining temporal relations between events Basic temporal relationships are provided Req A.4: Provide representations for text searching and natural language processing None Req A.5: Query language shall be generic and standard based ECLECTIC is an ad-hoc query language Req A.6: Query builder shall be intuitive Using the EHR4CR query builder (see Figure 24), a study manager can drag and drop data elements stored in the mediation model (marked as 1 in Figure 24) and logical and temporal operators (marked as 2 in Figure 24) in order to populate query-templates designed for representing formally the eligibility criteria of the clinical trial (marked as 3 in Figure 24). Figure 24. EHR4CR query builder demonstrating Protocol Feasibility Study module Req A.7 Provide interfaces for external software algorithms: none Req A.8 Maintain backward compatibility: not addressed Mediation model In this section, we describe the characteristics of the EHR4CR Common Information Model (CIM) regarding the 10 desiderata stated in the prevous section. Req B.1: The mediation model shall be based on standard domain knowledge and reference models. The EHR4CR Common Information Model (CIM) consists in a set of multilingual semantic resources based on multiple standards (FHIR resources organized into categories based on HL7 CCD sections and UMLS semantic types) Req B.2: Use of standard terminologies, ontologies and value sets that are multilingual and internationally used EHR4CR templates are composed of data elements that are bound to a set of international reference terminologies selected by the project: ICD, SNOMED-CT, LOINC, ATC, ICD-O, Pubcan, TNM, PathLex. As much as possible, we enriched and/or merged reference terminologies in order to build multilingual terminologies and value sets (in English, French at least and when possible in the four languages of the EHR4CR partners: English, French, German, and Polish). 70

71 Req B.3: Possible use of internally defined extensions of existing standard terminologies (in order to add any missing concept or any missing description in any specific language) An EHR4CR terminology was created in order to create concepts that are in the scope of the project but do not exist in the selected reference terminologies. Req B.4: Mappings between reference terminologies (e.g. SNOMED-MedDRA, SNOMED CT- NCI Thesaurus) We integrated the UMLS CUI in order to allow multi-terminology binding. Req B.5: Expressiveness The current limited set of FHIR-based templates allows the representation of the main textual clinical data (signs, symptoms, diseases, outcome, procedures, care plans, etc.). We defined context-dependent value sets for representing multiple views or contextual information (e.g. organ specific scores or histologic types, etc.). Req B.6: Scoped to the needs of the users of the research network in the context of dedicated use cases but scalable and sustainable (designed to be rapidly and efficiently scoped to cover any new requirement, extensible in terms of structure and content) The EHR4CR mediation model (EHR4CR CIM) has been developed and can be extended, through a global consensus-based development process in order to cover the scope of both i) eligibility criteria and data items identified from a given set of specific clinical trials (bottom up approach resulting in the creation of useful data elements ) and ii) standards reference clinical information models or data elements (e.g. CDISC SHARE) (top down approach). Although scoped to the needs of the users of the EHR4CR platform in the context of the three use cases of the project (PFS, PRS or CTE), its structure ensures its scalability so that it can be extended in terms of both structure and content to cover any new need. The EHR4CR CIM was developed and evolved through repeated cycles using a "Learning by Doing" approach in order to cover the scope of 14 first clinical trials selected to demonstrate the PFS use case, then of 17 additional clinical trials (PRS use case) and finally of 28 additional clinical trials (CTE use case). Each new version of the EHR4CR CIM has an extended scope and improved quality. Req B.7: Standard formal languages allowing semantic reasoning The semantic resources are stored into a semantic metadata repository (MDR). Metadata scheme is expressed in different programming languages including RDF. Req B.8: Version management Version management is provided for any type of semantic resource (terminologies, value sets, data elements, templates) Req B.9: Collaborative authoring tool A tool was developed for authoring and maintaining the shared semantic resources of the mediation model. Req B.10: Standard semantic services based on new web technologies The semantic interoperability services (SIS) are developed to enable EHR4CR end-user services to assess and consume the semantic resources of the mediation model (terminologies, value sets, data elements, templates) and the mappings. SIS are used at the workbench by the EHR4CR query builder for query specification (representation of free text eligibility criteria using the data elements of the mediation model) and at the EHR4CR endpoints for query transformation. This goal was realized via the expansion of the original functionality outlined in HL7 s Common Terminology Service Release 2 (CTS2) Specification. The functional profiles of the SIS include capabilities for searching and query code system content, value set content and template content. The technical specifications of the EHR4CR SIS rely on Representational State Transfer (REST)-based APIs/web services, recently been adopted by HL7. 71

72 7.2.3 Standardization pipeline for data providers In this section, we describe the characteristics of the EHR4CR standardization pipeline regarding the 5 requirements stated in the Standardization pipeline section of the conceptual framework. Req C.1 Automatic mapping algorithms Once hospital clinical data repositories (CDRs) are connected to the EHR4CR platform, source information models need to be mapped to the EHR4CR CIM. In the current state, the concepts used in the definitions of the central data elements were manually mapped to corresponding local terms used in pilot sites. Supporting tools are still under development. The current version of the Terminology Mapping Editor (TME) has limited functionalities, it allows the Terminology Mapper to upload subset of local value sets and to create their mapping to central value sets defined within the EHR4CR CIM. Req C.2: Natural language processing for semantic annotation of text: none Req C.3: Formal representation of mappings and version management Mappings are available in SKOS format. Version management s provided. Req C.4: Use case driven support for prioritizing the mapping effort: none Req C.5:Standard semantic services Mappings are available through REST-based APIs/web services 8 Conclusion With the development of platforms enabling the use of routinely collected clinical data in the context of international research, scalable solutions for cross border and cross domain semantic interoperability need to be developed. Expression language, underlying model of patient data and codification of eligibility concepts are essential constructs of a formal knowledge representation for eligibility criteria. There is currently an intense focus directed to the issue of developing and maintaining shareable, multipurpose, high-quality computable phenotype algorithms in order to mediate between different different EHR products and clinical research systems. 8.1 The EHR4CR semantic interoperability platform The EHR4CR semantic interoperability platform fulfills at least partially - most of the 23 requirements of the proposed conceptual framework. The mediation model is based on multiple standards: standard models (HL7 FHIR templates, ISO 21090, ISO11179), standard value sets and terminologies. Integrating these different multi-level standards is challenging and terminology binding is especially a difficult issue while contextual and versioning issues need to be addressed. We developed specific data structures faceted templates to get a good balance between complexity (a limited set of generic templates) and expressiveness (major scalability in terms of structure and content thanks to the facets). As much as possible, we enriched and/or merged reference terminologies in order to build multilingual terminologies and define multilingual value sets (at least in the four languages spoken by the EHR4CR partners: English, French, German, and Polish). An EHR4CR terminology was created in order to create concepts that are in the scope of the project but do not exist in the selected reference terminologies. We developed a collaborative editing tool handling the management of any type of the EHR4CR complex semantic resources (faceted templates, data elements, value sets, concepts from huge 72

73 and complex terminologies e.g. SNOMED CT) and of their relationships. We addressed the versioning issues for every type of resource, deriving CTS2 approaches for vocabulary updates. A Terminology Mapping Editor (TME), under development, enables participant EHRs to develop and maintain semantic mappings between their proprietary models and the mediation model. This tool is still at its infancy and does not yet fulfil the expected requirements (such as use case driven support for prioritizing the mapping effort, contextual terminology mapping, automatic mapping algorithms addressing multi lingual issues). The semantic resources (mediation models and mappings) are accessible to any component of the EHR4CR platform through standardized semantic services based on new web technologies, such as Representational State Transfer (REST)-based APIs/web services, recently been adopted by HL7. Our current mediation model does not fully fulfil some of the ten requirements. We are considering, in the future, to integrate terminology mappings between reference terminologies (e.g. mappings between SNOMEDCT and MedDRA, NCI-T, ICD-9, ICD-10, ICD-O) in order to fully support multi-terminology binding. We still are working to represent multiple granularities, multiple consistent views, context representation. We plan to evaluate the FHIR resources currently being developed in order to represent multi-scale clinical data including molecular findings such as genomics information. We still need to define complex templates allowing the combination of basic templates. Developing a smart user interface for searching and/or browsing within complex semantic resources remains problematic. We also plan to improve the collaborative editing of these resources by medical experts using the GUI and/or CSV files. We are also working on an improved distribution model (with three modes: full, snapshots and/or deltas). Regarding the data standardization process in hospitals, the Terminology Mapping Editor is still at its infancy and does not yet fulfil the expected requirements (such as use case driven support for prioritizing the mapping effort, contextual terminology mapping, automatic mapping algorithms addressing multi lingual issues) Within the EHR4CR project, we identified the need for a governance body and process for ensuring the quality of the data standardization pipeline within the network. Since a set of complex and sometimes time-consuming activities is required at the hospital side at the connection phase (initial mapping to a core of semantic resources) and at the set up phase of each new study (update of the mappings in the specific context of the study), it is important that those activities are well organized and properly synchronized with central efforts. Thus, it is not just a matter of content scope of the semantic resources but also a matter of reaching agreements on how they are represented and accessed. The governance body and process will be especially important in the context of any operational use of the EHR4CR platform at a broader scale within an extended network. 8.2 Limits, related projects and perspectives Expression languages employed to represent eligibility logic include ad hoc expressions (with or without the use of templates), the Arden Syntax, logic-based languages (i.e., PAL, SQL, and DL), object-oriented languages (i.e., GELLO), and temporal query languages (e.g., Asbru and Chronus II)[Weng10]. Ad hoc formalisms are functional in many existing systems and can provide interesting features regarding expressiveness. SQL-based queries on a clinical database are expressive but not extensible for knowledge re-use or inference. These mechanisms all suffer from the lack of scalability. Multiple query languages were used for different types of logic within the same model or system. Ontologies are increasingly being used as common terminological resources to automatically reconcile data heterogeneity and implement large- 73

74 scale, distributed data management systems. Ontology-aware query interfaces that are integrated EHR systems can subsequently leverage the ontology annotations to support extensive query answering functionalities [Sahoo14]. Over the past decade, medical informatics researchers have been studying issues related to clinical information models associated with terminologies and have begun to articulate some requirements for high quality models [Ahn13, Weng10, Mo15]. There are several efforts trying to address the interoperability between the clinical research and patient care domains in building a common data model where the interoperating systems are required to interact through this well-defined mediation model. In this top-down approach, a top-level knowledge model agreement is forced for the underlying data models of the interoperating parties for successful data exchange. Some projects, adopting this top-down strategy, proposed solutions that have been carried forward into practice and new experience has been gained: OMOP CDM [Reisinge10], FDA Mini-Sentinel [Curtis12], I2B2-SHRINE [Kohane12, McMurry13], STRIDE [Lowe09], emerge [Pathak11,Newton13, Herr15], SHARPn [Rea12,Pathak13], ], TRANSFoRm project [Delaney15] and other initiatives [Weng 10,Sinaci13, Shivade14,Jiang15]. CDISC SHARE is an important initiative in addressing the interoperability between care and research domains through maintaining common data elements built upon BRIDG DAM where they are annotated with CDISC data sets like CDASH and SDTM, and other CDISC terminologies [SHARE15]. CDISC SHARE CDEs will be considered for enriching the EHR4CR mediation model. In the SALUS project, Sinaci et al. also applied a comprehensive set of semantic web technologies with the commonly adopted MDR standard ISO/IEC In addition, they built a federated semantic MDR framework and demonstrated that it was possible to semantically link disparate CDE definition efforts by different organizations [Sinaci13]. The EHR4CR project developed an instance of a platform, providing communication, security and semantic interoperability services to the eleven participating hospitals located in five European countries and ten pharmaceutical companies [Coorevits13, De moor14]. According to an evaluation framework of the query languages, mediation model and standardization pipeline, the EHR4CR semantic interoperability platform fulfills most of the requirements. Regarding the mediation model, some requirements remain problematic. The scope of the EHR4CR mediation model needs to be continuously adapted to the user s needs. Since the update can hardly be fully automatized (e.g. through automatic coding of free text clinical trial protocols), a collaborative editor needs to efficiently support the creation of new semantic resources scoped to any additional use case. Despite recent efforts, formal representation of multimodal and multi-level data supporting data interoperability across clinical research and care domains is still challenging. Terminology mapping in hospital sites is the major bottleneck of the data standardization pipeline. Supportive tools are still at their infancy. Semantic interoperability within a broad international research network reusing clinical data from EHRs requires a rigorous governance process to ensure the quality of the data standardization process. 8.3 References 1. Ahn S, Huff SM, Kim Y, Kalra D. Quality metrics for detailed clinical models. Int J Med Inform May;82(5): Coorevits P, Sundgren M, Klein GO, Bahr A, Claerhout B, Daniel C, et al. Electronic health records: new opportunities for clinical research. J Intern Med. déc 2013;274(6): Cuggia M, Besana P, Glasspool D. Comparing semi-automatic systems for recruitment of patients to 74

75 clinical trials. Int J Med Inform 2011;80: Curtis LH et al. Design considerations, architecture, and use of the MiniSentinel distributed data system. Pharmacoepidem Drug Saf 2012;21: Delaney BC, Curcin V, Andreasson A, Arvanitis TN, Bastiaens H, Corrigan D, Ethier JF, Kostopoulou O, Kuchinke W, McGilchrist M, van Royen P, Wagner P.Translational Medicine and Patient Safety in Europe: TRANSFoRm-Architecture for the Learning Health System in Europe. Biomed Res Int. 2015;2015: De Moor GD, Sundgren M, Kalra D, Schmidt A, Dugas M, Claerhout B, Karakoyun T, Ohmann C, Lastic PY, Ammour N, Kush R, Dupont D, Cuggia M, Daniel C, Thienpont G, Coorevits P. Using Electronic Health Records for Clinical Research: the Case of the EHR4CR Project. J Biomed Inform Oct El Fadly A, Rance B, Lucas N, et al. Integrating clinical research with the Healthcare Enterprise: from the RE-USE project to the EHR4CR platform. J Biomed Inform 2011;44 Suppl 1:S ERGO: a template-based expression language for encoding eligibility criteria, < ERGO_Technical_Documentation.pdf/>; 2009 [accessed ]. 9. Gennari J, Sklar D, Silva J. Cross-tool communication: from protocol authoring to eligibility determination. In: Proceedings of the AMIA symposium; p Herr TM etal. Practical considerations in genomic decision support: The emerge experience. J Pathol Inform Sep 28;6: Jenders R, Sujansky W, Broverman C, Chadwick M. Towards improved knowledge sharing: assessment of the HL7 reference information model to support medical logic module queries. In: Proceedings of the AMIA annual fall symposium; p Jiang G, Evans J, Oniki TA, Coyle JF, Bain L, Huff SM, Kush RD, Chute CG. Harmonization of detailed clinical models with clinical study data standards. Methods Inf Med Jan 12;54(1): Kohane IS, Churchill SE, Murphy SN. A translational engine at the national scale: informatics for integrating biology and the bedside. J Am Med Inform Assoc 2012;19: Lowe HJ et al. STRIDE an integrated standards-based translational research informatics platform. AMIA Annu Symp Proc 2009;2009: McMurry AJ, Murphy SN, MacFadden D, Weber G, Simons WW, Orechia J, et al. SHRINE: enabling nationally scalable multi-site disease studies. PLoS ONE. 2013;8(3):e Mo et al. Desiderata for computable representations of electronic health records-driven phenotype algorithms. J Am Med Inform Assoc Nov;22(6): Nammuni K, Pickering C, Modgil S, Montgomery A, Hammond P, Wyatt JC, et al. Design-a-trial: a rule-based decision support system for clinical trial design. Knowl-Based Syst 2004;17: Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the emerge network. J Am Med Inform Assoc. juin 2013;20(e1):e Niland J. ASPIRE: agreement on standardized protocol inclusion requirements for eligibility. In: An unpublished web resource; Pathak J et al. Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: the emerge Network experience. J Am Med Inform Assoc 2011;18: Pathak J, Bailey KR, Beebe CE, Bethard S, Carrell DC, Chen PJ, et al. Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium. J Am Med Inform Assoc. déc 2013;20(e2):e Rea S, Pathak J, Savova G, Oniki TA, Westberg L, Beebe CE, et al. Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: the SHARPn project. J. Biomed. Inform Aug;45(4): Reisinger SJ et al. Development and evaluation of a common data model enabling active drug safety surveillance using disparate healthcare databases. J Am Med Inform Assoc 2010;17: Sahoo SS, Lhatoo SD, Gupta DK, Cui L, Zhao M, Jayapandian C, Bozorgi A, Zhang GQ. Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care. J Am Med Inform Assoc Jan-Feb;21(1): Schreiweis B, Trinczek B, Köpcke F, Leusch T, Majeed RW, Wenk J, Bergh B, Ohmann C, Röhrig R, Dugas M, Prokosch HU. Comparison of electronic health record system functionalities to support the patient recruitment process in clinical trials. Int J Med Inform Nov;83(11):

76 Psychiatric Respirat ory Infections Respiratory Oncolog y Respiratory CANVAS Diabetes Diabetes CT_StartDate CT_Name CT_diseaseAr ea 26. Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, Lai AM. A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc Mar-Apr;21(2): Sinaci AA, Laleci Erturkmen GB. A federated semantic metadata registry framework for enabling interoperability across clinical research and care domains. J Biomed Inform. oct 2013;46(5): Weng C, Tu SW, Sim I, et al. Formal representation of eligibility criteria: a literature review. J Biomed Inform 2010;43: Appendix 9.1 List of clinical trials CT_Id (CT_IdPharm a) CT_Pharma CT_Description CT_EHR4CR UseCase CT_Ho spitals NCT AstraZeneca Evaluation of the Effect of Dapagliflozin in Combination With Metformin on Body Weight in Subjects With Type 2 Diabetes CTE NCT Janssen CANVAS - CANagliflozin cardiovascular Assessment Study CTE NCT NCT GSK Amgen An Exercise Endurance Study to Evaluate the Effects of Treatment of Chronic Obstructive Pulmonary Disease (COPD) Patients With a Dual Bronchodilator: GSK573719/GW Study B Denosumab Compared to Zoledronic Acid in the Treatment of Bone Disease in Subjects With Multiple Myeloma CTE CTE NCT NCT GSK Janssen A Randomized, Double Blind, Placebo Controlled, Incomplete Block, Crossover, Dose Ranging Study to Evaluate the Dose Response of GSK Administered Once or Twice Daily Over 7 Days in Patients With Chronic Obstructive Pulmonary Disease (COPD) (AC ) A Study of Doripenem in Infants Less Than 12 Weeks of Age CTE CTE NCT NCT GSK Janssen Clinical Study Evaluating Safety and Efficacy of Fluticasone Furoate and Fluticasone Propionate in People With Asthma Study of Paliperidone Palmitate 3 Month and 1 Month Formulations for the Treatment of Patients With Schizophrenia CTE CTE 76

77 Respiratory Diabetes Respiratory Respiratory Cardiovascula r Respiratory Cardiovascula r Neuroscience Oncology Respiratory NCT GSK Evaluate the Safety, Efficacy and Dose Response of GSK in Combination With Fluticasone Furoate in Subjects With Asthma (ILA115938) CTE NCT Bayer Radium(223) Dichloride (Alpharadin) in Castration-Resistant (Hormone- Refractory) Prostate Cancer Patients With Bone Metastases CTE NCT Novartis Exploring the Efficacy and Safety of Siponimod in Patients With Secondary Progressive Multiple Sclerosis (EXPAND) CTE NCT Bayer Oral Rivaroxaban in Children With Venous Thrombosis (EINSTEINJunior) CTE NCT GSK Efficacy and Safety Study of Mepolizumab Adjunctive Therapy in Subjects With Severe Uncontrolled Refractory Asthma CTE NCT Novartis Study of the Safety and Efficacy of LCZ696 on Arterial Stiffness in Elderly Patients With Hypertension CTE NCT NCT GSK Novartis A Study to Assess the Efficacy of Fluticasone Furoate/Vilanterol (FF/VI) Inhalation Powder 100/25 mcg Once Daily Compared With Fluticasone Propionate/Salmeterol Inhalation Powder 250/50 mcg Twice Daily in Subjects With Chronic Obstructive Pulmonary Disease (COPD) Efficacy and Safety of QGE031versus Placebo and Omalizumab in Patients Aged Years With Asthma CTE CTE NCT AstraZeneca A Study Comparing Cardiovascular Effects of Ticagrelor and Clopidogrel in Patients With Peripheral Artery Disease (EUCLID) CTE NCT GSK The Purpose of This Study is to Evaluate the Spirometric Effect (Trough FEV1) of Umeclidinium/Vilanterol 62.5/25 mcg Once Daily Compared With Tiotriopium 18 mcg Once Daily Over a 24-week Treatment Period in Subjects With COPD CTE 77

78 Cardiovascula r EVOLVE Cardiovascula r (Secondary Hyperparathy All All All Cardiovascula r Renal Infections Respiratory Cardiovascula r Respiratory Ophthalmology NCT NCT Novartis Novartis Efficacy and Safety of Two Treatment Regimens of 0.5 mg Ranibizumab Intravitreal Injections Guided by Functional and/or Anatomical Criteria, in Patients With Neovascular Agerelated Macular Degeneration (OCTAVE) QVA vs. Salmeterol/Fluticasone, 52- week Exacerbation Study CTE CTE NCT NCT Janssen Janssen A Study Exploring Two Strategies of Rivaroxaban (JNJ ; BAY ) and One of Oral Vitamin K Antagonist in K2Patients With Atrial Fibrillation Who Undergo Percutaneous Coronary Intervention (PIONEER AF-PCI) An Effectiveness and Safety Study of Inhaled JNJ (RV568) in Patients With Moderate to Severe Chronic Obstructive Pulmonary Disease CTE CTE NCT Janssen Telaprevir With Peginterferon Alfa & Ribavirin in Ex-People Who INject Drugs Infected by Genotype 1 Chronic Hepatitis C (INTEGRATE) CTE NCT Bayer Fixed Dose Correction / naïve and Pre Dialysis (Europe and Asia Pacific) (DIALOGUE 1) CTE NCT Bayer Multiple Dose Study in Heart Failure of BAY (PARSiFAL) CTE Amgen Data Element Standards CTE Lilly ODM Library v2011 CTE Novartis Data Element Frequency Count CTE NCT Amgen PFS NCT Bayer PFS 78

79 Oncology Cardiovascula r and Metabolic Cardiovascula r (Acute Decompensat Oncology Prostate cancer Neurology (Parkinson) Oncology Neurology NCT GSK PFS NCT Novartis PFS NCT AstraZeneca PFS NCT Merck PFS NCT Janssen PFS NCT Sanofi PFS NCT GSK PFS NCT Novartis PFS NCT Roche PFS NCT GSK PFS NCT Eli Lilly PFS NCT Sanofi Cabazitaxel at 20 mg/m² Compared to PFS,CTE 25 mg/m² With Prednisone for the Treatment of Metastatic Castration Resistant Prostate Cancer (PROSELICA) Amgen PRS 79

80 AstraZeneca PRS AP-HP, UNIVD UN Bayer PRS Bayer PRS KCL Bayer PRS JandJ PRS Novartis PRS Novartis PRS WWU Novartis PRS Roche PRS FAU, WWU Sanofi PRS Bayer PRS 80

81 OCTAVE GetGoal Duo-2 Proselica Roche PRS Sanofi PRS FAU Bayer PRS FAU Sanofi PRS AP-HP, UNIVD UN Novartis PRS WWU, HUG 9.2 Detailed Functional Model for each of Interface Semantic Interoperability Services (SIS) 9.3 Business Scenarios Scenario A: cts2:codesystem Scenario A: cts2:codesystem Model Figure a.1: UML Model 81

For a data repository, a cts2:codesystem (CS) is not directly a reference.

IRI should keep the version information for a concept.

82 For a data repository, a cts2:codesystem (CS) is not directly a reference. As a CS is dynamic in time (update & changes), cts2:codesystemversion is used as a reference inside Data Repository. (See next figure) A cts2:codesystemconceptversion is an entity that is part of a cts2:codesystemversion. IRI should keep the version information for a concept. A same concept present inside 2 versions could have same official code (ex : code = A03) but a concept must have 2 different IRI depending version (umls2013aa:atc#a03 & umls2014aa:atc#a03). Basics organization (hierarchy) illustration in the next fig. A Scenario B: Scenarios about cts2:valueset Figure b.1: UML Model 82

83 9.3.3 Scenario C.: Scenarios about hl7: Templates. 83

9.3.3.1 TemplateVersion Organization TemplateVersion are organized (as ordered members)

Categories are/could be organized (as ordered members) inside others Categories.

84 TemplateVersion Organization TemplateVersion are organized (as ordered members) inside Categories. Categories are/could be organized (as ordered members) inside others Categories. Figure A illustrates such an organization to reach different TemplateVersion inside an instance. Reach a template : Template Illustration : General View. 84

85 Template Implementation Examples Diagnosis Template Example Observation Template Example Procedure Template Example. 85

86 9.3.4 Detailed Functional Model for each Interface Scenario A.: Search/query scenarios for cts2:codesystem Service A.1: Get a cts2:codesystemversion. Description Input Output Precondition Postconditions Exception Conditions Aspects left to Technical Specification Relationship to Levels of Conformance Miscellaneous Notes Other relevant content Associated Scenario From a unique id, the service return the details of a cts2:codesystemversion. IRI or Code or Acronym of a cts2:codesystemversion. Details (preflabel,oid,iri) of a cts2:codesystemversion Link to first level cts2:codesystemconceptversion (= skos:topconcept) An id (IRI, code, Acronym) is known by the Client. Details of a cts2:codesystemversion & its first level hierarchy. id is unknown by the System. A, C Service A.2: Get a cts2:codesystemconceptversion cts2:codesystemconceptversion are organized ideas that compose a cts2:codesystemversion. Description From a unique id, the service return the details of a cts2:codesystemconceptversion. Input 1. IRI or code of a cts2:codesystemconceptversion Output 1. Details (preflabel,oid,iri) of a cts2:codesystemconceptversion 2. Links to potential narrower cts2:codesystemconceptversion. [IF EXISTS] Precondition 1. An id (IRI, code) is known by the Client. Postconditions 1. A detailed cts2:codesystemconceptversion.. Exception Conditions 1. id is unknown by the System. Aspects left to Technical Specification Relationship to Levels of Conformance Miscellaneous Notes Other relevant content Associated Scenario A, C Scenario B: Search/query scenarios for cts2:valueset cts2:valuesetversion are ordered Collections. Value Set contain elements. These Elements are named fhir:concept Service B.1: Get a cts2:valuesetversion. Description From a unique id, the service return the details of a cts2:valuesetversion. Input 1. IRI or OID of a cts2:valuesetversion Output 1. Details & Content(prefLabel,OID,IRI) of a cts2:valuesetversion. [See Notes for details] Precondition 1. An id (IRI, OID) is known by the Client. Postconditions 1. A cts2:valuesetversion and its members as a Collection ( List) Exception Conditions 1. id is unknown by the System. 86

87 Aspects left to Technical Specification Relationship to Levels of Conformance Miscellaneous Notes Other relevant content Associated Scenario The Members of Value Set are typed as fhir:concept entities. B, C Service B.2: Get a fhir:concept The fhir:concept are the values that contains a cts2:valuesetversion. A fhir:concept could be a link towards a cts2:codesystemconceptversion OR towards a cts2:codesystemversion. fhir:concept could contain a numeric VALUE representing the rank of element inside the Value Set. fhir:concept could contain a link towards a cts2:dataelementversion. Description From a unique id, the service return the details of a fhir:concept.. Input 1. IRI or OID of a fhir:concept. Output 1. Details (preflabel,oid,iri) of a fhir:concept.. 2. Link to a CodeSystem element (a cts2:codesystemconceptversion OR a cts2:codesystemversion) 3. Link to a cts2:dataelementversion. [IF EXISTS] 4. Rank inside the collection [IF EXISTS] Precondition 1. An id (IRI, OID) is known by the Client. Postconditions 1. A fhir:concept and its relations. Exception Conditions 1. id is unknown by the System. Aspects left to Technical Specification Relationship to Levels of Conformance Miscellaneous Notes Other relevant content Associated Scenario B, C Scenario C.: Search/query scenarios for hl7:template Service C.1: Get a Category. A Category is an entity that organize, classify (ordered) templates or other levels of Category. A Category is an Ordered Collection that could contain : A list (ordered) of Sub level Category. A list (ordered) of TemplateVersion. Description From a unique id, the service return the details of a Category & its potential ordered members that means Category OR TemplateVersion. Input 1. IRI or OID or ACRONYM of a Category. Output 1. Details (preflabel,oid,iri,acronym) of a Category 2. Ordered Members = Categories. (IF EXIST) OR 1. Ordered Members = TemplateVersions. (IF EXIST) Precondition 1. An id (IRI, OID or ACRONYM) is known by the Client. Postconditions 1. A Category & its potential ordered members are known by the client. Exception Conditions 1. id is unknown by the System. 87

88 Aspects left to Technical Specification Relationship to Levels of Conformance Miscellaneous Notes Other relevant content Associated Scenario The client is able to know with this service if this Category contains a sub level of other Categories OR contains some templateversions. C Service C.2: Get a TemplateVersion A TemplateVersion corresponds to one unique version of HL7:Template. A HL7:Template corresponds to a pattern of constraints defining a context that should be use to express some information. Typically a template is used to create some form fields with boolean, list, set of data etc.. Template always contains a link to one DataElementVersion. Description From a unique id, the service return the details of a TemplateVersion Input 1. IRI or OID of a TemplateVersion Output 1. Details (preflabel,oid,iri) of a TemplateVersion 2. Link to a DataElementVersion* Precondition 1. An id (IRI, OID) is known by the Client. Postconditions 1. A templateversion and its relationship to a dataelementversion Exception Conditions 1. id is unknown by the System. Aspects left to Technical Specification Relationship to Levels of Conformance Miscellaneous Notes Other relevant content Associated Scenario C Service C.4: Get a DataElementVersion Description From a unique id, the service return the details of a DataElementVersion. Input 1. IRI or OID of a DataElementVersion Output 1. Details (preflabel,oid,iri) of a DataElementVersion [See Notes for details] 2. link to a ValueSetVersion [IF EXISTS] Precondition 1. An id (IRI, OID) is known by the Client. Postconditions 1. A DataElementVersion and its relation to a potential ValueSetVersion. Exception Conditions 2. id is unknown by the System. Aspects left to Technical Specification Relationship to Levels of Conformance Miscellaneous Notes Details of Data Element should express : Conceptual Space Property : CODE, VALUE. Data Type : CD, CO, BOOLEAN etc (ISO Data Types) Other relevant content 88

89 Associated Scenario Service C.5: Get a ValueSetVersion cf. service B Service C.6: Get a fhir:concept cf. service B Service C.7: Get a cts2:codesystemversion cf. service A Service C.8: Get a cts2:codesystemconceptversion cf. service A Semantic services used by SDM/ODM editor Introduction C The CDISC SDM (Study/Trial Design Model) 1.0 standard is based on the CDISC ODM (Operational Data Model) Both are XML standards which allow machine-readable, interchangeable descriptions of the study design and the data collection. In the following sections the abbreviation SDM-ODM may occur to indicate this extension mechanism. The ODM elements can be used within one of the three SDM constructs Structure (Arms, Activities, etc.), Workflow (decision points, branches, etc.) and Timing. Some SDM elements can be used as standalone definitions and re-used as ODM annotations. These are for example the Summary Parameters or the Inclusion/Exclusion criteria definitions. The following chapters describe the use of the CDISC SDM-ODM standard within the EHR4CR scenario 2 and scenario 3 contexts and the integration into third party SDM-ODM editors Usage of a SDM-ODM container The SDM-ODM standard holds the information required to electronically describe a study protocol. The standard allows partial definitions, which offers the EHR4CR project to include only the elements needed to fulfill the scope of scenario 2 and scenario 3. Nevertheless the Study or Data Manager should consider having a mechanism to complete the SDM-ODM representation created by third party SDM-ODM tools. The idea of having a minimal SDM-ODM container is based on the CDISC example 2.3- ODMShell.xml which can be downloaded from the CDISC website. The container structure can be represented with only ODM element definitions. 89

90 <ODM> <Study OID="SAMPLE_STUDY"> <GlobalVariables> <StudyName>CDISC Study Design Prototype</StudyName> <StudyDescription>A sample study</studydescription> <ProtocolName>SDM (Prototype)</ProtocolName> </GlobalVariables> <MetaDataVersion> <Protocol>  </Protocol>  </MetaDataVersion> </Study> </ODM> Table 3 SDM-ODM container The SDM element definitions are placed in <Protocol> tag and refer to the ODM definition SDM elements for patient recruitment Reflecting the patient recruitment in scenario 2 the eligibility criteria are mandatory elements within the EHR4CR scope. These criteria in ODM are defined as a free text conditions which return true or false. Each free text criteria should have a unique OID within the SDM-ODM definition file. In combination with the ODM Study OID a global unique OID can be created. The use of multiple languages is possible and should be addressed. <MetaDataVersion> <ConditionDef Name="Informed consent obtained" OID="co_ic"> <Description> <TranslatedText xml:lang="en">written informed consent obtained.</translatedtext> </Description> </ConditionDef> </MetaDataVersion> Table 4 Free text eligibility criteria These ODM conditions can now be referenced by Inclusion/Exclusion SDM elements. 90

91 <sdm:inclusionexclusioncriteria> <Description> <TranslatedText xml:lang="en">eligibility criteria</translatedtext> </Description> <sdm:inclusioncriteria> <sdm:criterion OID="crit_ic" Name="Written informed..." ConditionOID="co_ic" /> </sdm:inclusioncriteria> <sdm:exclusioncriteria> <sdm:criterion OID="crit_age" Name="Patient <18 ConditionOID="co_age" /> </sdm:exclusioncriteria> </sdm:inclusionexclusioncriteria> Table 5 SDM Inclusion/Exclusion criteria representation The SDM elements like Structure, Workflow and Timing are of limited use in the EHR4CR scope. As XML based standards are also human-readable, the additional elements can be added using a normal text or advanced XML editor but more specialized SDM-ODM editors are advised SDM-ODM extension for third party SDM-ODM designer A third party SDM-ODM editor should be able to create an SDM-ODM container and to load a valid SDM-ODM file in order to complete the values like e.g. the eligibility criteria before this file will be uploaded into the EHR4CR platform. Therefore the tool needs a menu bar with an entry to create open and save a SDM-ODM file. A file dialog box should have the possibility to browse on the file system. The first time the file is created by the local workbench the CreationDateTime attribute of the ODM tag should be completed in the ISO 8601 format like T11:07:23+01:00. The FileType and Granularity as well as the XML namespaces are fixed attributes in this case. <ODM xmlns= xmlns:sdm= FileType="Transactional" CreationDateTime=" T11:07:23+01:00"> </ODM> Granularity="All" Table 6 First SDM-ODM output Global Definitions = protocol The local workbench should be able to add and edit an SDM-ODM container (Table 3) with the following elements and attributes. File (OID, Description) Study (OID, Name, Description ) and Protocol Name MetaDataVersion (OID, Name, Description) Supported Languages 91

Figure 1 SDM-ODM Global Definitions 9.4.5.1 Eligibility criteria and conditions The CDISC SDM standard uses ODM ConditionDefs to describe the eligibility of a patient.

Via a context menu the elements can be added, changed and deleted. The dialog box shown in Figure 2 supports the user to enter the free text eligibility criteria for the defined languages of Figure 1.

92 Figure 1 SDM-ODM Global Definitions Eligibility criteria and conditions The CDISC SDM standard uses ODM ConditionDefs to describe the eligibility of a patient. These ODM ConditionDefs could be referenced as well to ODM ItemRefs as CollectionExceptionConditionOIDs. Therefore the conditions should be shown as separate elements in an own list. Via a context menu the elements can be added, changed and deleted. The dialog box shown in Figure 2 supports the user to enter the free text eligibility criteria for the defined languages of Figure 1. Figure 2 Handling of ODM ConditionDefs The last missing step to fulfil the SDM-ODM requirements for the eligibility criteria is the relation between the Inclusion/Exclusion criteria and the ConditionDefs. A table representation of the criteria could be filled with the condition OIDs via Drag & Drop. 92

93 In two separate groups either the Inclusion or Exclusion Criteria attributes like Criterion OID or Name could be entered. In the background the XML structure of Table 5 SDM Inclusion/Exclusion criteria representation should be created. The created structure must be saved via File / Save File in the menu bar. Figure 3 Eligibility criteria in SDM-ODM Queries Query element exists If a query element exists in the Query Builder the final ECLECTIC query statement can be saved within the ODM metadata file using the FormalExpression XML tag Missing data elements In order to create new elements the study designer has to switch to the Data Element editor to update EHR4CR central terminology. This element should get the context attribute EHR4CR and the query as text inside the node. <ConditionDef Name="Age" OID="co_age"> <Description> <TranslatedText xml:lang="en">age</translatedtext> </Description> <FormalExpression Context="EHR4CR">QUERY</FormalExpression> </ConditionDef> Table 7 Add ECLETIC to ODM conditions EHR4CR Data Element annotation in ODM In order to prepopulate the ODM ClinicalData the ODM MetaData file should be enriched with the EHR4CR Data Element codes taken from the termapp. The mechanism used for this annotation would be the CDISC ODM Alias XML tag. 93

94 The third party SDM-ODM editor should have the possibility to add Alias XML elements to an ODM ItemDef (ODM 1.2) and to CodeListItems (ODM 1.3.1). The Alias tag contains the so called context and value attribute. The third party SDM-ODM editor should allow a globally defined context EHR4CR for the working study definition. The context has to be added manually. Figure 4 Global definition of the EHR4CR context By adding this special context a link to the EHR4CR Meta Data repository can be established. It should be reflected if the connection was successful. This can be done by setting the text color of the global Alias definition to green. Once this connection is established, the attribute value of an ItemGroupDef can be defined as the classification by using a full text search on all available classifications in the EHR4CR repository. The text field for the EHR4CR context updates if a value within the field changes and shows the new matching labels in a drop down. Figure 5 Example of full text search This label will be used on ItemGroup level to be set as the Alias value attribute for the EHR4CR context. <ItemGroupDef OID="g _vs" Name="Vital Signs"> <Alias Context="EHR4CR" Value= "12-Vital Signs"/> </ItemGroupDef> The annotation for ItemDefs and codelists are described in the next chapter Annotation of quantitative elements The attribute value of an ItemDef is completed as well by a full text search of the EHR4CR Data Element labels. The value is in this case the code and not the label. The EHR4CR service has to provide either the code is applicable or the label. The classification is used as a filter argument if 94

it is set to the corresponding ItemGroupDef. This means only the labels for the values within the classification should be returned by the service if the search is used for ItemDefs or codelists.

to corresponding measurement units is not possible with ODM 1.3.1. This results in the limitation of having multiple measurement units referenced to one ItemDef in the EHR4CR study setup context.

95 it is set to the corresponding ItemGroupDef. This means only the labels for the values within the classification should be returned by the service if the search is used for ItemDefs or codelists. <ItemDef OID="i_diabp" Name="Diastolic" DataType=" " Length="5"> <Alias Context="EHR4CR" Value= "<Code of Diastolic blood pressure>"/> </ItemDef> Table 8 EHR Data Element annotation in ODM The link to corresponding measurement units is not possible with ODM This results in the limitation of having multiple measurement units referenced to one ItemDef in the EHR4CR study setup context. Figure 6 ODM annotation of CD datatype Annotation of qualitative elements The Alias tag can also be referenced to a CodeListItems in addition to an ItemDef element in the ODM standard. This can be used for the EHR4CR data element type CO as described in Figure 7 ODM annotation of CO data type in ODM. It is advised to also annotate the code of the parent element to the ItemDef. 95

Figure 7 ODM annotation of CO data type 9.4.5.

InclusionExclusionCriteria to ODM ItemDefs seem to be close.

96 Figure 7 ODM annotation of CO data type Prepopulation of eligibility criteria As the CRF pre-population of the EHR4CR scenario 3 uses the CDISC ODM standard, the re-use of ODM ConditionDefs and the relation of SDM InclusionExclusionCriteria to ODM ItemDefs seem to be close. Unfortunately the standard does not provide a reference mechanism which can be used for scenario 3 in order to pre-populate the eligibility criteria as ODM elements. The link of ODM ItemDefs to ConditionDefs has another meaning as the ItemRef attribute CollectionExceptionConditionOIDs does NOT collect this questionnaire in the EDC system if true. In the SDM context it means that the patient is Inclusion/Exclusion criteria are met and the patient is eligible. A CDISC compliant EDC system would display after the upload of the ODM Metadata the ItemDefs as questionnaires on the screen and wait for user interaction. After the data collection the values are stored in the so called ItemData tags. If the InclusionExclusion criteria would be linked to the ItemDefs the EHR4CR platform could automatically create the ItemData ODM elements from the already processed Metadata in scenario 2 as there is a 1:1 relation between ODM ItemDefs and SDM InclusionExclusion criteria. As long as the CDISC SDM-ODM standard does not provide such a link the mapping of the ODM ItemData and the EHR record entries will remain as a manual process for the eligibility criteria. The third party SDM-ODM editor must be able to load the SDM-ODM container in order to complete the ODM definitions. 96

Executive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design)

Electronic Health Records for Clinical Research Executive Summary for deliverable D6.1: Definition of the PFS services (requirements, initial design) Project acronym: EHR4CR Project full title: Electronic