Relational Database Query Languages for Instances of Medical Information Repositories. Aastha Madaan

Size: px

Start display at page:

Download "Relational Database Query Languages for Instances of Medical Information Repositories. Aastha Madaan"

Allyson Jordan
5 years ago
Views:

Relational Database Query Languages for Instances of Medical Information Repositories Aastha Madaan DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE

1 Relational Database Query Languages for Instances of Medical Information Repositories Aastha Madaan DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE AND ENGINEERING Graduate Department of Computer and Information Systems University of Aizu 2014

2 I hereby declare that this dissertation is entirely the result of my own work except where otherwise indicated. I have used only the resources given in the list of references. Date Candidate s Signature Aizu-Wakamatsu c Copyright by Aastha Madaan. All rights reserved. i

3 The dissertation titled Relational Database Query Languages for Instances of Medical Information Repositories by Aastha Madaan is reviewed and approved. I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Main Referee Professor Date Subhash Bhalla Professor Date Zixue Cheng Professor Date Vitaly Klyuev Professor Date Incheon Paik University of Aizu March, 2014 ii

4 This dissertation is dedicated to my mother. iii

5 Acknowledgements I would like to acknowledge the support of the University of Aizu for an excellent educational environment during the study. The library facilities and computer facilities of the University have been exceptional in the level of support. I would also like to thank database laboratory colleagues for providing a supportive environment. I am also thankful to Dr. Shinji Kikuchi from NEC research, for the technical discussions and feedbacks for research improvements. I am thankful to many friends in Aizu who made the stay enjoyable. I am grateful to Prof. Wanming Chu, for her help during research work and for sharing valuable insights. Foremost, I would like to thankfully acknowledge the encouragement, technical support and guidance from my university research advisor, Prof. Subhash Bhalla during the studies. iv

6 Abstract The medical information includes both knowledge-based information (scientific papers and other literature) and patient-specific information (electronic health records, EHRs). The former is available on the Web through Web document repositories (such as, MedlinePlus, medical literature- PubMed, Medline). Querying these resources is required by a variety of end-users for improved quality of patient-care. These users include practitioners, specialists, researchers who are well versed with the medical knowledge and terminologies. They vary in their background, experience and interact with the system in various contexts. They pose precise queries and expect complete results within time limits (almost real-time). As a result, it is a complex task to query these resources. A search engine returns a large list of documents for a user search. A physician has to go over each of these documents to find the one that (exactly) matches his expectations. This task is time-consuming and may be abandoned. The study considers the medical domain experts in contrast with the database users, Web users, IR users and others. The medical experts are well versed with the schema (terminologies and medical processes) in comparison to the other users in terms of schema knowledge and ability to query. The study considers that enabling a database style query language over these resources may allow the domain experts to formulate exact queries and receive precise results within time constraints. The patient-specific resources are available as standardized electronic health records repository. The earlier paper-based patient records have evolved from Electronic Medical Records (EMRs), to Personal Health Records (PHRs) to now available EHRs (Electronic health records). The EHRs interact with different departments and subjects in a health-care organization. As a result huge amount of heterogeneous data is generated. Recently, several standards such as HL7, CEN and openehr have been proposed. The aim of these standards is to develop semantic interoperability. The study considers the electronic health records based on the openehr standard. Due to the huge amount of user-data and usage-data generated in case of the standardized EHRs database system there is need to support querying and enhance its usability. The standardized health records have complex structure and may be queried at single-patient level or population-level by a variety of medical domainexperts during the process of patient-care and for research purposes. This makes querying the EHRs databases complex. Moreover, a variety of end-users, varying in demographics and characteristics interact with the standardized EHRs systems for everyday tasks. Hence, the interaction data generated is large in volume and diverse in nature. Pattern mining tools can help in discovering knowledge from this data for usability and learnability studies. Using the discovered knowledge, the standardized EHRs database system can be improved on a continuous basis over its lifetime. The goal of this study is to provide ability to query to the medical domain experts. For this, it provides database support to the medical domain experts for web-user v

7 level activity over the Web document repositories. It aims to support their everyday querying needs (over the Web document repositories and EHRs repository) through relational query languages. It also aims to enhance the usability of the standardized EHRs database system. The main contributions of the study can be listed as: 1. The study models a Web based document repository as a three-dimensional cube that can be mapped to an XML-based hierarchical structure (XML schema). It proposes a two-step framework to create user-level schema corresponding to the Web document repository using the concept of Web document segmentation (as an off-line process) and enable existing XQuery query language (for XML) over it. This transformation makes the document repository query-able. Further, a new high-level query language interface, QBT (Query-by-Segment tag) is developed over it (on-line process). 2. Next, the study adapts the existing graphical, high-level query language XQBE (XQuery-By-Example) over the transformed schema for enhanced querying over the on-line medical document repositories. It provides a drag and drop query interface over an on-line medical document repository. 3. The study proposes a NoSQL, cloud-based persistence mechanism for the archetype-based EHRs. It further, proposes a high-level, graphical, relationallike query language Archetype By Example (AQBE) similar to the existing Query-By-Example (QBE) query language to support the user-queries over the standardized EHRs repository. It enables the medical concepts such as, blood-pressure or heart-rate to be queried independent of the system implementation and application environment. 4. To address the usability concerns for these databases and system enhancement purposes, the study proposes an automated usability enhancement framework. It makes use of pattern mining tools and user-centered design (UCD) guidelines to understand the heterogeneous end-users and user-system interactions. Further, the framework is adapted to provide an e-learning framework for the end-users for improved usability, learnability and ease-of adoption of the system. Such a system is capable to provide lifetime continuous feedback to the EHR system designer for system enhancement. Through the experiments and user-studies, it can be concluded that the study provides easy-to-use high-level query languages for the domain-experts as well as the novice medical users to query the knowledge-based and patient-specific medical information repositories. The use of query languages allows the users to frame exact queries and receive results within the desired context. The experiments exhibit the strengths of the query-languages compared to the existing methods of search/query. Limitations of the study include evaluation of the proposed systems (QBT, MXQBE - XQBE on MedlinePlus medical encyclopedia) over similar medical document repositories. For the proposed AQBE query language, more query functions (for population and epidemiological queries) need to be implemented to make it a complete database query language for the standardized EHRs database (using a NoSQL store). Implementation in real world setting and a usability study of the proposed query interfaces with actual physicians and clinicians will help to improve the system.

8 Contents List of Figures List of Tables Abbreviations xii xv xviii 1 Introduction The Medical Information Repositories Knowledge-based Information Repositories Patient Specific Information Repositories Model of the Environment Understanding the Medical Document Repositories Complex Medical Information Needs The End-Users Existing Solutions and Issues Thesis Contribution Query Support for Knowledge Based Information Repositories Multi-Stage Visual Query Language User-Level Schema Query Support for Patient-Specific Information Repositories Archetype QBE Query Language Automated Usability Framework for Standardized EHRs databases Structure of the thesis Overview of Chapters I Web Document Segmentation for MedlinePlus Medical Encyclopedia 13 2 Literature Review and Background Studies The Web Documents Need for Web Document Segmentation Web Document Segmentation: Tasks and definition Evolution of Web Page Segmentation Approaches Web Document Segmentation: Approaches Classification Based on Partitioning Top-Down Partitioning Bottom-Up Partitioning vii

9 CONTENTS Classification based on the Underlying Methodology Related Work Application domains Database Usage and Web Page Segmentation Segmentation Accuracy Evaluation Shortcomings of the methods of evaluation Summary and Conclusions In-depth Querying over Domain Specific Document Repositories Introduction The medical documents on the Web Structure of Health Document Repositories Users of Health Information An Example Problem Statement Background and Motivation Hierarchical Query Interfaces Query-Interface for the Web Document Collections Query Methods based on Web Document Segmentation Queries based on Dynamic Forms Integrated Querying of the Web Databases and the documents Proposed Approach Segmentation Framework Querying Framework Data Model Resource Tree Segment Tree Query Result The Algorithm Preprocessing the Web documents Segmentation of the Web documents Creation of a Concept-based Database High-level Query-Interface Query-by-Segment Tag Experiments Performance of the QBT Query-Interface Interface Support for User Input Usability Studies with actual End-users Discussions Summary and Conclusions Semantic Granularity Model for Domain Specific Queries Domain-Specific Document Repostories Keyword Search vs. Semantic Granular Query Model Shortcomings of Existing Search Engines w.r.t Medical Information Problem Statement Background and Motivation viii

10 CONTENTS Information Quality of Retrieval from The Web-based Medical Document Repositories Nature of Medical Documents The Notion of Web Document Segmentation History of Information through Granular Search and Database Query Profile of Skilled and Semi-Skilled Users Medical Users vs. Other Users High-level Database query languages Proposed Framework Structure of Semantically Coherent Segments in the Web Documents The Terminology-enriched Schema Query Classes Offline Process The Structural Analysis The Semantic Analysis Tree-structured Database Model Online Process Improved Structured Querying Experiments Experimental Settings Platform Dataset Document Preprocessing Web-document Segmentation Quality Enhanced Query Capability XQBE system architecture Query Formulation Query Functions Discussions Consideration of Fuzzy User Inputs Usability and Acceptance Future Work Summary and Conclusions II Standardized Electronic Health Records: Database Query Languages and Usability Studies 85 5 Literature Review and Background Studies Information Interchange Services and Distributed Architecture Business processes in a Health Organization Standardized Electronic Health Records (EHRs) Semantic Interoperability Levels of interoperability Dual Level Modeling Approach The OpenEHR Architecture The openehr Archetypes Archetypes and Semantic Interoperability Archetype Definition Language (ADL) ix

11 CONTENTS 5.3 Information Retrieval in EHR Systems Interoperability and levels of Query Interfaces Challenges in Querying Archetype based EHR Archetype Query Language (AQL) Querying the EHR Systems Bottom-Up Approach XQBE (XQuery by Example) Mapping ADL to XQBE for EHR Data XGI: XQuery Graphical Interface Top-Down Approach XQBO (Query by Object using DTD) Retro Guide Keyword Search Human-System Interactions and Usability within EHR Systems Pattern Discovery: Mining End-User Needs Discussions Summary and Conclusions Quasi-relational Query Language and Standardized EHRs Persistence and Querying for Standardized Electronic Health Records Querying in Archetypal EHRs Methods for Persisting the Archetypal EHRs NoSQL based persistence for the Standardized EHRs Research Questions and Problem Statement Problem Statement Proposed Approach Standardized EHRs Database System Architecture AQBE Query Generator System Specifications AQBE Runtime Experiments Discussions Summary and Conclusions Automated Usability Framework for the Standardized EHRs Usability Support for Standardized Electronic Health Records End-to-end Workflow Management Context of the Study Problem Statement Background and Motivation Usability Issues in the EHRs databases User-Centered Design for the EHR databases Standardized Electronic Health Record Databases Building blocks of the openehr model Evolution of Standardized EHRs User Interface Generation An Example Pattern Discovery and the Standardized EHRs Databases Data and Information Quality Issues x

12 CONTENTS 7.3 Proposed Framework User Classification Understanding User-work-flow Patterns The Knowledge Repository Mathematical Formulation The Algorithm Experiments Pre-Study and End-User Responses Post-Study Experimental Evaluation Dataset Preparation Experimental Method Performance Evaluation Quantitative Evaluation Qualitative Evaluation Discussions Applicability of the Framework Automated Usability Enhancement Framework and e-learning Objective e-learning Goals: Various Dimensions of Learnability Knowledge Archive (Repository) Discussion: Qualitative Performance of e-learning scheme Summary and Conclusions Summary and Conclusions Limitations of the Study Future Work Appendix A 162 A.1 Query by Segment Tag: Usability Studies with End-users A.2 Pre-Study with the Clinicians A.3 Prototype EHR system based on openehr standard and Usability Concerns References 169 xi

13 List of Figures 1.1 A snippet of the Web document Heart Attack from the MedlinePlus medical encyclopedia document repository An overview of the problem area addressed by the study, existing approaches and the proposed methods Different levels of queries performed by a clinician during health-care delivery (a) Query formulation through the proposed multi-stage query language on the user-level schema, (b) A sample query formulated using the multi-stage query language Webpage segmentation algorithms on a Timeline Approach of visual segmentation based algorithms Traditional keyword search (above) and the proposed high-level query-interface over Web health documents Spatial DOM (SDOM) corresponding to the DOM of a Web page (Oro et al., 2010) An example of content structure for heart attack web document in MedlinePlus medical encyclopedia (med, 2012a) Characteristics of the Web document captured by the VisHue algorithm (Section 2.6) Screenshot of the Query-by-Segment (QBT) query-interface over MedlinePlus medical encyclopedia (med, 2012a) Auto-complete prompts in the QBT interface User-response about use of on-line medical information User-response about use of on-line medical information User-response about the usefulness of the training session Query Results (total 141) using the existing advanced keyword search on MedlinePlus medical encyclopedia (med, 2012b) with keywords Tuberculosis and Pneumonia Various common components of a Web document in the MedlinePlus Medical Encyclopedia document repository Hierarchy of semantically coherent segments of a Web document from a healthcare document repository ((med, 2012b)) Snapshot of Microsoft Academic Search that employs object-level vertical search ( An example of common search methodology followed by a health expert (Flu Shot query) adapted from (White et al., 2008) xii

14 LIST OF FIGURES 4.6 Web Document Segmentation framework for database creation and traditional search methods employed for the domain experts Steps of the Web Document Segmentation Model (off-line Process) Structure points (Tree of Contents algorithm) on MedlinePlus document (a) Tree of Headings similar to a DOM tree (adapted from(htm, 2011)), (B) Tree of Contents similar to VIPS (adapted (Cai et al., 2003)), (C) The Semantic Schema Tree corresponding to an example medical encyclopedia Web document A Snippet of a document (from MedlinePlus repository (med, 2012b)) in the semantic XML Database Representation of a query (Q1) as a sub-tree of the schema tree (adapted from (Abiteboul et al., 2000)) Representation of a Web document as a two-dimensional array of segments and contents Representation of the Web document repository as a three-dimensional database System architecture of the XQBE query language for the MedlinePlus medical Encyclopedia (adapted from (Braga et al., 2005)) Query in XQBE query language for finding the cases where patient is showing symptoms of Peptic Ulcer due to Helicobacter pylon virus Proposed improved query and search interfaces (according to expert and naive users) Querying by the domain experts on the transformed database (containing data objects of the Web Document) and representation of the various transformations required to create query-able entities An overview of different inter and intra-organizational interactions Semantic Interoperability Two-Level Modeling Approach DBMS architecture compared to openehr Architecture The Blood Pressure concept represented as an OpenEHR archetype Query Support at different Levels Syntax of AQL Mapping process to present XQBE for EHRs (Sachdeva, 2009) A sample of BP.dtd BP.xqbe - an XQBE template for query (see online verison for colours) Human System Interactions in EHR domain Human-System Interaction Monitoring Model for EHR Systems as per the TDQM framework The system architecture of the proposed Standardized EHRs database system Screenshot of interface of the AQBE query generator of the proposed database system Steps for generating the forms on the AQBE editor Steps performed for patient -data insertion Steps performed for the querying process on the AQBE editor Usability and standardization support infrastructure for the standardized EHRs databases Support studies for evolution and enhancements in standardized EHR databases User-Centered Design (UCD) steps for interactive health-care application using standardized EHR databases (adapted from ISO ) (Sachdeva et al., 2012a).126 xiii

15 LIST OF FIGURES 7.4 Blood pressure concept represented as an archetype (mind-map) in the openehr archetype repository (CKM, 2012) Flow of openehr artefacts during the process of patient-care (T. et al., 2008) Increasing complexity of the standardized EHRs w.r.t. time and patient encounters. Changes in templates and archetypes (considering life-long representation of a single patient s EHR) (HL7, 2011), (T. et al., 2008) The relationship between templates and the archetypes for the openehr standard (T. et al., 2008) Distinct Users of standardized EHR databases (Schabetsberger et al., 2005) Characteristics and demographics identified for any user of the standardized EHR databases Participating attributes and their values for decision tree classification Decision-support tasks and requirements of a clinician in day-to-day activities Decision Tree corresponding to the Users dataset with user-labels as the class attribute Variation in the accuracy (%) of the user-categorization based on the attributes chosen as the final attribute Variation in the use of Standardized EHR databases features (used by a clinician) as a function of frequency of use A comparison between the length of the interesting patterns discovered and the frequency of their access (Work-flows dataset) Enhanced Decision making with the support of Usability Support Study (Framework) User-centered design guidelines and the proposed e-learning scheme to enhance usability Steps of proposed e-learning Scheme and flow of continuous feedback to the system designer through the KArch for user-learnability enhancement Internal structure of the proposed Knowledge Archive (KArch) A.1 AQBE prototype system representing the usability concerns - (i) Flow of contents on the user-interface, (ii) mandatory and optional user-attributes, (iii) archetypes required based on the needs of the (specialized) health-care environment, (iv) when and where the template needs to be generated, (v) Distinguishing the mandatory and optional user data A.2 Archetype list in the AQBE system, used for selectiion of archetype(s) for dynamic form generation A.3 Example form corresponding to the blood pressure concept in the AQBE system representing the various attributes in the concept A.4 Example form corresponding to the blood pressure concept (Figure A.3) continued.168 A.5 The participating archetypes of heart summary failure of the EU-SHN Project. 168 xiv

16 List of Tables 2.1 The Various Terminologies Used in the Literature for describing Segments Classification of Algorithms on the basis of Partitioning approach Summary Performance metrics Various medical resources on the Web Approaches for querying Web data and databases VisHue Web page segmentation algorithm Algorithm for Query-by-Segment Feature Comparison: QBT query-interface and existing keyword search Comparison for QBT query-interface and existing keyword search w.r.t search query formulation Summary of Information of the Participants Various types of query classes supported by the proposed framework (a) Count of the original Document Corpus Top-30 frequently occurring segments (attributes) in the transformed database Evaluation of segmentation quality on the basis of users rating Evaluation of segmentation quality on the basis of user ratings Sample set of the representative queries XQuery expression of Query 1 in (Table 6) Query-capability of the XQBE and XQuery for the MedlinePlus Medical Encyclopedia User Intent as captured by the MedlinePlus document corpus and the transformed database corresponding to the corpus Qualitative comparison based on query features among the considered query methods for searching and querying medical information The three levels of interoperability in the Standardized EHRs Comparison of various querying approaches for the EHR users Example snippet of JSON equivalent of the Blood pressure concept (archetype) Comparison between databases for persistence of the archetypal EHRs Sample set of 23 queries prepared for evaluation of the AQBE Query Language Comparison of query-capability between AQL and AQBE interfaces for different types of queries Sample set of the work-flow steps captured in the database for a general physician s common tasks in a clinical setting xv

17 ABBREVIATIONS 7.2 Hypotheses for the UCD guidelines to address the Usability Issues in the Standardized EHRs database Day-to-day clinical tasks for the surveyed categories of clinicians Summary of key responses of the pre-study with clinicians Abbreviations of the common features of a clinical application (adapted from (Zheng et al., 2009)) Accuracy of User-classification using the ID3 algorithm (Users dataset) Results of the maximal sequences of varying length w.r.t the work-flows of the clinicians From the Work-flows dataset- Most frequent consecutive feature accesses (forming the higher order maximal patterns for knowledge discovery) and their levelof-support Qualitative overview of the influence of distribution of EHRs features in the user-work-flows in the analytical studies From the Work-flows dataset- Most frequent consecutive feature accesses (forming the higher order maximal patterns for knowledge discovery) and their levelof-support Relative comparison of the improvement in task performance using qualitative measures A snippet of Users dataset The Usability Challenges and e-learning Support Scheme Traditional methods vs. the proposed e-learning scheme w.r.t. the qualitative measures of usability enhancement A snippet of Work-flows dataset A.1 Part I: User Information A.6 Questionnaire for the pre-study performed with the end-users (clinicians) A.2 Part II: Questionnaire about use of on-line medical information A.3 Part III (a): Queries for study with the end-users (clinicians and other health experts) A.4 Part III (b): End-user Questionnaire for the QBT Interface A.5 Part III (b): End-user Questionnaire for the QBT Interface (contd.) xvi

18 Abbreviations ADL Archetype Definition Language AQBE Archetype Query By Example AQL Archetype Query Language CDSS Clinical decision-support systems CKM Clinical Knowledge Manager CSS Cascading style sheets DICOM Digital Imaging and Communications in Medicine DoC Degree-of-Coherence DOM Document Object Model DT D EDT EHR EU P GU I Document type definition Extended DOM-Tree Electronic Health Record EHR Usability Protocol Graphical user-interface HL7 Health Level 7 HT M L Hypertext Markup Language ICD IHE IR ISO International Classification of Diseases Integrating Healthcare Enterprise Information retrieval International standards organization LOIN C Logical Observation Identifiers Names and Codes M esh Medical Subject Headings N IST National Institute of Standards and Technology N LM National Library of Medicine xvii

19 ABBREVIATIONS P HR QBE QBT RM SIOp Personal Health Record Query-by-example Query-by-Segment Tag Reference Model Semantic Interoperability SN OM ED CT SNOMED Clinical Terms SP A U CD Sequential Pattern Analysis User-centered Design V IP S VIsual Page Segmentation W W W World Wide Web XGI XGI XQuery Graphical Interface XQuery Graphical Interface XQBE XQuery by Example xviii

20 Chapter 1 Introduction The Medical Information Repositories The medical domain is complex. Therefore, limited access to target documents by indexing through a standard search engine is not sufficient. The medical information includes both patient-specific information (EHRs) and knowledge-based information (scientific papers and other literature) (Hanbury, 2012). The medical knowledge (terminologies and concepts) has evolved over 10s of years. This is available on the Web through Web document repositories (such as, MedlinePlus (med, 2012b)), popular medical literature related publications (PubMed (pub, 2011), Medline (med, 2012a)), other primary and secondary resources and EHRs ((Freire et al., 2012)). Querying these resources is required by the secondary applications such as evidence-based medicine and secondary use of EHRs to improve the quality of care (Hanbury, 2012). Medical information is utilized by a variety of end-users with complex requirements. Practitioners, specialists and researchers are well-versed with the medical knowledge and terminologies. These users have precise queries and expect complete results within time limits (almost real-time). According to Kreshmoi survey (Gschwandtner et al., 2011), the end-user requirements vary on the basis of level of specialty. Moreover, the issue of trustworthiness of information and authentication of a resource are of major concern for the users. The widespread use of WWW has given rise to a range of simple query processors, the search engines. These query a database of semi-structured data (the HTML pages). For example, one can use a search engine to find pages containing a word Villain. However, it is difficult to obtain only pages in which villain appears in the context of a character in a wild west movie (Cohen et al., 2000). In the health-care domain, the end-users vary in their background, experience and have variable needs. These users interact in various contexts, a clinician interacts with the patients during clinical-care. He (or she) may need to query the medical literature and other document repositories to assess the plan for the treatments or patient-diagnosis. For such queries, the users need a query language and a schema. The goal of this study, is to assist the domain experts equipped with domain knowledge but not well-versed to use a query language such as SQL, XQuery to query the Web document repositories. This thesis illustrates the need of in-depth (and granular) querying of medical information on the Web. For example, a physician has an exploratory evidence-based query 1 1 Research Publication(s) - Aastha Madaan Domain specific multistage query language for medical document repositories. Proc. VLDB Endowment, Vol. 6, No.12, August 2013, , ISSN (online), Very Large Data Base Endowment Inc. 1 The evidence-based queries are raised during patient-care where the clinicians wish to query the knowledge archives to determine the relevance of signs and symptoms to the potential existence of one or more medical disorders (Cartright et al., 2011). 1

21 1. INTRODUCTION cases where helicobacter pylon bacteria causes peptic ulcer or he or she may have a hypothesisdirected query 2, Treatment in case of high-fever and dizziness. The conventional search (Yahoo, Google or localized search menu), return all the Web documents which match any of the keywords ranked by a set of criteria defined by the search engine. This may return a large number of documents. The physician has to go over each of the documents to find the one that (exactly) matches his expectations. This task is time-consuming. It may be abandoned by the physician. Both types of queries may involve multiple filtering-conditions (attributes) which make the query complex. To answer such end-user queries there is a need to facilitate DB-style queries, where a user can query helicobacter pylon bacteria in context of symptoms and peptic ulcer in context of causes and can append these filtering conditions (attributes) in multiple interactive stages to receive results (in this case, as name of disease). Several recent studies have made an attempt to address the information needs of the endusers within various domains. These include work on granular-level domain-specific search, estimating granularity of information in a document (Yan et al., 2011), estimating the difficulty of a document, the quality of documents, document summarization and the use of terminology resources for query refinement (Hanbury, 2012). Cross-lingual search is of importance for end users at all levels (Hanbury, 2012). However, several challenges remain for efficient health-care delivery. Figure 1.1: A snippet of the Web document Heart Attack from the MedlinePlus medical encyclopedia document repository Knowledge-based Information Repositories A major category of information used in health-care is expert or knowledge-based information. This information is generally created or collected by experts who may or may not be part of the health-care organization. This kind of information is used by health-care workers during clinical decision making process. Some examples of the knowledge-based information resources are, the on-line medical document repositories such as MedlinePlus (med, 2012b), A.D.A.M medical encyclopedia (ada, 2011), WebMD (Web, 0 30). It may also include the health-care journals 2 The hypothesis-directed queries represent the non-diagnostic intent of information about conditions seeking details on potential hypothesis including treatments, cures and outcomes (Cartright et al., 2011). 2

22 and the medical literature resources such as PubMed (pub, 2011), MEDLINE (med, 2012a). Patients, consumers, and health-care providers need easy-access to the latest and high-quality information in these resources to make critical health status and other health-care decisions Patient Specific Information Repositories The patient-specific information is a result of patient-encounters within a health-care organization. The earlier used paper-records were replaced by the Electronic medical records (EMRs). The EMRs are a digital version of the paper charts and contain the medical and treatment history of the patients. They allow the clinicians to track data over time, identify the patients for screenings/checkups and maintain the recordings for parameters such as, blood pressure and vaccination. But the information in EMRs cannot be exchanged with other health-care organizations and specialists due to the lack of a standard format of patient-data. The EHR evolution has led to the evolution of Personal Health Records (PHRs). In case of PHRs, a software application is used by individuals to record their personal knowledge about their health and the health of their dependents. There are web-based PHR portals - namely, Google Health (Google, 2011) and Microsoft Health Vault (Microsoft, 2011). These aim to store and allow access to the EHR data from a variety of perspectives. A hospital maintains an archive of the EHRs of a patient as a part of its day-to-day activity. In comparison to PHRs, the EHR repositories at the hospitals provide multiple additional functions. These facilitate the epidemic prevention studies and facilitate national planning using a life-long record for a population. Furthermore, these support shifting towards integrated care for health-care, medical care and social-welfare activities. Hospitals and health agencies exchange EHR data through standards such as HL7 (Health Level 7) (HL7, 2011) and DICOM (Digital Imaging and Communications in Medicine) (Hussein et al., 2004). The standards such as, CEN EN (cen, 2011) and openehr (cen, 2011) define the unit of exchange (documents) as EHR Extract (expressed in XML). In order to have interoperable exchanges, the EHRs use industry standards promoted by the Integrating Healthcare Enterprise (IHE) (IHE, 2013). The openehr standard is becoming a widely used EHR standard (cen, 2011). In this study we consider the EHRs based on the openehr standard. 1.2 Model of the Environment In this section we describe the model of the environment. The complexity of medical information, various end-users and the recent approaches for querying data are given. Considering that user preferences facilitate information retrieval and querying from domain-specific resources, the need for granular query results is emphasized Understanding the Medical Document Repositories The documents on the Web are well-structured for human readability and comprehension. A program can reliably extract the categorized components information. The medical Web resources often comprise of the electronic form of the former paper-based document collections. These are referred by the clinicians for in-patient diagnosis and by general users for preliminary symptom recognition. As in the case of a text document, headings are employed to organize the content of the documents. Headings are located at the top of the sections and subsections. These preview and succinctly summarize upcoming content and show subordination. These lead to a hierarchical structure for a document. It plays a role in understanding the relationships between its contents. The same structure becomes useful for segmenting a Web document. 3

23 1. INTRODUCTION Hence, the logical (semantic) layout of the Web document, along with the domain knowledge and the structural tags for the content organization form a logical hierarchy with the domainspecific terms. This can be transformed into XML data model that can be further utilized for database-like query on the Web document collection. The contents of these resources do not change (much) over a period of time. The schema and these resources have evolved from common practice terms over 10s of years. Typically, a Web document represents a single theme or a topic (Figure 1.1). There may be multiple subtopics within a given topic and each of the documents may contain distinct sub-topics depending upon the topic it represents. Each of the topic and sub-topics labels can be represented as queryable attributes to the end-users. These labels are the elements of the medical terms (disease, condition, process) that are semantically grouped in a Web document. These are arranged for efficient browsing of a repository. The header and footer on any Web document give the general Web site information and other meta-data. The free text comprises of paragraphs and lists. Figure 1.1, shows a sample Web document describing the various components given above which form a complete Web document. The main content of the (similarly organized topics on) the Web documents can be logically divided into several information segments. These can be based on common Web document labels, such as - causes, symptoms, home care, alternatives and references for a medical encyclopedia documents. This can significantly ease the task of querying by the health-care experts Complex Medical Information Needs The major barriers for efficient response to querying of medical information include perceived lack of quality, relevance of the content, inaccessibility, and trustworthiness of resources (Gschwandtner et al., 2011). Inadequate quality assessment of on-line medical information may lead to misleading information. Therefore, validity and reliability of the Internet resources is often questioned for medical information (Gschwandtner et al., 2011). Moreover, there is no criteria so as to when a search or query should be terminated. The clinicians often find that the results returned to them are not appropriate as per their requirement (Gschwandtner et al., 2011). Querying medical information may vary in relation to depending on the role of the clinician and the way information is presented. While computer-based knowledge resources are useful in addressing these needs, ineffective search skills and lack of time are common barriers to information seeking (Sevgin et al., 2013). The major challenges that arise due to the complexity of medical information are: 1. Identification of suitable query methods, and the results searched. 2. Identification of resources to be queried for the results (knowledge of schema). 3. Identification of results presented to the end-users The End-Users The end-users in the medical domain can be classified on the basis of their knowledge and expertise in the field. The first set of users can be termed as the novice-users (or the casual users). These users include the patients and their relatives. They are not well-versed with medical terminologies and concepts and may use keywords which may not be (much) related to the actual terms. They possess low ability to pose reasonable queries. These users (may or) may not be good with the computer expertise. Hence, they utilize the results that are available from general-purpose search engines. 4

24 On the other hand, the doctors and clinicians, are the medical domain-knowledge experts and frequently need to access the knowledge archives (Web document repositories). These users must use reliable sources and require complete context of the results. They may (or may not) be good with the computer expertise but can pose exact and precise queries. Unlike searches on the Web, the domain-experts in the medical domain have in-depth knowledge of the particular information to be located, the details of the resource type and its reliability. They make use of the complex structural information to extract the relevant semantic information. Therefore, a query language is developed to support the needs of the specialists (domain experts). It can subsequently reduce the learning curve of the novice users. There is a knowledge gap, which is often figured out by the intermediary agent about what is exactly required by the user (Li and Jagadish, 2012). It can be eliminated by allowing the user to construct queries interactively on a user-level schema (with query-able attributes) Existing Solutions and Issues The generic search engines may be time-efficient and free but often lack the specificity in providing relevant and high-quality information for the professional use in medical care (Gschwandtner et al., 2011). These existing (keyword) searches consider the Web documents as bag of words and do not exploit the rich structural information available for the on-line medical information (Marian and Wang, 2009). Physicians who search using keyword search approach, say, a problem from its conditions may find the most relevant link on the second or third page. Whereas, they wish to find the relevant results for their queries quickly and efficiently. As a result of difficulties, the clinicians tend to believe that the answers to their (complex, specific and specialized) queries does not exist or exists in a fuzzy state (Gschwandtner et al., 2011). Thus, the quality of results returned by the conventional search engines about medical information suffer in quality of information because of the following reasons: 1. Query with only 1 or 2 terms may not contain enough terms for the search engine to retrieve the desired information to the user. 2. In the document repository of the search engine, there might exist more than thousands of articles matching the query requested. This makes it impossible to locate the desired information by simply browsing through contents of returned results. 3. Conventional search engines focus on generic information search, domain-specific results are usually not taken into consideration during the search. Thus a simple keyword-based search does not produce relevant search results in specific domains such as the medical domain. Recently form-based interfaces have been proposed for the naive users to allow them to query a database. These work well with the query logic of the end-users. For the complex queries, the number of forms might increase. Considering the approach proposed by Jayapandian (Jayapandian and Jagadish, 2006), where the forms are clustered based on the query-needs of the users, the forms need to be modified to address the queries which cannot be performed otherwise. 1.3 Thesis Contribution In this thesis, we consider two instances of the medical information (Figure 1.2) the knowledgebased resources (medical encyclopedias), and the patient-specific resources (Standardized Elec- 5

25 1. INTRODUCTION tronic Health Records (EHRs)). The thesis primarily focuses on the former information resources and the latter set of medical information has been taken up as part of the ongoing research activities at the laboratory. Here, area of focus is querying the medical information resources, which form the expert-users personal and external knowledge base. A multi-stage query language is proposed. It provides a user-level query calculator to formulate a query using domain concepts. Overcoming these will simplify the querying tasks for the expert and novice domain users. It will enable them to get the desired results. For the case of the standardized EHRs, we consider challenges of usability and querying these resources and outline the possible solutions to them. A database system is proposed to store these records and a relational-like query language is enabled over it. In a complex system such as the standardized EHRs database system, huge amount of user-data and usage-data needs to be analyzed. Due to which several usability and learnability issues arise. To address this an automated usability framework based on data-analytical tools is proposed as part of the thesis. Figure 1.2 gives an overview of thesis contribution. The main contribution of the thesis is to provide ability to query to the domain-experts of the medical domain. It considers the two instances (mentioned above). The figure describes the problem area, existing approaches used to query medical information and the proposed approaches for the considered instances. It also gives the details of the contributions w.r.t chapters in the thesis. Medical Users two main categories Medical Experts use MEDICAL INFORMATION RESOURCES existing methods Google, Yahoo, other search engines existing methods Developer -level Query Langauges (AQL) Instance 2 search/query Standardized Electronic Health Records (EHRs) Novice Users Instance 1 search/query Online medical resources - MEDICAL ENCYCLOPEDIA Proposed- Support Tree-based Database Proposed- Support Cloud-based Database Our Contribution: Ability to Query Proposed Methods 1. Archetype Query-By- Example Query Language [Chapter 6] 2. Automated Usability Enhancement Framework [Chapter 7] 1. Query-by-Segment Tag Query Interface [Chapter 3] 2. MXQBE - XQBE Query Language on MedlinePlus [Chapter 4] Medical Experts Proposed Methods Figure 1.2: An overview of the problem area addressed by the study, existing approaches and the proposed methods. 6

26 1.3.1 Query Support for Knowledge Based Information Repositories The general practitioners need a query tool which gives 5-10 authentic results per page with a simple layout. These users query the health Web document repositories. Although physicians are time-constrained, they are prepared to devote time to complex queries. The immediate need at point-of-care is critical for the physicians. Information on drugs, disease descriptions, treatment and clinical trial information are dominant needs (Gschwandtner et al., 2011). We describe two key-features of the proposed approach to address this problem statement. The proposed query language attempts to overcome the following shortcomings. 1. Implement the query methodology to seek diagnostic or hypothesis-directed information (followed by a medical domain-expert) and, 2. Present the relevant areas (granular results) of the Web document that match the user s query criteria Multi-Stage Visual Query Language Querying in the medical domain has a natural multi-stage and transitional progression from symptoms, to causes, remedy (and so-on) during patient-diagnosis. The medical domain users wish to express these complex semantics of patient-care and receive precise and complete answers. Database queries can easily fulfill this requirement as compared to the keyword-based web search. Hence, in this thesis we aim to model the multi-stage process of patient-care as a query language over a database of Web documents. At each stage user can choose an attribute (filter-conditions) that he or she wants to append to the query. The system provides flexibility to the users to form the order of accessing these features and add conditions on these attributes. Figure 4.6 describes various levels of query-complexity that occur as a result of querying by a domain-expert on his knowledge-base during patient-diagnosis. He may perform a simple query involving a single medical-concept (symptoms/causes), say, Find diseases, where fever is a cause. For a query, Find medication when a patient has vomiting due to fever, a user may query two-concepts, symptoms and causes. Such queries, can be termed as medium-level complex queries. Further, a clinician may have a more complex query. He or she may query the knowledge base after observing fever as a symptom, but on further observing the patient may discover high-blood pressure also as symptom. In such a case, he may wish to query the symptoms (concept) incrementally with two values. In more complex scenarios, a user may recursively query multiple concepts with different values. For example, in Figure 4.6 for the query find remedies when a patient is having fever because of vomiting, after reading the remedies, a clinician may further wish to query the causes, with a value, over-eating User-Level Schema The concepts and the terminologies in the medical domain have evolved from the domain knowledge and experience of the domain experts (clinicians and researchers) over several years. This information does not change frequently and is enriched over a period of time. With the emergence of the standards for medical terminologies, for example the LOINC (LOINC, 2013) for the laboratory tests, ICD (9,10) (ICD(9, 2013) and SNOMED-CT ((IHTSDO), 2013) for disease codes, the medical information is increasingly becoming interoperable across geographically distributed health-care systems. The Web document repository creates a semantic object-level universal schema which can be queried by domain experts and other users. The attributes and data stored in this schema is largely understandable by the users and is easy-to-query. Such a schema can simplify and 7

27 1. INTRODUCTION Simple Medium Complex Recursive Cases =? Cases =? where Causes = "Fever" Causes = "Fever" Symptoms = "Vomitting" Medication =? where Symptoms = "Fever" AND Symptoms = "High Blood Pressure" More causes Causes = "Vomitting" Symptoms = "Fever" Remedies =? User-Level Schema Clinician s Knowledge Base = Web document Repository Layout Figure 1.3: Different levels of queries performed by a clinician during health-care delivery. enrich the querying experience of the end-users. This facilitates a query language for these users to easily formulate their queries interactively in a multi-stage manner. We propose to capture the hierarchical schema (mapped to XML form) for the Web based document repositories. The multi-stage querying and the XML-based schema in terms of the domain-level knowledge of the users eliminates the need to write complex code for the queries (or complex SQL or XQuery expressions). This approach has been earlier proposed in (Sachdeva and Bhalla, 2009), for granular and precise querying of the archetype-based EHRs. Therefore, the integration of the available domain knowledge and programming methodology is required to surmount the difficulty of using the query languages. For the purpose of creating the user-level schema (previous section), the proposed approach maps the domain knowledge of the experts and the concepts represented in a Web document. It extracts the syntactic structure of Web document considering the meta-data (in form of headings and sub-headings). And further, these concepts are used to label the nodes of the hierarchical structure to form a concept-based structure. These node-labels define the attributes for querying. We aim to develop a segmentation algorithm which segments the Web document considering the layout features and the semantic organization of the domain-level concepts. It makes use of the concept of Web page-segmentation, but considers the visual, layout and semantic features as compared to the earlier visual-based approach (Cai et al., 2003). As part of this thesis, we aim to develop a domain-specific multi-stage query language described in the previous section. Figure 1.4(a), represents the query formulation process for clinical queries through the proposed multi-stage query language. At each stage of query formulation the user can dynamically select a medical-concept to query. Assign a value for it and then either execute the query or further refine the query by adding another attribute(s) and view results. The query is executed on the user-level schema. It provides the users with the segment-level results. Hence, the proposed query language can allow the user to formulate complex DB-style queries using a simple interface and understandable attributes. Figure 1.4(b), query formulation for the query, Cases where a patient has fever due to affliction of pneumonia and tuberculosis using the proposed multi-stage query language. The query fetches precise results by querying only specific contexts or segments (Causes or Symptoms). This thesis will propose query methods (by using a user-level schema) for the medical domain users. The following steps will be needed to complete this thesis. 1. Multi-stage Query-by-Segment (Tag) Query language for on-line medical documents. In this thesis, we consider about relating the user-level schema with a multistage query-by-segment query language. The term segment can be defined as a unique 8

28 Clinician Write Query Select attribute Assign Value Execute Query Query Multi-stage Query Language User Level- Schema (Doctor s Knowledge Base) Result View Results Clinician Query Querying Causes = Pneumonia AND Causes = Tuberculosis User Level-Schema Doctor s Knowledge Base Execute Query Display Results Add Attribute Yes Refine Results AND yes Refine Query Yes Append condition <OR/AND/NOT> (a) End No Symptoms = Fever... Display Results (b) End No Figure 1.4: (a) Query formulation through the proposed multi-stage query language on the user-level schema, (b) A sample query formulated using the multi-stage query language. identifiable or query-able entity in documents belonging to specialized domains. It may be defined as a medical-concept, topic or subtopic-label in a medical document repository. The user can dynamically select an attribute to query on the user-interface (UI) and subsequently enter the value for that attribute. Further the user can append more attributes to it to formulate the desired query. The interface allows the user to query the knowledge archives and knowledge base without the need to understand the structure of the underlying schema or having any technical expertise. 2. XQBE for on-line medical document repositories (MXQBE). At present, we attempted to use the XQBE graphical query language (Braga et al., 2005), on the userlevel schema generated by the proposed approach. In this approach, the topic and subtopic labels (query-able attributes) form the intermediate nodes of XQBE structure. The values of the attributes are added as leaf nodes. Since the user-interface is graphical, the users can easily drag and drop the nodes adding the filtering conditions without any prior knowledge of the schema structure. They can specify the attributes (segments) they wish to receive as results. The usability studies are planned to show that this kind of query language is easier to use and return relevant results to the users Query Support for Patient-Specific Information Repositories Standardized Electronic Health Records (EHRs) (based on openehr standard (T. et al., 2008)) make use of archetypes 1 for representation of data. In combination with terminologies, the archetypes enable powerful possibilities for semantic querying of repository data. Such querying enables longitudinal processing of health data, regardless of the originating system. The semantics of data is better understood by viewing the data in the context of the user interface (UI). Query-by-example (QBE) is among the earliest graphical query languages for relational databases which uses a table-like interface (Zloof, 1977). The approach simplifies the querying over relational database for the novice-users by presenting the query-able attributes with the 1 An openehr archetype is a computable expression of the domain content model in the form of structured constraint statements, based on openehr reference model (art, 2003). These are defined by the clinicians. In general, they are defined for re-use, and can be specialized to include local particularities. They can accommodate any number of natural languages and terminologies. 9

29 1. INTRODUCTION relations in table-format; the user simply has to fill in the values for the attributes on which he or she wishes to condition the query Archetype QBE Query Language The proposed AQBE (Archetype Query By Example) approach (Sachdeva et al., 2012b) has been developed on similar lines of the QBE interface. This study adopts this AQBE approach with use of MongoDB (NoSQL) database. The target audiences of the interface are the skilled and semi-skilled users in the health-care domain. The interface is expected to eliminate the need to learn the AQL (Archetype Query Language), (AQL, 2011) syntax (and the ADL (Archetype Definition Language (Beale and Heard, 008a) )) for querying purposes. The proposed AQBE query generator provides a quasi-relational query language interface which is independent of the need to generate corresponding AQL queries using the AQL (for example, the query builder developed by the Ocean Informatics (aql, 0 30)). This is in comparison to the work proposed in (Sachdeva et al., 2012b) that depends on AQL query generation before the queries are executed on the database layer Automated Usability Framework for Standardized EHRs databases The proposed off-line usability support framework (Figure 7.1) aims to capture the usagepatterns of the various user-groups instead of large volumes of usage logs. These patterns are further used to enhance system-usability. In contrast, considering the life-long health records systems changing a software system such as the standardized EHRs database is enormously difficult and expensive (WALKER, 2008). Giving considerations throughout the design lifecycle is required including a complete consideration of end-to-end work-flows (Constance M. et al., 2005). There is a need to automate the process of pattern-discovery from usage logs and storing these patterns for system re-engineering. 1.4 Structure of the thesis The thesis is organized in two main parts corresponding to the two instances of medical information considered. The first part gives the details of the literature review undertaken and the research study done for the knowledge-based medical information resources on the Web. The second part of the thesis gives the literature review undertaken and the research study done for the standardized Electronic Health Record databases (patient-specific information resources). Part I, Chapter 2, gives literature review pursued in the area of Web document segmentation. The concept is further used in the methodology for creating a user-level schema corresponding to the on-line Web document repository. This helps in utilizing the headings/subheadings as query-able attributes for relational query language support for the healthcare (expert) workers using the on-line repositories. Chapters 3 and 4 describe an in-depth and high-level query languages proposed in the study. These query languages are capable to provide granular results for the user queries. The need of a multi-stage query language and features of the user-level schema corresponding to the on-line medical document repositories are given. Part II, Chapter 5, describes the work done to support the health-care workers for adopting, and querying the standardized electronic health records (EHRs). The chapter 5 gives the literature review done in the area of standardized EHRs. Chapter 6 describes the AQBE query language, (a high-level relational like query language) that is supported on a NoSQLbased persistence of these EHRs. Chapter 7 describes off-line usability support studies for the standardized EHRs based database systems. 10

30 Chapter 8, summarizes the contributions of the thesis and gives the limitations and future work required for the study. The Chapters 2 and 5 give the overview of literature review undertaken for each instance of medical information. Each of the chapters is broadly structured into sections describing the research area, problem statement, background and motivation for the research problem, and proposed approach. These are followed by the experimental evaluation, discussions and conclusion sections Overview of Chapters Chapter 2 examines the existing approaches for Web document segmentation. It classifies the approaches on the basis of their partitioning method and the underlying approach. Further it compares these approaches on the basis of key features of Web document segmentation (semantics, precision, authenticity, resultant structure, performance with respect to time and space and dependency on a Webpage programming standard or language). It studies the various application domains which use the concept of Web document segmentation as an intermediate step. It describes the various evaluation measures used and discusses their shortcomings. It also relates the database support required by Web document segmentation. Chapter 3, highlights the challenges faced by the health-care end-users for in-depth querying of Web based health-care information resources. It compares the existing works for in-depth querying and highlights the concerns in their applicability to the health-care documents. It considers the need for segment-level search/query. The details of the proposed Query By Segment Tag (QBT) graphical query-language interface are given. Qualitative and quantitative evaluations show the effectiveness and efficiency of the proposed query-language. A user study is done to evaluate the acceptability of the QBT interface by the users. Chapter 4, describes the need and process of adopting the existing XQBE query language to the on-line medical document repositories. To meet these objectives, the study proposes a two-stage framework. The chapter details the model of the document repository, MedlinePlus medical encyclopedia, creation of the user-level schema as an off-line process and then describes the on-line querying process. It evaluates the framework on the real-world on-line document repository dataset (MedlinePlus Medical Encyclopedia). Further, the results obtained are analyzed and the study discusses the feasibility, ease-of-use of the proposed work over the existing domain-specific search tools and states the strengths and limitations of the study. Chapter 5, details the role of semantic interoperability in various cross-organizational business processes in health-care domain. It highlights the dual-level modeling approach of the openehr standard and the describes the role of archetypes (artefact). It compares the existing information search and querying methods for the EHRs systems and analyses their applicability to the dual-level standardized EHRs. Later part of the chapter discusses the various user-system interactions and role of humans in the design of the EHRs systems. It describes the need for pattern mining techniques for enhancing the human-system interaction model for the standardized EHRs. Chapter 6, considers the openehr standard based Standardized Electronic Health Records (EHRs) schema using dual-level modeling for information exchange. It describes a new persistence mechanism using a NoSQL database for storing the standardized EHRs. Further, a highlevel QBE-like AQBE (Archetype based Query-By-Example) has been evolved for the EHRs data repository. It also, explains the complexity in the structured EHRs and the archetypes. The chapter highlights the need of efficient and scalable persistence mechanism for these standardized EHRs. The last part of the chapter details AQBE database system as a support for in-depth queryability and the different types of user queries supported. 11

31 1. INTRODUCTION Chapter 7, explains the need to reduce the dependency on post-release user-feedbacks, surveys and facilitate the task of system redesign (and re-engineering). The study considers the correlation between socio-technical features of the users and their usage-patterns over the standardized EHRs database. It describes the proposed automated usability framework for these database systems. The framework is detailed in the chapter and its features, user-centric design and automated usability support are described. Experimental evaluation and user-studies support the aim of the work to reduce the gap between the designed application flow and user work-flows (anticipated by them) within a standardized EHRs database system. It discusses the applicability of the framework as an e-learning scheme. 12

32 Part I Web Document Segmentation for MedlinePlus Medical Encyclopedia 13

33 Chapter 2 Literature Review and Background Studies 2.1 The Web Documents The World Wide Web has become a repository of many important documents (Alcic and Conrad, 2011). Consequently, Web related applications have become significant. The acquisition, detection and analysis of Web contents is receiving more and more attention. Information extraction algorithms have emerged a big way to improve the efficiency for information search (Chen et al., 2010). Most of data on Webpages is encoded using markup languages such as, HTML (Jinlin et al., 2001). The existing information retrieval systems consider Webpages as the smallest and undividable units. A Webpage contains multiple topics that are often not necessarily related to each other. In this context, detecting the content structure of a Webpage could potentially improve the performance of Web information retrieval. Webpages are normally composed for viewing. These lack information on semantic structures (Jinlin et al., 2001). Over a period of time, the amount of information and services available on the Web have increased. and, simultaneously, Web-centric business models evolve. Thus, identifying and retrieving distinct information elements from the Web is becoming increasingly difficult. Separating (segmenting) the distinct elements and accurately classifying them into relevant and non-relevant parts is essential for high-quality results (Kohlschütter and Nejdl, 2008). Applications such as Web automation, need to identify user-interaction (UI) components, such as buttons and links (Ayelet and Ruth, 2009). The Webpage structure (layout) varies based on the content types. Many Web designers prefer to use their own styles. The HTML headings (tags) convey a document s logical structure (El-Shayeb et al., 2009). HTML tends to retain its integrity relegates most of the physical formatting for Webpages to Cascading style sheets (CSS) which is removed from HTML itself. At the same time, advanced programming for interactivity on the Web has been taken up by new applications and languages that are not part of HTML (e.g., Java, JavaScript, CGI and Perl). Furthermore, because of flexibility of HTML syntax, a lot of Webpages do not follow the W3C HTML specifications. It leads the HTML tags to poorly reflect the actual semantic structure of a page (Azmy, 2005). Thereby the main tag containing main content differs in variety of Websites (Rahmani et al., 2010). However, within a single Webpage, it is important to distinguish valuable information from noisy content (Song et al., 2004) and detect its logical structure. 14

34 2.1.1 Need for Web Document Segmentation Breaking a Webpage into its principal parts and indexing the main content can improve the indexing performance. An improved retrieval performance can be achieved by considering the page as dividable unit having an underlying semantic structure with coherent segments (topics and subtopics) as atoms. For example, annotations for Web images are more precise. These need to be extracted from the connected paragraph. Similarly, duplicate contents appearing in a document or among different documents can be detected. Webpage accessibility can be improved by a clear document structure and duplicate elimination. The above mentioned applications underline the need for Webpage segmentation for Web related applications. Segmentation algorithms can be used to improve the quality of results in many Web mining applications (Cao et al., 2008), (Li et al., 2007). Segmentation techniques can also be useful for the contextual advertisement problem, where the advertisers often wish to place (or avoid placing) distinct advertisements on specific regions of a specific Webpage. Traditional information retrieval assumes a document as the basic information unit and extracts keywords from the document for creating an index. However, Web documents consist of differing parts (blocks) for example, a main content, navigation bars, headers, footers and advertisements. Each of them might additionally include multiple sub-divisions. This results in inaccurate information retrieval and hence unwanted documents have higher rankings. An application that intends to re-use Web content, such as a search engine or a Web-toprint application, needs to identify the regions of the page that contain distinct information (Ayelet and Ruth, 2009). Segments demarcate informative and non-informative content on a Webpage. Most of the Web documents contain explicit structure descriptions. It is necessary to develop methods for extracting useful information from the documents in order to allow their integration to the existing databases or information systems (Burget, 2007). The page segments enable streamlined content management and allow pages to be composed in terms of reusable components Web Document Segmentation: Tasks and definition Each of the region within the document plays a particular role within its pages. It is referred to as a segment. A recent study, defines a segment of a Webpage as a self-contained logical region (Fernandes et al., 2011). A Webpage segment: 1. It is not nested within any other segment and 2. It is represented by a pair (l; c), where l is the label of the segment, represented as the root-to-segment path in the Webpage DOM tree, and c is the portion of text of the segment. It defines the segment mathematically as, a Webpage ρ n might be represented as a finite set of (non-overlapping) logical segments ρ n = {s 1,..., s k } with k varying according to the page structure. Similarly, another study defines a Webpage segment as a region that a user would identify as distinct from the rest of the page in some way (Ayelet and Ruth, 2009). Similarly, Chakrabarti, et.al., defines a Webpage segment as, a fragment of HTML, which when rendered, produces a visually continuous and cohesive region on the browser window and has a unified theme in its content and purpose (Chakrabarti et al., 2008). Table 2.1 gives the various terminologies used for segments in a Web document and their definitions. 15

35 2. LITERATURE REVIEW AND BACKGROUND STUDIES Table 2.1: The Various Terminologies Used in the Literature for describing Segments S.No. Terminology Definition Representative work 1 Objects Basic objects of a Webpage are indivisible and combination of them form a composite object (Jinlin et al., 2001) 2 Pagelets It can be defined both semantically and (Bar-Yossef and syntactically Rajagopalan, 2002) 3 Segments It is a fragment of HTML, which when (Mukherjee rendered produces visually continuous et al., 2003), and cohesive region, with a unified (Chakrabarti et al., theme in its content and theme 2008), (Vineel, 2009) 4 Web Elements They are the basic elements which play (Yu et al., 2003) different roles within a Webpage. 5 Fragments It is a portion of the Web page which has a distinct theme or function. It is (Ramaswamy et al., 2004) distinguishable from the other parts of the page 6 Area Areas are blocks of interest for the users (Burget, 2007) on the Webpage 7 Blocks Information blocks formed by closely related content, forming topic within the page (Cao et al., 2010), (Song et al., 2004), (Cai et al., 2003), (Lin and Ho, 2002), (Kohlschütter and Nejdl, 2008), (Alcic and Conrad, 2011) Evolution of Web Page Segmentation Approaches The algorithms for extracting the structure of the Webpage were introduced in the year These focused on the HTML page analysis based on visual cues (Jinlin et al., 2001). Another algorithm discovered the functions of various objects within the Webpage (Jinlin et al., 2001). It attempts to detect the underlying template of the Webpages. These methods are good for the HTML based Webpages that are strictly based on HTML DOM structure. These algorithms were followed by algorithms which are based on the visual representation of the Webpages (Gu et al., 2002) and (Jinlin et al., 2001). In 2003, the common and most recognized Webpage segmentation algorithm the VIPS (VIsual Page Segmentation) algorithm was proposed (Cai et al., 2003). Also, an ontology-based algorithm was proposed by (Mukherjee et al., 2003). Later an algorithm based on the repititive structures within the Webpage was also proposed (Nanno et al., 2004). Similarly, Ramaswamy et.al. proposed an algorithm to automatically generate fragments (Ramaswamy et al., 2004). Subsequently, Krupl and Herzog proposed the visually guided bottom-up table detection and segmentation approach for the Web documents (Krüpl and Herzog, 2006). Shayeb et.al. proposed an approach based on heading detection to discover the latent structure of the doc- 16

36 Figure 2.1: Webpage segmentation algorithms on a Timeline. uments (El-Shayeb et al., 2009). (Burget, 2007) proposed a counter algorithm to (Cai et al., 2003) based on document layout features and (Chakrabarti et al., 2007) used isotonic smoothing technique for page level template detection. Whereas, (Guo et al., 2007) proposed a Webpage segmentation technique based on geometric and style information. (Chibane and Doan, 2007), proposed a topic segmentation approach based on visual criteria and content layout to segment a Webpage. In 2008, the Webpage segmentation was addressed as a graph-theoretic problem (Chakrabarti et al., 2008) and as a densitometric problem (Kohlschütter and Nejdl, 2008). Pnueli et.al. proposed the visual segmentation method (Ayelet and Ruth, 2009). Vineel proposed a Webpage DOM node characterization algorithm for Webpage segmentation (Vineel, 2009). (Cao et al., 2010) proposed a segmentation method for Webpage analysis using shrinking and dividing. In 2011, (Fernandes et al., 2011) utilized the VIPS algorithm and the DOM structure of the Webpages within the Website to create clusters of segments. Also, (Alcic and Conrad, 2011) used Web content clustering techniques for Webpage segmentation. Figure 2.1 represents the work done in the area along a timeline. 2.2 Web Document Segmentation: Approaches A Webpage may be partitioned in a top-down and bottom-up manner. Moreover, the Webpage segmentation algorithms can be classified on the basis of the underlying methodology considering, image processing, entropy or graph-based methods Classification Based on Partitioning The Webpages are segmented in a top-down manner by iteratively partitioning a complete Webpage until a pre-defined granularity level is achieved. These consider the Webpage as an entity. Most of the approaches employ heuristic rules to estimate appropriate block structures. They may also be segmented in a bottom-up manner, considering how the individual elements participate to form the components of a Webpage. This approach may consider the (leaf) nodes of the DOM representation, or the text elements of the Webpage or the consider Webpage elements as the nodes of a Webpage graphs. It assumes each of these elements as atomic units of contents. In the subsequent step, these units are grouped into various segments Top-Down Partitioning The first of the top-down approaches such as the algorithm proposed by Chen et al. distinguish five high-level blocks, namely, header, footer, left sidebar, right sidebar and main content (Jinlin et al., 2001). It proposes to understand the author s intention of identifying the object of the 17

37 2. LITERATURE REVIEW AND BACKGROUND STUDIES Table 2.2: Classification of Algorithms on the basis of Partitioning approach Partitioning Method Top-Down Partitioning Bottom-Up Partitioning Representative work (Gu et al., 2002), (Lin and Ho, 2002), (Cai et al., 2003), (Juan et al., 2005), (El-Shayeb et al., 2009), (Guo et al., 2007), (Vineel, 2009), (Cao et al., 2010), (Sano et al., 2011) (Li et al., 2002), (Krüpl and Herzog, 2006), (Burget, 2007) (Chakrabarti et al., 2008), (Kohlschütter and Nejdl, 2008), (Alcic and Conrad, 2011) Webpage, their functions and category. After the identification step, it creates an functionobject model of the Webpages and defines a set of general adaptation rules. These rules are based on the function-objects of the Webpage and perform the adaptation process over the Wireless Adaption Protocol (WAP). Similarly, an automatic top-down, tag-tree independent approach to detecting Web content structure has been proposed (Gu et al., 2002). In contrast to the function-object model of the Webpage it proposes to generate the Web content structure. The Web content structure represents the layout of a Webpage. The structure captures the user s understanding of the contents of the Webpage while browsing the objects such as color, position and size. It further determines the horizontal visual separators distinguishing the content. The algorithm divides and merges further applies a projection to segment the Webpage. The blocks are divided and merged on the basis of their visual similarity. It retains the logical chunks in the Webpage. For the table-based documents in HTML, Lin and Ho propose a simple partitioning method based on HTML table structures (Lin and Ho, 2002). It calculates the entropy value based on each feature of the table. According to the entropy-thresholds it partitions the blocks of the table as redundant or informative. On similar lines as (Gu et al., 2002), The Vision-based Page Segmentation algorithm proposed by Cai et al. computes (for each block) a Degree-of- Coherence (DoC) utilizing heuristic rules based on DOM as well as visual features (Cai et al., 2003). It detects the horizontal and vertical separators, visually separating the content of the Webpage. Once, the separators are detected, it generates the content structure by dividing and merging the visual blocks based on a set of heuristic rules and predefined partitioning threshold. Kao et al. define an approach where the blocks of DOM subtrees are separated based on the entropies of the contained terms (Kao et al., 2005). It generates an intra-page informative structure by connecting the information blocks on a Webpage. An information block is a region of user-interest within a Webpage. It performs the block condensation and expansion by assembling the sub-trees with similar features. (El-Shayeb et al., 2009) infers the hierarchical structure of a document, by identification of headings and their relationship with each other. Two phases are employed for carrying out this task: (i)the heading detection phase (ii)the identification of relationships between headings. In the first phase, the document text portions are identified as headings and their features are extracted and stored. In the second phase, the levels are assigned. In (Juan et al., 2005), the proposed algorithm DeSeA (Delimiter based Segmentation Algorithm), divides the Webpage 18

38 into coherent and relevant blocks using predefined delimiters. It uses an EDT (Extended DOM- Tree) from which the relevant blocks are extracted by recursive splitting and merging of the blocks. Each time the segmentation is performed the EDT is split based on specified-level of delimiters. (Guo et al., 2007) algorithms works in three steps: (i) it inputs the HTML Webpage to the Mozilla engine, which outputs a frame tree comprising of the visually-rendered features of the Webpage, (ii) Next the output frame tree is passed through block finder which extracts the frame subtree as blocks. (iii) finally a block partition tree is presented as an output. (Vineel, 2009) defines and assigns values to content size and entropy to measure the strength of local patterns within the subtree of a node in-order to perform page segmentation. The algorithm proposed by Cao et.al. processes the image of a Web page by using an edge-detection algorithm. It segments a Web page by iteratively shrinking and dividing the generated image. It detects the zones of the image that can be divided and then repeatedly segments until all the blocks are indivisible (Cao et al., 2010). Sano et.al. proposes a 3-step approach to extract the semantic structure of a Webpage. It first detects the layout template of a Webpage and then divides the page into minimum blocks (block may represent title of the content on the Web page) (Sano et al., 2011). Further, it assembles groups of these blocks into Web contents. It uses the decision-tree learning with nine parameters to extract the title block from each minimum block Bottom-Up Partitioning The Li et al. s algorithm (Li et al., 2002) introduced the concept of micro information units (MIU). It considers that a Webpage is often populated with a number of small information units. It segments a Webpage by identifying the MIUs in the page by an off-line process. It creates a HTML tag-tree of the basic tags which comprise a particular Webpage. It then extracts the MIUs from the Webpage using the tag-tree and merging the adjacent tags to form coherent content units. The algorithm proposed by (Krüpl and Herzog, 2006) works bottom-up by grouping word-bounding boxes into larger groups and uses a set of heuristics. It works by extracting the visually salient entities (the formatting and visual presentation represents the semantics of the content). It uses a simple word tokenization in the first step and returns the bounding boxes of all the words for a given URL. It defines the heuristics for simple table column identification and extraction based on the alignment of cells and contents within a table. Kohlschutter et al. define an abstract block-level page segmentation model which focuses on the low-level properties of text instead of DOM-structural information. The key observation is that the number of tokens in a text-fragment (or more precisely, its token density) is a valuable feature for segmentation decisions. This reduces the page segmentation problem to a 1D-partitioning problem. For example, the task of detecting block-separating gaps on a Web page is reduced to finding neighbored text portions with a significant change in the slope of the block-by-block text density (Kohlschütter and Nejdl, 2008). Alcic et.al. define Web contents as part of complex DOM tree that can be directly derived from the HTML of their hosting page. The representation of Web contents is based on the included contents and their semantics. Once the contents are rendered in a browser application, they posses geometrical properties like an exact position in the output panel and horizontal and vertical dimensions. It further uses the geometric, semantic and DOM-based distance measures with common clustering techniques to find appropriate page blocks (Alcic and Conrad, 2011) Classification based on the Underlying Methodology Here, the various segmentation algorithms are classified based on their underlying approach used for partitioning. The algorithms can be broadly classified into eight categories: (i) the DOM- 19

39 2. LITERATURE REVIEW AND BACKGROUND STUDIES based algorithms- these algorithms construct and analyze the DOM tree corresponding to the Webpage. They utilize the DOM nodes and tags as delimiters within the segmentation process. (ii) The layout-based segmentation methods use the layout information after rendering the content blocks, (assuming that similar content blocks are located close to each other and have similar shapes). Cai et al. use layout information such as font, color, and size to restructure a Webpage in a content block tree. Webpages have various parts with different layouts (associated pages, headers and footers ). (iii) Methods based on visual cues- these algorithms utilize the visual features of the Webpage contents such as the rendered coordinates, text color, font and size along with other features such as the DOM-tree structure for segmentation of the Webpages and (iv) Mining based algorithms- these algorithms focus on the smallest content units of a Web page. They reduce the page segmentation to clustering of Web contents to structurally and semantically cohesive groups. These define the distance measures for content units based on DOM, geometric and semantic properties; (v) Text Density based method- It considers the segmentation as a linguistic problem and analyzes the text density of the content groups in order to discover the Webpage segments; (vi) Graph based algorithms- The graph-theoretic algorithm constructs the graph of a Webpage and then segments the graph on the basis of the weights of the nodes and the edges representing the semantic associations and content groups respectively; (vii) Image processing based algorithm - It generates the image of the Webpage and then extracts the segments by iteratively shrinking and expanding the image until indivisible blocks are formed; (viii) Miscellaneous approaches- These include the other algorithms. In the following sub-sections each of these approaches are discussed in detail in terms of (i) basic approach, (ii) the resultant structure and (iii) precision. 1. DOM-Based Algorithms: These algorithms render the DOM-tree associated with a Webpage. They perform syntactic analysis of the Webpage. Under this category we analyze the following algorithms (Jinlin et al., 2001), (Chung et al., 2002) and (Juan et al., 2005). (Jinlin et al., 2001): This report detects visual similarities of content objects on Webpages. The approach defines various objects, simple, container, group and the list representing the unbreakable visual HTML objects. An ordered set of objects, special container objects rendered on same text line and satisfy some consistency constraint respectively. It then finds the visual similarity of simple objects to compare visual similarities of more complex (container) objects. Text attributes like font face, styles, size and color are considered. Starting from the root node, the label of path of a node is called as a pattern and leaves under the node are positions of the pattern in string. From the results of frequency counting, the best patterns are chosen based on some heuristics. Structured documents are constructed in a recursive manner. Then pattern detection algorithm is applied to elements of these initial container objects, and detected patterns are converted to list objects. The final container object is the hierarchical structured document that is actually a tree representation of original page. (Chung et al., 2002): The approach performs document transformation and semantic element tagging process. It utilizes the document restructuring rules and information about the topic (in form of concepts). For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. These XML documents embed logical schematic structures but also the semantic information in form of element names. (Juan et al., 2005): The DeSeA algorithm is implemented in two steps. A Webpage is divided into coherent blocks and then relevant blocks are detected from them. The block extraction process is divided into splitting and merging process. In splitting process, 20

40 a Webpage is segmented into blocks using level-1 delimiters first, and the hierarchical structure is recorded into a block tree. For each segmentation round, the EDT(Extended DOM Tree) is segmented using specific level delimiters starting from the root node. The whole Webpage is put into the block tree with the root node first. The blocks produced by DOM-based methods partition pages using the HTML tags. However, a lot of Webpages do not obey the W3C HTML specifications, which causes mistakes in DOM tree structure. In case of the DOM-tree, two nodes which may appear to be semantically related actually may not be related. Hence, the DOM-based segmentation may be too detailed. After partitioning, each block represents some information. It usually does not provide complete information about a single semantic entity. 2. Template Detection Algorithms: Here, the three main template detection algorithms are described. (Bar-Yossef and Rajagopalan, 2002): The algorithm assumes that several Websites comprise of the templatized pages. A templatized page shares a common administrative authority, look and feel. It assumes that in most cases, pages sharing a template also share large number of common links, hence templates play a role in template based navigation. It considers the Webpage as items, and the set of Webpages are considered as an itemset. Frequent itemsets corresponding to a template are discovered. Hence, the templates are discovered as an instance of frequent itemset counting paradigm. (Chakrabarti et al., 2007): The proposed framework has two main ideas. The first is the automatic generation of training data for a classifier. Given a Webpage, a templateness score is assigned to every corresponding DOM node. The second is the global smoothing of these per-node classifier scores by solving a regularized isotonic regression problem. (Sano et al., 2011): The method comprises of three steps. First, it determines the layout template of a Webpage by template matching. Second, it divides the page into minimum blocks. Third, it assembles groups of these blocks into Web content blocks. It focuses on the blocks which are titles of the Web content within a Webpage. 3. Visual Cues Based Methods: People view a Webpage as a collection of different semantic objects. When a Webpage is presented to the user, the spatial and visual cues can help the user to unconsciously divide the Webpage into several semantic parts. Therefore, it might be possible to automatically segment the Webpages by using the spatial and visual cues. Visual cues are very helpful to detect the semantic regions in Webpages. Webpages can be partitioned in a 2-D style logical structure. The VIPS algorithm (Cai et al., 2003) extracts the semantic structure of a Webpage. This semantic structure is a hierarchical structure in which each node corresponds to a block. Each node is assigned a value DoC (Degree of Coherence) to indicate the coherency of the content in the block. The algorithm makes use of page layout features by first extracting all the suitable blocks from the HTML DOM tree, and then tries to find the horizontal or vertical separators between the extracted visual blocks. Finally, based on these separators, the semantic structure for the Webpage is constructed. It tries to fill the gap between DOM structure and the conceptual structure. The algorithm simulates how actual user finds the main content based on structural and visual delimiters. The DOM structure and visual information are used iteratively for visual block extraction, separator detection and content structure construction. Finally a vision-based content structure is extracted. In the VIPS method, a visual block is actually an aggregation of some DOM nodes. Figure 2.2 gives a snippet 21

2. LITERATURE REVIEW AND BACKGROUND STUDIES of VIPS algorithm process. The right part of the figure displays the hierarchical structure corresponding to each of the visual blocks (VB) on the Webpage.

2: Approach of visual segmentation based algorithms. (Gu et al., 2002): Gu et.

41 2. LITERATURE REVIEW AND BACKGROUND STUDIES of VIPS algorithm process. The right part of the figure displays the hierarchical structure corresponding to each of the visual blocks (VB) on the Webpage. The visual blocks and tree of visual blocks corresponding to a web page [Cai, 2003] The tree structures corresponding to the above two web pages [Fernandes, 2011] Figure 2.2: Approach of visual segmentation based algorithms. (Gu et al., 2002): Gu et.al simulate how a user understands the Web layout structure when he or she browses a page such as objects size, position, color and background. A projection-based algorithm is applied to segment a Webpage into blocks. Blocks are further divided into sub-blocks or merged if they are visually similar. In this way it avoids the breaking of logical chunks. The approach is independent of physical realization and performs even when the physical structure is different from visual-layout structure. To construct Web content structure, first basic objects are extracted from the physical structure (tag-tree). It preprocesses the basic objects to find out decoration objects and group similar ones to reduce complexity. Then based on Web visual representation, the whole page is divided into blocks through projection. Adjacent blocks are merged if they are visually similar. This dividing and merging process continues recursively until the layout structure of the whole page is constructed. (Fernandes et al., 2011): Fernandes et.al propose an algorithm for segmenting a collection of Webpages from a Website. It assumes that pages belonging to a same Website share a common structure and hypothesize that a more accurate segmentation can be achieved by considering all pages of a Website. The method adopts a DOM tree alignment strategy (Vieira et al., 2006). Besides segmenting the Webpages, the algorithm clusters the segments into segment classes, which are a set of segments that play the same role in a group of pages. It takes the set of pages found in the Website as input and produces as output a set of segment classes. The segmentation algorithm is divided into three phases. In phase 1, the algorithm takes the Webpages as input. In this phase, the DOM trees of the Webpages are pre-processed to obtain a structural representation and corresponding segment classes. Whenever internal nodes with textual content are found, their children in the DOM tree are not represented and become a leaf node. The second phase of the algorithm creates an auxiliary hierarchical structure named SOM tree (SOM 22

42 is an acronym to Site Object Model) that summarizes the DOM trees of all pages of a Website. The third phase removes the nodes of the SOM tree that may be considered by a human as internal parts of one or more segments. By removing these noisy nodes, the SOM tree becomes a structure formed only by nodes that belong to the label (root-tosegment path) of one or more segments. Each leaf node of this new structure refers to a distinct segment class. To get the layout of the Webpage, the algorithm proposed by (Ayelet and Ruth, 2009) uses edge-analysis on the GUI image (or a transformation of it), looks for long edges directed in the horizontal and vertical directions, and then selects the rectangles that do not lie within any other rectangle in the GUI image. After this stage, the algorithm seeks for areas containing information, and groups them into distinct layout elements. This technique gives the high level layout. Next, the algorithm finds the next level of layout, by recursively computing the outer-most rectangles within each layout element. This recursive process continues till the level of individual elements or terminated by the user. These may be text areas, images, videos, buttons and edit boxes. 4. Page Layout Based: InfoDiscoverer system partitions a Webpage into content blocks using HTML tag TABLE in a Webpage (Lin and Ho, 2002). Based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. According to the entropy value of each feature in a content block, the entropy value of the block is defined. By analyzing the information measure, it proposes a method to dynamically select the entropy-threshold that partitions blocks into either informative or redundant. Informative content blocks are distinguished parts of the page, whereas redundant content blocks are common parts. 5. Web Mining Based: Several mining techniques can be used for Webpage segmentation. (Burget, 2007): It proposes a segmentation algorithm based on two levels of box clustering (bottom-up approach). The input of the algorithm is a set of boxes produced by the rendering engine. Each box is characterized by its position and size on the resulting page. An abstract model of the document is obtained by segmentation, it can be based on general rules that consider the document content, its layout and visual presentation. This makes the extraction method independent on the document format and more robust in comparison to the methods based on the direct analysis of the document code. (Vineel, 2009): It presents a novel DOM tree mining approach for page segmentation. It first characterize the nodes of DOM tree, based on their content size and entropy. While content size of a node indicates the amount of textual content contributed by its subtree, entropy measures the strength of local patterns exhibited. Based on the characterization of DOM nodes, an unsupervised algorithm is developed to automatically identify the Webpage segments. The aim of the segmentation algorithm is to prune the filtered DOM tree, so that the leaf nodes in the resultant tree represent page segments. The segmentation algorithm traverses down the tree till it encounters a node which satisfies pre-specified constraints on node size and entropy. The tree is then pruned at this node, and the subtree under the node is not traversed. (Rahmani et al., 2010): The approach simulates a user s visit on a Webpage and how the user finds the position of the main content (block) on the page. It transforms input DOM tree (from input HTML) into a block tree based on their visual representation and DOM structure such that on every node it has a specification vector, then the obtained small block tree is traversed to find main block having dominant computed value in comparison with other block nodes based on its specification vector values. The 23

43 2. LITERATURE REVIEW AND BACKGROUND STUDIES introduced method doesn t have any learning phases and could find informative content on any random input detailed Webpage. (Alcic and Conrad, 2011): The algorithm assumes that a Webpage consists of many atomic content units (Web contents) which can be grouped to from page blocks. Thus the Webpage segmentation problem is reduced to clustering of Web contents. The approach uses three different distance measures DOM-based, geometric and semantic-based to compute the dissimilarity of Web contents. Cluster analysis is used for discovering groups of similar objects in various data repositories. Three different clustering methods, partition-based, hierarchical agglomerative and density-based are utilized. 6. Text Density Based: (Kohlschütter and Nejdl, 2008) utilizes the text-density as a measure to identify the individual text segments of a Webpage. It builds methods from quantitative linguistics and strategies from the area of Computer Vision. The distribution of segment level text density follows a negative hyper geometric distribution, described by Frumkinas Law. The approach defines an abstract block-level page segmentation model which focuses on the low-level properties of text instead of DOM-structured information. The number of tokens in a text-fragment (token density) is a valuable feature for segmentation decisions. The strengths of this approach lies in the fact that it reduces the segmentation problem to a 1D-partitioning problem. It proposes a block fusion algorithm for identifying segments using the text density metric. The block fusion algorithm uses the text density for merging the blocks or segments within the Webpage. 7. Web Page as a Graph: The approach in (Chakrabarti et al., 2008) formulates the segmentation problem in a combinatorial optimization framework. It casts it as a minimization problem on a suitably defined weighted graph, whose nodes are the DOM tree nodes and the edges are the weights expressing the cost of placing the end points in same or different segments. It takes this abstract formulation and produces two concrete instantiations, one based on correlation clustering and another based on energy-minimizing cuts in graphs. Both these problems have good approximation algorithms. The quality of segmentation in this algorithm depends heavily on the edge weights. 8. Image Processing on a Web Page: (Cao et al., 2010) proposes an algorithm which segments the Webpage by considering the image corresponding to a Webpage. A layout for segmentation of the Webpage is generated by performing edge analysis on the GUI image (or its transformation). It assumes that the main objects are outlined so that there is a border between them. It seeks for areas containing information, and groups them into distinct layout elements. This technique gives a high-level layout; thereby segmenting the page to its main components. This approach uses only the visual information and does not apply any semantic analysis to group or un-group the elements. It finds the layouts recursively going deeper into the page. 9. Miscellaneous: (a) Web Elements as Function Objects (FOM): Every Object in a Website serves certain basic and specific functions which reflect authors intention and the purpose of an object. Based on this, (Jinlin et al., 2001) proposed the FOM(Function Object Model) model for Website understanding. FOM includes two complementary parts: (i) Basic FOM based on the basic functional properties of object and (ii) Specific FOM based on the category of object. Basic FOM represents an object by its basic functional properties and Specific FOM represents an object by its category. Combining Basic FOM and Specific FOM together, the authors understanding of 24

44 the Website is obtained. For segmentation, FOM can be combined with semantic representation model (such as XML) and a user model. (b) Webpage containing template-based content: Another study seeks to bridge the semantic gap between the HTML documents. It considers the fundamental problem of automatically annotating HTML documents with semantic labels. It exploits the fact that semantically related items exhibit consistency in presentation style as well as spatial locality in template-based content-rich HTML documents. The authors of the research work have developed a novel framework for automatically partitioning such documents into semantic structures. It converts a semantic concept rich document to a semantic partition tree with each partition (subtree) consisting of items related to a semantic concept (Mukherjee et al., 2003). (c) Web Elements as Fragments: Ramaswamy et.al. propose an approach comprising of three unique features for fragment detection from the Webpage. It proposes a hierarchical and fragment-aware model of the dynamic Webpages and presents an efficient algorithm to detect maximal fragments that are shared among multiple documents. It further, develops a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. There two independent fragment detection algorithms: one for detecting shared fragments and one for detecting lifetime personalization based (L-P) fragments. Both algorithms can be collocated with a server-side cache or an edge cache, and work on the dynamic Webpage dumps (from the Website) (Ramaswamy et al., 2004). 10. Shayeb et.al. present an algorithm for heading detection and heading-level detection which makes use of visual characteristics of the Webpage content (El-Shayeb et al., 2009). It uses the identification of headings and their relationship with each other to infer the hierarchical structure of a document. Two phases are employed for carrying out this task:(i) Identification of headings, and (ii) Identification of relationships between headings. In the first phase, all document text portions that are recognized as headings are collected and stored along with their features (font size, font weight, color, etc). In the second phase, a level for each of the identified headings is assigned. 11. Web Page elements as states of Hidden Markov Model: A recent study proposes Webpage segmentation based on generalized Hidden Markov Model (HMM), according to the page content as well as the structural configuration. It uses multiple emission features (term, layout, and formatting) instead of single emission feature such as, theme text, user interaction, copyright and the Websites label regions. Although there may differ according to each Website design style. So it establishes a HMM of five states. In this model, five states itself are not directly visible, but the each observation vector belongs to a certain state. It reflects the record of changes (individual difference) between different regions. 2.3 Related Work Application domains Various applications are studied to utilize segments of the Webpages for user services. 1. Enhanced Information Extraction. Webpage segmentation plays a significant role in automating the existing techniques for Webpage information extraction. The approach proposed in (Liu et al., 2010) utilizes the visual features on the deep Webpages to implement 25

45 2. LITERATURE REVIEW AND BACKGROUND STUDIES deep Web data extraction. It includes data record extraction and data item extraction. It creates a Visual Block tree from the visual representation of the Webpage. Then, data records from the Visual Block tree are extracted and partitioned into data items. Further, the data items of the same semantic are aligned together. The set of visual extraction rules are generated for the Web database based on sample deep Webpages. Song et.al. define a block importance estimation as a learning problem (Song et al., 2004). First, the approach uses the VIPS (VIsion-based Page Segmentation) algorithm (Cai et al., 2003) to partition a Webpage into semantic blocks with a hierarchical structure. Then spatial features (such as position, size) and content features (such as images and links) are extracted to construct a feature vector for each block. Based on these features, learning algorithms, such as SVM and neural network, are applied to train various block importance models. Better performance is achieved by integrating content features. 2. Improved Information Retrieval. Information retrieval (IR) is the area of study concerned with searching for documents, information within them and their metadata. It also concerns searching structured storage, relational databases, and the World Wide Web. Segmentation can significantly improve the performance of information retrieval on the Web documents. VIPS algorithm groups semantically related content into a segment. The term correlations within a segment is much higher than those in other parts of a Webpage. The most relevant segments from a candidate set of segments are used to select expansion terms for queries (Cai et al., 2003). These selected terms are used to construct a new expanded query to retrieve the final results yielding better and accurate information from the Webpage. (Yu et al., 2003) proposes Webpage segmentation solution for Pseudo Relevance Feedback. It is a technique commonly used to improve retrieval performance. Its basic idea is to extract expansion terms from the top-ranked documents to formulate a new query for retrieval. Through a query expansion, some relevant documents missed in the initial round can then be retrieved to improve the overall performance. If the words added to the original query are unrelated to the topic, the quality of the retrieval is likely to be degraded. The Webpage segmentation is applied to pseudo-relevance feedback. The VIPS algorithm is applied to divide retrieved Webpages into segments. After the vision-based content structure is obtained, all the leaf nodes are extracted as segments. Only a few (e.g. 80) top pages are selected for segmentation. The candidate segment set is made up of these resulting segments. Some ranking methods (such as BM2500) are used to sort the candidate segments and the top (e.g. 20) segments are selected for expansion term selection. It uses an approach similar to the traditional pseudo-relevance feedback algorithm to select expansion terms. The expanded query is used to retrieve the data set again for the final results. Hence, by partitioning a Webpage into semantically related units, better query expansion terms can be selected to improve the overall retrieval performance. 3. Extraction of Academic Papers from the Web. A lot of academic and research papers are embedded in the Webpages as HTML. Webpage segmentation can aid significantly in extracting these from the Webpages. The approach in (Liu and Zeng, 2011) views this problem as semantic labeling to the text blocks in Webpages, with each semantic representing one academic paper property. It employs the graphical model called Conditional Random Fields (CRF) model to jointly optimize property detection and labeling and using the approach of VIPS algorithm (Cai et al., 2003). The unified approach achieves better performance in academic paper extraction than the separated methods, because 26

46 the approach can take advantage of the interdependencies across the different properties. The leaf blocks are atomic units, the extraction task is transformed into assigning possible semantic tags to each leaf block. The semantic tags correspond to the 15 properties of the academic papers and the noise tag. 4. Phishing Detection. Phishing is a criminal activity using social engineering techniques. Phishers try to fraudulently acquire sensitive information (e-banking passwords, social security numbers, credit card numbers and so on) by constructing counterfeit Websites resembling original ones and deceiving the users into believing that they are legitimate. According to (Cao et al., 2010), the whole phishing page detection method contains two parts. First, the suspicious Webpages that may come from Spam testing are saved as images and sent to phishing detecting center for further analyzing. Then, the detecting center segments the coming page image into blocks. Finally, the features of the blocks that contain size, color information and the relations between blocks are extracted. These features compose an attribute relation graph (ARG) of the Webpage image. It detects the similarity of the suspicious Webpage with those protected ones like pages of bank and ISP (packets by Internet Service Providers) by the nested earth movers distance (NEMD) of their ARGs. Based on Webpages features, this algorithm is composed of three steps Web to image, image pre-processing and dividing. (Kohlschütter and Nejdl, 2008), quantifies the usefulness of their segmentation for the purpose of near-duplicate detection. According to (Chakrabarti et al., 2008), most duplicate detection algorithms rely on the concept of shingles, which are extracted by moving a fixed size window over the text of a Webpage. The shingles with the smallest N hash values are then stored as the signature of the Webpage. Two Webpages are considered to be near-duplicates if their signatures share shingles. Moreover, typical shingling algorithms consider all content on a Webpage to be equally important. Some shingles do not represent the core functionality of the Webpage and should not be used to compare whether two Webpages are duplicates or not. This might cause a shingling approach to consider two distinct Webpages from the same Website to be duplicates in case the shingles hit the noisy content. Similarly, false negatives might occur if two true duplicate Webpages exist on different Websites with different noisy content. 5. Archiving of Webpages. Nowadays, many applications are interested in detecting and discovering changes on the Web to help users to understand page updates and more generally, the Web dynamics. Web archiving is one of these fields where detecting changes on Webpages is important. A major problem encountered by archiving systems is to understand what happened between two versions of Webpages. The Vi-DIFF (Pehlivan et al., 2010), compares two Webpages in three steps: (i) segmentation, (ii) change detection and (iii) delta file generation. The segmentation step consists in partitioning Webpages into visual semantic blocks. Then, in the change detection step, the two restructured versions of Webpages are compared and, finally, a delta file describing the visual changes is produced. Their solution for detecting changes between two Webpages versions can serve for various applications. Vi-DIFF can also be used after some maintenance operations to verify structural and content changes within the Webpage. 6. Beyond Page level search. (Nie et al., 2009) proposes that the Webpage understanding problem can be viewed as three subtasks: Webpage segmentation, Webpage structure labeling, and Webpage text segmentation and labeling. Using Vision-based Page Segmentation (i.e., VIPS) algorithm and data record extraction technologies, one can automatically 27

47 2. LITERATURE REVIEW AND BACKGROUND STUDIES detect these object blocks, which are further segmented into atomic extraction units (i.e., HTML elements) called object elements. Each object element provides (partial) information about a single attribute of the Web object. The Web object extraction problem can be solved as a Webpage structure labeling problem assuming there is no need to further segment the text content within the HTML elements. With more semantic understanding of the text tokens one could perform better structure labeling, and with better structure labeling one can perform better page segmentation, and vice versa. Cai et.al. (Cai et al., 2004a), explores the use of page segmentation algorithms to partition Webpages into blocks and improve retrieval performance in the Web context. It compares four types of methods, fixed-length page segmentation, DOM-based page segmentation, vision-based page segmentation, and a combined method which integrates both semantic and fixed-length properties. In this method a Webpage is first passed to VIPS for segmentation, and then to a normalization procedure. The segments generated by the approach proposed by (Fernandes et al., 2011) are applied as input to a segment aware Web search method. It produces results close to those produced when using a manual page segmentation method. Besides segmenting the Webpages, the algorithm is able to cluster the segments into segment classes, which are set of segments that play the same role in a group of pages of a site. Their method is particularly useful to produce input to a previously proposed segment aware ranking method, since it not only segments the Webpages, but also is able to cluster them into classes. Hence, the results obtained by their method in segment aware Web search are close to the ones obtained when using a segmentation approach based on manual intervention. According to (Kuppusamy and Aghila, 2012), the results rendered by the search engines are mostly a linear snippet list. The approach proposes a model for dynamic page construction from the search engine results using Webpage segmentation based approach. The algorithm fetches corresponding Webpages from first N search results and applies the segmentation algorithm on each of them. It then creates a segmentation matrix from the search result array. The page segmentor fetches the pages from the World Wide Web and apply the segmentation algorithm on it. Hence, the resultant Webpage is generated from the user perspective Database Usage and Web Page Segmentation (Zeleny, 2011), states that considering principles like templates, it is possible to reuse the Webpage segmentation results for several pages. Segmenting just one page and storing the result for other pages based on the same template can improve segmentation performance significantly, especially for methods based on visual appearance of the page. Storing the template along with the original page structure (and reusing it afterwards) can be thought of as a cache for segmentation algorithms. The output of every visual segmentation algorithm can be termed as Tree of Visual Areas. Different algorithms have different output formats, but they all have characteristics of the tree structure. Both the original DOM tree and the Tree of Visual Areas need to be stored. Also it is important to know which nodes of the DOM tree are represented by particular visual area. This is to performed by mapping the Tree of Visual Areas to the DOM tree. And the whole segmenting chain is analyzed in order to have complete information for the database design. These input and output structures are then stored in the database for maintaining the segmentation cache. Hence, database usage is an important consideration for the Webpage segmentation algorithm. 28

48 2.3.3 Segmentation Accuracy Evaluation The evaluation methods have not yet received sufficient attention to consider the superior segmentation performance of an algorithm. Most of the evaluation methods focus recall and precision of the segmentation. There is no effective measure to evaluate the semantic accuracy of the segmentation. Table 2.3: Summary Performance metrics S.No. Evaluation Metrics Definition 1 Precision #correctw ebsegmentation #allw ebsegmentation 2 Recall 3 F-Measure #correctw ebsegmentation #possiblew ebsegmentation 2 Recall P recision Recall + P recision 4 Normalization Mutual Information NMI (X,Y) I(X, Y ) (H(X)H(Y ) The choice of the measures to evaluate and quantify segmentation accuracy is influenced by the approach adopted by different algorithms. These include precision, recall and F-measure. The table 2.3 gives the detailed mathematical formulas for these measures. The recall in context of Webpage segmentation algorithms can be understood as, the percentage of available descriptions detected and precision as the percentage of correct descriptions in context of description extraction (Burget, 2007). (Sano et al., 2011) describes these measures in terms of the title and non-title blocks. (El-Shayeb et al., 2009) defines the standard information retrieval measures of precision and recall to evaluate the algorithm. Other proposals such as, (Fernandes et al., 2011), (Chakrabarti et al., 2008), (Alcic and Conrad, 2011) use Adjusted RAND Index, normalization and average mutual information as measures of evaluating the segmentation accuracy. Whereas, the template detection based algorithms depend on the technique used for template detection, each of the techniques like isomorphism, mining have varying performance. The algorithms which treat the Webpage segmentation as an edge analysis problem in the image processing domain and the other which treat the Webpage as an image are independent of any technology used to generate the Webpage. The time and space requirements for the DOM-based, template detection and visual cues-based methods is high as compared to that of the linguistic approach for Webpage segmentation. The authenticity and the precision of the segments resulting from the application of the mentioned methods on the various Webpages is also an important point of consideration in choosing a suitable segmentation algorithm. Different applications may require varying degrees of precision, an algorithm which is capable of understanding the needs of the users and adjusting the precision accordingly is considered as a good algorithm. The resultant structure after the 29

49 2. LITERATURE REVIEW AND BACKGROUND STUDIES application of the segmentation algorithm on the Webpage impacts the memory requirements. It also is significant for the application domains that the particular algorithm can address as the Webpage segmentation mostly is an intermediate step of various applications like Webpage archiving, phishing detection and information extraction Shortcomings of the methods of evaluation In the proposed evaluations, the context of search or segmentation is not considered. People may read different parts of the page in different context. Without knowing the user s intention, it is very difficult to say which part of the page users may read. In some cases, the Webpage used are very specific (news sites) as they have a very specific structure. Considering any of the evaluation methods, it is not clear what the evaluation addresses, does it address the success of the partitioning or does it address the success of interaction model. It is very hard to observe the effect of both on the user evaluation. The focus is more on the different presentation techniques and their effectiveness in terms of the search time and user experience. When processing time is investigated, the number of pages is very limited and it is not clear what kind of pages are used to test the processing time, for example did they use simple or complex pages. Most of the algorithms receive data and meta data of a single Webpage as input and they output a content structure, usually in a form of semantic tree. The advantage of this family of methods is that they are designed to work on a single page (unlike Template detection methods). That also implies the main disadvantage of scaling. As the contents within each of the Webpages may not be visually distinct in the same manner. There may be differences in the way they are arranged on the Webpage as the users may see(visually). Especially for visual methods processing a large number of Webpages can become a serious problem (Zelen, 2011). Hence, it is not clear how well the visual methods and the other methods can be applied successfully to a large number of Webpages. 2.4 Summary and Conclusions Web document segmentation is an intermediate step in many Web data application domains such as phishing detection, information extraction, information retrieval, beyond page level search, Web archiving and temporal querying. It also plays a role in the reverse engineering for the Web documents. We carried out a comparative study about various Web document segmentation methodologies. 1. In particular, we classified these methodologies based on the underlying technique and compiled the various application domains where these algorithms are required. 2. We performed an in-depth comparison between the various categories of the algorithms to highlight the significant dimensions, such as the precision of the segmentation and the resulting structure corresponding to the Web document. The chapter considers the classification of the existing approaches for Webpage segmentation relying on how they partition the Webpage or the technique they use to perform segmentation. It points out new research trends that are currently under investigation. The study summarizes research issues that still deserve attention, in devising new methods for segmentation independent of any Webpage programming language or Web standard. It should require reasonable in space and time requirements and sensitive to the application context and users requirements. Another interesting research direction is the employment of the segmentation in new Web based applications which are schema dependent and have critical user interface requirements. 30

50 Here, eight categories of Webpage segmentation algorithms have been identified on the basis of their underlying techniques. For each of the approaches, their advantages and disadvantages have been listed. This study is helpful for implementation of a new segmentation algorithm or use an existing method for a new domain. It provides an overview of the existing approaches to solve any Web related problem such as, Webpage summarization, information extraction and generating semantics based schema close to the user perception of the Web objects (segments). 31

51 Chapter 3 In-depth Querying over Domain Specific Document Repositories Introduction The World Wide Web (WWW) is the an important source of information in the world. It is an inadequate source to serve the information needs of the health-care providers. A recent study indicates that only 46 per cent of physicians agree that the Internet is a source of accurate, relevant, and objective content (Malet et al., 1999). A Web document usually contains multiple contents for Web document navigation, decoration, interaction, and contact-information. These may or may not be relevant to user queries. Information retrieval from the medical resources accessible over the Web presents a number of challenges. Existing tools and utilities utilized for this includes reliable navigation tools, search utilities, and filter for content and quality. The health and medicine domain has a well-developed and controlled vocabulary and domain knowledge to aid information retrieval systems. For example, NLM s MeSH thesaurus (MeS, 0 11) well-defined diagnostic procedures (MED, 9 17), health topics (Hea, 9 18), dictionaries (MED, 9 18) and encyclopedias ADAM (ada, 2011), (MaryLand, 2011), (med, 2012a), GEOR- GIA (geo, 2011) have evolved over hundred years of medical expertise of the domain experts (researchers and clinicians). Most information retrieval systems on the Web consider Web documents as the smallest and undividable units. A Web document as a whole may not represent a single semantic (concept). Some basic understanding of the semantic (conceptual) content structure and the various conceptual groups represented within the Web documents could improve people s browsing and searching experience (Cai et al., 2003), (Ayelet and Ruth, 2009). To facilitate in-depth querying through a high-level query-interface requires an understanding of the type of the medical information resources, structure of information within them and the different users that need to be catered by these resources. The proposed study aims to generate a hierarchical structure of the Web documents in a medical document repository. It proposes an approach for Web document segmentation and the resultant structure is utilized to improve the querying for medical document repositories. 1 Research Publication(s) - (1) Aastha Madaan and Wanming Chu. In-depth querying of web-based medical documents: beyond single page results. International Journal of Computational Science and Engineering (IJCSE), ( ) [To APPEAR, accepted July, 2013], Inderscience Publishers, ISSN (E): , ISSN (P): and, (2) Aastha Madaan, Wanming Chu, and Subhash Bhalla VisHue: web page segmentation for an improved query interface for medlineplus medical encyclopedia. LNCS Vol. 7108, Springer-Verlag, Berlin, Heidelberg, pp , 9, ISSN: (Print) (Online). 32

52 Table 3.1: Various medical resources on the Web. S.No. Resource Type Details 1 Meeting Meeting announcements and reports 2 Directory A list of items from other sources 3 Abstract Introduction to the content of full text articles or resources 4 Homepage Institutional or personal resource starting points 5 News Releases, newsletters, and updates 6 Cases Case presentations 7 Images Pathology, radiographic, and clinical images 8 Review Analysis of research reports or synopses of referenced materials 9 Study Formal, peer-reviewed, structured, referenced research reports and clinical trials 10 Procedure Interventions, techniques, surgeries, instrumentation, technical manuals 11 Educational Material Includes learning modules, lectures, forms, continuing medical education materials, brief items, tables, charts, tracings, and algorithms 12 Video Video transmissions or clips 13 Audio Sound clips, radio programs 14 Database Searchable collection of items or documents The medical documents on the Web Medical documents on the Web generally comprise of the Question and Answer Portals (MED, 2012) medical Weblogs (AVV, 9 17), medical reviews (AMR, 9 17) and Wikis (Wik, 9 17). Weblogs and answer portals mainly deal with diseases and medications. The Wiki and the encyclopedia provide more information on anatomy and procedures (Wik, 9 17). While patients and nurses describe personal aspects of their life, doctors present health-related information in their blog posts. The Web-based medical resources consist of the resources as given in Table 3.1. These resources are used by the consumers (patients) and health-care providers (researchers and clinicians). There are two major channels for information searching on the web: using expert Websites such as MedlinePlus (med, 2012a) and WebMD (Web, 0 30) and using Google Search. Expert health websites provide health information consumers with a rich taxonomy supporting their browsing interface. This type of information searching is adequate for discovery purposes (Mahoui et al., 2011). On the other hand, searching using Google allows Web users to target a larger search information space and it is more suited for targeted queries (Mahoui et al., 2011). The nave users of the health-care domain i.e. the patients and their relatives use these resources to understand the preliminary symptoms for a disease or for knowledge about a health condition. Such users lack domain knowledge and are generally satisfied using the existing keyword search. The medical domain experts are- medical researchers, clinicians, and specialist. These users are well versed with the medical terminologies and are aware of the exact queryterms to be used within a specified context. Hence, they need to use an in-depth query-interface. 33

53 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES Query/Search Browser <Search Engine> Search Indexed Documents (web Resources) End-User (Expert/Novice Users) Results urls of the documents that match user query Query End-User (Expert/Novice Users) Query Interface Query/Search XML Database index to a Web Document Collection Indexed Document Segments Results Figure 3.1: Traditional keyword search (above) and the proposed high-level query-interface over Web health documents. Figure 3.1 shows the comparison between querying and searching the medical web resources using the existing keyword search and the proposed approach (segment-level results). The former method returns the URLs of the Web documents (indexed with keywords). Whereas, in the latter, specific and precise segments are returned as query results from the transformed concept-database (Web document collection) Structure of Health Document Repositories The Web contains a broad array of medical documents containing the traditional medical literature. The breadth and granularity of a resource type is decided by the domain experts (Scott-Wright et al., 2006), (Zhang et al., 2012). A number of medical Web crawlers have become available. One of these is Medical World Search (mws, 2011), which uses Harvest program and Perl scripts to allow selection of individual fields from remote Web documents and the presentation of search results in various formats. The Web-based medical resources have a well-defined structure. For example, in a medical encyclopedia, a health topic may contain attributes such as the definition of the topic, symptoms, external references, and home care. Each of the subtopics defines the sub-concepts related to the disease or a laboratory test performed Users of Health Information Consumer health information is an important research area in health-care domain. It involves the studies, technologies and communications for bridging the gap between Web document resources and consumers (patients and the general-public and domain experts). To design effective query-interface systems for such on-line resources, it is critically important to gain an in-depth understanding of how consumers search for health information in these systems. Users of medical documents know the subjects covered in the document and the type of resource. For example, a pathologist will achieve a more precise search results from a heterogeneous database using a resource type such as pathology images rather than a more general one 34

54 such as images. More and more users are searching health information on-line, along with the domain experts (clinicians and researchers). Researchers and practitioners observe topics, search processes, query and reformulate queries; access the search results, and evaluate health information. These users require precise and exact results in real-time during patient-care. In-depth query formulation is a major aspect of health information searching behaviour that requires substantial number of studies (Jenkins et al., 2003) An Example A health expert may want to query on-line health resources with complex queries. He may require Web documents describing the diseases that do not have a symptom Sy. He or she can either mention without Sy explicitly in many different ways or do not mention Sy at all. For a Web document P describing a disease D, the presence or absence of the keywords for Sy on P cannot directly use the criterion for deciding whether D has Sy. Hence, it is difficult for a medical information searcher to obtain useful search results purely through traditional keyword search. Moreover, considering only one of the attributes like the symptoms is not enough, the physician or medical expert may have some considerations w.r.t the symptoms. For example, a physician examining a patient for headache, after observing the patient may have a query such as, Find the case where patient has no eye pain but swelling under the eye. If in the returned Web document a symptom or the symptom (sub-topic) is missing it is not considered. In-depth Querying is defined as enabling the user to specify the contents he (or she) wishes to query within a context. A query such as, cases where helicobacter pylon bacteria causes peptic ulcer the query is formulated in terms of occurrence of the peptic ulcer within the concept symptom and helicobacter pylon bacteria within the concept causes of the medical resource say, an encyclopedia. The user is taken into a more semantic and granular level. Whereas, if the query is performed using the existing tools (med, 2012a), it will return all the results with the keywords or all any combination of the keywords irrespective of the user intent. Its aim is to provide him or her specific set of results (say, where the match occurs as a topic, or where it occurs within as a sub-topic). 3.2 Problem Statement The health-care documents are increasingly becoming available on the Web. These documents are a part of various document repositories such as the medical dictionaries, encyclopedias, and health topic collections. These resources cater to the information needs of the expert and the novice users in the health-care domain. The existing keyword searches like those of google.com or yahoo.com work for some users. At the same time the large number of results that are determined by these searches, make the expert users like the clinicians and researchers largely dissatisfied. There is an increasing need to develop a high-level query-interface that serves the querying need of users without the need to understand the structure of the underlying data or complex query syntaxes. The expert users may receive precise and results with high accuracy. The novice users continue to receive simple results for their queries. Hence, we address this problem in this work, by segmenting the Web documents (using concepts based on medical domain knowledge) and then enabling a high-level dynamic query-interface over the (segmented Web document database). 35

55 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES 3.3 Background and Motivation Existing search engines such as Yahoo, Google service millions of queries a day. This mechanism is generic and offers limited capabilities for retrieving the information of interest burying the user under a heap of irrelevant information (Chakrabarti et al., 2007). Since, the documents on the Web are not well structured a program cannot reliably extract the routine information that a human wishes to discover. For example, if a user wishes to find an article authored by a person X. The query results will show all articles with an occurrence of X. Such results are not relevant for the end-users. Hence, we conclude that there is a need for an in-depth querying of the Web documents. It may provide the users results from specific segments of the Web document. In this work, we discuss the querying requirements and existing methods for the health care domain. We, propose a new high-level incremental query-interface to enhance the query capability of the end-users. A typical Web document displays a number of visually distinct messages to the user. It might contain advertisements and links to other relevant pages in addition to the main content of the page. Thus, an application that intends to re-use content on the Web (such as a search engine or a Web-to-print application) needs to identify the regions of the Web document that are relevant (Chakrabarti et al., 2007). These content-groups have carefully placed visual and other clues using which most users can subconsciously segment the browser-rendered page into semantic regions (Chakrabarti et al., 2007). Various studies have considered these characteristics of the Web pages, and aim to return semantically coherent regions for user queries or segment the Web page to perform a block based search (Cai et al., 2004b). Some of the works utilize the visual and layout characteristics to segment a given Web page (Cai et al., 2003). Figure 3.2: Spatial DOM (SDOM) corresponding to the DOM of a Web page (Oro et al., 2010) Hierarchical Query Interfaces For a query interface, using the hierarchical structure of the Web page exploits the relationships amongst the various content groups. The hierarchical structures of the similarly structured Web documents map the attributes easily and generate unified and integrated query interface. It provides a query interface matching the understanding of the user (Dragut et al., 2009). They 36

56 are qualitatively better than the ones generated by sources that have a flat representation. The fields of this interface are organized in a hierarchical manner with appropriate labels. Due to this inherent advantage, we propose to utilize the hierarchical structure of the Web documents for query interface generation. Such interfaces are more suitable for similarly structured Web document repositories (Dragut et al., 2009) Query-Interface for the Web Document Collections Querying data from presentation formats (HTML) for purposes such as information extraction, requires the consideration of tree structures and of spatial relationships between the laid out elements. The rendering of tree structures is frequently connected with the updates over the resulting layout structure. The work in (Oro et al., 2010) proposes the SXPath language based on a combination of a spatial algebra with formal descriptions of XPATH navigation. Querying data from such a format requires the knowledge of the presentation structure and their conceptual relationships. It extends XPath 1.0 (XPA, 9 17) by spatial navigation primitives of spatial axes, based on topological and rectangular cardinal relations that allow selecting document nodes with a specific spatial relation, w.r.t. the context node (Oro et al., 2010). In addition, the spatial position functions exploit some spatial orderings among document nodes and allow selection of nodes that are in a given spatial position, w.r.t. the context node. Figure 3.2, represents the spatial DOM that is a transformed structure from the DOM of a Web document using the proposed algorithm. It is used for querying the Web documents Query Methods based on Web Document Segmentation The Web document segmentation demarcates informative and non-informative content on a Web page. It discriminates between different types of information. In a multi-word query where terms match across different segments in a page, this information is useful in adjusting the relevance of the page and the query (Cai et al., 2004b). In addition, the user can query the concept-groups individually with an improved query-interface. The earlier work for block level searches (Cai et al., 2004b), segments a Web page into visually semantic blocks, and computes the importance values of the blocks using a block importance model (Song et al., 2004). The block-based Web search-engines use these blocks and their importance values (Cai et al., 2004b). This approach improves the relevance of search results for record based or data intensive websites, such as, yahoo.com, and amazon.com. Another approach, the Object-Level Vertical Search extracts all the Web information about a real world object or entity. It integrates it to generate a pseudo page for the object that answers the user queries. Microsoft Academic Search ( and Windows Live Product Search ( use the object-level vertical search. The disadvantage of the approach is that if the search terms are scattered across various segments of a Web document with different concepts, it leads to low precision. In the case of queries on medical Web repositories documents (for example MedlinePlus (med, 2012a), such a search may not be useful Queries based on Dynamic Forms One of the simplest ways to query a database is through a form where a user can fill in relevant information and obtain desired results by submitting the form. Designing good forms is a nontrivial manual task, and the designer needs to understand the underlying data organization and the querying needs of the users. Furthermore, each form should be simple and easy to 37

57 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES understand, while collectively, the interface must support as many queries as possible. The approach proposed in (Jayapandian and Jagadish, 2006) focuses on a tunable clustering algorithm for establishing form structure based on multiple similar queries. The algorithm is adaptive and can incrementally adjust forms to reflect the recent querying trends. A form-based query interface, only requires filling blanks to specify query parameters, helps make data accessible to users with no knowledge of formal query-languages or the database schema. Generating a good form-based interface requires to consider how the interface can answer the diverse queries posed to it with minimal effort on the part of the user. However, the user queries are limited to the fields in the form and do not allow the user to reformulate his queries apart from these fields Integrated Querying of the Web Databases and the documents Information retrieval (IR) and databases (DB) are considered as two major technologies for managing and searching information. Any application that manages information requires IR and database systems in order to provide its users with a complete application (Garcia-Alvarado and Ordonez, 2011). The existing interrelated information sources on the Internet can be categorized into structured (database) and semi-structured (documents). The approach proposed in (Garcia-Alvarado and Ordonez, 2011), assumes that the heterogeneous relational databases can be integrated to serve as references for external data. The set of links between database meta-data, content, and documents is managed and queried in a DBMS. This can replace the searching of the document collection even though it lacks information links to the database. The user can get insightful answers to complex questions which involve searches such as, database tables that are highly related to arsenic measurement, information in the database related to a specific section of a document, or researchers who queried a particular table. All documents are assumed to have some structure that helps focus on a set of keywords in a particular portion of the document. The approach proposed in (Garcia-Alvarado and Ordonez, 2011), uses the high-level metadata links which are based on describing a link between a table and a document. Specific links relate sections or subsections in the document and columns in a table schema. The content links represent complex atomic granularity. They are a result of matching the content of structured data records and document keywords. The querying is performed on these smaller indexed tables by using selection predicates p to find matching keywords given in a query, and the ranking is obtained by computing the corresponding weights for every concept. 38

58 Approach Existing Keyword Search and Advanced Search (med, 2012a) DOM and XPath based Query Language (Oro et al., 2010) Characteristics Suitability to Skilled users Data Model Complexity Few query options, large number of results, no precise results w.r.t skilled users Based on existing Web search engine s data model Users are required knowledge of XPATH query-language Based on DOM along with the spatial characteristics of contents Querying Functions Interface Capability Advance keyword with AND, OR operations Easy to Use Simple keyword search Same as available for XPATH (2 operands) Limited No User Interface Complex for both skilled and novice users Web page segmentation based Block-level search (Cai et al., 2004b) Dynamic Forms (Jayapandian and Jagadish, 2006) Integrated Querying of the Web databases and documents (Garcia-Alvarado and Ordonez, 2011) Proposed Formbased Query Language Interfaces (proposed) Good for entitybased searches, where each entity is independent Easy to use for the novice users User is required to have the knowledge of SQL query -language Address the query needs of both skilled and novice users, precise and accurate results Based on visual cues and DOM Simple Keyword search within the blocks Dynamic forms are generated based on needs of a particular user-cluster Web documents and databases linked, meta-data created with heterogeneous database Based on querylanguage operations and interface resembles forms Same as SQL Same as SQL Same as SQL, with the exception of aggregate queries Limited Form-based UI No User interface Dynamic form like interface Simple Keyword Simple, Input form Complex for users Query attributes search style field values with no prior represented explicitly, knowledge of SQL easy for users 39

59 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES Query Formulation Keywords, easy Based on syntaxes of XPATH Keyword based, easy Limited to available form fields, otherwise not query-able Table 3.2: Approaches for querying Web data and databases. Relational Queries are to be written Any number of operators and operands can be combined 40

60 3.4 Proposed Approach The proposed approach has two parts, the Web document segmentation, and the development of the query-interface over the segmented database. The following subsections present the two steps of the approach. Figure 3.3: An example of content structure for heart attack web document in MedlinePlus medical encyclopedia (med, 2012a). Table 3.2 gives a detailed comparison of the above approaches w.r.t the various dimensions associated with enhanced querying for the skilled users of the medical domain. It also compares the proposed approach along these dimensions with the other approaches Segmentation Framework The initial step of the approach is to segment the Web documents based on the health-care concepts and the layout of the contents on the Web document. This is an off-line process used to create a concept-enriched XML database. The design of a Web document has the following key features: 1. The segments are non-overlapping and are hierarchically structured. 2. The segment labels and contents can be transformed to a concept-based database. 3. The segments are independent of any underlying source code, Web standard or Web page generation language. 4. Content groups are distinguishable using visual cues in Web document. 5. Segmentation granularity can be controlled and used for developing focused query interface. The underlying assumptions for segmentation are: (i) most of the Web documents of a document repository have similar structure and (ii) the rendered geometric patterns can derive the intercontent relationship. The segmentation of the Web documents is performed by combining the 41

61 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES HTML DOM (corresponding to a Web document) with the following layout rules, exhibiting the layout: 1. Web documents are organized top to down and left to right. 2. Contents within Web documents are organized in conceptual units of information, best suited to user s understanding. 3. Headings or subheadings of a Web document represent the concept represented by the content enclosed within them. 4. Considering the entire collection of documents, headings and subheadings of the same group have same text font, size and background color. 5. Each subheading has different text style (size, font) than the heading within which it is enclosed. 6. Considering the entire collection of documents, all headings and subheadings have the same orientation. 7. The components in the Web document (text/anchors/forms/lists/tables or images) are distinguishable. 8. Considering the entire collection of documents, the sub-topic labels vary in the Web documents of a repository. A syntactic hierarchical tree is generated using these rules, and DOM transformation for the entire collection. The proposed algorithm creates a (semantically labeled) concept-based hierarchical structure of the Web document. The approach constructs two structures; one is the domain specific tree where the domain refers to a document repository or a group of similar document repositories (in this case, the health-care domain). The set of candidate-labels belong to the domain knowledge. The tree is rendered semantically (based on distinct concepts represented) and labels are assigned to the nodes of the tree based on the candidate set to generate a labeled tree tree of concepts. The hierarchical structure is reproduced by integrating the two trees, with subheadings and headings of the Web document as the labels of nodes in the tree. Figure 3.3, shows the hierarchical structure, tree of concepts for the MedlinePlus encyclopedia (med, 2012a), topic Heart Attack. The dashed lines indicate that the node has more siblings (not represented due to space constraint). The schema of entire collection (repository) of the Web documents is mapped to this structure. Each of the nodes of the tree is stored as query-able attributes and the content description of the node is stored as the value of the attribute. The XML database can best represent such data as it can match the hierarchical nested structure of the contents and labels of a Web document. Figure 3.4 describes the various characteristics that VisHue algorithm (Section 2.6) can capture for a given Web document Querying Framework Before describing the underlying algorithm for the query-interface, we describe the concepts that build up the framework for in-depth querying of the on-line medical documents using a high-level query-interface. 42

62 Web-Page Segmentation framework User Behavior View Web Document Characteristics Informative Blocks Reasonable Memory Usage Query Interface Query Result Less No. of Iterations Figure 3.4: Characteristics of the Web document captured by the VisHue algorithm (Section 2.6) Data Model Let D be a set of Web documents d1, d2, d3 dn. Each di contains a set of headings H = h 1, h 2... h n and a set of subheadings S = s 1, s 2, s 3... s n. Each of the heading h i encloses one or more subheadings s i. Any consecutive pair of subheadings (<s i, s j >) or a heading and a subheading pair (<h i, s i >) enclose some textual content which explains the concept represented by the headings or subheadings discovered using the structural analysis. Any Web page di transformed into a hierarchical structure contains headings h i and subheadings si as the nodes. This tree further creates an XML database with conceptual tags or nodes of the hierarchical tree Resource Tree A complete collection of Web documents is represented as a forest F with n unbounded trees. The number of trees is proportional to the number of Web documents within the resource. For example in case of MedlinePlus medical encyclopedia (med, 2012a), there is a collection of around 4000 documents within it. Each tree is termed as a segment tree and is optimized by merging the semantically similar trees Segment Tree A segment tree can be defined as an unbounded tree T = (v, e) with a root r where, where the root r represents the Web page topic or main heading. The number of child nodes of the root varies for each Web page and is directly proportional to the number of subheadings that occur within it. Each of subheadings becomes a node in the tree and the edges between parent-children emphasize the heading and subheading relationships within a Web page. 43

63 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES Query Result For each user query, Q consisting of n query-able attributes the results returns a set of segments seg 1, seg 2... seg n. Each of the segments may belong to same or different Web documents d i. The result segments seg i contains the queried keywords with combination of operators (AND, OR, NOT and multiple combinations of these) as desired by the user. 3.5 The Algorithm This section describes the algorithm for the segmentation and query-interface generation (Madaan et al., 2011) Preprocessing the Web documents As an initial step, the Web documents from an on-line medical document repository are preprocessed to extract the corresponding HTML DOM for it. This is the first step for processing the documents to create a hierarchical structure. The XML files are created corresponding to the HTML DOM Segmentation of the Web documents As described in section 2.5, the segmentation process comprises of two steps. First, an HTML DOM tree corresponding to the Web document is generated using on-line tools. Then further, the rules mentioned above are applied to derive the tree of domain. This tree is based on the similarity in structure of the web documents belonging to a particular document repository. The algorithm described in gives the steps of the Web page segmentation process of the VisHue algorithm. Table 3.3: VisHue Web page segmentation algorithm Algorithm: VisHue Inputs: Preprocessed Web documents WD, of the document repository DR, set of geometric rules GR Output: Tree of concepts Steps: Create DOM tree corresponding to each W D i - Render each W D i using the pre-defined set of rules GR - Generate the hierarchical structure Tree of Domain - Create label-set using the domain knowledge and concepts represented by the document repository - Traverse, tree of domain, at each node, assign label from the label-set - Create tree of concepts Creation of a Concept-based Database The health-concepts based XML database is created using the hierarchical structure generated above. The headings and sub-heading nodes are transformed into attributes. The content 44

64 enclosed within them is stored as the value for these attributes. Figure 3.3 represents a sample content structure for the MedlinePlus medical encyclopedia heart attack Web document (med, 2012a). In the description of this heading, a subheading symptoms is encountered, below which a leaf node represents the content the symptoms of heart attack. This Web document when converted to a concept-database, has tags such as, <heart attack> <heart attack>, <description> </description> and <symptoms>contents</symptoms> High-level Query-Interface Table 3.4: Algorithm for Query-by-Segment Algorithm: CREATE QBT QUERY LANGUAGE Inputs: Concept-rich DB, set of areas for search AR in Web documents WD Output: Query-by-segment tag, QBT query-interface Steps: Present the query-able attributes (segment tags) values - Present the areas within which the user may limit the scope of search - Allow user to enter the values for the selected query-able attributes - With the entered values and the query-able attributes, formulate corresponding XQuery. - Execute query to retrieve answers for the specific attributes Query-by-Segment Tag Query-by-Segment Tag (QBT) is a high-level query-interface for the on-line medical document repositories. The query-interface uses a concept-based database. The database creation utilizes the content (hierarchical) structure generated by the VisHue algorithm. The interface is a highlevel dynamic form-based interface where the query fields can be automatically changed and customized by the end-user. The conceptual labels of the database represent the query-able attributes on the user-interface. These are the part of dynamic forms generated as per the user selections. Moreover, the interface allows the users to confine their search to specific regions within a Web page. If he or she wishes to perform a query with say, Nausea as a symptom. The standard search available on the MedlinePlus (med, 2012a), displays search results where nausea has an occurrence in the segment side effect of medication besides in the segment symptoms. On the other hand, in the proposed query-interface, the search for nausea can be limited to only the symptoms segment. QBT (query-interface) allows incremental query creation. The user can create and modify the queries incrementally. Figure 3.5 gives the screens of the interface of the query-interface. Once the user selects the concept to query, he or she can choose the search scope within a Web document repository. In the second step, she or he can input the keywords for the concepts selected in the first step. Figure 3.5, shows the selected user attributes causes and title. The user has the provision to delete an attribute and he or she can append multiple predicates ( OR and AND operations). The user clicks search once his (or her) query is formulated. Using the query-interface, the users can perform queries such as, Cases where atherosclerosis causes heart attack. At any point, during query formulation, the user can modify the query as per his or her requirement. The query-interface is composed of three steps namely, concept name selection; enter attribute values to query, modify query (add/delete/change attributes for query). The initial 45

65 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES step displays a virtual table divided in two parts. One describes the concept names from the different segments within the document repository and the other defines the scope of search in which the user wishes query. The user may select a single or multiple attributes to query. As soon as the user selects an attribute, the other possible co-occurring segments (concepts) are highlighted in the table. This allows the user to have knowledge about co-occurring attributes in a Web document without the need of prior understanding of database structure. The user can further select the scope of search from the second half of the table. Figure 3.5 represents the screenshot of our prototype system built over MedlinePlus medical encyclopedia document repository (med, 2012a). In Figure 3.5 he or she selects title and causes concepts and the default scope of search (complete document). Next, he (or she) is presented with the query formulation form. Here, the user enters the values for his or her selected attributes. The fields have a hint based auto-completion facility. It prompts the possible keywords that a user may need to enter the attribute value to query. This reduces the possible errors by the novice users. The QBT query-interface maps the user inputs and query parameters to an XQuery. This query is further executed on the underlying database. In addition, the results are presented to the user, with the link to the particular Web document. Figure 3.5: Screenshot of the Query-by-Segment (QBT) query-interface over MedlinePlus medical encyclopedia (med, 2012a). 46

66 3.6 Experiments Here, the performance of the prototype QBT query-interface 1 is evaluated and compared with the traditional keyword search available for the MedlinePlus medical encyclopedia (med, 2012a). The results of the qualitative and quantitative analysis of the performance of the query-interface exhibit its efficiency. Implementation: The prototype query-interface is developed for the MedlinePlus medical encyclopedia (med, 2012a). It is document repository, which is used by millions of health-care professionals and the other users due to quality and authenticity of the content. The system is implemented on Windows 7, 64-bit OS. Apache Tomcat 6.0 server is used as a platform to run the application. The client and server side languages/tools can be described as: 1. Server-side Tools. PHP, jquery, DB2 Database 2. Client-side Tools. HTML, AJAX, JavaScript A set of 16 queries were considered for evaluation based on most common diagnostic references the various health-experts query the medical encyclopedia. Here 4 of these queries are listed Performance of the QBT Query-Interface Enabling a high-level query language interface such as the QBT Query-Interface, allows the users to receive direct answers to their queries, and allows them to perform complex queries. The users receive useful results in form of only the relevant documents in the search/query context defined by the user. Table 3.5 gives the key features of comparison between the existing keyword search based tools and the proposed QBT query-interface. Table 3.5: Feature Comparison: QBT query-interface and existing keyword search Features QBT Keyword Search Direct Answers Segment-based (precise),answers Set of articles with an occurrence of keyword(s) Query Capability Limited query features Limited; Keyword search Retrieval Units Text snippets along with article URLs Article URLs Complex Queries Querying operations like sum, average, Not available Usefulness Easy to use Interface multiple AND, OR and NOT possible Focused and relevant results for accurate processing Labels are self-explanatory and scope of query can be defined for a given combination Large number of results are presented which need to be sorted by user Simple and Advanced option for entering the keyword(s) combinations Table 3.6 compares the query formulation and interpretation by the existing keyword search and QBT query-interface over the MedlinePlus encyclopedia (med, 2012a). It shows that QBT query-interface has the capability to support complex queries where a user can find articles with occurrence of a keyword and negation of another keyword within a same document. The

3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES attributes of the database such as symptoms, causes, treatment define the context within which a user can search for the desired

67 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES attributes of the database such as symptoms, causes, treatment define the context within which a user can search for the desired medical conditions. Table 3.6: Comparison for QBT query-interface and existing keyword search w.r.t search query formulation Search has no provision of negating one of the key- User Query QBT Query Keyword Search Q1: Cases where patient has Symptom= Hypertension hypertension but not high, Symptom= NOT High blood pressure blood pressure words Q2: Cases for patient to stop certain activities before a test (can resume normal activities after it) Q3: Heart attack caused by high blood pressure Q4: Poisoning caused by eating fish Before Procedure= Stop, After Procedure= Normal Cause= High blood pressure, Symptom= heart attack Food Source= Fish Side Effect= Poisoning Keyword search with stop and normal keywords Search for keyword heart attack and high blood pressure Search all articles for keywords fish and poisoning Figure 3.6: Auto-complete prompts in the QBT interface Interface Support for User Input The QBT interface provides certain features that allows the user to input his queries, accurately. This helps in avoiding any fuzzy inputs for queries, to be executed on the transformed user-level schema. 1. On selection of an attribute for query, say causes, the interface highlights all the other attributes that co-occur with Causes attribute. This helps the user to define, his or her condition/query in an easier way. 48

68 2. The user can select the scope of search. If the user wishes to search the entered term within the topics of the document repository or within the supplementary content of the documents. 3. When the user enters the values for the selected attributes, the interface provides prompts to auto-complete the user-input. Figure 3.6, shows the prompt for all the possible terms starting with head. This avoids any spelling mistakes that may occur for the novice or the expert users Usability Studies with actual End-users A small-scale usability study was conducted to test the interface with the undergraduate and graduate students at the laboratory. These 20 students are well-versed with the query and search methods and how to use computer and Internet but are not involved in the project or any other related study w.r.t QBT interface. Each of the students were given a questionnaire comprising of three set of questions. In the first part, students were asked questions about their age, education and demographical location Table A.1. The second set of questions were related to the use of on-line medical information by these users Table A.2. In the third part of the study the users were expected to use the interface for the sample set of queries and answer questions related to relevancy of obtained results, the time taken and number of clicks required to formulate and execute the query Table A.4 and Table A.5. A set of 10 queries were given to the students for the study Table A.3. The students were given a 15 minute introduction about the QBT interface and the sample set of queries. Table 3.7: Summary of Information of the Participants S.No. Features Observation 1 Number of Participants 20 2 Demographical Location University Laboratory 3 Educational Qualifications Undergraduate - 15, Graduate- 5 4 Avg. age group Table 3.7 summarizes the information about the characteristics of the subjects of usability study. Most of the participants were undergraduates belonging to the age group of years. However, the because of consideration of graduate students of varying age group, the average age group of the subjects lies in the range The students belong to the database and software engineering laboratories of the university. None of them previously used the QBT interface. The figure 3.7 describes the user-responses for use of on-line medical information. On the average, most users rarely or sometimes access the on-line medical resources such as a medical encyclopedia. Based on the user responses, it this can be attributed to the fact that the novice users are largely unaware of existence of authenticated medical resources such as medical encyclopedias (A.D.A.M.S, MedlinePlus). Also, most of them stated that they receive too many results for their queries when using the general purpose search engines for searching medical information. Hence, it can be concluded that there is a need for an easy-to-use query tool, that can support the skilled as well as the novice users of the medical domain. After using the QBT interface, the novice or the general users were largely satisfied and preferred to use the interface over search engines such as Google and Yahoo. Although, most of them requested some more modifications to make the user-interface more usable. Figure

3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES Figure 3.7: User-response about use of on-line medical information. depicts user response for questions related to this aspect of study.

9 shows the user responses w.r.t these two aspects. The users also stated that the queries in the sample set are relevant to often relevant to them.

During analysis of query responses, the users were asked to write the average time taken, number of clicks used and the number of results obtained.

Most of the users could not correctly formulate the queries Q5, Q7 and Q8 and hence did not receive any results for them.

69 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES Figure 3.7: User-response about use of on-line medical information. depicts user response for questions related to this aspect of study. The users agreed that the training session was helpful for them to understand the interface features and learn the process of query formulation using the interface. Figure 3.9 shows the user responses w.r.t these two aspects. The users also stated that the queries in the sample set are relevant to often relevant to them. 12 of the 20 users found most queries as relevant to them. During analysis of query responses, the users were asked to write the average time taken, number of clicks used and the number of results obtained. They were also asked to mark whether the results obtained were relevant to their information needs. Most of the users could not correctly formulate the queries Q5, Q7 and Q8 and hence did not receive any results for them. All these queries were indirect in stating the causes and symptoms to be searched. The users successfully formulated the queries Q1, Q2, Q4 and Q9 in less than a minute on the interface. However, for queries Q3 and Q10 they took around 3 minutes to formulate the query. On an average 6-10 clicks were used to formulate the queries. For each of the queries Q1, Q2, Q3, Q4, Q6, Q9 and Q10 four to ten results were obtained except for query Q4. This may be because the query Q4 describes a generic situation of finding cases where fever is caused by virus. The user-study explored different aspects of use of on-line medical information, training users to use new query interfaces and also explored the users view on use of the interface for a sample set of queries. The results show that if the novice and the expert users of the medical domain are supported by easy-to-use query languages, these users can use on-line medical information 50

70 Figure 3.8: User-response about use of on-line medical information. more frequently. The process of health-care delivery can be made less prone to errors due to complex query needs and insufficient interfaces. Increasing the length of training session and explaining more example can help the users to better understand the QBT interface. Overall, such an interface can be accepted and used by both the medical experts and the novice users for their everyday querying tasks on the on-line medical document repositories. 3.7 Discussions In this study, we combine the visual heuristics and domain knowledge concepts of the medical web document repositories to segment the documents and generate a hierarchical structure. The algorithm VisHue segments the documents semantically and generates a nested hierarchical structure corresponding to the web documents. This structure creates an XML database based on health-care concepts, on this database a high-level, incremental; query-by-segment tag queryinterface for enhanced querying is enabled. The qualitative analysis of the QBT query-interface proves its capability to capture the user-intent for precise query results. 3.8 Summary and Conclusions The study emphasizes the need for better query methods for on-line medical document repositories considering the various users, various types of medical document repositories. It compares the existing approaches for querying the data on the Web and databases and proposes the VisHue algorithm, for Web document segmentation. The segmented hierarchical structure corresponding to the on-line medical documents are stored as a concept-based database. A high-level query-interface, QBT (Query-by-segment tag) over this database is proposed. The comparisons of the performance of the Query-by-Segment query-interface over the concept-based database exhibit the strengths of the query-interface for (the novice and) the expert users of the health-care domain. The query- interface enables the users to discover precise responses to their queries. It is an important approach to enhance the search and 51

query capabilities of the users seeking on-line health information.

71 3. IN-DEPTH QUERYING OVER DOMAIN SPECIFIC DOCUMENT REPOSITORIES Figure 3.9: User-response about the usefulness of the training session. query capabilities of the users seeking on-line health information. Subsequently, it is intended to strengthen the features and query operations for the QBT (Query-by-segment tag) queryinterface. 52

72 Chapter 4 Semantic Granularity Model for Domain Specific Queries Domain-Specific Document Repostories The World Wide Web is rapidly becoming a vast repository of information. The paper-based specialized documents in domains such as health-care are now available in form of Web document repositories. These repositories provide health information. The availability of distributed repositories of quality medical information on the Internet has placed access methods at the center of research (Chen et al., 2003). Currently, information on the Web is discovered primarily using browsers and search engines. Existing search engines such as Yahoo, Google, service millions of searches a day using Web documents as units. This poses the problem of information overload. Such retrieval is prevalent across all disciplines. The Web archives and the specialized document repositories occupy a small portion of the Web. These do not have specialized (human or technical) querying resources available. Users of these resources need to depend on the commercial search engines (McCown and Nelson, 2009). Moreover, the similarity and popularity based models alone are not effective in domain-specific documents especially for domain experts (Yan et al., 2011). The terms information granulation, and structural granularity have already been defined (Yan et al., 2011). Therefore, in this work, we focus on semantic granulation, where a user can query concepts and sub-concepts of a term to find answers to his queries from the Web documents. Because of sheer volume of documents archived on the Web and the growing number of domain specific document repositories, it is difficult to manually label general or specific documents (Yan et al., 2011). Hence, there is a need for semantic granularity based query tools for domain specific documents. The Web based medical repositories are in the form of well- authenticated governmental and commercial resources which include the encyclopedias and dictionaries. The ability to accurately search for, access, and process information is particularly pressing in the field of medicine (Chen et al., 2003). The medical professionals and researchers need information from Web sources during health-care-delivery (Jenkins et al., 2003). For the top-down approach of keyword based searching, the domain experts need to specify search conditions and identify content related to their queries. Whereas, database querying is based on the logic of predicate 1 Research Publication(s) - (1) Aastha Madaan and Subhash Bhalla. Adoption of a Semantic Granularity based Model for Domain-specific Web Document Queries. [Under Preparation] and (2) Aastha Madaan, Wanming Chu, Handling Domain Specific Document Repositories for Application of Query Languages, March 2014, 9th International Workshop on Databases in Networked Information Systems, Lecture Notes in Computer Science (LNCS), Vol. 8381, pp 1-16 (approx.), Springer Berlin Heidelberg, ISSN: (Print) (Online). [TO, APPEAR] 53

73 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES evaluation with a precisely defined answer set for a given query. In the domain of database querying, the high-level graphical query languages are XQBE (Braga et al., 2005) and QBE (Zloof, 1977). By using these query languages for the Web document repositories, we can simplify the querying tasks of the skilled users. These interfaces reduce the need to understand the complex schema details (and the details of the query syntax). Furthermore, the schema headings are obtainable from the expert s knowledge base (from Web documents). Considering the on-line document repositories in the health-care domain, we propose intelligent querying methods. A Web page segmentation technique based on the layout analysis and matched with the domain knowledge has been proposed. A database, considering the user s view (from the Web repository) has been created. Such terminology-enriched databases allows the users to query the Web document repositories. These enable structured queries using easy-touse graphical query languages as (high-level query languages) (Yan et al., 2011). This problem of semantic granularization has been highly simplified by existence of the same vocabulary within user s domain knowledge and within the semantic structure of the Web documents Keyword Search vs. Semantic Granular Query Model The health information (on-line) is in demand. Nearly, 80 percent American Internet users seek health information on-line (Jones, 2012). There is a the need to match the information needs with relevant content to improve the information acquisition task of the clinicians (Jones, 2012). Keyword-only system searches consider the underlying resources as bags of words and do not exploit the rich structural information that is available in Web document repositories (especially, in health-care domain). For example, during an analysis of the submitted queries to MedlinePlus, the actual questions of the users that led them to submit the queries could not be examined (med, 2012b). The users submit only one or more keywords, due to this the wide range of information sought cannot be discerned (Scott-Wright et al., 2006). We present the user, the segment labels (granular-level) of a document that can be queried by him or her. All the above terms of the specialized Web documents in any domain are semantically related to user s knowledge base as an expert. Additionally, these are on the repository pages, as headings that the user utilizes and is familiar, being an expert. For example, a physician examining a patient for pneumonia after observing the patient history with tuberculosis may have a query, Find the cases where patient has pneumonia and tuberculosis symptoms together. Using the existing keyword search, a list of rank documents are returned to the user which contain one or both the keywords irrespective of the context they are searched in. Figure 4.1, displays the results for the above query from the MedlinePlus medical encyclopedia. As shown in the figure, the health-care expert needs to explore each of the Web pages to determine the exact results that he/she may desire Shortcomings of Existing Search Engines w.r.t Medical Information Domain experts search differently that people with little or no domain knowledge. They differ in their search strategies and often search for information more effectively (White et al., 2009). Domain expertise differs from the search expertise. It refers to the knowledge of the subject or topic of the information need rather than the knowledge of the search process (White et al., 2009). This difference is evident in the specialized domains such as, medicine, law, and finance and computer science. In such specialized domains, apart from general (novice) users, a large number of professional groups search and query on-line information resources. The search engines often do not consider the user-expertise for displaying results of a user-query nor do they consider the user s level of capability to understand results or technical vocabulary. The generalpurpose search engines perform textual search w.r.t keywords input by the users irrespective of 54

the context of search (Luo, 2009). Moreover, concerns such as, quality of information provided on website is not considered by the search engines when displaying the results. Input keywords Figure 4.

74 the context of search (Luo, 2009). Moreover, concerns such as, quality of information provided on website is not considered by the search engines when displaying the results. Input keywords Figure 4.1: Query Results (total 141) using the existing advanced keyword search on Medline- Plus medical encyclopedia (med, 2012b) with keywords Tuberculosis and Pneumonia. The existing on-line medical document repositories such as the MedlinePlus provide an advanced keyword search interface to the users to search health information (Figure 4.1). The interface comprises of a search box where the user can enter the keywords he wishes to search. The keywords can be appended with the operators such as, OR, NOT, -, + and *. The users also have the option to refine the search results based on the type of resource they wish to access. For example, the keywords may occur within the Health Topics, Drugs and Supplements or the Encyclopedia. The users can also filter the results based on the set of indexed keywords listed by the search. The larger the page, the information localization becomes harder while browsing the page. The existing search interface does not provide any means for the user to perform database queries that can provide results from particular segments of relevance. Unlike the general searches over the Web, the medical domain experts have in-depth knowledge of the domainterminologies they wish to query. They know the exact terms and contents they are looking for within a document repository. 4.2 Problem Statement The proposal presents a multi-step framework that comprises of a query part and semantic granularization part, respectively. These include: (1) design of a semantic granularity Web 55

75 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES page segmentation technique (by matching at 1-1 level the domain knowledge within the structural granules of Web documents), (2) transforming the Web documents into a terminologyenriched XML database (3) Enabling high-level graphical query language for in-depth (structured) queries, and (4) Qualitative and quantitative studies that show the efficiency of the point (1) and (3). Our research work is aimed at allowing in-depth querying. We focus on the document repositories in the health-care domain as this domain has critical needs for improved querying. Also the terms medical and health-care have been used interchangeably. The underlying challenges are, to be able to capture the Web documents as a medical document database and to determine the semantics represented by these documents (by mapping the Web document heading with structural elements and of the database). Therefore, we transform the content-labels in the Web document to the attributes in a database. This facilitates the adoption of query tools to support general purpose queries over common (well-known) Web document labels. 4.3 Background and Motivation The existing research for domain-specific Web documents focus on improving the search methods rather than the query methods. There have been efforts in the area, to improve existing keyword search interfaces (Oyama et al., 2004). Also there have been related efforts in the area of domain-specific information-retrieval (IR) techniques that provide granular search (Yan et al., 2011). To the best of our knowledge the research effort for improving the query methods over Web document repositories is urgently required. This work is the first approach in the direction of using the database query language on Web document repositories in order to retrieve precise (database-like query) results in minimal time for the domain experts. In this section, we discuss the web document repositories in the health-care/medical domain and the various studies that are related to the various components of our proposed framework Information Quality of Retrieval from The Web-based Medical Document Repositories The users expect right content and high credibility when accessing on-line Medical Document Repositories (Jones, 2012). The content becomes useful for the end-users by means of data analytics and querying. Thus, facilitating easy and in-depth query methods can allow a user to extract contents to suffice their needs. A health-care Web-document repository such as an encyclopedia contains documents that represent a large number of common medical conditions, symptoms, diagnostic tests, anatomical and physiological terms, treatments, procedures, prevention topics and other medical terms (Laurent and Vickers, 2009). There are document collections that are governed by the state and are authenticated to contain and offer credible content to the users. The MedlinePlus is a free access Web document collection maintained by U.S. National Library of Medicine (NLM) (nlm, 2011). Over 150 million people from around the world use MedlinePlus each year (Zou et al., 2006). Web documents of the most commonly cited medical Web resources such as the encyclopedias (hie, 2011), (ada, 2011), (geo, 2011), (med, 2012b) are similarly structured and static in nature (figure 4.2). The existing keyword search and menu-based (interface) searches on MedlinePlus and other document collections often give redundant results that are of not much use to the medical domain experts. There is another NLM owned web repository for information on the clinical trials (McCray and Tse, 2003). It provides consumer health information for the patients, their families, doctors and health-care providers. It brings together information 56

Header Free -text Related Content Hyperlink Main Content Figure 4.2: Various common components of a Web document in the MedlinePlus Medical Encyclopedia document repository.

76 Header Free -text Related Content Hyperlink Main Content Figure 4.2: Various common components of a Web document in the MedlinePlus Medical Encyclopedia document repository. from the United States National Library of Medicine, the National Institutes of Health (NIH), other U.S. government agencies, and other health organizations. Other most commonly used credible health-care document repositories are Health on the Net (Europe), MedlinePlus and Mayo Clinic (North America), Better Health and HealthInsite (Australia) are Web-based health document repositories used by the users are identified by the work of (Fisher et al., 2012). The HON Survey (2006) identified that 79% of health consumers prefer a government agency to be responsible for on-line health information provision (Fisher et al., 2012) Nature of Medical Documents The documents on the Web are well-structured for human readability and comprehension. A program can reliably extract the categorized components information. The medical Web resources often comprise of the electronic form of the former paper-based document collections. These are referred by the clinicians for in-patient diagnosis and by general users for preliminary symptom recognition. As in the case of a text document, headings are employed to organize the content of the documents. Headings are located at the top of the sections and subsections. These preview and succinctly summarize upcoming content and show subordination. These lead to a hierarchical structure for a document. It plays a role in understanding the relationships between its contents. The same structure becomes useful for segmenting a Web document. Hence, the logical (semantic) layout of the Web document, along with the domain knowledge and the structural tags for the content organization form a logical hierarchy with the domainspecific terms. This can be transformed into XML data model that can be further utilized for 57

77 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES database-like query on the Web document collection. The contents of these resources do not change (much) over a period of time. The schema and these resources have evolved from common practice terms over 10s of years. Typically, a Web document represents a single theme or a topic (Figure 4.2). Elements of the medical terms (disease, condition, process) are semantically grouped in a Web document. These elements are arranged for efficient browsing of a repository. The header and footer on any Web document give the general Web site information and other meta-data. The free text comprises of paragraphs and lists. Figure 4.2, shows a sample Web document describing the various components given above which form a complete Web document. The main content of the (similarly organized topics on) the Web documents can be logically divided into several information segments. These can be based on common Web document labels, such as - causes, symptoms, home care, alternatives and references for a medical encyclopedia documents (Figure 4.3 and Figure 4.9(c)). Topic of the Document Subtopics Miscellaneous/ Related content Subtopic 1 Subtopic 2 Subtopic n Content topic 1 Content topic 2 Content topic n content content content content content content Figure 4.3: Hierarchy of semantically coherent segments of a Web document from a health-care document repository ((med, 2012b)). A Web document can be visualized in form of an ordered tree where the leaves correspond to text elements and internal nodes correspond to labels of the semantic groups within the Web document. The root represents the label of the Web document. Thus, for subsequent queries, the ordered tree follows a pre-order traversal. The root and the internal nodes of the tree expose the labels (or attributes) which a user may query. Figure 4.3 represents the organization of contents on a Web document from a health-care document repository (med, 2012b). The colored nodes represent the subtopic and related content topics or say, headings and subheadings. These nodes can be transformed into query-able attributes and the content enclosed contains the values queried in the predicates The Notion of Web Document Segmentation The positions of HTML tags containing main content differ in a variety of web sites. Finding the main content in such an un-structured set of Web documents needs consideration and special algorithms. Document Object Model (DOM) provides each Web document a fine-grained structure, illustrating the content and the presentation of the document. The HTML code and corresponding DOM tree is often very similar to the semantic structure of the document perceived by the user, in case of medical document archives (Zelen, 2011). Thus the Web 58

78 document needs to be constructed and then mapped to the corresponding HTML DOM of the Web document. A segment is usually defined as non-overlapping content on a Web document which has a label and encloses some content. So far, the Web document segmentation approach has been utilized in applications such as Web document dynamics, information retrieval, phishing detection and generation of result snippets. Among these the VIPS (Vision-based Page Segmentation) algorithm (Cai et al., 2003) is most popular algorithm. In VIPS algorithm, a tree structure is used to model the document. Each node corresponds to a visual block in the document, and has a Degree of Coherence (DoC) associated with it. The shortcoming of this approach is that it creates a hierarchy of blocks which are segmented visually. Hence, the tree structure represents the visual semantics rather than semantics of the contents. Some algorithms consider the content or link information besides the tag tree. Others consider the heuristic rules to discover record boundaries within a document (Liu et al., 2010). Some approaches make use of the image analysis on the Web documents (Cao et al., 2010) or perform graph based segmentation (Chakrabarti et al., 2008). In (Ramaswamy et al., 2004) and (El-Shayeb et al., 2009), visual cues are used along with DOM analysis. These try to identify the logical relationships within the Web content, based on visual layout information. These approaches rely mostly on the DOM structure. Such algorithms are time-consuming and inflexible. However, a collection of contents can form semantically coherent groups within a Web document. These allow the users to query them individually. The health-care documents and the other specialized domain documents are similarly structured. A clear understanding of the layout features of these documents can allow querying on them as segments of the database schema History of Information through Granular Search and Database Query Consumers often encounter barriers in information seeking. Some of the there are obstacles to effective retrieval. Some of these are a result of keyword search. These include submission of ill-formed query strings, mismatch in terminology, and terms that are too broad or too narrow or may be out of touch. Such queries lead to incomplete retrieval and irrelevant or no results (McCray and Tse, 2003). Current medical search engines such as OmniMedicaSearch medical Web (hmf, 2011), HealthLine (hea, 2011), Medical World Search (mws, 2011) and PogoFrog (pog, 2011) for physicians, health-care professionals, and novice users are unsupervised and untrained. The users of such system have to decide the relevance of results in their terms from all the returned results. There have been several efforts for improving the search methodology on various document repositories. Few efforts have been made for providing interfaces for the querying needs of the domain experts. For example, initially, the MedlinePlus repository used ht://dig (Song et al., 2004), a free-ware utility, which was configured to retrieve documents. This enabled the users to start searching through an index so that they can see the variety of information and links related to a health topic. Subsequently, an A-Z list of links at the top of the home document encouraged users to choose a topic by letter. ht://dig displayed a list of links regardless of content source, such as encyclopedia or drug information. In the medical encyclopedia section, some users expected to search only the medical encyclopedia documents. But the search link remained largely invisible to users. In 2000, MedlinePlus introduced a search box on every document. The requirement for a new search engine allowed NLM to revisit the display of search results as well. Users liked the display but often, many of them were not aware of its function. 59

79 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES Figure 4.4: Snapshot of Microsoft Academic Search that employs object-level vertical search ( Similarly, the earlier work in the area of block-level searches, the blocks along with their importance values, have been used to build customized Web search-engines (Cai et al., 2004b). This approach improves the relevance of search results for record-based or data intensive Web sites, such as, yahoo.com, and amazon.com. In the case of medical repositories and domainspecific Web documents, such searches may not be useful, as the context (topic/sub-topic) is equally important for the user to search the keywords. In Object-Level Vertical Search (Nie et al., 2009), all the Web information about a real world object or entity is extracted and integrated to generate a pseudo document for the object and is indexed to answer user queries. This search technology has been used in Microsoft Academic Search ( and Windows Live Product Search ( Another type of search, the Entity Relationship Search (Nie et al., 2009), deploys an Entity Relationship Search Engine in the China search market called Renlifang ( Figure 4.4 presents a screenshot of the Microsoft Academic search which uses the concept of object-level search. The disadvantage of the approach is that if the search terms are scattered across various segments with different semantics, it leads to low precision. None of the mentioned studies capture the query requirements of the specialized Web documents. In the light of these existing issues with the searches we define In-depth Querying as enabling the user to specify the contents he (or she) wishes to query. For instance, in the case of traditional keyword search, a disease name in the user s search criteria will return all the possible results with any occurrence of disease name within the Web documents. However, the proposed framework enables the user to perform DB-style queries on an originally static 60

80 source of information. For example, if he or she wishes to query cases where helicobacter pylon bacteria causes peptic ulcer. The query is formulated in terms of occurrence of the peptic ulcer within a segment symptom within the medical resource (say, the encyclopedia). The consideration is given at a more semantic and granular level. Its aim is to provide him (or her) with a specific set of results (say, where the match occurs as a topic or where it occurs within as a sub-topic). In deciding which disease does a patient has, both the presence and absence of certain symptoms provide clues (e.g. whether sputum is accompanied by coughing). For example, Web documents describing the diseases that do not have the symptom Sy, can either mention without Sy explicitly in many different ways or do not mention Sy at all. For a Web document P describing a disease D, the presence or absence of the keywords for Sy on P cannot be directly used as the criterion for deciding whether D has Sy. Otherwise, it is difficult for a medical information seeker to obtain useful search results ((purely) through traditional keyword search). Moreover, considering only one of the attributes (symptoms) is not enough, the physician or medical expert may require multiple such as, considerations using other semantic segments (other than the symptoms) Profile of Skilled and Semi-Skilled Users The most common task on the Web today is searching for health-care information (White et al., 2008), (Laurent and Vickers, 2009). Several studies on domain expertise have highlighted the differences between experts and novices, including: vocabulary and search expression (White et al., 2008). The amount of knowledge the domain experts have about a domain is an important determinant of their search behavior. Medical domain experts have complex query requirements over the data and have a well-evolved meta-language. They often use well-formed query expression to seek the answer set by expressing several terms using medical terminology. Medical experts carry out depth-first searches, following deep-trails of information and evaluate the information based on the most sophisticated and varied criteria. Whereas, the novice users concentrate on breadth-first searches and evaluate through overview knowledge (Jenkins et al., 2003). The first document of the general search engine results is significantly more likely to be accessed by (inexperienced) health information seekers, with an exponential decline thereafter (Laurent and Vickers, 2009) Medical Users vs. Other Users The medical users differ from the Web users, database users, IR and other similar users. The Web users are generally satisfied by browsing the Internet for their information needs. They do not have any tight time constraints or in-depth queries. The database users on the other hand, have complex queries but do not have any schema knowledge. They are satisfied by using complex query languages such as SQL, and XQuery. The information needs of the IR users are satisfied by use of general-purpose search engines such as Google, Yahoo, and Bing. They do not mind browsing multiple Web pages for information they need. Moreover, they do not have schema knowledge to formulate precise queries. In contrast to these users, the health-care experts such as medical researchers and practitioners are well versed with medical terminologies, and processes (schema) of the medical information resources. These users have precise and complex queries. They expect detailed and specific answers or results for their queries. Hence, they need easy-to-use query tools and languages for their information needs. To highlight the case, consider medical personnel accessing the MedlinePlus encyclopedia. A nurse understands if systolic is high - (i) Systolic is a part of blood pressure, or ii) if when it is a symptom or a cause ). He (or she) can query for example, by, using meta-language, with such option, as, symptom = systolic. The work of White et al. (White et al., 2008), 61

4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES discusses that the expert users have queries such as, find the categories of people who should or should not get flu shot?

81 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES discusses that the expert users have queries such as, find the categories of people who should or should not get flu shot? For such a query the normal path of querying that is followed by an end-user (in-case of a health expert) is that he first explores the government Web site such as the MedlinePlus and then accesses the related resources. This Web site is his first choice Figure 4.5: An example of common search methodology followed by a health expert (Flu Shot query) adapted from (White et al., 2008). because of the existence of domain-specific search knowledge in health-care and his expertise. Figure 4.5 represents the query methodology pursued by the health experts (H1, H2 H5), Web sites they choose to visit in order to find the solution for the above query (described in (White et al., 2008)). Therefore, there is a need to facilitate the querying by providing query languages to the on-line health-care repositories using fine granularity. For the skilled and semi-skilled users the following assumptions can be stated: 1. The user is well-versed with the data (user view of the database), and the associated meta-language. This especially true for users of organized document repositories, 2. The users have highly focused query needs. Therefore, these users expect precise and complete answers, and 3. Their vocabulary is the same High-level Database query languages Database querying is based on the logic of predicate evaluation with a precisely defined answer set for a given query, On the other hand, in an information retrieval approach the ranked results are accepted (DBL, 2009). Database query languages such as, SQL for relational databases, XQuery for the XML databases have been developed for the skilled database users. These require complete knowledge of query language syntaxes for query formulation. Whereas, the highlevel graphical query languages such as, QBE (Query-by-example) for the relational database and XQBE (XQuery by example) for the XML database target both the unskilled and the skilled users. These facilitate query construction. The users are able to quickly form their queries (Braga et al., 2005). In this article we use the XQBE query language for the in-depth querying of the Web documents. XQBE allows deep nesting of the XQuery FLOWR expressions (xqu, 2011). These efficiently meet the query needs of the domain experts because the schema is well-known. 62

Web-Page Segmentation Framework Medical Encyclopedia Web Documents input Structural & Semantic Analysis Granularized contents of documents for Query Language Interface Medical Domain Knowledge

6: Web Document Segmentation framework for database creation and traditional search methods employed for the domain experts. 4.

The proposed technique has two steps, (1) the semantic granularization step, it is an off-line step, and (2) the in-depth query step which is an on-line step.

82 Web-Page Segmentation Framework Medical Encyclopedia Web Documents input Structural & Semantic Analysis Granularized contents of documents for Query Language Interface Medical Domain Knowledge Meta-language Communication transformation interfacing Tree Structured Database Skilled and Semi-skilled users present in traditional search methods Medical Encyclopedia Web Documents Figure 4.6: Web Document Segmentation framework for database creation and traditional search methods employed for the domain experts. 4.4 Proposed Framework There is an absence of a well-formed query languages for the Web documents in domains such as life-sciences, biomedical and health-care. The proposed technique has two steps, (1) the semantic granularization step, it is an off-line step, and (2) the in-depth query step which is an on-line step. The main focus of the proposed framework is to segment the Web document using the underlying structure and pre-existing domain knowledge for successful application of a query language. Web document segmentation expands a user s querying and searching horizons by enabling him (or her) to query within headings and subheadings of the document. The offline step segments the Web document into a segment hierarchy. The segments are visually and structurally distinguishable and semantically coherent in nature. Further, the hierarchical structure is represented using an XML database. It preserves the hierarchical representation of the relationships between the various segments. This database contains attributes which are the nodes of the hierarchical structure and represent key-entities of the particular domain. Thus, the Web documents are transformed into a terminology-rich database, the on-line step enables the application of database tools and high-level database query languages (such as, XQBE or QBE) on data repository Structure of Semantically Coherent Segments in the Web Documents The proposed approach aims at understanding a Web document and it s semantic structure (as per the user s view) considering that there is a 1-1 match with the Web document vocabulary. Figure 4.6, displays the outline of the proposed framework. It represents the components of the Web document segmentation approach and how the skilled and the semi skilled users directly accessed the medical domain knowledge stored in medical encyclopedia (and other Web document repositories). The right side of the figure, represents a Web document (in this case, a medical encyclopedia document) belonging to a document repository. The domain knowledge of the experts is retained. It aids them in the query formulation with exact medical vocabulary. 63

83 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES The center part of the diagram describes the Web document segmentation framework to map the HTML documents into a terminologies-enriched database that can be queried. Here, we identify a set of 8 layout (design) rules for recognition of segments on a given Web document similar to the set of rules defined by the VIPS algorithm (Cai et al., 2003). Rule 1: Web documents are mostly organized top-down and left-right. Rule 2: Contents within Web documents are organized in semantic units of information, best suited to user s understanding. Rule 3: Headings or subheadings within the Web document denote the semantics of the content enclosed within them. Rule 4: Headings and subheadings of the same group have same text font, size and background color. Rule 5: Each subheading has different text style (size, font) than the heading within which it is enclosed. Rule 6: All headings and subheadings have the same orientation. Rule 7: The components in the Web document may be text/anchors/forms/lists/tables or images which are recognizable units. Rule 8: Not all the sub-topic labels are present in every Web document of a repository. Hence, the segments from the user s point of view can be extracted using these rules, and can further be mapped to the segments extracted using the HTML tags The Terminology-enriched Schema The terminology-enriched schema in this study refers to the forest of trees where each of the trees is a hierarchical structure corresponding to a Web document from the Web document repository. We assume that the Web documents share segment (node) in the hierarchical structure generation among themselves and the collection contains unique nodes. To represent this forest the XML database is used. Each of the attributes in the XML contains terminologybased tags which are easy to comprehend and are self-explanatory. Each of the tags is formed by the labels l i of the segment s i defined above. In the following sections the details of the terminologies is given. Table 4.1: Various types of query classes supported by the proposed framework. S. No. Query Type Description 1 Inter-topic These are queries with two or more attributes which are topic labels or headings in different documents. 2 Inter-subtopic These are queries with two or more attributes which are subtopic labels in different documents. 3 Intra-subtopic These are queries with two or more attributes which are subtopic labels in the same document. 4 Topic and subtopic These are queries with two or more attributes, of which at least one is a topic label and at least one is a subtopic label. 64

84 4.4.3 Query Classes The HTML pages are frequently constructed on the fly in response to a query generated at the user-interface. Moreover, the HTML documents have a fixed level of granularity, while the database queries can group or divide data to arbitrary level of granularity (Abiteboul et al., 2000). Information granulation refers to the computational processes of generating and presenting levels of abstraction to facilitate problem solving (Yao, 2005) and (Zadeh, 1996). The existing information granulation mechanisms do not effectively support web document searching and querying as they fail to accurately estimate the semantic details carried by such documents (Yan et al., 2011). An IR system can automatically estimate the granularity requirements of a query using the same approach for document granularity computation (Yan et al., 2011). The granularity requirements of an information seeker can be determined manually. An information seeker can explicitly specify the granularity by labeling a query as general or specific (w.r.t. documents required). He or she may use a set of predefined words, such as review, introduction, indepth and specialized to specify their granularity preferences. In this approach, if all the terms of a document are semantically related, the document is considered as being specific to a particular topic. In the medical domain, the name of a specific medicine or virus is often related to the name of a specific disease. Hence, are represented in a single document under various topics and sub-topics within a Web document repository, say an encyclopedia. To achieve the information granulation through domain specific queries in this study, we propose to address various types of queries through the proposed framework. For this, a topic-label is referred to the main heading of a document within the document repository and a sub-topic label is referred to a sub-topic heading within each document of the document repository. A document may contain multiple-subtopic labels. In the medical domain, each document mainly describes a single topic. Table 4.1 describes the various categories of the queries considered in this study. Design Heuristics Preprocessing Tree of Headings Semantics Based XML structure Tree of Contents Tree of Semantics Medical Domain Knowledge Figure 4.7: Steps of the Web Document Segmentation Model (off-line Process). 65

85 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES Offline Process In this section, we present the structural and the semantic analysis performed for the construction the terminology-based schema by using the knowledge of the pre-established medical processes. This facilitates users to follow a time tested work-flow cycle for data-quality enhancement procedures. Using the rules stated in Section 3.5.1, following assumptions are made about the Web documents in the specialized domains: (i) The HTML parser can successfully extract headings and the subheadings (tags) from the Web documents, (ii) the main focus of the end-users on the Web documents is the main content rather than the related content comprising of the meta-data and images and (iii) The content groups (segments) are assumed to be non-overlapping. The outcome of such a process, is a schema tree. Figure 4.7, depicts the transformation. There are three main transformations that are done in order to generate the final hierarchical structure. First the input HTML is transformed to a tree of headings by the extraction of the HTML heading tags. Next the layout rules defined above are applied to extract a tree termed as tree of contents. The resultant hierarchy schema tree can directly be mapped to the database. The first step is performed at the pre-processing stage. Algorithm 1 gives the outline of proposed segmentation algorithm highlighting the two data-structures that are constructed (i) structural tree and (ii) semantic tree. Algorithm 1: Schema Tree Input: TH: Tree of Headings, CS: Candidate Label-set. Output: Terminology-enriched Schema. createtreeofcontents(th ) createtreeofsemantics(tc, CS) First the method CreateTreeOfContents() is invoked. It associates the nodes of the tree of headings to the contents enclosed within each of the topics and sub-topics. Secondly the CreateTreeOfSemantics() method is invoked. It semantically labels the tree of contents with labels representative of the domain knowledge and the predicates or attributes the user can query against The Structural Analysis The tree of headings and subheadings generated at the preprocessing stage are the place-holders for the semantic labels. The content group association is performed by associating the text under the heading or subheading corresponding to the nodes of the tree of headings. This is important as the rendered Web document contents may differ from how the tags are organized in the DOM of the Web document. The layout rules 1-2 (above) are used to identify content organization on the Web document whereas 3-5 help in heading recognition. We define the term, structural curve as a curve drawn across the document segregating the content groups. As shown in Figure 4.8, any change observed with respect to rules 3-5 is marked as a structure point (SP). On any Web document the content groups are contained between these structure points. The algorithm captures any change in the font, text size and color of text distinguishes a heading from the rest of the text. The formal algorithm is given below (Algorithm 2) The Semantic Analysis The domain knowledge of any specialized domain is enriched by the resources such as, the dictionaries, thesaurus and the concerns databases. These resources form the domain termi- 66

86 Figure 4.8: Structure points (Tree of Contents algorithm) on MedlinePlus document. nologies which further formulate the set of candidate labels for the headings (topics) and the sub-headings (sub-topics) of a Web document. These form the meta-language components and attributes for the queries. For example, in the health-care or medical domain the candidate label set for the root level is represented as, CL root = Medical terminologies. Here, the medical terminology may refer to a disease name. Candidate labels for the intermediate nodes may refer to the terminologies that may describe sub-concepts associated with a disease. For instance, a disease (fever) is a medical term which may have several sub-concepts describing it such as, CL fever = symptoms, causes, exams and tests and treatment. The tree is obtained by a preorder traversal of the tree of contents and mapping each of each node to the corresponding label from the label set sequentially. The formal algorithm is given above (Algorithm 3). The procedure determine level determines whether the node is a root node or is one of the intermediate nodes of the tree of contents. Figure 4.9 summarizes the three transformed data structures in form of block hierarchies. The first hierarchy corresponds to the tree of headings, while the second corresponds to the tree of contents while the third represents the semantics or terminology-enriched schema tree which is directly transformed to an XML. Figure 4.10 gives the XML database snippet of a MedlinePlus article on Aarskog syndrome generated by the above algorithm. As depicted in the figure, each of the XML tag corresponds to a terminology relevant to the health-care domain. Similar transformation is performed on each of the articles. 67

87 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES Algorithm 2: Create Tree of Contents Input: TH: Tree of Headings, LR: 8 Layout rules Output: Tree of Contents Perform pre-order traversal of TH Draw a virtual curve in a top-down left right manner on the rendered web document for each node N i in TH do Apply the rules 3-4 Mark any change in layout (text size, font or color) as SP Assign the content between 2 consecutive SPs to the SP above Map this node to the corresponding node of TH end Algorithm 3: Constructing Tree of Semantics Input: TC: Tree of Contents, CLnode Output: The frequency number (FreNum a ) node α gets assigned Perform pre-order traversal of TC for each node n i TC do determinelevel(n i ) determine label set corresponding to the level AssignLabel(l) end 4.5 Tree-structured Database Model Online Process The XQuery query language for XML databases is equivalent to the SQL query language for relational databases. These query languages are powerful languages which allow the database users to query the underlying databases in an in-depth manner. They require the users to be knowledgeable about the schema so that they can precisely specify both the location of the entities and attributes they are searching for, and the relationships among those entities and attributes. But these are often difficult to understand and learn by the non-programming background users. Hence, the XQBE query language follows a visual query paradigm (Braga et al., 2005). This query language is directly map-able to XQuery. Hence, the user is presented with a GUI (graphical user-interface) where the source part describes the XML data to be matched against the set of input documents, while the construct part specifies which parts will be retained in the result; together with (optional) newly generated XML items. With this underlying basis the XQBE query language tool is applied on the transformed terminologyenriched database. In addition, our proposal presents the XML tags based on Web document headings. Each of the considered Web document is a two-dimensional array of segments and their contents (Figure 4.12). The similarly structured documents together form the third dimension of the database corresponding to the Web document repository. Figure 4.13 describes the three-dimensional cube. The results of a user-query are found across all documents of the Web document repository containing the queried attributes (segment headings and sub-headings). The proposed approach aims to discover the conceptual inter and intra topic and segment labels for the user to query across the documents of the Web document repository. It makes use of 68

88 H1 H2 H2 H2 H2 (a) H1 Contents H2 H2 H2 H2 Title (e.g. (name of a disease) description Causes Symptoms Care Others Contents Contents Contents Contents Contents Contents Contents Contents (b) (c) Figure 4.9: (a) Tree of Headings similar to a DOM tree (adapted from(htm, 2011)), (B) Tree of Contents similar to VIPS (adapted (Cai et al., 2003)), (C) The Semantic Schema Tree corresponding to an example medical encyclopedia Web document. the conceptual hierarchical structure. Mathematically, the Web document repository can be defined as, W DRep = n i=1 W D i Where, n represents the total number of documents in the Web document repository. Each of these Web documents is a two-dimensional array of segments and their contents which can be represented mathematically as, W D = m j=1 seg j Here, m represents the total number of segments per document of the repository. The number of segments per document is variable. Each segment of the Web document, is a set of the label and the contents enclosed by the segment as explained in Section Mathematically, it can be represented as, seg j = (l, c) j where, each element (l, c) j belongs to the set (L, C) which is a cumulative set of all the distinct labels in the document repository and contents enclosed. The document repository is then transformed into database using different data-structures defined in Section 3.6. Mathematically, the tree of contents can be defined as, given a Web document (W), the tree of contents (TC) is defined as: TC: (hs, c) i, i = {# of segments s of the Web document W} Where hs i is node of the DOM tree (placeholder for the label l i ), representing a heading or subheading in the Web document and c i represents to the contents under i th label l i. The result of this step is a tree, with each node of the tree acting as a place-holder for a concept-based label and each leaf node associated with the content segments. The Tree of Semantics (TS) can be defined as, given the tree of contents (TC) and set of labels (terminologies) 69

4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES Article Title Causes Symptoms Exams_and_Tests treatment Support_Groups Outlook Prognosis Prevention Alternative_Names UpdateDate Figure 4.

89 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES Article Title Causes Symptoms Exams_and_Tests treatment Support_Groups Outlook Prognosis Prevention Alternative_Names UpdateDate Figure 4.10: A Snippet of a document (from MedlinePlus repository (med, 2012b)) in the semantic XML Database. CL, from the domain as: TS = (n i, L) Where n i is a node in TC (Section 3.6), and L: TC l maps each node of n i to a label l where l CL, based on the medical terminologies Improved Structured Querying Each of the Web documents can be represented as a hierarchy of segments (blocks). The labels of the nodes of these trees represent domain-specific terms and concepts. The complex task of in-depth querying is reduced to a simpler task of utilizing XQuery query language (or XQBE high-level query language) (Braga et al., 2005) over the specialized medical documents repository. The schema tree may be visualized as a data graph (Abiteboul et al., 2000), where the edges represent the attributes and the nodes represent the data. The notion of pathexpression may be associated with it. Any query can be mapped as a sub-graph or sub-tree of this data graph. One of the main distinguishing features of semi-structured query language is its ability to reach arbitrary depths in the data graph. The semi-structured data can be viewed as an edge-labeled graph with a function. FE: E L, Where L represents the candidate label set which form the domain of edge labels. Figure 4.11 represents the data graph corresponding to a medical encyclopedia document repository, where the database is the root of the graph and its children are the documents (headache, fever, hypertension). The child nodes show the segments within these documents. A user query follows a path from root to leaf i.e. it forms a sub-tree of the schema tree. The complete path 70

90 Causes Symptoms Home Care MedlinePlus medical encyclopedia Transformed Database corresponding to Document Repository Headache Fever Hypertension Causes Treatment Considerations Query sub-tree Causes Symptoms Each node Identifiable by XPATH Expression in the XQuery Outlook Figure 4.11: Representation of a query (Q1) as a sub-tree of the schema tree (adapted from (Abiteboul et al., 2000)). from the root to the node provides the user the in-depth information as per the query. Each of these nodes are represented by XPATH expression in the query. According to the proposed approach the query submitted by a user can be viewed as a sub-tree in the forest generated for the document repository. The result of any user query is a sub-tree of the node queried to the entire path till the leaf node within the sub-tree under that node. If the result of the user query is contained in multiple sub-trees under the root node, then multiple sub-trees are presented to the user. 4.6 Experiments The aim of the experimental evaluation is two-fold. One the quality and efficiency of the Web document segmentation approach (off-line component) is to be evaluated. Quality of segmentation is directly related to quality of results of queries. Non-overlapping and distinct segments can directly be mapped to the attributes of the database. Second, the efficiency of the application of the XQBE query language for querying the sample dataset is to be evaluated. The latter evaluation is aimed to be performed in two parts. the first part to evaluate the query operations feasible for the sample dataset. Therefore, quantitative experiments are performed to evaluate the search space and number of query-steps required. The qualitative evaluation is done considering the query intent captured by the XQBE language (DB-style queries) as compared to existing domain specific tools Experimental Settings Platform We evaluated the performance of the Web document segmentation approach and the application of the XQBE query language on the transformed database (MXQBE) on Windows 7 PC, with 12GB RAM. The XML database has been created using DB2 database (db2, 2011). The XQBE tool v2.0.0 (xqb, 0 30) has been installed with jre1.6.0 for testing the applicability of the XQBE 71

91 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES segments Web Document segment 1 (l1,c1) segment 2 (l2,c2)... contents Figure 4.12: Representation of a Web document as a two-dimensional array of segments and contents. segments (attributes) Web Document1 segment 1 (l1,c1) segment 2 (l2,c2) segment 3 (l3,c3)... documents (of repository) content of segment 1 label of segment 3 contents (of attributes) Figure 4.13: Representation of the Web document repository as a three-dimensional database. query language. Also, the queries have been tested using the XQuery query language, with the Altova XML spy 2012 (free trial version) (xml, 2011) Dataset The well-known Web based health document repository s MedlinePlus medical encyclopedia (med, 2012b) has been used for the evaluation purposes Document Preprocessing MedlinePlus (med, 2012b) received a total volume of 12,234,737 English queries over a period of 315 days (Scott-Wright et al., 2006). All the 3886 Web documents from MedlinePlus medical encyclopedia have been obtained for evaluation (as on March, 2009). Currently, the repository comprises of 4000 pages. For the purpose of Web document segmentation evaluation, 50 randomly sampled Web documents of this repository have been considered. The complete document corpus of 3886 documents has been converted to the XML database form for evaluation 72

92 Table 4.2: (a) Count of the original Document Corpus. Description Value Number of Documents 3886 (b)xml Database (transformed dataset) Description Value Number of root elements 3886 Number of distinct child nodes 413 of the second stage of enhanced querying. These documents contain various categories, such as medical conditions, treatment and laboratory tests. The links of the original documents connect back to the Web site to provide results of user queries. These documents are processed in form of XML elements. As shown in Table 4.2 (b), the document corpus is converted to a database which contains 413 distinct segment labels and 446 distinct combinations of the segment-labels within the articles. Further, the Table 4.3 describes the top-30 segment labels in terms of the frequency of the occurrence in the XML database. For the evaluation of querying efficiency a set of 20 queries (Table 4.6) are formulated from the perspective of medical domain experts after survey of on-line literature. The queries are formed considering the common medical and diagnostic procedures. Each of the queries are categorized as inter-segment, intra-segment, topic-segment, inter-topic queries (Section 3.5.2) Web-document Segmentation Quality Here, the quality of Web document segmentation in the health-care domain is evaluated. The quality of semantic objects created from the Web document is evaluated and a comparison with the existing VIPS algorithm (Cai et al., 2003) is performed. The second step evaluates the enhanced querying capability. For this, the applicability of XQuery, query language on the terminology-enriched database is evaluated. Further, applicability of the XQBE query language on the database is evaluated. The query results obtained by the proposed framework are compared to those obtained by the existing tools (advanced keyword search on the MedlinePlus repository). For accurate extraction of the semantically coherent and logical segments, the user-objects (user s view of the objects on the Web document), the layout-based objects (the objects extracted from the rendered Web document), and the objects stored in the database should be identical. The accuracy of the segmentation process is directly proportional to the accuracy of the determined user-objects. The aim of this study is to enable the high-level query language on a database corresponding to a Web document repository. The current set of experiments for determining the quality of Web document segmentation are performed in the laboratory setup with a total of 5 graduate and undergraduate students (subjects). These students are not connected with the proposed work or any related study and do not have medical background. The background about the MedlinePlus medical encyclopedia was given to them as a tutorial displaying the on-line Web documents and explaining the usage and needs of the clinicians and practitioners using existing literature (Jenkins et al., 2003), (Currie et al., 2003). For segmentation quality evaluation, these subjects are asked to rate the segments of the Web documents generated by the application of our approach relative to their perception based segments for the corresponding documents (on the on-line MedlinePlus repository). The ratings 73

93 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES Table 4.3: Top-30 frequently occurring segments (attributes) in the transformed database. results mean 1752 How the test will feel Segment Label Frequency Segment Label Frequency Segment Label Frequency References 2972 Risks 760 Poison control 285 Alternative 2966 Considerations 614 Poisonous Ingredient 278 Names Outlook 1897 How the test is 607 What to expect 278 performed at your office visit Causes 1857 Why is the test 603 Where found 277 performed Symptoms 1843 What abnormal 593 Support groups 233 When to contact medical professional Treatment 1487 How to prepare for test 591 Description Why the procedure is performed Exams and 1470 Normal results 556 After the procedure 176 Tests Possible Complications 1393 Home care 489 Information 173 Prevention 1235 Before calling 285 When to call the 163 Emergency doctor 185 are categorized as perfect, satisfactory, fair and bad. They were also asked to rate the segments based on the coherency of content related to the topic or subtopic label (heading) and whether the label clearly summarizes the content description. Table 4.4 shows the results of the evaluation. Each cell represents the number of documents with a particular rating and the last column gives the average number of documents for each rating. Table 4.4: Evaluation of segmentation quality on the basis of users rating. Rating User 1 User 2 User 3 User 4 User 5 Avg. No. of Documents Perfect Satisfactory Fair Bad The results show that each of the users on an average rated 39.8 of the 50 documents (79.6%) as perfectly segmented. While another set of 19%, documents have been segmented satisfactorily. Therefore, the results indicate a good quality of segmentation from the user s perspective of the segments. From these results, it can be concluded that the off-line component segments the Web documents with high accuracy considering the user s view, along with the 74

94 structural and semantic meta-data of the document. The proposed segmentation approach is Table 4.5: Evaluation of segmentation quality on the basis of user ratings. PDoC Value VIPS Algorithm Proposed Algorithm 8 Bad Satisfactory 9 Satisfactory Satisfactory compared with the VIPS algorithm (Cai et al., 2003). The demonstration version of the VIPS tool (Cai Deng and Wei-Ying, 2003) has been downloaded and executed with PDoC (Predefined degree of coherence) 8, 9 and 10 for extraction of segments on the sample dataset. The results are ranked as bad, no change and satisfactory. A result is ranked as bad if it is far from the visibly distinguished segments and is satisfactory if each of the obtained segments reflect a single semantic. Else it is marked as no change. The process of adjusting the PDoC value for the VIPS algorithm is random and thus, time consuming. The value can range from 0-10 and it is difficult to determine at once the best value. Hence the values are adjusted randomly until well-defined segments are obtained. Table 4.5 gives the results obtained. A PDoC value of 8 gives clearly distinguishable (visually) segments but these are not coherent in nature. The label of a segment is associated with the content of another. In the subsequent iterations, the PDoC value is increased to 9 and coherent segments are obtained. On further increasing the coherency value no change in the segments is observed. In comparison to VIPS algorithm which considers only the visual separators on the Web page to determine the segments the proposed approach is dependent on the logical arrangement of segments on a document. It considers the structural, semantic and layout properties of the segments on the Web document. It does not depend on the pre-defined coherency ratio to achieve satisfactory results Enhanced Query Capability In this section the experimental evaluation for the XQBE query interface over the MedlinePlus medical encyclopedia is given. For these experiments a total 20 queries (Table 4.6) are considered. All the experiments have been conducted in the laboratory on the system specified in the previous section. Experimental comparison is performed between the proposed methodologies with the existing advanced keyword search on MedlinePlus repository. Majority of the query operations required by the end-users (clinicians and the medical practitioners) are tested XQBE system architecture Figure 4.14 represents the system architecture of the XQBE query language. The schema information corresponding to the generated user-level schema is given to the system. The system then itself checks the sub-elements that a user may add to an attribute (node) upto level 1. Then recursively sub-elements can be added for querying. The client side of the query interface provides a syntax checker in addition to the editor for accurate query formulation. A query can be translated into the corresponding XQuery on the client side itself. The server side of the interface is implemented as Web service and executes the query after translating it into XQuery. The XQBE tree or graph (for a query) is translated into an intermediate XML representation before conversion to corresponding XQuery. 75

95 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES XQBE Client - Visual Editor - Syntax Checker - Result Displayer SOAP Request SOAP Response XQBE Server Translation from XQBEML into XQuery MedlinePlus Medical Encyclopedia Schema Specification Xquery Engine Driver Local XQuery Engine Xquery Engine Driver Server side XQuery Engine Local MedlinePlus Medical Encyclopedia Data Server-side MedlinePlus Medical Encyclopedia Data Figure 4.14: System architecture of the XQBE query language for the MedlinePlus medical Encyclopedia (adapted from (Braga et al., 2005)). System Configuration. XQBE client 1 and JRE version 1.5 (or above). No. of concepts S.No. Query Type of Query 1 Find cases where fever is caused by affliction of Topicsegment pneumonia and tuberculosis. 2 Find cases where patient has hypertension but has Intrasegment NO high blood pressure. 3 Find cases where patient has abdominal pain because Inter- of gastritis. segment 4 Find if lung needle biopsy can help to diagnose Intersegment pneumonia. 5 Find other symptoms where Chronic Kidney Failure Inter- is caused by Anemia. segment 6 Find examinations and Test results which help in Topicsegment prevention of Renal Failure by treatment of Hepatitis A or Hepatitis B or Hepatitis C

96 7 Find cases where Fever is caused by virus. Intersegment 8 Find if Oxygen therapy work for the treatment Topicmultiple of Chronic Respiratory Failure and symptoms are Lethargy OR Shortness of breath. segments 9 Find the advisory for the ability to do daily tasks Topicsegment after a Stroke. 10 Find the chances of occurrence of Acoustic Neu- Inter- 12 Find cases where if a patient can have Eczema Symptoms caused by Allergy to Salt 13 Find cases where patients have a Bulging Disk and Pinched Nerve. 14 Find if Bipolar Communication patient, should be advised Eye Exam OR Thyroid Exam 15 Find treatment options for patients with Osteoporosis and fewer side effects 16 Find chances of Cancer Risk in patients-showing symptom of Sleep deprivation and have been exposed to Radiation (but not Environmental Toxins and does not have Genetic Disorder ). roma due to Mobile phone usage. 11 Find cases where a patient have Anaphylaxis allergy from shrimp, but tested NEGATIVE for allergy. segment Intersegment Intersegment Intrasegment Intersegment Intersegment Intrasegment 17 Find cases where Tumor may be caused by Pacemakers. Topicsegment 18 Find symptoms when usage of Obesity Drugs Intersegment should be terminated. 19 Find cases where Anxiety leads to drinking Alcoholsegment Inter- 20 Find if, Cardiovascular Disease occurs due to high Intersegment amount of Triglycerides intake. Table 4.6: Sample set of the representative queries Query Formulation Once the user-level schema is input to the XQBE query language. required for formulating a query. The following steps are 1. Step 1: Choose the root element of the query (disease name or laboratory test). 2. Step 2: Draw any sub-elements of the root (causes, symptoms, and home remedies). 3. Step 3: Enter the operator or choose the condition with the element. 4. Step 4: Enter the value of the element to be queried. 5. Step 5: On the output part, select the elements or attributes required in the query result. 77

97 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES Execute Query Query Predicates Figure 4.15: Query in XQBE query language for finding the cases where patient is showing symptoms of Peptic Ulcer due to Helicobacter pylon virus. A user configures the source query using the graphical interface of XQBE query language. On execution, the corresponding XQuery is generated using the algorithm 4. Figure 4.15 gives a snapshot of a user query, find a case where the infliction of helicobacter virus causes peptic ulcer using the XQBE query language. The graphical query language of XQBE requires the user to simply configure the query on the left hand side and specify the granularity of the results desired on the right. The user submits the input values under the node causes. For the output the user specifies the title (case) as a requirement. On execution the user is returned with a document which contains pneumonia and tuberculosis in the causes segment. The number of results returned depends on the number of documents satisfying the given conditions. Each of the queries is executed using both XQuery and XQBE query languages. Table 4.7 shows the declarative query in XQuery corresponding to the first query. It is expected to return precise results Query Functions Developer level query languages such as SQL and XQuery provide all the basic functions of a query language given in Table 4.8. However, the graphical or visual query languages provide limited functionality. In this case, the XQBE interface implements the negation operator partially. For queries on the MedlinePlus medical encyclopedia, considering the a comparison is 78

98 Algorithm 4: Algorithm for Query transformation from MXQBE interface to XQuery input : XQBE Query, MXQBE output: XQuery for source tree in MXQBE foreach element of the MXQBE source tree do Parse MXQBE source tree, (source part of interface) Map graphical configuration to variable definition (using XQBE rule definition) Map variable definition to predicative terms Use node (with binding edge) to instantiate variables Associate XPATH to each of them Map leaf nodes to selection criteria end Generate FLWR expression for nodes with binding edges return XQuery (nested FLWR expression) Table 4.7: XQuery expression of Query 1 in (Table 6). for $b in doc( AllArticles NEW.xml )//Article where ($b/[functx:contains-word($b//causes, pneumonia ) ] and $b/ [functx:contains-word ($b//causes, tuberculosis ) ] and $b/[functx:contains-word($b//symptoms, fever ) ] ) return <Title name= $b//title/name > $b//title </Title> shown in the table. The results show that the search space is significantly reduced by using the concept-enriched database compared to the keyword search in the existing MedlinePlus interface. This is because the existing interface performs a search across all documents whereas; the DB-style queries search only the attributes (segments) which are specified in the query conditions. As evident from the table results, the search space for the DB-style queries is more localized and narrow. This reduces the cost of querying (reduced time of execution). On the other hand, the proposed granular search allows the user to select specific context (segment label). The performance of the XQBE and XQuery tools is evaluated, w.r.t. qualitative dimensions important for a user. For obtaining high accuracy results it is also important to understand the intention of the users well through the queries posed by them. For this a qualitative comparison based on the technique or methodology of the two methods is drawn. Table 4.9, gives the baseline comparison between the interpretations of the user queries by the two methods. The table gives the actual intent and the intent interpreted for a given query. Using the DB-style queries the exact intent of the queries submitted by the user is determined. This is because the user has the flexibility to select the segment labels which he or she wishes to include in their query criteria. The skilled and the semi-skilled users are well-versed with the terminologies and concepts they need. Table 4.10, gives a qualitative comparison between the query methods listed in this study w.r.t the query features. These features are critical for any efficient query language or query tool. The results indicate that the MXQBE query language 79

99 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES Table 4.8: Query-capability of the XQBE and XQuery for the MedlinePlus Medical Encyclopedia. Query Types XQBE on MedlinePlus XQuery on MedlinePlus Project Rename Existential Quantification Not yet Required Not yet Required Nesting Partially surd Negation Join Relational and Boolean operators surd surd TOP operator Union No surd Sorting Not yet Required Not yet Required Difference Partially surd Filtering Not yet Required Not yet Required Grouping No No Cartesian Product Not yet Required Not yet Required Universal Qualification To be explored To be explored Querying Schema Order Not yet Required Not yet Required Querying Instance Order Not yet Required Not yet Required Flattening yes Yes covers most of the query functionalities of XQuery and those of the existing MedlinePlus tools and is easy-to-use. 4.7 Discussions Consideration of Fuzzy User Inputs In case of the MedlinePlus medical encyclopedia, a given segment label (causes and symptoms) may exist across multiple documents and contain same keywords. For example, fever can be cause of multiple diseases such as, asthma or abdominal pain. If the user is not aware of other specific causes or symptoms he or she needs to add to the query, it may result into a large number of query results. For a given user query, all the attributes (sub-trees labels) corresponding to user input are queried. Hence, multiple sub-trees may satisfy the user criterion. In case of large repositories, this is even more critical with a large number of results. Therefore, a ranking mechanism is required to obtain high quality relevant results. A possible ranking method that makes use of existing raking methods in IR has been proposed by Schlieder et.al (Schlieder and Meuss, 2000). The method uses the measures of term frequency (TF) and inverse document frequency (IDF) for ranking results of XML queries. In the proposed model for the medical document repository, each of the document forms a sub-tree of the tree corresponding to the complete repository. Considering that, each of the intermediate nodes represents a logical segment of the document. The TF of a queried attribute (A q ) can be defined as the number of occurrences in a sub-tree normalized by the frequency of 80

100 Table 4.9: User Intent as captured by the MedlinePlus document corpus and the transformed database corresponding to the corpus. high blood pressure Q3 Cases where- Symptoms= Abdominal Pain, Causes = Gastritis Q4 Can the test be referred for?, Diagnosis= Pneumonia, Test Referred= Lung Needle Biopsy Abdominal Pain and Gastritis Pneumonia and Lung Needle Biopsy Queries Intention MedlinePlus Corpus (Representative Keyword Search) Q1 Disease =?? When Pneumonia and tuberculosis - Causes = Pneumonia and fever and tuberculosis, Symptoms= Fever Q2 Is it possible? (to Hypertension not high show), Symptoms: blood pressure Hypertension but not Terminologyenriched Database Causes= Pneumonia and tuberculosis and Symptoms= Fever Symptom= Hypertension and not (high blood pressure) Symptoms =Abdominal Pain and Causes= Gastritis Test Name= Lung Needle Biopsy and Abnormal Results Mean= Pneumonia the most frequent term in the sub-tree with the same segment label. The IDF can be defined as the ratio of total number of sub-trees and the number of sub-trees containing the attribute A q. The weight of the queried attribute Aq, can be calculated using the formula, w Aq = tf Aq idf Aq. Using this formula, the weight of the result segments can be calculated and arranged in decreasing order. The most relevant segment will be the first result presented to the user. This methodology is based on the Vector Space Model (VSM) in IR (Schlieder and Meuss, 2002). The method assumes that the most relevant segments will have the highest frequency of the value of predicate (attribute value) selected by the user. Another possible approach is to list the segments in order to the results (document segments) in a depth first order. This may also represent the documents in the alphabetical order. However, it may not necessarily represent the most relevant result first. Also, a possible approach is to perform relevance ranking between the query sub-tree and the sub-tree of the repository tree that best matches the query sub-tree similar to the approach given in (Aghili et al., 2006). As described each of the queries forms a sub-tree of the document repository tree. The resulting sub-trees can be ranked by the matching the structure and the content. Using the technique, first structure of the query is matched with the structure of the document sub-tree and further the contents for each of the nodes is matched between the two sub-trees. This technique returns the exact query results as well as ranks the other matching results. 81

101 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES Table 4.10: Qualitative comparison based on query features among the considered query methods for searching and querying medical information. Query Features Direct Answers Retrieval Unit Querying Capability Aggregate Queries Usefulness Easy to use and Interpretable Interface Existing MedlinePlus XQuery MXQBE Search Set of documents containing Precise and direct an- Precise and direct an- the all or some of the swers swers keywords Entire document containing Result embedded in Result embedded in the search keywords XML tags XML tags Advanced keyword search Focused Querying Focused Querying Only simple AND, OR and NOT Results may or may not be relevant The user browses through all results to find the ones relevant to him or her Yes Relevant and Exact Results The users need the support from technical people and proper training syntaxes of the XQuery language. Yes Relevant and Exact Results The novice users need to be trained but the semi-skilled and skilled users can interpret the result within the XML tags Usability and Acceptance Preliminary results indicate that medical experts (skilled and semi-skilled users of the healthcare domain) need powerful query languages. They do not trust the general-purpose commercial search engines, the advanced keyword searches or the alphabetic browsing over these document repositories. They are not confident about the outcome of computation or ranking methods used to fetch the results. Nor are they sure about the procedures implemented for fetching the results. Moreover, the sources used by the commercial web search engines for providing the results may or may not be a credible resource. This is critical in specialized domains such as life-sciences, biomedical and health-care. Such general-purpose interfaces are suitable for novice users who do not know the exact terms for querying, and do not mind browsing the list of results to locate the results which are relevant. With the high-level database query tools, the expert medical practitioners (skilled or semi-skilled users) can formulate queries using simple drag-drop interface (MXQBE query language). He (or she) is aware of the relationships among the various medical terminologies and can connect the nodes and the corresponding desired output. This eliminates the need to learn complex query language syntaxes. It provides the users with the ability to perform in-depth querying. Previous studies point out that QBE is suited for simple (single relation) queries (Saito and Morishita, 2008). The XQBE query language has been developed on similar lines as the QBE for hierarchical XML-like structured database. The visual programming environment of these two graphical query languages give the non-programming user a single view of data. It imposes transforming the on-line document corpus to a coherent user s view based on Web documents 82

102 Tree-structured database (using meta-data) Tree-structured database (using meta-data) Document repository Query-language Interface Keyword Search (with semantic granularity) New Dimension Domain Expert Novice Users Figure 4.16: Proposed improved query and search interfaces (according to expert and naive users). rather than the database normalization theory. This facilitates query using MXQBE (QBE like interface) for on-line specialized document repositories. Figure 4.16 shows the improved interactions of a domain-expert and a novice user with the transformed XML database proposed in this work. The domain experts are equipped to use the high-level query language (MXQBE). The search method for the novice users can also be improved using the keyword search and returning specific segments as results from the Web documents rather than the complete ranked list of the documents containing the keywords. Figure 4.17 shows the transformations that are performed by the proposed approach for converting the HTML document tags into terminology-based query attributes. The Web objects are identified from the HTML Web document from the user s perspective to create a user-level schema. The user can query it using the query languages such as the SQL and XQuery. As evident in the figure, the XML and the DTD enable the high-level query language such as the XQBE, or the XQuery query language. The query tags within the XML and DTD define the query schema Future Work For simplification and initial application of the proposed framework we do not take into consideration the Web pages with hyperlinks. Also the scalability of the approach needs to be assessed for large-scale document repositories. This is essential to compare the overhead of conversion of the Web documents to the database and the enhanced query capability of its users. At present the usability studies of the proposed querying method is done with the subjects as the university students rather than the medical practitioners. As future work, the proposed approach can be generalized over other similarly structured specialized Web documents repositories(legacy documents) such as other medical Web document repositories, finance, and law. Further, a usability study with the medical practitioners can evaluate the dimensions of ease-of-use, learnability and understandability of the MXQBE interface. 83

103 4. SEMANTIC GRANULARITY MODEL FOR DOMAIN SPECIFIC QUERIES XQuery DB Objects System Developers Web Document Repository User s object of interest Tree-structured Database XML DTD Enabled Query Language Domain Specific High Level Interface (XQuery/XQBE/ XGI) Domain Experts Figure 4.17: Querying by the domain experts on the transformed database (containing data objects of the Web Document) and representation of the various transformations required to create query-able entities. 4.8 Summary and Conclusions The Web document repositories on the Internet provide form-based searches and keyword-based searches. An in-depth querying is required by the domain-experts to make use of these resources. The chapter demonstrated the application of a two-step framework over the specialized Web document repositories. The framework transforms the previously static Web documents into semantic/terminology-based query-able attributes by an off-line process. It also transforms the document structure and headings into labels of a database schema providing ease-of-query to the domain experts. The work presents the successful application of XQuery and MXQBE (Braga et al., 2005) query languages for the Web document repository, as an on-line resource. Identifying and understanding the query needs of the domain experts makes it possible to provide an expert-level query language. It will also improve the search interface for the nonexperts. Our attempt is to provide precise results to the skilled users, with ease-of-use. In this study, user-level objects or logically coherent segments of the Web document are extracted to generate a terminology-enriched database. Over this database, the existing easyto-use high-level graphical query languages are enabled to perform in-depth querying. The notion of logical and semantically-coherent segments for specialized domain Web documents are used for construction of hierarchical structure. Its orientation is equivalent to the user s view of the document. The experiments and evaluation of the framework exhibit its efficiency. The representative set of queries is exhaustive in terms of coverage of most commonly submitted user queries on these document repositories. The domain experts need to be trained to use these new query languages, which are fairly simple. It attempts to advance the state of art for enhancing query abilities for the domain experts in the specialized domains (for example, the health-care domain). 84

104 Part II Standardized Electronic Health Records: Database Query Languages and Usability Studies 85

105 Chapter 5 Literature Review and Background Studies Information Interchange Services and Distributed Architecture Clinical processes are generally, characterized by a high degree of communication and cooperation among physicians, nurses, and other groups of personnel. An information system should support these processes by enabling a seamless information flow between different participants and different locations. Since processes are fundamental building blocks of an organization s success, health-care providers have to redefine their boundaries and make strategic alliances to cope with the changes in the health-care provision market. Also, the information required to make clinical decisions may exists in many different forms and with different organizations (say hospitals, laboratories). Assembling this information is difficult and takes significant and precious time. In such situations the business processing rules may not be applied consistently because of inexperience or poor training and these rules might vary from one organization to another. Other reasons may include (Alexandrou and Mentzas, 2009): 1. Transition of work from one stage of the business process to another may be done slowly, incorrectly, or not at all causing delays, errors and possible mistreatment of the patient. 2. Difficulties arise in determining the status of tasks in the business process. 3. Changing business processes may require major retraining and lengthy learning curves. 4. The use of inconsistent and multiple terminologies in the description of rules or tasks. Thus, the fragmented and distributed organizations worldwide require exchanging information for patient care and clinical decision making. There is a need of standardization of this information with the semantics of information exchange preserved. The standards should be fully able to address semantic and organizational level interoperability. Many standards are being developed for health care, for example, HL7-CDA (HL7, 2011), CEN (cen, 06 3) and 1 Research Publication(s) - (1) Shelly Sachdeva, Aastha Madaan, Subhash Bhalla: Discovery of Patterns to Improve Usability of Electronic Health Record Systems. Journal of Information Processing, Vol. 20, No. 1, pp , 2012, Information Processing Society of Japan, ISSN: (online), DOI: and (2) Shelly Sachdeva, Aastha Madaan, Wanming Chu, Information interchange services for electronic health record databases, International Journal of Computational Science and Engineering (IJCSE), Vol. 7, No. 1, pp , 2012, Inderscience Publishers, ISSN (E): , ISSN (P): , DOI: /IJCSE

106 OpenEHR (T. et al., 2008). These standards aim towards dual level modeling approach, which clearly distinguishes the expert s domain level knowledge from the developer s implementation level knowledge Business processes in a Health Organization In the health-care context, a patient visits various organizations or units within organizations to get proper treatment. The role of IT is to coordinate all the business interactions between the various entities like hospitals, patients, laboratories, insurers and administrative staff (Alexandrou and Mentzas, 2009). An Institute of Medicine s 2001 report (IOM 2001) identified four stages of health organization evolution in the progression from autonomous modes of working to fully coordinated shared care with a high degree of specialization and expertise, as follows: 1. Stage 1: Highly fragmented practice with individuals functioning autonomously and little specialization; 2. Stage 2: Referral networks of loosely structured multidisciplinary teams; 3. Stage 3: More patient-centered team-based care but focusing primarily on needs and intentions of health care professionals with some decision support but little integration of health information; 4. Stage 4: Shared multidisciplinary care, evidence-based and patient-centric practice with strong service coordination between practices and with good quality practices and performance measures [34]. Diagnostic Center Medical Insurers Lab test results Sample for Lab-Test Invoice of hospital expenses Reimbursement summary Cross Organizational Business Process Bus Initiate hospitalization request/request for lab test Hospital A Transmit the diagnosis/ exchange the patient summary Exchange patient summary Report/retrieve diagnosis for patient Hospital B Figure 5.1: An overview of different inter and intra-organizational interactions. Each of the stages mentioned above requires semantic interoperability to be capable of information exchange to achieve the goal of shared health-care. These processes have the requirements of reliability, standardization and retain the semantics of the information exchanged. 87

107 5. LITERATURE REVIEW AND BACKGROUND STUDIES A patient who needs medical attention may undergo a complete cycle of different interactions with the system. At each point, there is information exchanged among the different entities involved in providing shared care. Figure 5.1 describe an example of different user-business process and inter organizational business process interactions during health-care cycle. In a simple patient-hospital scenario, following processes and user-system interactions are involved: 1. A patient P goes for a laboratory examination before a medical treatment 2. Patient P receives the medical treatment from Hospital A (a) The health-care expert defines the treatment procedure (b) Health care experts requires all information about the patient Health care (c) Advises to consult a health-care expert at Hospital B 3. Hospital A creates a summary of the patient P for exchange with hospital B 4. Patient P undergoes tests at hospital B 5. Hospital B sends back report to A and 6. Finally the patient P applies for the reimbursement to the insurance consultant. All these processes need the business process interoperability, and each of the business process to inter-operate requires the semantics of the information to be retained. The degree of semantic interoperability might vary at each of the steps. In the above example, a health care worker needs to query the EHRs of the patient to determine the test to be conducted on the user; he further needs to retrieve information about the patient history, medical observations to determine the medical procedure. The health information needs to be exchanged between different organizations so the semantics of the information need to be retained, in order to retrieve information from the system. Therefore, we observe that there are two critical needs of a business process operating within an organization, the semantic interoperability at all levels of information exchange (standardization of information exchange) and secondly, a user-friendly approach to query the EHR system. In the subsequent sections we focus these two needs Standardized Electronic Health Records (EHRs) Digitized form of individual patient s medical record is referred as electronic health record (EHR). These records can be stored, retrieved and shared over a network through enhancement in information technology. An Integrated Care EHR (ISO13606, 2008) is defined as: a repository of information regarding the health of a subject of care in computer processable form, stored and transmitted securely, and accessible by multiple authorized users (ISO/TC 215, 2003) to support continual, efficient and high-quality integrated health-care. Many efforts are being made for the development, implementation, maintenance and utilization of new patient centered collaborative health care (Kalra et al., 2009). Electronic health records (EHR) data is used by clinicians, patients, health-care organizations, and decision makers for a variety of purposes. The OpenEHR standard for EHRs supports version controlled health records (T. et al., 2008). This enables all past states of a health record to be investigated in the event of clinical or medico-legal investigations forming a longitudinal record of patient health. The most frequently used information is stored in a separate record for fast lookup and querying. The Institute of Electrical and Electronics Engineers (IEEE) defines interoperability as the ability of two or more components to exchange information and to use the information that 88

108 has been exchanged (IEEE, 1990). Semantic Interoperability is a big challenge in health-care industry. Semantic interoperability states that the meaning of information must be preserved from a user level through logical level to physical level. To achieve interoperability, the standard proposes the use of two-level modeling for separation of information and knowledge. It helps to improve quality of information exchanged by sharing archetypes via a repository with versioning, assigning unique archetype identifiers, and by the use of underlying reference model (RM) proposed in open electronic health record architecture (EHRA) (Beale, 2008), (ISO13606, 2008), CEN/TC251 CEN13606 (cen, 06 3) (developed by the European Committee for Standardization). 5.2 Semantic Interoperability The most complex level of interoperability recognizes that the context of the information being shared is important when business processes, work-flows, and data transcend organizational boundaries. A fundamental goal of interoperating systems is not only to share and understand data but also to perform actions on it. Actionable data allows systems to make changes to data that are consistent and understood by all other interoperating systems. Additionally, these changes should not produce any undesired consequences on the other systems. If systems are to successfully interoperate they must have consensus and reach agreements at these levels. Agreements on more levels mean higher interoperability (Lewis et al., 2008). Semantic interoperability concerns the actual meaning of the data being exchanged. It is not possible to exchange data completely unless the interacting systems agree and conform to a common semantics for all shared data. The complete realization of this shared healthcare scenario clearly requires integrating all the relevant information potentially from several independent institutions. Thus, even if different organizations agree on the structure of the information exchanged, functional interoperability alone will not be enough to realize shared health-care. It requires that any two systems that need to exchange information agree on the not just the structure of information exchanged but also the meaning of information exchanged. Taking into consideration the rapidly changing health-care environment and requirements, an ideal information system should be capable of incrementally evolving according to the users needs. Supporting clinical processes with information technology requires work-flow specification (i.e., the identification of tasks, procedural steps, input and output information, people and departments involved, and the management of information flow according to this specification). Semantic interoperability is most needed when EHR data are to be shared and combined from different systems (or across diverse modules within a large system). Full semantic interoperability is required across heterogeneous EHR systems in order to gain the benefits of computerized support for reminders, alerts, decision support, work-flow management and evidence based health-care, i.e. to improve effectiveness and reduce clinical risks. Semantic interoperability (SIOp) has numerous facets (Kalra et al., 2009): For individual patients - SIOp relevant tasks comprise assisted clinical data capture and quick access to the patient record as well as to pertinent background knowledge. It also includes quality assurance, clinical decision support, monitoring and alerts, as well as feedback regarding quality and costs. For aggregated population data - SIOp relevant tasks include reporting, health economics, surveillance, quality assurance, epidemiology (hypothesis formulation), bio- and tissue-banking. SIOp enables the meaningful linkage of research findings and knowledge to patient information, and the discovery of new knowledge from semantically coherent EHR repositories. Users enter information. It should safely reach the designated part of the system, and allow it to be 89

109 5. LITERATURE REVIEW AND BACKGROUND STUDIES sharable with other users and systems. As we know, there are different vendors for different systems. Thus, Semantic interoperability should be taken into account during exchange of data, information and knowledge. Figure 5.2 indicates that, the meaning of information will be preserved across various applications, systems and enterprises. Health care domain is evolving at a fast rate. Health knowledge is becoming broad, deep and rich with time. Often, different clinics and hospitals have their own information systems to maintain patient data. The distributed and heterogeneous data resources may hinder the exchange of data among systems and organizations. There is a need for legacy migration of data to a standard form for the purposes of exchanges. The EHRs should be standardized and should incorporate semantic interoperability. World Health Organization (WHO), has strong desire to develop and implement semantically interoperable health information systems and EHRs. Application A Application B Application N System A System B System N Enterprise A Enterprise B Enterprise N Lossless information Figure 5.2: Semantic Interoperability Levels of interoperability There are three levels of interoperability (Garde et al., 2007a) : 1. Syntactic (data) interoperability 2. Structural interoperability/semantic interpretability. 3. Semantic interoperability. These are described in Table Dual Level Modeling Approach In essence, the Electronic Health Records (EHRs) have a complex structure that may include data of about parameters, such as temperature, blood-pressure and body mass index. 90

110 Table 5.1: The three levels of interoperability in the Standardized EHRs. Levels of interoperability Syntactic interoperability Structural interoperability Semantic interoperability Main Description mechanisms for interoperability openehr The openehr reference model ensures syntactic interoperability independent of any defined archetypes. The RM does not Reference Model define clinical knowledge. It is defined and communicated by (RM) archetypes, separately from the reference model. Hence, data items are communicated between systems only in terms of clearly defined, generic reference model instances. As the reference model is stable, achieving syntactic interoperability between systems is a simple task. Archetypes Structural interoperability is achieved by the definition and use of archetypes. As agreed models of clinical or other domain specific concepts, archetypes are clinically meaningful entities. An EHR entry (or a part) which has been archetyped will have the same meaning no matter where it appears. Thus, archetypes can be shared by multiple health systems and authorities, enabling information to be shared between different systems and types of health-care professionals. Clinical knowledge can be shared and can be safely interpreted by exchanging archetypes. Domain Knowledge Governance The use of archetypes and the reference model alone do not guarantee that different EHR systems and vendors will construct equivalent EHR extracts, and use the record hierarchy and terminology in consistent ways. For semantically interoperable systems, archetype development must be coordinated through systematic Domain Knowledge Governance tool. For example, it succeeds to avoid incompatible, overlapping archetypes for essentially the same concept. Individual parameters have their own contents. In order to serve as an information interchange platform, EHRs use archetypes to accommodate various forms of contents (Beale and Heard, 008b). The EHR data has multitude of representations. The contents can be structured, semi-structured or unstructured, or a mixture of all three. These can be plain text, coded text, paragraphs, measured quantities with values and units, date, time, date-time, and partial date/time, encapsulated data (multimedia, parse-able content), basic types (such as boolean, state variable), container types (list, set) and uniform resource identifiers (URI). The two-level modeling approach has been proposed. Dual model approach for EHR architecture is defined by ISO (cen, 06 3) for a single subject of care (patient). The emphasis is on achieving interoperability of systems and components during communication of a part or complete EHR. Examples of Dual Model EHR architecture are CEN/TC251 EN13606 (CEN, 2011) standard (developed by European committee for standardization) and openehr standard. CEN/TC251 is a regional Standards Development Organization. OpenEHR foundation (T. et al., 2008) was established by University College London and Ocean informatics. It is an 91

111 Actor Actor 5. LITERATURE REVIEW AND BACKGROUND STUDIES international foundation working towards semantic interoperability of EHR and improvement of health care. Clinical Actor User Clinical Model (Archetypes and Templates) Database Schema (Reference Model) Clinical Domain Expert Software Expert Figure 5.3: Two-Level Modeling Approach The OpenEHR Architecture The openehr is pioneering the field for maintaining semantic interoperable EHRs. It has launched the implementation of the specification project. It aims at a new business model for electronic medical records. The latest edition of Microsoft s Connected Health Framework includes openehr (ISO ) archetypes as part of its domain knowledge architecture. The openehr Reference Model is based on ISO and CEN EHR standards, and is interoperable with HL7 (HL7, 2011) and EDIFACT (Electronic data interchange for administration, commerce and transport) message standards. This enables openehr-based software to be integrated with other software and systems. Figure 5.4 shows how the DBMS architecture (Silberschatz et al., 2010) can be compared to the openehr architecture (Beale and Heard, 008b). In two-level modeling approach, the lower level consists of reference model and the upper level consists of domain level definitions in the form of archetypes and templates (Figure 5.3). Reference Model (RM) (cen, 06 3) is an object-oriented model that contains the basic entities for representing any entry in an EHR. The software and data can be built from RM. Concepts in openehr, RM are invariant. It comprises a small set of classes that define the generic building blocks to construct EHRs. This information model ensures syntactic or data interoperability. The second level is based on archetypes (cen, 06 3), (T. and S., 2005). These are formal definitions of clinical concepts in the form of structured and constrained combinations of the entities of a RM. A definition of data as archetypes can be developed in terms of constraints on structure, types, values, and behaviors of RM classes for each concept, and in the domain in which we want to use. Archetypes are flexible. They are general-purpose, reusable and compose-able. These provide knowledge level interoperability, i.e., the ability of systems to reliably communicate with 92

112 each other at the level of knowledge concepts. Thus, the meaning of clinical data will be preserved in archetype based systems. Standardization can be achieved, so that, whenever there is a change in the clinical knowledge (or requirements), the software need not be changed and only the archetypes need to be modified. The clinical user can enter and access information through clinical application. The clinical domain expert can record and maintain the clinical model through modeler. The modeler is software needed to manage the archetypes. Patients can have complete control over access and distribution of their health records. User View 1 User View 2 User View 3 Health integration platform Application development platform Knowledge management platform Service model Conceptual/ Logical Level Templates Queries Terminology interface Concepts as Archetypes Archetype model Physical/ Internal Service Basic entities as objects and versions Reference model Figure 5.4: DBMS architecture compared to openehr Architecture. The openehr architecture is analogous to the three levels of the DBMS architecture. The following points illustrate this similarity. Physical level: The lowest level of abstraction describes the details of reference model. These include identification, access to knowledge resources, data types and structures, versioning semantics, support for archetyping and semantics of enterprise level health information types. Logical Level: The next higher level of abstraction describes the clinical concepts that are to be stored in the system. They can be represented in the form of archetypes and templates. The implementation of clinical concepts may involve physical-level structures. Users of logical level do not need to be aware of this complexity. Clinical domain experts use logical level. View Level: The highest level of abstraction describes only part of the entire EHR architecture. This corresponds to the service model. Several views are defined and the users see these views. In addition to hiding details of logical level, the views also provide a security mechanism to prevent users from accessing certain parts within EHR contents. 93

113 5. LITERATURE REVIEW AND BACKGROUND STUDIES The openehr Archetypes The EHRs based on the OpenEHR standard use the archetypes to accommodate various forms of content (T. et al., 2008), [15]. The EHRs have a complex structure that includes data from about 100 to 200 parameters, such as temperature, blood-pressure, and body mass index. Individual parameters have their own contents. Each contains an item such as data (e.g., captured for a blood pressure observation). It offers complete knowledge about a clinical context, (i.e., attributes of data); state (context for interpretation of data); and protocol (information regarding gathering of data), as shown in Figure 7.4. The contents within the EHR data have a multitude of representations. The contents may be structured, semi-structured or unstructured, or a mixture of all three. These may be plain text, coded text, paragraphs, measured quantities (with values and units); date, time, date-time (and partial date/time); encapsulated data (multimedia, parse-able content); basic types (such as Boolean, state variable); container types (list, set); and uniform resource identifiers (URI) (Sachdeva and Bhalla, 2012). They facilitate data integration, cohort formation, variable selection (including context related variables), time course analysis, and query formulation Gall et al. (Gall et al., 2008). Sundvall et al. (E. et al., 2007) described an information visualization tool for time varying health data based on archetypes. Electronic decision support systems (DSS) are another driver for EHR use. Archetype systems support DSS, because the data are well-structured. Also, the archetype s standard format, across a variety of settings, makes the DSS tools more widely available. The openehr foundation is developing archetypes which will ensure semantic interoperability. The openehr archetypes are developed systematically through domain knowledge governance tools. According to the statistics provided by CKM, which is a domain knowledge governance tool, there are 279 numbers of archetypes (CKM, 2012). Domain knowledge governance will ensure that archetypes will meet the information needs of the various areas. With CKM, the users interested in modeling clinical content can participate in the creation and/or enhancement of an international set of archetypes. These provide the foundation for interoperable Electronic Health Records. CKM is a framework for managing archetypes. It helps in identifying which archetypes need to be standardized, and which are domain specific. It establishes a frame of reference and helps to train multidisciplinary teams for archetype development. The coordination effort team will inform and support domain knowledge governance. To support this, openehr has employed the Web Ontology Language (OWL) and the Protg OWL Plug-In to develop and maintain an Archetype Ontology which provides the necessary meta-information on archetypes for Domain Knowledge Governance. The Archetype Ontology captures the meta-information about archetypes needed to support Domain Knowledge Governance (Garde et al., 2007b). Figure 5.5: The Blood Pressure concept represented as an OpenEHR archetype 94

114 Archetypes and Semantic Interoperability Archetypes specify the design of the clinical data that a Health Care Professional needs to store in a computer system. Archetypes enable the formal definition of clinical content by domain experts without the need for technical understanding. These conserve the meaning of data by maintaining explicitly specified and well-structured clinical content for semantic interpretability. These can safely evolve and thus deal with ever-changing health knowledge using a two-level approach. Hence, the data stored at one hospital in the archetypes is easily exchanged with another hospital and the semantics of the data are retained. These are the basis for knowledge based systems as these are means to define clinical knowledge and are language neutral. These should be governed with an international scope, and should be developed by clinical experts in interdisciplinary cooperative way (Garde et al., 2007a). The developed archetypes need to be reviewed by other clinical experts (e.g., clinical review board) to ensure their completeness and relevance to evidence-based clinical practice. The archetype repository is a place of development, governance and primary source of archetypes. High quality archetypes with high quality clinical content are the key to semantic interoperability of clinical systems (Garde et al., 2007a). According to archetype editorial group and clinical knowledge manager (CKM) (CKM, 2012), the information should be sharable and computable. Terminologies also help in achieving semantic interoperability. Terms within archetypes are linked (bound) to external terminologies like SNOMED-CT ((IHTSDO), 2013). With the use of reference model, archetypes and a companion domain knowledge governance tool, the semantic interoperability of EHR systems becomes a reachable goal Archetype Definition Language (ADL) Archetype Description language (ADL) (Beale and Heard, 008a) is considered as a language for system level interactions (Figure 5.2). Archetypes for any domain are described using a formal language known as ADL. ADL is path addressable like XML. The openehr Archetype Object Model (AOM) describes the definitive semantic model of archetypes, in the form of an object model (Beale and Heard, 008b). The AOM defines relationships which must hold true between the parts of an archetype for it to be valid as a whole. In simpler terms, all archetypes should conform to AOM. Since EHR has a hierarchical structure, ADL syntax is one possible serialization of an archetype. ADL uses three other syntaxes, cadl (constraint form of ADL), dadl (data definition form of ADL), and a version of first-order predicate logic (FOPL), to describe constraints on data which are instances of RM ((Beale and Heard, 008a)). 5.3 Information Retrieval in EHR Systems EHR systems are end-user systems used by health-care workers, such as, the nursing staff, administrative department, clinicians, and patients. As the use of EHRs is becoming more widespread, so is the need to search and provide effective information discovery within them. There are two approaches to explore the data within them: Searching, the EHR collection given a user question (query) and return relevant fragments from the EHRs or Mining the EHR collection to extract interesting patterns, group entities to various classes or decide whether an EHR satisfies some given property. Mining differs from searching, since mining techniques look for interesting patterns or trends in the data, whereas search algorithms look for relevant discrete pieces of data in a collection, given a query. We focus on providing an end-user with the information. 95

115 5. LITERATURE REVIEW AND BACKGROUND STUDIES Improving the usability of EHRs will support care of the whole patient and improve the quality, safety, efficiency, and effectiveness of care delivered in the primary care setting (Hristidis, 2009).This potential is presently limited by many factors, one of which is the challenge of extracting information from them in order to execute a research query (Hristidis, 2009). There is limited pre-existing work on the generic specification of clinical queries. The EHR system users are not IT experts who understand the implementation details and terminologies used within the system. Hence, provision of query systems which are intuitive for non-experts has been recognized as an important informatics challenge (Huser et al., 2010). A query language is essential for EHR systems. The best and easy way they can access the system is using the natural language forms to provide keywords to find relevant information. A high level query interface which is user-friendly and system efficient is a critical requirement for any EHR system. In case of such a query-language, the end-users need to be presented with the view level interface to the system. These end-users are mostly the non-expert users like the clinicians, patients, and other health-care workers. They may site a user-query to search the EHR collection: Find information related to Asthma in the collection or a query like Find all blood pressure values, where systolic value is greater than (or equal to) 140, or diastolic value is greater than (or equals to) 90, within a specified EHR. These types of query are very similar to the queries used in Web search engines, and are typically expressed as a list of keywords hence called keyword queries (Hristidis, 2009). In addition to keyword queries, a user may input a search query. Provision of query systems which are intuitive for non-experts has been recognized as an important informatics challenge (Huser et al., 2010). Ideally, the user should be able to write a natural language query and receive a relevant answer. For instance, a natural language query is Find patient records with frequent low blood pressure and family history if ventricular septal defect. There are no effective systems to accurately answer natural language queries, given the complexity of natural languages. Between two extremes of plain keyword queries and natural language queries, a large number of query languages have been proposed. For instance, some languages specify attribute name, value pairs, such as blood pressure= low AND history CONTAINS Ventricular Septal Defect. Answering such queries is challenging because an EHR may relate to asthma in many different ways-for example, asthma may be part of patient history field or part of patient s diagnosis field, or may not even be present in an EHR but a related concept (Respiration obstruction ). Furthermore, assuming that there are EHRs for all of the above types of relationships to the query keyword, which one should be displayed first (Hristidis, 2009) Interoperability and levels of Query Interfaces Considering Figure 5.2, the health-care worker will need an additional support layer. existing support can be compared to (Figure 5.6) The System Programmers level for development of EHR system- using ADL and Archetype Query Language (AQL). Application Programmer level for development of system applications, using XQuery, OQL (object query language) and SQL - type of interfaces (assuming the RDBMSs may support ADL in future). Healthcare worker level interfaces: This is an active research area and no easy-to-use interfaces exist till date. At the system level, the AQL language is supported for development of initial support infrastructure. The application level can benefit from XML conversions and support of query 96

116 User Interactions: XQBE/ Query Forms/ QBE Application Development : XQuery/ SQL/ OQL System Development: AQL Figure 5.6: Query Support at different Levels. language at application level, in the form of XQuery. Higher-level of support is an active area of research. Many research efforts aim to improve user interaction facilities (Braga et al., 2005), (Jayapandian, 2009) Challenges in Querying Archetype based EHR EHRs allow multiple representations (Ma et al., 2007). In principle, EHRs can be represented as relational structures (governed by an object/relational mapping layer), and in various XML storage representations. There are many properties and classes in the reference model, but the archetypes will constrain only those parts of a model which are meaningful to constrain. These constraints cannot be stronger than those in reference model. For example, if an attribute is mandatory in RM, it is not valid to express a constraint allowing the attribute to be optional in the archetype (ADL). So, the single ADL file is not sufficient enough for querying. The user may want to query some properties or attributes from RM, along with the querying from properties in archetypes. In order to create a data instance of a parameter of EHR, we need different archetypes in ADL, and also these archetypes may belong to different categories of archetypes. At the user level, querying data regarding BP must be very simple. The user only knows BP as a parameter and will query that parameter only. For example, to create a data instance for Blood Pressure, we need two different archetypes- namely, encounter and blood pressure. These archetypes belong to different categories viz., COMPOSITION and OBSERVATION. At the time of query, the system faces this problem- which archetypes must be included in querying. For example, querying on BP requires the use of two archetypes viz., Encounter archetype (belonging to COMPOSITION category of RM) and Blood Pressure archetype (belonging to OBSERVATION category of RM). This problem can be addressed by the use of templates. Archetypes are encapsulated by templates for the purpose of intelligent querying (T. and S., 2005). The templates are used for archetype composition or chaining. Archetypes provide the pattern for data rather than an exact template. The result of the use of archetypes to create 97

117 5. LITERATURE REVIEW AND BACKGROUND STUDIES data in the EHR is that the structure of data in any top-level object conforms to the constraints defined in a composition of archetypes chosen by a template. The EHR system must have an appealing and responsive query interface that provides a rich overview of data and an effective query mechanism for patient data. The overall solution should be designed with an end-to-end perspective in mind. A query interface is required that will support users at varying levels of query skills. These include semi-skilled users at clinics or hospitals Archetype Query Language (AQL) To query upon EHRs, a query language, Archetype Query Language (AQL) has been developed (AQL, 2011). It is neutral to EHRs, programming languages and system environments. It depends on the openehr archetype model, semantics and its syntax. AQL is able to express queries from an archetype-based EHR system. The use of AQL is confined to a skilled programmers level. It was first named as EQL (EHR Query Language) which has been enhanced with the following two innovations (Ma et al., 2007): Utilizing the openehr path mechanism to represent the query criteria and the response or results; and Using a containment mechanism to indicate the data hierarchy and to constrain the source data to which the query is applied. The syntax of AQL (Figure 5.7) is illustrated by the help of example. Query: Find all blood pressure values, where systolic value is greater than (or equal to) 140, and diastolic value is greater than (or equals to) 90, within a specified EHR. Identified path -systolic Naming retrieved results SELECT obs/data[at0001]/events[at0006]/data[at0003]/items[at0004]/value/magnitude AS Systolic, obs/data[at0001]/events[at0006]/data[at0003]/items[at0005]/value/magnitude AS Diastolic Class Expression FROM EHR [ehr_id/value=$ehruid] Archetype Predicate CONTAINS COMPOSITION [openehr-ehr-composition.encounter.v1] CONTAINS OBSERVATION obs openehr-ehr-observation.blood_pressure.v1] WHERE obs/data[at0001]/events[at0006]/data[at0003]/items[at0004]/value/magnitude>= 140 OR obs/data[at0001]/events[at0006]/data[at0003]/items[at0005]/value/magnitude>=90 Figure 5.7: Syntax of AQL. 98

118 5.4 Querying the EHR Systems Huser et al. (Huser et al., 2010) describes two fundamental ways of querying Enterprise Data Warehouse (EDW) data: direct authorship of the query code (the user constructs the query logic in a low-level query language) or use of a query-building tool (a specific query application assists the user in the query composition). Direct authorship of the query code is very similar to conventional programming and requires substantial expertise in a given query language, plus substantial knowledge of the underlying database schema. Direct code authorship is often used for complex queries, the only restriction being the query language syntax. A non-expert EDW user usually collaborates with an expert analyst, knowledgeable of the EDW data structures and query technologies. In this section, we focus on the latter querying methodology. We divide the various available methods into two categories the top down and bottom up approaches. The top-down and bottom-up approach is with respect to how a query interface is developed Bottom-Up Approach In the bottom-up approach, we consider how the database can be queried using a low-level representation of data. In this category, we consider the XQBE (XQuery-by-Example) and XGI (XQuery graphic interface) interfaces XQBE (XQuery by Example) AQL is difficult for semi-skilled users. It requires the knowledge of archetypes and knowledge of languages such as ADL, XML and SQL. We propose to study for the convenience of health-care professionals for querying archetype based EHR systems based on the proposed query interface XQBE (Braga et al., 2005). An alternative approach proposed by Ocean informatics (aql, 0 30) suggests using a query builder tool, to construct AQL query. It requires form related inputs and more skills on the part of the user. Our goal is easy to achieve with the help of XQBE. XQBE is a user-friendly, visual query language for expressing a large subset of XQuery in a visual form. Its simplicity, expressive power and direct mapping to XQuery are some of the highlighting features for its use. Like XQuery, XQBE relies on the underlying expressions in XML. It requires all data to be converted to XML form. It presents a user with XML sub-tree expressions for the items of user interests. XQBE s main graphical elements are trees. There are two parts, the source part which describes the XML data to be matched against the set of input documents, and the construct part, which specifies which parts will be retained in the result, together with (optional) newly generated XML items. In order to adopt a XQBE like interface at user level, we propose to convert ADL into XML. ADL can be mapped to an equivalent XML instance. Thus, paths are directly convertible to XPath expressions. According to Beale and Heard (Beale and Heard, 008b), the particular mapping chosen may be designed to be a faithful reflection of the semantics of object-oriented data. There may be need for some additional tags for making the mapping of nested container attributes since XML does not have a systematic object-oriented semantics. Thus, single attribute nodes can be mapped to tagged nodes of the same name. Container attribute nodes map to a series of tagged nodes of the same name, each with the XML attribute id set to the node identifier. Type names map to XML type attributes. In the present proposal, the patient data description is converted to XML form (Sachdeva and Bhalla, 2009). It is suitably reformed for adoption of XQBE interface. Thus users can directly use XQBE query interface to access patient data. This process eliminates the need to 99

119 5. LITERATURE REVIEW AND BACKGROUND STUDIES learn and use the AQL language on the part of the users. The XQBE skills can be learnt with ease (Braga et al., 2005) Mapping ADL to XQBE for EHR Data Database queries are usually dependent on local database schemas but archetype systems being proposed aim to have portable queries. The queries play a crucial role in decision support and epidemiological situations. The XQBE approach for archetype-based EHRs is being proposed for semi-skilled users (such as doctors, physicians, nurses). The mapping process to create XQBE is shown in following steps (Figure 5.8). The conversions of ADL file into XML file. Generation of DTD for the XML file. Generation of XQBE interface structure. Subsequently, for the semi-skilled user, this three step process will facilitate in querying archetype based EHRs. The step (ii) in above process will aid in the guided construction of query provided by XQBE (XQBE 2.0.0). XQBE XML ADL (i) (ii) (iii) DTD Figure 5.8: Mapping process to present XQBE for EHRs (Sachdeva, 2009). Query Scenario1. Find all Blood Pressure values, having the systolic BP and diastolic BP, (where systolic BP >= 140 and diastolic BP >=90). The AQL syntax for the above query is shown in Figure 5.7. By using XQBE approach for querying, we perform step (i) to step (iii) as explained, on BP parameter. For each case of query, and for querying different parameters of EHR, we need to convert each parameter (in form of adl) to a corresponding xml for the demonstration. We propose to develop an automated tool in the subsequent phase. The clinical user will be provided with a substituted XQBE interface (Figure 5.10) in place of AQL. XQBE is a visual interface. A user is presented with graphical image of EHR components, for example, BP in this case. Based on the selected source data, the user defines target sub-tree 100

120 (in XML form) to express the query (and its outcome). The query is expressed by using the graphical elements of XQBE. The source part of the query is built using the DTD (Figure 5.9). A guided construction is provided to the user to add predicates for the query. The construct (or result) part of the query is built by the user using the graphical elements of XQBE by dragging and dropping them. <!ELEMENT adl_version ( #PCDATA ) > <!ELEMENT archetype ( original_language, description, archetype_id, adl_version, concept, definition, ontology ) > <!ELEMENT archetype_id ( value ) > <! ELEMENT assumed_value (terminology_id, code_string, magnitude?, units?, precision? ) > <! ELEMENT attributes (rm_attribute_name, existence, children+, cardinality? ) > <! ATTLIST attributes xsi: type (C_MULTIPLE_ATTRIBUTE C_SINGLE_ATTRIBUTE) #REQUIRED > <! ELEMENT cardinality (is_ordered, is_unique, interval) > <! ELEMENT children (assumed_value attributes code_list includes item list node_id occurrences property rm_type_name target_path terminology_id)* > <! ATTLIST children xsi: type (ARCHETYPE_INTERNAL_REF ARCHETYPE_SLOT C_CODE_PHRASE C_COMPLEX_OBJECT C_DV_QUANTITY C_PRIMITIVE_OBJECT) #REQUIRED > Figure 5.9: A sample of BP.dtd. Figure 5.10 shows the element nodes and sub-element nodes in the source part. The subelements (systolic and diastolic) of the BP element, one systolic and one diastolic satisfy condition1 (systolic>=140) AND condition2 (diastolic>=90) are described with the help of XQBE convention. As per the convention, an arc containing + indicates that children element node may exist at any level of nesting (as in XPATH we use // ). The construct part consists of element node for BP (set tag T-node), and also element nodes for systolic and diastolic, which relates the projected BP element nodes to its systolic and diastolic sub-elements. The fragment node (shown by filled triangle) indicates that the entire fragment of systolic and diastolic must be retained in the result XGI: XQuery Graphical Interface XGI is a visual interface for graphically generating XQuery proposed in (Li, 2007). XGI uses a web-based architecture. It is easy to maintain and install software for biomedical researchers. XGI provides users with a navigable source tree that assists users in understanding the source schema and gives users the ability to graphically choose elements from the source schema to be included in the query schema. XGI has a robust query creation process that can help users to create expressive XQuery statements. Appendix B displays the interface of the XGI interface. To construct the query one has to first select elements from the source tree in Figure 5.10, which is a representation of the complete schema of the XML database. These elements can be added to the constructed query tree in Figure Once a source tree node is added to the query tree, it forms an implicit mapping edge between the source and query tree node. Users can also add a user defined node 101

5. LITERATURE REVIEW AND BACKGROUND STUDIES Figure 5.10: BP.xqbe - an XQBE template for query (see online verison for colours). anywhere on the query tree to arbitrarily structure the query result.

121 5. LITERATURE REVIEW AND BACKGROUND STUDIES Figure 5.10: BP.xqbe - an XQBE template for query (see online verison for colours). anywhere on the query tree to arbitrarily structure the query result. The user-defined node does not contain any mapping edge to a node in the source tree. Searching can eliminate the tedious and time-consuming task of finding the desired node in a large source schema tree. Nodes whose name begins with the search string are automatically returned (Figure 5.10). To distinguish result nodes with the same name, XGI will display each result node s path information in the information panel when users position the mouse over the node (Figure 5.10). Users can change the name of the query node so the same query result will be returned under a different tag name. The input to the XGI interface will be a serialized instance of an XML file and then a source tree is generated with the root element of that file. A small limitation of the XGI tool is that, it is not an open source tool. XGI can also be a candidate for a query interface for the archetype based EHR system. Its implementation will be similar to that of the XQBE interface Top-Down Approach In the top-down approach, we consider how the inputs collected through a query form or the user-friendly interface can be mapped to the underlying query language. In this subsection, we describe the XQBO (a combination of Query-by-Object using the corresponding.dtd of the XML file), Retroguide (a flow chart based approach), and the keyword search interfaces XQBO (Query by Object using DTD) This is an enhancement of the already proposed Query-by-Object (QBO) (Rahman et al., 2006) which faces the limitation of the being generated from the underlying XML file. It is generated from the corresponding.dtd of the xml file to be queried upon. This technique is in the initial phases of development. 102

122 Retro Guide RetroGuide is a suite of applications which enables a more user-friendly analysis of Enterprise Data Warehouse (EDW) data. RetroGuide uses, as graphical query metaphor, step based flowchart models (called scenarios ) to represent queries. Its scenario has two layers: a graphical flowchart layer and a hidden code layer. The flowchart layer can be created and reviewed by users with limited expertise in database and query technology. The code layer is hidden behind the nodes and arrows of the flowchart and contains references to modular applications which can obtain EHR data or provide various analytical functions (Huser et al., 2010). Its framework is extensible. New modular applications can be added and the user is enabled to use scenario variables to combine related query criterion. There is a very close relationship between the scenario flowchart and the query execution engine. A query is viewed as a patient-centered, step-based process. The flowchart layer of a RetroGuide scenario represents these steps graphically and the execution of the flowchart is enabled by references to modular RetroGuide external applications (RGEAs) that provide specific data processing tasks, such as obtaining patient data, or performing data transformation and analysis. The XML Process Definition Language (XPDL) is the language behind the Retroguide. An open source work-flow editor called JaWE is used to model RetroGuide scenarios. Each node in the scenario may contain the execution of one or more external RGEAs. Arcs connecting the nodes represent the flow of logic and may contain a transition condition which further restricts the scenario logic, or implements branching or repetition logic. A detailed analysis of this approach is required to determine its ability to be used for an archetype based EHR system. For this, we should be able to map the XML generated from the ADL file in our archetype based EHR system to the XPDL format for the Retroguide. Another major challenge would be relating multiple concepts in case of inter-related dependent archetype concepts participating in a clinical decision query. It can also be interesting to view archetypes as components participating in a business process (patient-care). An advantage of investigating this possibility is that RetroGuide business processes also handle the temporal nature of queries, which can be a key advantage for the clinical diagnostic queries Keyword Search Although the availability of electronic administrative and clinical data has facilitated many types of automated searches, the searching generally requires technical expertise, is bounded by slow batch processes, and tolerance for low sensitivity of results. Querying by keyword has emerged as one of the most effective paradigms for searching (Hristidis et al., 2010). The key focus of this technique is to determine how to rank the documents of a collection according to their goodness with respect to a query. EHRs are complex structured documents containing several associated clinical entities (e.g., physicians, medications, patients, and events). The common query is usually expressed as a list of keywords, similar to the Web search. Factors like relevance to the query, specificity, and importance influence the goodness of the result. The relevance is a subjective judgment and may include being about the intended subject, being timely (recent information), being authoritative (from a trusted source) and satisfying the users goals and their intended uses of the information (information need) (Hristidis et al., 2010). It can be observed that the words in the query appear frequently in the document, in any order ( bag of words ). The importance of a result is determined by its authoritativeness (should be from a trusted source). The specificity of a result is determined by its relevance (being on the proper subject and satisfying user goals) and conciseness (Hristidis et al., 2010). EHRs are generally semi-structured (XML format) or completely structured (relational format). Health standards organizations like HL7 have been designing XML-based formats to represent 103

123 5. LITERATURE REVIEW AND BACKGROUND STUDIES Interaction Paradigms Telemedicine Hospitals EHR Based Applications PHR portals Web services Repository of EHRs Healthcare application Wireless interface Human System Figure 5.11: Human System Interactions in EHR domain EHRs. When using the keyword search for the archetype-based EHR systems the challenge will be to see how to retrieve the data from the database, in form of the XML documents or simple set of relational attributes. If the data is retrieved from the database in form of XML documents, then the keyword search can be applied, as mentioned in the above approach. The proper application of the methodology remains a topic of research. 5.5 Human-System Interactions and Usability within EHR Systems Users of EHR systems complain about their difficulties, in terms of non-friendliness of systems and about the non-intuitive support for data input. Ideally, the user need not scan or need to search through the entire health record. The user must be able to look at an application layer that accesses and presents a focused amount of information from a given health record or set of records. Unfortunately, the current EHR systems are not optimized for this purpose. Instead, many of them function as something closer to an electronic version of the paper record. Figure 5.11 presents the human system interaction in the health care domain with respect to an EHR system. Currently, the health-care user may interact through various paradigms such as, telemedicine, hospitals and wireless interfaces. The skinput, gaze-based and multitouch technologies may also be used. These interact with the system through various web services as in the case of Personal Health Record (PHR) portals such as GoogleHealth (Google, 2011) and Microsoft HealthVault (Microsoft, 2011). According to the human-computer interaction principles, the user needs to play major role in design of such systems whose ultimate purpose is to be useful for the end-users. Also, the professionals and patients should be treated at the same level. And these interfaces should provide the relevant information and the screens to help them. The user must be able to perform the tasks without having to understand the complete functionality of the system and with minimal number of clicks to save time. In summary, a user centric EHR system must be created and personalized according to the end-users with the goal to reduce errors and improve learnability, memorability, user satisfaction, and effectiveness of the EHR system. 104

124 Healthcare User Human Segmentation of end users Understanding Usability Concerns Forecasting and Prediction Applying Data Mining Techniques Interaction Model for EHR System Figure 5.12: Human-System Interaction Monitoring Model for EHR Systems as per the TDQM framework Pattern Discovery: Mining End-User Needs Considering an EHR system s design at the initial level, the pattern discovery techniques can target the following areas or functions. 1. Segmenting end-users into groups with similar age, gender, demographic variables, user groups, purpose, work-flows and task-flows; 2. Understanding the lack of interest among users while using an EHR; 3. Anticipating end-user s future actions, browsing patterns, reports demanded based on their history and characteristics as per the above segmentation; and 4. Predicting operations that are needed. Existing research studies use manual methods such as, observing the end-users on sites and recording their behavior and choices, through questionnaires and surveys as a means to capture data, for end-user involvement in design. Most often, the hidden values and the data mining of resources have the potential to predict trends and user-behavior. The use of above techniques has been emphasized in the proposed human system interaction model. Figure 5.12, presents a human system interaction model within the EHR domain. Its aim is to refine the design ideas that test the best options through multiple iterations. It contains a cycle for various steps involved in the interaction model. The model will take various user inputs and extract knowledge that will help the designer to evolve a more efficient EHR system. The performance of the resultant EHR system will be more efficient in terms of time with: reduced number of clicks, prior knowledge of complex sections of the EHR system, customized navigation, user-friendly interface of the EHR system, and reduced cost of improving the EHR system developed. 5.6 Discussions It is methodologically difficult to evaluate advanced data query systems and only a subset of previous publications about query systems includes an evaluation section (Huser et al., 2010). The spectrum reflecting the degree of formal evaluation component in query systems publications would be: 105

125 5. LITERATURE REVIEW AND BACKGROUND STUDIES 1. No formal evaluation method presented (system features or methodology are descriptive only). 2. Partial presentation of several example queries, with or without comparison to other query technologies. 3. Complete single or multiple case studies (query and results) where the system is used to solve a concrete analytical problem. 4. Presentation of system usage statistics demonstrating technology adoption by users (Huser et al., 2010). Formerly, the query-building tools have been specifically designed for a non-expert user and offer a set of pre-designed features which are easier to use. A query building tool provides an additional query modelling layer which eventually generates query code in one or a combination of several query languages with the aim of simplifying the query task for the end-user. Although a query building tool enables non-experts to execute queries, it often limits the query expressiveness. A common challenge of many query building tools is a case of a complex query. It is not possible to author such query within the query building tool (Huser et al., 2010). This limitation can be because of several factors: limited user interface. The chosen graphical metaphor or the tool s native modeling paradigm cannot support all necessary query criteria. Limited capability to combine interim solution layers within the tool (e.g., output of one query criterion is input for another criterion). The underlying low-level query language is too restrictive and cannot be extended with user defined functions or combined with additional technologies within the tool. XQBE consists of two components: the XQBE client and the XQBE server. The server translates the query result tree to the XQuery statement, executes the query over any arbitrary XML data source, and returns the result to the XQBE client. The client is a stand-alone, Java-based graphical query editor that allows users to construct the query result tree from any arbitrary XML source schema by explicitly defining both the source tree and the query result tree. The XQBE interface proposed for an archetype-based EHR system takes the XML document as input from the user and generates the source tree, on which a domain expert who is completely aware of the archetype structure can choose the attributes to be queried. The attributes selected for the result are shown in the construct part as a tree. Such an interface though represents the underlying structure of the archetypes being queried, is not convenient for the naive end users like the nurses/administrative staff who are not clinical domain experts. XGI can be used to browse the source schema, define the query schema, and visualize the query output graphically. It is likely that the queries generated by biomedical researchers will often become more complex because of the interrelated ontologies among every single concept or element. It cannot handle complex scenarios like nesting (hierarchical binding), aggregates, sorting, negation, filtering, arithmetic computations, and distributed query generation (Bales et al., 2005). Some interface functionalities need to be improved in XGI, such as support for adding multiple nodes to the query tree simultaneously, and implementing a query-in-place feature that can automatically validate the user s query schema. The keyword search though may be more user-friendly but when implemented in a complex system like the EHR system. It faces issues like ranking of the results according to their 106

126 relevance. For instance, a user query specifies Find High Blood pressure patients, in this case the systolic or diastolic BP may be high, so if the system fetches all the results. There should be a mechanism to rank these results. It will be a challenge to match the criteria of relevance between the end-users perspective and the system s perspective. In other words, how successful a given system is to rank the results. The relevance is a subjective judgment and may include being about the intended subject, being timely (recent information), being authoritative (from a trusted source) and satisfying the usersf goals and their intended uses of the information (Huser et al., 2010). A qualitative comparison of the above mentioned approaches from an end-user perspective is shown in Table 5.2. For XQBE and XGI interfaces the users need to be aware of the underlying knowledge and structure of the archetype, therefore it is termed as partial. Whereas the RetroGuide is based on the clinical work-flow process so models the query according to the end user. The keyword search also takes keywords as input from the user. The next parameter for evaluation is display of result, we see from an end-user perspective, whether the output displayed is comprehendible or not. The time efficiency parameter tells whether the results fetched involves intermediate time consuming tasks or not. The ability to handle complex queries is an important feature of evaluation as the structure of EHR data is complex and the concepts are interrelated. An interface with the capability to handle complex query scenarios is an asset for any system. The main goal in the distributed health-care is to maintain the semantics across levels and organizations. These approaches (in Table 5.2) promise or exhibit SIOp. 5.7 Summary and Conclusions In the traditional RDBMS, the top-down and bottom up approach for query interface development cannot be distinguished significantly. They are almost similar because of the relational nature of the tables involved. The available and the most popular interface QBE (Zloof, 1977) presents the user with an exact perception of the underlying database tables and present them the sense of manually manipulating these tables. The user formulates queries by filling the appropriate table rows with an example of the possible answers. The user only needs to distinguish between the variable and the constant part of the query. Since, there is a direct mapping between the physical level view and the high level view. The user s view is readily mapped to the underlying database. In case of Archetype based EHR systems, the querying tasks are more complex with a number of archetypes required to be referred together. The hierarchical structure of the archetype adds to the complexity of information retrieval from them. If a bottom-up approach is used for query interface development. Then, the data base can be queried using the AQL (which is a combination of SQL and is path traceable). This AQL query need to be transformed to an XQuery and then a high level interface like XQBE can be developed based on the XQuery. The resultant interfaces, require the user to be trained about the complete underlying structure of the archetype files. All these transformations are complex, expensive in terms of efforts, space and time. If a top-down approach is used for query interface development, a natural language form may be presented to user. A form based input empowers the user to have the complete functionality of the system without the knowledge of the underlying structure. In this form the end-user can provide his inputs which can be in form of magnitudes for certain attributes or keywords or may be string (keywords) inputs. The challenge is to map this user input onto the underlying archetype attributes and the combination of ADL files that are involved (case of complex 107

127 5. LITERATURE REVIEW AND BACKGROUND STUDIES Table 5.2: Comparison of various querying approaches for the EHR users. Functions XQBE XGI QBO+DTD RetroGuide Keyword Search Ease of Use Partial (training required) Partial (training required) Partial (training required) Good (forms a process) Ability to handle complex queries Semantic Interoperability Hierarchical Tree part of the source tree Not Good very Graphical query output Not good very Yes Not very good Needs to be explored Needs to be explored Needs to be evaluated Good (enter a keyword) Display of User- Interpretable Result Detailed ratio/percentagrences All occur- with output with option to each row expand a further particular expandable result Assumed No evaluation to be quick as the data available fetched using effi- Time ciency the already created hidden JAVA libraries Yes Yes Yes Yes Yes Yes Displays top results based on rank input). This mapping needs to be resolved and the business logic is mapped to the underlying archetypes by the means of the ADL files. Exploring the use of the RetroGuide (Huser et al., 2010) for the purpose will be an important research aspect as it will allow modeling the clinical queries in form of business processes which may be temporal in nature. The external libraries of the RetroGuide need to be explored, whether they are capable to fetch the data from the archetype based systems. The study provides an analysis of the various facets and levels of the semantic interoperability required in the health-care domain. It presents a comparison of the approaches from an end-user s perspective. It concludes the discussion by categorizing the approaches into either the bottom-up or the top-down method. It further discusses which will be more appropriate for a dual-level standardized EHR system. It considers that querying will be a major issue in the future as the EHR systems are moving towards standardization and growing exponentially. These systems allow a complete access of the information to an end-user of the future EHR systems. The evolution of EHRs can promote ease of use of the health care systems. At the same time, the internal complexities of the system have increased. As the amount of data becomes voluminous, there is an increase in the need for interoperability, epidemic studies and to provide a user-friendly system. The traditional approach should be replaced for EHR systems by 108

128 involving the principles of TDQM framework. In the health-care domain this is essential to reduce health risks with each change and within the process of evolution. For this purpose, understanding of user needs, their inhibitions to use the system, and predicting their navigation patterns are required. The traditional methods such as, observing the user, filling up surveys can aid this requirement. These efforts are not efficient and have a limited scope. Thus, a framework to target the design phase of an EHR system is required. 109

129 Chapter 6 Quasi-relational Query Language and Standardized EHRs Persistence and Querying for Standardized Electronic Health Records In this section we discuss the various existing approaches for persistence and in-depth querying of archetypal EHRs Querying in Archetypal EHRs Querying the system with the dual-model architecture is different from querying a relational or an XML database system. Here, the user is only aware of the concepts such as, blood-pressure or heart-rate and intends to query them. Hence, there is a need for query support independent of the system implementation, application environment, and programming language. The domain professionals and software developers should be able to use the query language or query-language interface. The user may query some properties or attributes from the RM and archetypes. The different categories of archetypes have different structures. They are encapsulated by templates for the purpose of intelligent querying (Beale, 2008). As a result, the structure of data in any top-level object conforms to the constraints defined in a composition of archetypes chosen by a template. Archetype paths connect the classes and attributes of the reference model (RM). They also form the basis of reusable semantic queries on archetypal data. Queries can be constructed using these paths which specify data items at the domain-level. For example, paths from a blood pressure measurement archetype may identify the systolic blood pressure (baseline), systolic pressures for other time offsets, patient position, and numerous other data items (T. et al., 2008), (Beale, 2008). OpenEHR proposes Archetype Query Language (AQL) (AQL, 2011) to support querying of the archetypes based on ADL (Archetype Definition Language) syntaxes (archetype paths). The paths for querying are incorporated within a familiar SQL-style syntax. However, AQL is not much useful for end-users (clinicians) due to complex syntaxes. Inability to perform aggregate queries (population-based queries) is a critical barrier in its adoption for quality health-care delivery. 1 Research Publication(s) - Aastha Madaan, Wanming Chu, Daigo Yaginuma, Subhash Bhalla Quasi-Relational Query Language Interface for Persistent Standardized EHRs: Using NoSQL Databases. LNCS, Vol. 7813, Springer-Verlag, Berlin, Heidelberg, pp , Springer Berlin Heidelberg, 15, ISSN: (Print) (Online). 110

130 6.1.2 Methods for Persisting the Archetypal EHRs With the widespread adoption of the EHRs by various health organizations across the globe, large amount of health data is readily available (Sachdeva and Bhalla, 2012). As a result there is an increasing need to utilize and manage this data to deliver quality health-care. The OpenEHR artefacts (archetypes and templates) have a deeply nested structure and each concept has its own data nodes. The persistence layer for these EHRs needs to be capable to handle such a structure. The EHR data belonging to the OpenEHR standard s reference model (RM) can be serialized into several formats such as JSON, XML and others (Freire et al., 2012). At present the OpenEHR forum (T. et al., 2008), does not define any persistence method for the archetypal EHRs. The Opereffa prototype system (Opereffa, 2011) makes use of relational model based persistence for storing the EHR data. It makes use of the PostgreSQL database (PostgreSQL, 2011) with a single relation archetype data. This methodology may not be very suitable for in-depth querying of the complex-structured archetypal data because each of the path values (data value) is stored as a value in the relation and cannot be easily presented as a query-able attribute to the users. Moreover, the volume of EHR data is increasing exponentially due to which having a single table may give rise to issues of scalability and performance. The research work in (Freire et al., 2012) discusses the applicability of the XML based persistence for the archetypal EHRs. It concludes through experimentation that XML-based persistence does not suffice the needs of storing the standardized EHR data. The way the archetypes are designed and the nature of the data values that are stored in the database make the automatically generated indexes in the XML databases inefficient. Moreover, the tree structured archetypes are relatively deep and comprise of repeated path segment identifiers. This requires the persistence layer to facilitate easy querying of these structures along with being capable to perform in-depth querying. The openehr EHRs databases are based on dual-level modeling approach and the main purpose of these databases is to allow users to query the archetypal data at patient and population levels. Hence, the choice of persistence mechanism is based on the requirements for query. Since the RM (openehr reference model) has large number of classes that have deep hierarchies, a pure object-relational mapping is not sufficient (Freire et al., 2012). The EHR data generated according to RM can be serialized into several formats such as, JSON, XML and others. The work given in (Freire et al., 2012) analyzes the performance of various XML based databases, Berkley DB XML, Sedna, MySQL, BaseX and exist w.r.t to clinical and epidemiological queries. The results show that the XML based databases are not sufficient for epidemiological queries as they result in large response times and number of records fetched was incorrect. On the other hand, the NoSQL databases are increasingly being used for large-scale and distributed heterogeneous data. MongoDB is ideal for knowledge bases, on-line journals, surveys repository, or for storage of externally generated clinical information that cannot be ported easily to a RDBMS EHR. Such information can be linked to a patient s active EHR profile, and remain search-able, but remain separate, for legal and clinical reasons, from more trusted in-house generated data (ehr, 2013). Moreover, MongoDB uses JSON documents that can contain openehr-based data. Hence, we chose the NoSQL based database for experimenting storage and querying standardized EHRs data NoSQL based persistence for the Standardized EHRs The above requirements need moving from the traditional relational and XML data-bases to highly scalable, high-performance and schema-less databases termed as the NoSQL databases (Redmond and Wilson, 2012). The NoSQL databases belong to different categories such as, 111

131 6. QUASI-RELATIONAL QUERY LANGUAGE AND STANDARDIZED EHRS column stores, key-value stores, document-oriented databases and graph databases (Redmond and Wilson, 2012). MongoDB is a JSON-based document-oriented NoSQL database which maintains high queryability similar to the relational databases and at the same time allows high scalability (Redmond and Wilson, 2012). It supports B-tree indexes similar to indexes in other database systems. Moreover, any field or sub-field contained in documents within a collection (similar to a relation) can be indexed. Using queries with good index coverage can reduce the number of full documents that MongoDB needs to store in memory, thus maximizing database performance and throughput. In this work, it is used for the persistence of the EHRs and to support relational-like queries for the EHRs using the proposed database system. An openehr based EHR is a collection of templates and each of these templates in return are formed by single or multiple archetypes. Table 6.1 presents a snippet for the blood pressure concept in the JSON format. The nested levels of the archetype structure are preserved within the document. The nodes of the blood pressure archetype are captured along with the unique paths and field values are used in querying. The name attribute represents the blood pressure concept, within which the systolic and diastolic sub-concepts are nested using the name attribute. Hence, an EHR can be mapped to a collection of JSON documents and the granularity of querying is available, at the EHR level, the individual archetype level and at the level of archetype fields. This eliminates the need of copyright archetype parsing tools available at openehr (Ocean Informatics) (T. et al., 2008). Also, the structure of each of the archetypes is preserved and can easily be flattened into a form on user selection Research Questions and Problem Statement According to the OpenEHR forum (T. et al., 2008), the persistence design for an OpenEHR system should provide good performance and query-ability. Although serialization of the RM is possible (previous subsection) but the application of serialization-into-blobs approach directly onto archetypal EHRs may not be very successful (Freire et al., 2012). In such a case, the sub-trees will be serialized as blobs and indexing will be applied on fields within the blob. These are stored in a one-column relational database. On the other hand, the user-queries can vary in granularity and if the queried field is not indexed entire blobs may need to be searched and de-serialized. Moreover, any change (updating/addition) in a query-able attribute will need change in table design and migrate the data. In the OpenEHR case, an Entry-level item exists inside Compositions distinguishing the concept which is represented by the particular archetype (T. et al., 2008). If Entries are serialized into blobs and stored opaquely, containing sections and compositions are stored transparently (using object-relational mapping), then queries on these transparently stored items will work. However data stored below the Entry-level (e.g. time in Observations, diagnoses in Evaluations) will be opaque, and require some kind of special indexing columns (Beale, 2005). This reduces the granularity for querying. The semantic paths in OpenEHR data provide a generic serialized-blob design. All data nodes will be serialized, and the path of each blob recorded next to it in a two-column table of node path, serialized node value, with an index on the path column. The paths need to be unique and sufficient for reconstituting the data on retrieval. For fine-grained objects hybrid indexing can be applied: fine-grained sub-trees will be serialized to single blobs, with their entire path structure being recorded; higher nodes are serialized singly with single paths being recorded against them (Beale, 2005). Path + node approach is independent of the object model. The simple tabular data can be stored efficiently in a relational database. The fine-grained nodes can be queried directly 112

132 Local Archetype Repository Fetch XML form of Archetypes OpenEHR Archetype Definitions Repository, CKM Creation of Archetype Repository (offline-process) from the Clinical Knowledge Manager [9] User Insert Data Query Data Quasi- Relational Query Language Interface AQBE Query Generator Generate screen form Request/ Retrieve data Archetype Parser DbAccessAPI XML of Archetype Store/Fetch EHR Data Local Archetype Repository Standardized EHR Repository Client Side Server Side Standardized EHRs Database System Figure 6.1: The system architecture of the proposed Standardized EHRs database system. using paths extracted from templates and archetypes. However, the following research questions remain open w.r.t. querying of the EHR data: 1. Ensuring uniqueness across all data as the archetype-based paths only provide uniqueness on a combination-of-archetypes basis. The full primary key for any given node is a tuple: Version id, path, where the Version id includes the GUID (global unique identifier) of the top-level object. Moreover, the version is unknown (Beale, 2005). 2. The paths of the nodes need to be parsed and compared quickly to answer the user-queries (Beale, 2005). 3. Determining the right granularity for grouping in the hybrid variant depends on the context of the application and the granularity desired by the user (Beale, 2005). 4. For delivery of quality health-care, the EHRs persistence mechanism should support complex querying of the data about an individual (clinical query) along with those based on whole population (epidemiological query) (Freire et al., 2012) Problem Statement This study is aimed at the persistence of the complex and hierarchically structured archetypal EHRs using a NoSQL database and enabling a high-level query language interface over them to simplify the decision making process for the clinicians and other health-care workers. The AQBE query generator is proposed in place of the AQL support for the OpenEHR based standardized EHRs by using a NoSQL database. 113

133 6. QUASI-RELATIONAL QUERY LANGUAGE AND STANDARDIZED EHRS 6.2 Proposed Approach Standardized EHRs Database System Architecture The architecture of the proposed system is divided into the client and the server-side (Figure 6.1). As shown in the figure, the client side component comprise of the AQBE editor. The user interface comprises of the AQBE query generator which serve two purposes of the user: (i) allows the user to insert the EHR data and (ii) enables the user to query the EHR data. The server-side consists of two data repositories- (i) the local archetype repository and (ii) the standardized EHR data repository. It also contains the Archetype Parser and the DbAccessAPI modules. The local archetype repository contains the xml files corresponding to the archetypes downloaded from the OpenEHR clinical knowledge manager (CKM) (CKM, 2012). The local archetype repository is created as an off-line-support process (as shown in Figure 6.1). The definitions of the archetypes within it are updated only if the corresponding archetypes change in the CKM. It is interfaced with the Archetype Parser module, which parses the xml of an archetype to extract the data nodes. These data-nodes are further used for input/query formgeneration. The form-generation is independent of the underlying persistence layer. The EHR database stores the EHR data and is used to retrieve user-query results. The DbAcessAPIs are a set of the driver classes which interact with the database AQBE Query Generator We propose a new AQBE query generator which provides a quasi-relational query language interface. The evolved interface comprises of two interfaces for data-input and querying respectively. The input interface allows the health-care workers to save the patient data through archetypes. It provides various templates (using archetypes) for data-entry forms. The second interface is for querying purposes where the users can formulate their queries. This is a quasirelational high-level query language inter-face and does not require the end-users to have any prior knowledge of the database structure or knowledge of query-language syntaxes. Figure 6.2 depicts the user-interface of the database system. The left-hand side depicts the query interface while the right-hand side represents the input interface. The user can select either of them from the top menu-bar based on the function he or she wishes to perform. Further, the user can select the concept he or she wishes to input or query using the drop-down menu provided under the main menu-bar. This generates the form for the concept which allows the user to enter the values for the attributes for which data-entry is to be made or query has to be formed System Specifications The implementation of query editor is done on Eclipse IDE (ecl, 5 20) and is hosted on the Play framework (with JVM). The prototype is available (aqb, 1 30). The various tools used in system development are given below. 1. Server Side Tools. For the server side, the MongoDB (Mon, 8 30) version is used. The Play framework (PLA, 8 30) is used along with Scala language (Sca, 8 30). Play offers a lightweight, stateless and web-friendly architecture with minimum resource consumption and Scala on the other hand, integrates the feature of object-oriented and functional languages. It replaces Java. The Casbah 2.3 (cas, 9 30) is a MongoDB driver which supports Scala. The JSON strings are parsed using the Scala-plugin, Lift JSON (jso, 8 30). 114

Figure 6.2: Screenshot of interface of the AQBE query generator of the proposed database system. 2. Client Side Tools. For the front-end HTML5 [13] and JavaScript (Jav, 2 30) are used. jquery 1.8.

For the prototype system, 40 archetypes are downloaded from the clinical manager (CKM, 2012). The archetypes are of composition type.

134 Figure 6.2: Screenshot of interface of the AQBE query generator of the proposed database system. 2. Client Side Tools. For the front-end HTML5 [13] and JavaScript (Jav, 2 30) are used. jquery (jqu, 2 30), jquery UI (jqu, 8 30) are used to capture the query fields. Any+Time DatePicker 4.11 (dat, 0 30) is used for capturing the time in the form fields and Twitter Bootstrap (twi, 8 30) front-end toolkit is used for designing the frond-end of the AQBE editor. For the prototype system, 40 archetypes are downloaded from the clinical manager (CKM, 2012). The archetypes are of composition type. For the front-end interface generation, each of the archetypes is parsed and the mapped to a corresponding form. Each form field corresponds to a node in the hierarchical structure of the archetype. The archetype is flattened for form generation AQBE Runtime The AQBE system performs three main functions. In this subsection the process flow for each one of these is given. 1. Form Generation - The form generation process is independent of the underlying persistence layer. A form corresponding to each of the concepts (archetype) can be generated accessing the local archetype repository. Figure 6.3 represents the steps and mappings for form-generation. The user accesses the user interface and selects the concept; the corresponding xml is retrieved from the archetype repository. A series of internal format transformations are performed to generate the form on the user interface of the AQBE editor. 115

135 6. QUASI-RELATIONAL QUERY LANGUAGE AND STANDARDIZED EHRS User access presented select concept User-Interface generate report semi-structured form Server Internal format Conversion archetype request archetype response Local Archetype Repository Figure 6.3: Steps for generating the forms on the AQBE editor. (generated form) User Input data User-Interface Input data Server Internal format conversion store data Standardized EHR Repository Figure 6.4: Steps performed for patient -data insertion. 2. Input Patient data For inserting the patient data entered by the user in the data-base, the system captures the data fields from the form. Each of the fields corresponds to the nodes of the archetype. Each concept or archetype on the form is stored as a document in the EHR data repository based on MongoDB after conversion to internal formats. Each of these documents has a unique id (Figure 6.4). 3. Query Patient data The user can query the EHR-data by using the query module of the editor. He or she selects a concept to query and the corresponding form is generated for input. This is similar to the QBE interface for the relational databases. The user can enter the condition and attribute (field) values on which the data need to be queried. For the response of the user-query, the result is fetched from the underlying EHR database and presented to the user. The steps are shown in Figure 6.5. As mentioned earlier, the AQBE system serializes the RM and the archetypes into JSON format. Table 6.1 presents a snippet for the blood pressure concept in the JSON format. The nodes of the blood pressure archetype are captured along with the unique paths and field values are used in querying through the user-forms. The name attribute represents the blood pressure concept, within which the systolic and diastolic sub-concepts are nested using the name attribute. This approach is similar to the hybrid approach for persistence, discussed in Section

136 User input query result (generated form) User-Interface Query result Server internal conversion of format Query fetch result Standardized EHR Repository Figure 6.5: Steps performed for the querying process on the AQBE editor. 6.3 Experiments It is important to consider the queries that can be performed by the proposed AQBE interface. At present, most of the single-patient queries could be successfully executed. So far, a set of 16 queries are executed to exhibit the strength and explore the weaknesses of the proposed system. For reference, a sample set of 4 queries and their equivalent queries in JavaScript executed on the prototype system are given below. 1. Query: Return the value of laboratory-glucose for a specific patient. JS-equivalent: db.docs.find({ $and :[{},{ ehr :{ $exists :true}},{ encounter :{ $exists :true}}, { laboratory-glucose :{ $exists :true}}]}, { id :0, ehr./name/value :1}). 2. Query: Find all blood pressure values where systolic value is greater or equal to 140 or diastolic value is greater or equals to 90 within a specified EHR. JS-equivalent:db.docs.find({ $and :[{ $or :[{ $and :[{ blood pressure./data[at0001]/events [at0006]/data[at0003]/items[at0004]/value/magnitude :{ $gte :140}},{ blood pressure./data [at0001]/events[at0006]/data[at0003]/items[at0004]/value/units : 0 }]},{ $and :[{ $and : [{ blood pressure./data[at0001 ]/events[at0006]/data[at0003]/items[at0005]/value/magnitude : { $gte :90}},{ blood pressure./data[at0001]/events[at0006]/data[at0003]/items[at0005]/value /units : 0 }]}]}]},{ ehr :{ $exists :true}},{ encounter :{ $exists :true}},{ blood pressure : { $exists :true}}]}, { id :0, blood pressure./data[at0001]/events[at0006]/data[at0003]/items[at0004]/ value/magnitude :1, blood pressure./data[at0001]/ events[at0006]/data[at0003]/ items[at0005]/value/ magnitude :1}) 3. Query: Get all HbA1c observations that have been done in the last 12 months for a specific patient. JS-equivalent: db.docs.find({ $and :[{ $and :[{ ehr./context/start time :{ $gt : }}]},{ ehr :{ $exists :true}},{ report :{ $exists :true}},{ findings :{ $exists :true}}, { lab test-hba1c :{ $exists :true}}]}, { id :0, report./context/other context[at0001]/items [at0002]/items[at0005]/value :1, lab test-hba1c :1}) 4. Query : Return all BP elements having a position in which BP was record. JS-equivalent: db.docs.find({ $and :[{ $and :[{ blood pressure./data[at0001]/events[at0006]/ state[at0007]/items[at0008]/value :{ $exists :true}}]},{ ehr :{ $exists :true}},{ encounter : { $exists :true}},{ blood pressure :{ $exists :true}}]}, { id :0, blood pressure :1}) 117

137 6. QUASI-RELATIONAL QUERY LANGUAGE AND STANDARDIZED EHRS Table 6.1: Example snippet of JSON equivalent of the Blood pressure concept (archetype). name : blood pressure, datalist :[{ name : Systolic, path : /data[at0001]/events[at0006]/data[at0003]/items[at0004]/value, datatype : DvQuantity, min :[0.0], max :[1000.0], unit :[ mm[hg] ] },{ name : Diastolic, path : /data[at0001]/events[at0006]/data[at0003]/items[at0005]/value, datatype : DvQuantity, min :[0.0], max :[1000.0], unit :[ mm[hg] ] },{ name : Mean Arterial Pressure, path : /data[at0001]/events[at0006]/data[at0003]/items[at1006]/value, datatype : DvQuantity, min :[0.0], max :[1000.0], unit :[ mm[hg] Note: The prototype system has few shortcomings, at present it cannot perform multipatient and multi-concept queries. Also it does not support the division and nested (in/not in) type of queries. We wish to overcome these weaknesses in our future improvements of the system. 6.4 Discussions The proposed approach has two main purposes. First, it makes use of the NoSQL, documentoriented database for the persistence of the standardized EHRs. Second, it enables relationallike high level query-interface on the persisted EHRs. Table 6.2 presents a detailed comparison between the relational databases (PostgreSQL), tested using the open-source prototype system Opereffa (Opereffa, 2011), NoSQL data-base (MongoDB) proposed here and XML database (DB XML) proposed in (Freire et al., 2012). The comparison shows that the NoSQL database scores over the other two databases on the compared dimensions which are significant for the standardized EHRs. Table 6.4 gives a detailed comparison between the various query methods for the standardized EHRs. It compares the query-ability of the AQL proposed by the Ocean Informatics (AQL, 2011), AQBE interface using a relational persistence layer proposed in (Sachdeva et al., 2012b) and the improved AQBE system proposed in this work. As evident from the table each of the method faces some challenges and has some shortcomings. The NoSQL, cloud-based DBs provide the advantages mentioned in Table 6.2 and the queries preserve the desired semantics of the concepts. A sample set of queries tested on on the prototype system has been given in 118

138 Table 6.2: Comparison between databases for persistence of the archetypal EHRs. Requirement Feature Scalability Performance Queryability PostgreSQL (Relational DB) A single large relation is defined, versioning may be expensive Relational form of the queries is slow (Jacobs, 2009) SQL like AQL queries can be performed Indexing Automatic, composite/secondary indexing possible MongoDB (No-SQL DB) Each concept is stored as JSON document with unique id and version id Light application, fast query-response QBE like AQBE interface provides powerful querying Automatic, composite/secondary indexing possible DB XML(Berkeley, XML DB) Limited scalability due to the nested structure of the archetypes and templates Limited - each of the nodes needs to be traversed for query response Performs epidemiological queries with low performance Database pre-defined. May not be suitable for EHR data Table 6.3. These queries are formulated with help of existing literature and user-studies with the clinicians. S.No. Query 1 Get a patient s current medication list. [single-(concept/patient), projection] 2 Find high blood pressure values (systolic >= 140 or diastolic >= 90 ) within a specified EHR.[single-(concept/patient), restrict and project ] 3 Find high blood pressure values (systolic >= 140 and diastolic >= 90 ) within a specified EHR. [single-(concept/patient), restrict and project] 4 Find blood pressure values where systolic/diastolic value >0.2 within a specified EHR. [single-(concept/patient), divide] 5 Get BMI values > 30 kg/m2 for a specific patient. [single-(concept/patient), restrict and project] 6 Get all HbA1c observations that have been done in the last 12 months for a specific patient. [single-(concept/patient), restrict and project] 7 Find all blood pressure (BP) values for a specific patient, showing their systolic and diastolic blood pressure values; also change the tag-name of systolic BP as Sys and Diastolic BP as Dias. [single-(concept/patient), rename] 8 Return all blood pressure (BP) elements having a position in which BP was record. [single-(concept/patient), exists] 9 Get the blood pressure (BP)values where the position is not standing. [single- (concept/patient), negation] 10 Find all the patients who have the same admitting doctor as A001. [singleconcept, multi-patient, restrict and project] 11 Find all the patients who have diabetes but no record of hypertension diagnosis 119

139 6. QUASI-RELATIONAL QUERY LANGUAGE AND STANDARDIZED EHRS 12 Get the number of patients admitted on 9 October, 2012.[single-concept, multipatient, aggregate] 13 Get the number of all the patients with diabetes. 14 Retrieve all patients who have not been discharged. 15 Get all patients who are suffering from the same problem as a specific patient (e.g., diagnosis is Diabetes). [single-concept, multi-patient, nested] 16 The children of women which had medication XYZ during their first pregnancy [complex query-multiple patients/concepts] 17 Find the number of patients who were given medications during hospital course that have caused an allergy in 1 or more patients[complex query- multiple patients/concepts, aggregate, epidemiological] 18 How many patients have had past medical history of anemia. patients[complex query- multiple patients/concepts, aggregate, epidemiological] 19 How many patients developed alopecia as a side effect of chemotherapy in the target population[complex query- multiple patients/concepts, aggregate, epidemiological] 20 How many cases of small cell lung cancer are noted among smoking females in the target population. [complex query- multiple patients/concepts, aggregate, epidemiological] 21 To retrieve results containing 3 concepts (Fever, sore throat, and cough with 1 concept having 2 sub-keys with numerical value (Temp > 38.2 deg and duration > 1 day) [complex query- multiple patients/concepts] 22 To retrieve results containing 5 concepts (fever, sore throat, cough, no vomiting and sputum);2 concepts having 1 sub-key with numerical value (fever temp > 38.2 deg and duration > 1day) and 1 concept having 1 sub-key with textual value (i.e. sputum of yellow color). [complex query- multiple patients/concepts]( 23 To retrieve results containing 3 clinical concepts (cough, no sore-throat, and had no sterol injection) with 1 concept having 1 sub-key with textual value (i.e. non sterol injection at the left side). [complex query- multiple patients/concepts] Table 6.3: Sample set of 23 queries prepared for evaluation of the AQBE Query Language. 6.5 Summary and Conclusions The dual-level modeling approach of the OpenEHR standard for interoperability of the EHRs provides a universal schema for storing these EHRs. The standardized EHRs are the building blocks of these systems and have a complex structure which needs to be queried by the target users. Moreover, the volume of data collected from various hospitals and health-organizations is increasing. Considering, large volumes of data an implementation of a highly-scalable database such as a NoSQL based database has been explored in this study. This study considers a persistence level storage system for the archetypal EHRs using a NoSQL database. This eliminates the object-relational mapping and maintains a node and path based persistence. The MongoDB is a document oriented NoSQL database that can form the cloud-based data store. This is important with respect to the need for application of 120

140 Table 6.4: Comparison of query-capability between AQL and AQBE interfaces for different types of queries. Query Types AQL (OpenEHR) AQBE (Relational AQBE (NoSQL DB) DB) Simple Query(Select) Filtered Query(Where Clause) Sorted Query(Order By) (Except Distinct) Grouping, Summary To be explored To be explored and Analysis(Group By, Having, grouping/aggregation/ analytical functions) Joins and Intersection(Outer/Inner/ To be explored Natu- ral/range/equi/self) Sub-query (In/Not To be explored To be explored In/Nested/ Parallel/Multi(row/column)/single row) Hierarchical Query To be explored Composite Query(Union, Union All, Intersect, Minus) Top-N Query To be explored To be explored To be explored the cloud computing for maintaining the voluminous data archive. The node and path based persistence allows highly-granular queries on the nested archetypal EHRs. This makes these databases usable for the users to perform extensive querying over standardized EHRs databases using the easy-to-use relation-al-like query interface (AQBE). 121

141 Chapter 7 Automated Usability Framework for the Standardized EHRs Usability Support for Standardized Electronic Health Records Practice of medicine requires complex processing of large amounts of data. The patient related data is needed by several occupational and health-care institutions. As part of the IT in healthcare, the standardized EHRs databases provide an advantage for storing and retrieving patient data (Zheng et al., 2009). Many government agencies are taking steps to encourage the electronic exchange of information between hospitals and health agencies through the standards, such as- HL7 (Health Level 7) (HL7, 2011), DICOM (Digital Imaging and Communications in Medicine) (Hussein et al., 2004), CEN EN (cen, 06 3) and openehr (T. et al., 2008). The standard-compliant documents form a useful representation for long-term storage representation for clinical data. These are a longitudinal collection of health information of patients and provide immediate electronic access at patient and population levels. Thus, standardized EHRs databases capture the patient related medical activities. These can facilitate knowledge discovery and decision-support for health-care delivery (art, 2003), (Sachdeva et al., 2012a). Typically, an information exchange occurs between laboratories; among clinicians and patients; and between order management systems such as care-planning, order-entry, pharmacyorder processing, and documentation of medication administration (WALKER, 2008). The standard-based health information makes it easier to combine data from heterogeneous sources where individual feeder systems differ in functionality, presentation, terminology, data representation and semantics (Sachdeva et al., 2012a), (Hui et al., 2011). For improving accuracy, a standardized EHRs database is connected to various standardized terminology systems such as SNOMED-CT ((IHTSDO), 2013), ICD (ICD(9, 2013) and LOINC (LOINC, 2013). In a recent study, American Medical Informatics Association (AMIA) cites the need to address the usability concerns for patient-sensitive functions related to controlled medical terminologies and application functions in case of standard-based, interoperable EHRs (Middleton et al., 2011). This concern is addressed by interfacing standardized EHRs as these systems can be interfaced with web services such as, the MedlinePlus Connect (Burgess et al., 2012) to automatically 1 Research Publication(s) - (1) Aastha Madaan and Subhash Bhalla. Usability Measures for Large Scale Adoption of the Standardized Electronic Health Record Databases Journal of Information Processing (JIP), Vol. 22, No. 3, pp (approx.), Information Processing Society of Japan, ISSN: (online) [TO APPEAR] and, (2) Aastha Madaan and Subhash Bhalla. Automated E-Learning Support for Healthcare Workers using Large Standardized Electronic Health Records Databases Journal of Information Processing (JIP), Information Processing Society of Japan, ISSN: (online) [ Under First Review] 122

142 retrieve information and problem-code lookups during patient-care (Jin et al., 2011). Healthcare Worker Accesses Standardized EHR Database Supports Offline support study for usability enhancements Developer s log for usability analysis Standard terminology support Terminologies Required Prompting Support Goals for the users: 1. Ease of Use- Improved Usability 2. Ease of Access- Standardization SNOMED-CT LOINC ICD (9,10) MedlinePlusConnect Webservices (support-system) Figure 7.1: Usability and standardization support infrastructure for the standardized EHRs databases. Figure 7.1 depicts the components associated with standardized EHRs database. The major goal of the users of these databases is to achieve ease-of-use. Healthcare workers face many difficulties such as, inefficient work-flows that fail to match clinical processes. User interface is poorly designed and is overloaded with data. This may present imperfect data, leading to a strong negative effect on the data and information quality of results (Schumacher and Jerch, 2012). Usability in health-care is challenging since the system is designed to meet the needs of multiple user types with varying requirements who work across various geographic, temporal, organizational, and cultural boundaries (Constance M. et al., 2005). The ability to perform meaningful, reproducible and objective usability metrics for EHR systems is limited by the socio-technical nature of the system (Middleton et al., 2011). Numerous health-care systems are designed without consideration of user-centered design guidelines. Consequently, these systems become ad-hoc and are gradually abandoned (Constance M. et al., 2005) End-to-end Workflow Management Recently, the NIST (NIST, 2012) proposed the EHR Usability Protocol (EUP) (Lowry et al., 2012). The protocol highlights that usability is a critical factor affecting the adoption and use of EHR database systems (Schumacher and Jerch, 2012). It states several challenges in the usability evaluation process for the EHR systems. It recommends that the usability tests need to be performed in a clinically relevant environment because majority of users of the EHR are the clinicians. They have precise expectations, complete knowledge of the patterns they are looking for in the data and are constrained by time. A diverse set of users and tasks need to be considered with a suitable level of granularity in the usability evaluation process for these databases. These granular details directly impact the user work-flows for performing a task. The way a general physician prescribes medication for a patient will be different from the process followed by a heart specialist. Moreover, broad spectrum of socio-technical factors of the users need to be considered for evaluating the impact on the work-flows and further on the usability. This requires real and complete the datasets. Further, a task may be triggered by 123

143 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS an external system. Hence, defining a task which is consistent across applications (taking into account all the external interfaces) is required. Personalized Querying, (High usability requirements) Need End-Users Distinct Features 1. Workflows 2. Settings 3. Purpose 4. Specializations 5. Demographics 6. Characteristics Pattern mining tools - Determine users categories and appreciated workflows input Captured through usage patterns Stored in Knowledge Repository (KRep) used to Bridge gap between- User-Workflows and system s application flow Primary Applications - System Usability - User Interface - Application flow improves Clinical Applications using standardized EHR Databases Figure 7.2: Support studies for evolution and enhancements in standardized EHR databases Context of the Study The traditional methods of usability evaluation and improvement rely on the post-release userfeedbacks, video-tapping user-sessions which are expensive, delayed and lack accuracy (Zheng et al., 2009). Currently pattern-discovery techniques are used in research areas such as clinical decision-support systems (CDSS), analyzing temporal patterns, chronic disease treatment and prevention and epidemic tracking studies. These are secondary applications in traditional context of a clinical application. We propose to support the primary concern of making a standardized EHRs database usable and learn-able for its users using pattern-discovery techniques. To the best of our knowledge, the proposed offline support study is the first comprehensive study that incorporates the UCD guidelines 1 for the standardized EHRs databases. It proposes to address the usability concerns (above) for the standardized EHRs databases. Figure 7.2 depicts the proposed knowledge repository (KRep) which stores the various users (their characteristics) and their preferred work-flows. This helps to bridge the gap between users intended work-flows and supported application-flow, thereby, improving the usability of the life-long and evolving system. In this study, the usability concerns of the openehr standard (Eichelberg et al., 2005) based EHRs database systems are addressed Problem Statement Due to the continuous and evolving nature of the EHRs databases and varying user profiles, measuring user performance in a valid and repeatable way is challenging (Constance M. et al., 1 User-centered design (UCD) focuses on the end-users, their needs and context in which a system will be used. It is an iterative process (Explained in Section 2.2). 124

144 2005), (Schumacher and Jerch, 2012). Clinicians need concise conceptualization and representation of complex clinical data for accurate problem solving and decision making. The following key features need to be considered in the light of UCD guidelines for the clinical applications based on the standardized EHRs databases. 1. Need to understand the diverse user groups and their environments- The users can be categorized into groups considering their demographics, various technical characteristics and environmental factors (Swanson and Lind, 2011). This reduces the gap between system capabilities and user-abilities. 2. Need to understand the tasks and work-flow goals The need is to overcome the limited scope and generalized methods used by the approaches of field studies, observation, interviews, questionnaires and surveys (Constance M. et al., 2005). In a complex environment of the standardized EHRs databases usability support-systems should be capable to discover the regularities and outliers in the user behavior by mining the user-system interactions. 3. Need to design an effective and learn-able clinical application - Accessing large number of screens to reach the relevant screens increases the click-through burden and disrupt the work-flows of the users. Each part (archetype) of information that is presented to the users using templates on the user-interface needs to be analyzed. 4. Creation of an automated up-to-date knowledge repository - The usage-data extracted from the transaction EHRs database can be mined to discover the realistic user-behaviors in actual patient-settings. This analysis can capture evolving user s needs, expectations and the change requirements in the of medical concepts (stored in the database). There is a need for a support-system (studies) based on the above features to support clinical applications based on the standardized EHRs databases. This can help to provide an easyto-use and learn system to the users. As a result, user-retention and satisfaction is increased. Whereas, the errors, development time and cost are reduced (Zhang and Muhammad F., 2011). 7.2 Background and Motivation The health information evolves over time as new knowledge becomes available. Further, the population size, the amount of electronic data gathered; along with the impact of globalization and the speed of disease outbreaks, pose new usability challenges (Canlas, 2009). Therefore, usability considerations of the standardized EHRs databases need to be addressed Usability Issues in the EHRs databases International standards organization (ISO (ISO, 2012) ) define usability as: the effectiveness, efficiency, and satisfaction with which the intended users can achieve their tasks in context of product use (NIST 2007) (NIST, 2012). Former studies (Zhang and Muhammad F., 2011), (Zheng et al., 2009) for usability evaluation of the previously used EHR systems consider manual methods of surveys and feedback (Section 1.2). In contrast, we focus on the usability evaluation and enhancement process specifically suited for the standardized EHRs databases. This is a response to the lifelong and evolving nature of the standardized EHRs. Based on the NIST report, we consider the elements in context of use - users, their tasks, equipments, their demographical and social environments (Lowry et al., 2012). For quality health-care delivery, the clinicians need an efficient user-interface. They have limited time to access a lot 125

145 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS of information. This needs the application-flows to be easy-to-use. Hence, execution time and delays in response, are part of the usability concerns User-Centered Design for the EHR databases Existing clinical application based on the standardized EHRs databases remain difficult to use due to the absence of human-factor design principles (Schumacher and Jerch, 2012). A usercentered design process is driven by the users and involves them in feedback-sessions, for the usability evaluation. It is based on systematic analysis of work-flow-development and application of design standards. It aims to provide ease-of-use to the users. As shown in Figure 7.3, for Enhancements in application development 4. Understand change in requirements -Evolving requirements and new users -Modules to be redesigned improved Standardized EHR Databases Evolving Nature 3. User-Interface Specification - Known human-behavior principles - Familiar interface models 2. Understand user-system Interactions user-needs, workflows, setting & environment Introduce usage enhancements 1. Understand Users Demographics, Characteristics Figure 7.3: User-Centered Design (UCD) steps for interactive health-care application using standardized EHR databases (adapted from ISO ) (Sachdeva et al., 2012a). an interactive software system (such as, the standardized EHRs database system and health-care system) there is a need to use the user-centered design (UCD) principles for enhanced usability and quality delivery to the end-users. The UCD guidelines given in ISO standard, consist of, 4 iterative steps- (i) identification of users, (ii) understanding their interactions with the system, (iii) user-interface design and (iv) iterative enhancement of the system ((TIGER), 2012). The first step involves understanding of the target user-groups of the system and their context-of-use (needs, work-flows, and environments). For understanding the interactions of the users with the system, their critical and frequent tasks need to be identified. In a health-care setup, the user-system interactions are mostly sequential in nature for most of the tasks such as, patient-diagnosis, preparation of assessment plan and assignment of medication (Ramakrishnan et al., 2010). Hence, the user-interaction in this case is referred to the sequential UI accesses made by the users to accomplish a given task. A usable health-care system (or standardized EHRs database system) needs to be periodically customized and enhanced to adapt to the evolving needs of the end-users and dynamically varying, complex human-system interactions. Another existing work implements the UCD guidelines for usability improvement through task analysis, user analysis, and environment analysis but using manual, time-consuming and inef- 126

146 Figure 7.4: Blood pressure concept represented as an archetype (mind-map) in the openehr archetype repository (CKM, 2012). ficient methods such as, surveys, questionnaires, and field studies (Constance M. et al., 2005). These become difficult to implement in case of EHRs system using standard-based EHRs, due to their complex structure and temporally evolving nature Standardized Electronic Health Record Databases In this subsection, the key features of the openehr model for the standardized EHRs databases are discussed. The various artefacts of the openehr model (archetypes and templates) are explained and further the evolving nature of these databases is explained. measurable or observable (LOINC, ICD SNOMED-CT) Published evidence base 1. Observations Admin Events (Web-based Health Portals/ Medical Hand books ) Personal knowledge base Investigator System Patient System 2. Evaluation clinically interpreted findings -assessment -opinion -goals define intervention 3. Instructions 4. Actions Investigator Agents recording clinical activities Figure 7.5: Flow of openehr artefacts during the process of patient-care (T. et al., 2008) 127

147 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Building blocks of the openehr model The EHRs use industry standards promoted by the Integrating Healthcare Enterprise (IHE) (IHE, 2013) and other standardization organizations such as openehr (Eichelberg et al., 2005). The openehr model uses a two-level methodology that decouples the knowledge model from the system design. This allows the integration knowledge model with the clinical applications independent of the system design. The standard proposed by openehr (Opereffa, 2011), accommodates new medical concepts (conceptual model) without the need for redevelopment through the use of archetypes 1. The archetypes is to express new information structures as a combination of predefined classes. The reference model (RM) represents the semantics of storing and processing the EHR data. It contains the generic data structures to model the logical structures in the clinical records (Sachdeva and Bhalla, 2012). The archetypes define the structure of the user-interface fields to capture the clinical-data and how the information can be stored in the underlying EHRs databases. In these databases, it is possible to add and retrieve new patient information for which the component structure is previously unknown. Figure 7.4 depicts the complex structure of the blood pressure archetype (concept) and its sub-concepts. Each medical concept may contain attributes and each of these attributes define constraints on the contained data. Hence, a complex structure is formed. EHRn-1 EHRn-2 Template 2 Archetype 3 Template 1 Archetype 1 Archetype 2 Template 2 Archetype 3 Template 1 Archetype 4 Archetype 5 time = t2 EHRn Template 3 Archetype 6 Template 1 Archetype 4 Archetype 5 time = t1 Composition of EHR from time = t1 to t2 Composition of EHR from time = t2 to t3 time = t3 Figure 7.6: Increasing complexity of the standardized EHRs w.r.t. time and patient encounters. Changes in templates and archetypes (considering life-long representation of a single patient s EHR) (HL7, 2011), (T. et al., 2008). At present the standard defines 352 archetypes under various categories of observation, evaluation, instruction and action. These categories cover the complete spectrum of the process of health-care delivery and represent the major clinical steps of patient-care (Helma van der et al., 2005). Figure 7.5 illustrates various clinical interactions among these archetypes. The user work-flows include interactions between the patient system, investigator system and the investigator agents. During patient-evaluation, the investigator (clinician) uses his (or her) personal knowledge base. This includes, health portals, published terminologies and medical concepts for decision making. Therefore, besides the type and the number of attributes (items 1 An openehr archetype is a computable expression of the domain content model in the form of structured constraint statements, based on openehr reference model (art, 2003). These are defined by the clinicians. In general, they are defined for re-use, and can be specialized to include local particularities. They can accommodate any number of natural languages and terminologies. 128

148 in archetypes) the interactions with the external systems add to the complexity for creating a usable user-interface for EHRs databases (Helma van der et al., 2005). The standardized EHRs databases incorporate the UCD modeling up-to a certain degree (through the use of archetypes) (Kashfi, 2010). The openehr foundation proposes that each screen of a medical application may be generated from several archetypes bundled together as a template 1. If the original archetype offers different terminologies, selections can be made to reflect the local conditions within a template. Furthermore, optional sections in an archetype can be omitted, or made mandatory and default values can be set using templates (Sachdeva and Bhalla, 2012). Hence, the openehr model facilitates the communication with the end users. Moreover, it eliminates the need for functional analysis and cognitive analysis proposed by previous approaches for usability enhancement and support studies (Constance M. et al., 2005) Evolution of Standardized EHRs Templates openehr template Templates openehr template Organizational Archetypes openehr organizer archetype Organizational Archetypes openehr organizer archetype Primary Archetypes openehr entry archetype Primary Archetypes openehr entry archetype Figure 7.7: The relationship between templates and the archetypes for the openehr standard (T. et al., 2008). The unstructured nature of clinical processes adds to the complexity of the clinical applications. Therefore, the clinical domain concepts are not easily understandable by the IT specialists as they lack the domain knowledge to model the user interfaces (Kashfi, 2010). Figure 7.6, represents the increase in the complexity of an EHR w.r.t. of time. The participating templates and the archetypes evolve over time. The definition of the archetypes may be modified or the participating archetypes within a template may change (w.r.t. changing user-expectations). The complexity also increases when new versions of an EHR are created with every patient encounter (single-patient EHR). Hence, a complex and temporal standardized EHRs database is generated. Considering these without the usability issues, may result in adaptable systems with comprehensive information models which are not usable (Kashfi, 2010). The dynamic nature of the EHRs create temporal inconsistency in reports. This may require a complete rollback to previous version of the EHR and identification of the cause of inconsistency. Hence, integration of the domain knowledge of the openehr standard with the UCD guidelines can generate standardized EHRs databases with high usability. The pattern mining algorithms can capture the evolving nature of user-needs and system features. 1 Templates are used to create definitions of content such as a particular document or message. They are required for specific use-cases, such as specific screen forms or reports (T. et al., 2008). 129

149 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS User Interface Generation Figure 7.7 represents the hierarchical relationship among the openehr based archetypes and templates. In case of openehr standard, the values can be recorded in the primary archetypes. These are represented as the entry archetypes. This second-level archetypes, organizational archetypes are shared models (documentlevel) which are applicable across different settings (use-cases). They are used to record organizational activities and constrain the contained primary archetypes. For example, recording of a clinician-patient interaction in a traditional manner includes tasks such as, history, physical examination, diagnosis and management. The openehr template or the constraint specification shows the contained primary archetypes, organizational models used, and their order. These templates together form the complete EHR of a patient An Example The Opereffa prototype system (Opereffa, 2011) is developed using the feed-back and considering the needs of the various organizations and personnel using the openehr standard (T. et al., 2008). It is an initial attempt to develop a clinical application based on the standardized EHRs. High-level of granularity in the archetype-based data is required at the persistence layer of the system. Opereffa uses the PostgreSQL (PostgreSQL, 2011) database to store the patient data in a single relation with few columns. The attributes are stored with their complete hierarchical path and archetype name. The path is extracted using the corresponding ADL 1. The templates used for the UI (forms) are developed with the aim to cover the basic spectrum of medications and allergies, which include the organizational archetypes for medications, results and investigations (Opereffa, 2013). Such a system if during implementation in a clinical scenario can supported by a knowledge repository (KRep, Section 4.3) to adapt to large number of target users and match the user s perceived application flow Pattern Discovery and the Standardized EHRs Databases Pattern discovery is an important component of biomedical informatics for discovering patterns and irregularities in data (Sachdeva et al., 2012a). It finds its application in various types of prediction and epidemiological analysis (Zhang and Muhammad F., 2011). As the EHR systems grow in their application and size, there is a huge volume of usage data and demographical data. The accuracy of the temporal data has profound medical, medico-legal, and research consequences. There is an increasing need to transform this information into knowledge for usability improvement. Pattern mining finds it applications for such transformations. Other studies emphasize the possibility to integrate clinical support systems with decision support (Hui et al., 2011). Data mining tools can be used to analyze the patient behavior and for determining the key features of the most appreciated application flows Data and Information Quality Issues The openehr archetypes have inbuilt constraints on the data items. This improves the quality of data captured from templates (constituted by the archetypes) on the user interface. Such data is more complete in nature and the number of errors is reduced. Simplified features (archetypes 1 ADL is a formal language for expressing archetypes. It provides a formal, textual syntax for describing constraints on any domain entity whose data is described by an information model (openehr reference model) (T. et al., 2008). 130

150 included in the template) and optimal application flow (multiple templates presented to the users) aid in the improvement of data and information quality. Medical Professionals General Practitioner Established specialist Physician Rescue Service Shared Electronic Health Record Databases Patient Public Authority Researcher Pharmacy Health Insurance Company Figure 7.8: Distinct Users of standardized EHR databases (Schabetsberger et al., 2005). 7.3 Proposed Framework The aim of this study is to reduce the gap between the state-of-art of the clinical applications and the future-proof standardized EHRs databases. For this, a conceptual framework is proposed along the UCD guidelines which utilizes the conventional pattern discovery techniques of classification, sequential pattern-analysis and temporal mining to maintain a knowledge repository. Next, the steps of the framework are described User Classification The complexity of EHR interactions increases when these are considered in full socio-technical context of its use. In the health-care domain, multiple user types, different geographical, cultural, temporal, and organizational factors need to be considered. Hence, to facilitate usable EHRs systems, the first step to make the EHRs databases usable is to understand the endusers of the system. Figure 7.8 represents the various health-care domain users. The EHRs database caters the needs of the medical professionals (physicians, specialists, practitioners) for patient-care. Also, it is used for administrative and patient-related information by the external agencies such as, the pharmacy and health insurance companies. The medical researchers use the EHRs data for research analysis in epidemic studies and other analysis. On the other hand the consumers of health-care information (patients and their relatives) use these for checking preliminary symptoms and medications. The administrative staff uses the EHRs database to store the billing information and patient details. Highly usable and easy-to use EHRs databases are required to address the focused and time-constraint needs of the medical professionals. The user-characteristics such as, age, role, gender and demographics such as, education, computer awareness influence his (or her) expectations and usage of a clinical application. For understanding the various end-users accurately, all these factors need to be considered. Figure 7.9 depicts the various user-characteristics and demographical features associated with 131

151 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Age Education/Training Qualification Sex Computer Efficiency Address Profession City State Country Years of Experience (Profile) Environment Speciality Workflow(s)/Tasks performed Figure 7.9: Characteristics and demographics identified for any user of the standardized EHR databases. a health-care user. We propose to categorize the users along a vertical dimension, according to their specialties and tasks. Further they can be categorized along the horizontal dimension based on the different levels of expertise, specialization and distinct values of the attribute considered. Adhering to the NIST specification (NIST, 2012), user-groups are generated such that their work-flows are clearly distinguishable. For the user-categorization, among the various techniques for pattern discovery the decision tree classification is applied considering each userattribute for classification. The user-attributes are chosen for maximum information gain 1. Figure 7.10, gives an example of the decision tree with various splitting attributes (vertical dimension) and their possible values (horizontal dimension). The user-characteristics define the test attributes (at each of the nodes) of the decision tree which classify the users. These along with the values of these characteristics define classification rules for end-user categorization Understanding User-work-flow Patterns All the accesses to the clinical application are temporal in nature and can be mined for frequent patterns to understand user-behavior. These can further be and stored as knowledge for reference. Considering the huge volume of logs that will be generated due to the complex structure of the templates (structure of multiple participating archetypes) and a large number of varying users, this is superior method than analyzing the usage logs for system improvement. An example of the daily activities of a general physician are given in Figure It also represents the various features accessed by the user and the role of the standardized EHR database in its context. Since, the features for a task have well-defined chronological order, to analyze the collected EHR interaction data, the use of: (1) sequential pattern analysis (SPA) similar to (Zheng et al., 2009) is proposed. It searches for recurring patterns in a series of EHR (feature) accesses that occur chronologically; (2) then over a time period SPA (Agrawal and Srikant, 1995) computes the probability of reusing certain EHR features or combinations of features, (determining the persistent interesting feature-sequences). The usage behavior can highlight the cognitive, behavioral and organizational roots that lead to sub-optimal behavior in the sys- 1 Information gain is an entropy based statistical measure which identifies the relevant attributes. The attribute with the highest information gain is considered the most discriminating attribute of the given set of attributes (Quinlan, 1986). 132

152 Computer Efficiency Yes/No Operational Setting Home Visit/Clinical/Hospital Education/Training Bachelors/Masters/Doctorate/Diploma Experience Age Training Professional/Practicing /Student <=25, >= 26, <=50, >=50 Sex Male/Female Distinct User Categories Researchers, Patients, Pharmacists, Health Insurers, Public Authority,Medical Professionals Figure 7.10: Participating attributes and their values for decision tree classification. tem (Zheng et al., 2009). In an actual EHR system implementation, an analysis framework such as the proposed framework is required to analyze the temporal patterns or recurring patterns occurring over a period per user-category. Discovering the various length sequences and finally the maximal patterns give the optimal path that has a high probability of being followed by the users of a particular category. Table 7.1, gives a set of the work-flows of a physician w.r.t the performed tasks. As evident from the table, to accomplish a given task the users (of same category) might follow different work-flows. Therefore, a huge volume of usage logs are generated for each user-category. A complete sequence of features accessed to perform a given task, are captured in the database as transactions. SPA (Agrawal and Srikant, 1995) searches for patterns within a large number of access sequences, where each sequence is composed of a series of time-stamped accesses [25]. It can uncover the frequent EHR features that tend to be accessed sequentially over a period of time consistently. If a combination of consecutive accesses, s, appears in X (number of access sequences) in a space of Y-sequences, then s receives a support of X / Y. The sequences where support, s is greater than the pre-defined support threshold are further determine the maximal patterns in the EHR transactional database. For example, a hypothetical pattern abc may be a sub-sequence contained in abcd. In such a case, abcd is considered as a maximal pattern. Hence, the complete user-work-flows which have the frequently accessed consecutive EHRs feature sequences as their subsequences can be discovered using SPA The Knowledge Repository For the standardized EHRs database, by storing the significant patterns in work-flows per user category over an interval and the preferred EHR features (based on their specialty), a knowledge repository can be generated. This knowledge repository contains discovered frequent patterns over number of sessions for a particular user-category. The repository is temporal in nature where the incoming rules are clustered with existing rules and support is incremented. If the user-category does not exist, a new rule is added with the corresponding support and user-category. These rules can help the application designer to understand the obsolete and frequently required features by the various categories of users. Therefore, it can provide data, information and knowledge to the appropriate users, in a understandable format and in the 133

153 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Table 7.1: Sample set of the work-flow steps captured in the database for a general physician s common tasks in a clinical setting. S. No. Work-flows Tasks 1 History of Present Illness,Assessment and Plan, Physical Examination, Diagnosis, Assessment and Plan, Medication, Medication Side Effects, Appointment Scheduling, Assessment Plan General Patient-Checkup 2 History of Present Illness, Assessment and Plan, Identify and Investigate Physical Examination, Laboratory Tests, Review Medical Problem Systems, Handbook Lookup, Diagnosis, Medication, Assessment Plan, Appointment Scheduling, Assessment Plan 3 History of Present Illness, Assessment and Plan, Physical Examination, Family History, Social History, Review Systems, Handbook Lookup, Diagnosis, Medication, Assessment Plan, Appointment Scheduling, Assessment Plan 4 History of Present Illness, Assessment and Plan, Physical Examination, Laboratory Tests, Review Systems, Handbook Lookup, Diagnosis, Medication, Vaccination, Assessment Plan, Appointment Scheduling, Assessment Plan 5 History of Present Illness, Assessment and Plan, Physical Examination, Laboratory Tests, Review Systems, Handbook Lookup, Medication, Assessment Plan, Appointment Scheduling, Assessment Plan 6 History of Present Illness, Assessment and Plan, Physical Examination, Diagnosis, Medication Side Effects, Medication, Social History, Assessment Plan, Appointment Scheduling, Assessment Plan Identify and Investigate Medical Problem Identify and Investigate Medical Problem Identify and Investigate Medical Problem Discuss treatment with patients desired sequence (optimal flow). A row in the KRep may be represented as <Id, UserId, CharacteristicsId, Freqworkflows, Outlierworkflows, and Timestamp>. The UserId, CharacteristicsId, Freqworkflows and Outlierworkflows in turn refer to the details of each of them. The EHR system designer can use the rules for the following purposes: 1. Improve layout of the user-forms. 2. Improve functionalities (archetypestemplates) provided to support users according to their attributes and needs. 3. Design an intuitive application flow (sequence of presentation of templates) according to users preferred work-flows and purpose. 4. Capture continuous temporal requirements w.r.t above without much overhead. 134

154 Activities Performed EHR Application Features Patient Consultation - Information to patients - Update EHR Record - Pre-consultation preparation - Review of the EHRs EHR Interface accesses Medication Inter-departmental Treatment Plan storage/ retrieval operations Standardized EHR Database Appointment Scheduling Figure 7.11: Decision-support tasks and requirements of a clinician in day-to-day activities Mathematical Formulation The proposed framework can be modeled mathematically as, three functions. First, the endusers of a clinical application can be modeled as, U= {u i }, (i = 1...n, users). Each user u i is characterized by a set of socio-technical features <e i, a i, sett i, s i, q i, ck i, p i, d i, c i > which represents expertise, age, setting, sex, qualification, computer knowledge, purpose, department and user-class respectively. Each of these characteristics has distinct but finite set of categorical values. Each user u i belongs to only one user-category C i. The characteristics associated with a user influence the his (or her) choice of work-flow to perform a given task. A work-flow pursued by a user is a chronological sequence of the features of the clinical application (based on the standardized EHRs database) accessed by him or her. For example, an access sequence, say, F 1, F 2, F 3 F 4 F 5 F 6 here, each F i may represent the EHR features of Patient History, Diagnosis, Medication List, Laboratory Tests and Appointment Scheduling. Such a sequence may be followed by a physician for a simple task of patient-encounter. The sequential pattern analysis (SPA) analyses these sequences to discover maximal patterns for each user category, [C i, F 1, F 2,..., F m ]. These patterns represent the preferred work-flows for the user. Let D be the database of the access sequences (transactions) pursued by the clinicians or other users belonging to a particular user-category. Let the minimum support value be represented as σ. Then the sequences S are determined, such that support(s) σ in D. For each user-category C i, these sequences are considered and the maximal patterns are discovered. The knowledge repository represents the correlations between the optimal these work-flows and the user-characteristics of the associated user-categories. A rule r i in the KRep may be depicted as IF <e i = Surgery AND a i >= sett i = Hospital > THEN is u i expert surgeon =yes, features accessed = <F 1, F 2,...,F 10 > and support = x%. Such, a rule defines the classification rule for the user category, expert surgeon. Expert surgeon may be a user-class with age greater than 40, a setting of hospital and expertise as surgery. It further gives the optimal and frequently accessed patterns associated with him (or her) support of x%. Hence, a standardized EHRs database, contains the user-data and their work-flows details as attributes. This information cannot be effectively used by an EHR system designer to improve an EHR system. Hence, using the proposed approach a knowledge repository is created 135

155 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Algorithm 5: Algorithmic steps for the proposed usability framework input : User characteristics (UC), user-ehr system interactions (UI), user-lables (UL) output: User category tree (CT), Maximal sequential patterns (MSP) Check input file F input containing UC in either.arff or excel format if file compatible then Input file to WEKA tool Select ID3 classifier classifyusers(id3 algorithm, 40, 60, UL) end repeat classifyusers(id3 algorithm, 40, 60, UL) until atleast 3 times; return determined user categories Check input file F input containing UI in either.arff or excel format if file compatible then Input file to WEKA tool Select SPA algorithm maximalpatterns(0.15, UI ) end repeat maximalpatterns(0.15, UI ) until atleast 3 times; return determined maximal patterns all length Add to KRep (determined user categories, determined maximal patterns all length) correlating the users, their environments and their work-flows The Algorithm The steps of framework are given formally in Algorithm 5. First the ID3 algorithm performs user classification based on user labels and the user-records input to it. Next, the maximal patterns are discovered from the input usage data for each user-category. Finally, the user categories are correlated with the maximal patterns for each task and then stored in a knowledge repository. 7.4 Experiments The aim of the experimental evaluation is to evaluate the effectiveness and accuracy of the pattern-mining techniques for each step of the framework. Table 7.2 gives a set of hypotheses based on the the UCD guidelines associated with it and the experiments performed to prove the hypotheses Pre-Study and End-User Responses Before evaluating the proposed framework a study with the actual-users of clinical application system was undertaken. The aim is to critically analyze the strengths and shortcomings expected from the proposed approach. As part of the study, the needs of the actual users were identified. They were consulted for suggestions about the proposed framework. A group of 15 clinicians working in a city hospital in the states of Delhi and Bangalore (India) and New 136

156 Table 7.2: Hypotheses for the UCD guidelines to address the Usability Issues in the Standardized EHRs database S.No. UCD Guidelines Hypotheses Experiments H1 Understand the enduserpact user-classification The user-attributes im- and preferred work-flows. rately. H2 Understand the usersystem interactions H3 Understand userpreference for the UI and application flow H4 Temporal updation of evolving user-needs and work-flows Frequent and Outlier patterns in the user-workflows indicate the useful and obsolete features of the application flow. Clinical application based on standardized should be easy-to-use for the endusers Standardized EHRs represent life-long records which keep evolving with time Find key attributes to classify the users accu- Find frequent (1) accessed EHRs features, (2) consecutive feature sequences, (3) Maximal patterns in the frequent access sequences. (1) Correlate maximal patterns and user categories, (2) Store rules in knowledge repository with associated support for generating optimal application flows Periodic Updation of the Knowledge Repository (KRep) Jersey (USA) were invited for an on-line study. A questionnaire (A.6) related to - (a) the need of EHR systems in the hospitals, (b) features of the existing system utilized and (c) manual usability studies was presented. The participating clinicians belong to the age group of years and work in the roles of general practitioner, internal medicine specialist and dentist. Table 7.3, represents the characteristics of the users, their demographics and their everyday tasks. Table 7.4 summarizes the key responses of the users important for the evaluation of the study. The responses highlight the usability barriers such as errors and problems in using clinical applications, manual methods of error-reporting and gap between existing system s application flow and complexity of practicing work-flows. Further, the need for involvement of clinicians in the design and system enhancement process has been supported. The clinicians preferred an automated system capable to understand their expectations and work-flows rather than manual studies. It also highlighted the lack of awareness among users about the health information standards Post-Study Experimental Evaluation A small-scale usability study is performed to demonstrate the usefulness of the proposed framework. The experiments are performed keeping in view the goals mentioned in Table 7.2. The datasets and tools used for evaluation purposes are described in the following subsections. 137

157 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Table 7.3: Day-to-day clinical tasks for the surveyed categories of clinicians. Specialization Qualification Age Organizational Setting Internal Medicine Dentist BDS, MDS Cons and Endo General Physician MBBS City Hospital City Hospital MD, MHA City Hospital Geographical Location India (Delhi, Bangalore) India (Delhi, Bangalore) New Jersey (USA) Everyday Clinical Tasks Patient Progress, clinical decision-support, discharge summaries and patient summaries Diagnosis, treatment planning, patient medication, procedures performed and patient follow-up patient progress, medication, patient-summary and follow-up Dataset Preparation The dataset creation is constrained by the non-availability of EHRs database systems based on the openehr standard. For the evaluation purpose two datasets are generated. The first dataset describes the end-users, their characteristics and demographical attributes. The second dataset describes the EHR database features. The tasks considered in this dataset are comparable to the responses of the surveyed clinicians and the existing literature (OpenVista, 2012), (OpenEMR, 2012), (Opereffa, 2011), (Zhang and Muhammad F., 2011), (Zheng et al., 2009). The Users dataset consists of 125 randomly-ordered records with 10 attributes (Table 7.12 gives a snippet of the dataset). The attributes represent the expertise of the user (whether the user is an expert or trainee). The second attribute represents the age. The location attribute gives the location of the health-care organization (city or a rural area). The setting describes the work environment of the user such as college, hospital, university, home or office. The subsequent attributes describe the sex, qualification, computer literacy of the users, purpose, department and user-label (class). The departments considered include, admin, front office, primary care, nursing, pharmacy, education, procedure-based care, non-procedure based care. The class labels Physician, Specialist, Student, Researcher, Billing Agent, Front Desk, Pharmacist, and Nurse are assigned. The Work-flows dataset is indicative of the nature of access sequences pursued by a clinician, using standardized EHRs database system. Table 7.5 represents the common EHR features and their abbreviations (used in the datasets). The features in Table 7.5 are adapted from the non-standard, widely used EHRs systems (OpenVista, 2012), (OpenEMR, 2012), (Zhang and Muhammad F., 2011), (Zheng et al., 2009). The corresponding archetypes are downloaded from the clinical knowledge manager, analyzed and adopted for the development of the AQBE EHR system (CKM, 2012), (Madaan et al., 2013). Each record of the dataset represents a patient-encounter. The work-flows are representative of system accesssequences of a general physician, internal medicine specialist and dentist working in local city hospital. It plays a key-role in determining the application flow of the EHR database system. It is generated qualitatively, by summarizing and combining the data collected from the analysis of the features of an EHR system (Zhang and Muhammad F., 2011) and (Zheng et al., 2009). 138

158 Table 7.4: Summary of key responses of the pre-study with clinicians. S. No. Key- Points User Response 1 Use of the clinical application for medical activities 60% (Yes) 2 Awareness of the standards for the EHRs (HL7, CEN 20% (Yes) and openehr) 3 Errors Encountered (very-frequently) 100% (Yes) 4 Possibility to recall errors encountered during manual 100% (No) usability studies 5 Agreement that the user characteristics and demographics Nearly All-Agree affect their work-flows 6 Agreement to the need for alignment of application All Agree flow with user-work-flows 7 Agreement to the need of involving end-users during All Agree design and improvement of the clinical applications 8 Agreement that the existing systems are easy-to-use Nearly All Disagree 9 Agreement to the need of efficient systems to automatically capture user-needs All Agree The relevant features and clinical tasks collected from the non-standard based open-source EHR systems (OpenEMR, 2012), (OpenVista, 2012) are used and arranged as transactions (or access sequences) per-task. The transactions are further modified, to use the available archetypes and formulate the fabricated dataset. The distribution of the features of the EHRs system is similar to the actual usage patterns pursued by the clinicians. It is verified by the clinicians during the pre-study with them by confirming the user-category, task aimed (general clinical tasks of patient-visit, medication, and diagnosis) and the work-flow pursued to achieve it. The original character of the data has been preserved. The user-ids are assigned randomly to each transaction, assuming the users belong to three categories, general physician, internal medicine specialist, and dentist. The period for the analysis assumes it to belong to distinct set of events at different time instances Each of these (1000, 1001) represents an occurrence of a user-system interaction (transaction) and are assigned sequentially to the rows in the dataset. A set of 210 records is created using the extrapolation and incorporation of the responses of the clinical experts and end-users. A dataset of 425 records is formulated by replication and shuffling them randomly. These are further re-arranged chronologically based on the corresponding timestamps Experimental Method The evaluation is performed using WEKA 3.6 (stable version) data mining software, written in Java, installed on a Windows 7, 64 bit machine. WEKA requires Java 1.4 or later. The RunWEKA.ini file is modified to set the CLASSPATH to configure WEKA on the system. WEKA is a collection of machine learning algorithms for data mining tasks. The algorithms can be applied directly to a dataset. For the evaluation, the datasets described in section are used. The datasets are formatted in.arff format. A set of statistical experiments is performed on both of the datasets, to prove the hypotheses stated in Table 7.2. All the experiments and analysis can be recreated using the data files as input to the WEKA tool. The algorithm and input parameters used for the experiments in the following sections. 139

159 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Table 7.5: Abbreviations of the common features of a clinical application (adapted from (Zheng et al., 2009)). S. No. EHR System Feature Abbreviation 1 History of Present Illness HPI 2 Assessment and Plan AP 3 Family History FH 4 Social History SH 5 Diagnosis (Problem List) DIAG 6 Physical Examination PE 7 Laboratory Test LT 8 Procedure PROC 9 Vaccination VACC 10 Medication MED 11 Medication Side Effects MSE 12 Review of Systems RS 13 Office Test OT Table 7.6: Accuracy of User-classification using the ID3 algorithm (Users dataset). Result Instance Count Percentage Correctly Classified Instances Incorrectly Classified Instances The dataset Users is split into 66% training records and 34% test data. The dataset Work-flows is split into 50% training records and 50% test data. This partition ensures that the classifier is well-trained to capture the variation in the attribute values Performance Evaluation The performance of the framework is evaluated by quantitative and qualitative studies Quantitative Evaluation For optimum decision tree creation for various users, the ID3 decision-tree algorithm (Quinlan, 1986) is used. The Users dataset is used for training the classifier and further classification. The Users dataset is input into the WEKA tool (University of Wakaito, 2011). Eight test runs are performed on the classifier using different attributes and the best results are chosen. A decision tree for user-classification is constructed based on seven attributes namely, department, location, sex, computer knowledge, purpose, expertise and qualification. We assumed that the variation in age does not cause a significant deviation for classification. Hence, the age attribute is skipped as the ID3 algorithm works on nominal attributes. Figure 7.12 represents a snapshot of the decision tree constructed by the algorithm. From the figure, it can be interpreted that for the Users dataset, the internal medicine (preventivecare) specialists working in a local hospital are well-versed with the use of computers and are mostly men. They are associated with the primary-care division of their organizations. 140

160 Figure 7.12: Decision Tree corresponding to the Users dataset with user-labels as the class attribute Figure 7.13: Variation in the accuracy (%) of the user-categorization based on the attributes chosen as the final attribute. While a physician is associated with a cosmopolitan setting and primary care division of the health-care organization. These results are in accordance to the expectations obtained from the surveyed clinicians. The ID3 algorithm exhibits high accuracy, of the total 125 records, only 5 instances are wrongly classified (Table 7.6). Figure 7.13 displays the variation in the percentage of accuracy of classification with various attributes chosen as the classifying attribute. It depicts how the various user-characteristics participate in user-classification and influence the categorization of the users into distinct class labels. From the experiments conducted, the UserLabel attribute (as the classifying attribute) gives the best results for the user-classification considering granular-level details(hypothesis 1, Table 7.2). Frequency of use of the features of the EHRs system is discovered using the work-flows dataset. Figure 7.14 represents the frequency of use of the sample EHR features. It depicts that the EHR features, procedure (PROC), vaccination (VACC) and quality report (QR) are least accessed with 3% frequency. Whereas, the assessment plan (AP) feature is most frequently accessed with 25% frequency. Hence, the AP feature should be included in the application- 141

161 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Figure 7.14: Variation in the use of Standardized EHR databases features (used by a clinician) as a function of frequency of use. flow of the associated user-categories (Hypothesis 2, table 7.2). To determine high-frequency Table 7.7: Results of the maximal sequences of varying length w.r.t the work-flows of the clinicians. Pattern Size 5-sequences 6-sequences 7-sequences 8-sequences 9-sequences 10-sequences Sequences AP,DIAG,DIAG,DIAG,RS HPI,AP,DIAG,AP,RS HPI,AP,DIAG,PE,RS HPI,DIAG,AP,DIAG,PE,RS HPI,DIAG,AP,AP,PE,RS AP,DIAG,AP,DIAG,AP,AP AP,DIAG,DIAG,DIAG,AP,PE,RS AP,DIAG,AP,DIAG,AP,PE,RS HPI,AP,DIAG,AP,AP,PE,RS AP,DIAG,AP,DIAG,DIAG,AP,PE,RS AP,DIAG,AP,AP,DIAG,AP,PE,RS DIAG,AP,DIAG,AP,DIAG,AP,PE,RS HPI,AP,DIAG,AP,AP,DIAG,AP,PE,RS AP,DIAG,AP,DIAG,AP,DIAG,AP,PE,RS HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,PE,RS patterns a support 1 of 0.15 is set. The 0.15 support threshold has been used to capture the frequent as well as the outlier patterns in the work-flows dataset. In a dataset of 425 records, any sequence which at least occurs in 63 records is considered as frequent sequences. The thresholds considered are 0.05, 0.15 and Test runs are performed with these thresholds and number of itemsets and frequency of patterns of each length are recorded. For all of these, similar results are received. The threshold of 0.05, some of the possible outlier patterns are missed, and whereas, with a minimum threshold of 0.25, some of the reasonably even frequent 1 The support for a sequence is defined as the fraction of total customers (clinicians) who support this sequence (in their work-flow to perform a clinical task)(agrawal and Srikant, 1995). 142

162 sequences are considered as outliers, hence the choice of 0.15 was made for the experiments of the study. Clinicians have different work-flows for tasks (represented as access-sequences in the Workflows dataset). The system providers on the other hand, need to align the application flow to the most optimal pattern (maximal pattern discovered) for making the EHR system usable. Table 7.7 displays the frequent patterns in user-work-flows grouped by their length (patterns with length 5 to 10 are given). The maximum length of the sequential patterns is 10 for the work-flows dataset. The table demonstrates that the pattern <HPI, AP, DIAG> (<History of Patient Illness, Assessment Plan, Diagnosis>) is the most frequent pattern across the varying lengths of the patterns. This patterns occurs with 80% frequency. The work-flows containing this subsequence are further analyzed to obtain the maximal patterns 1. The work-flows with features (QR, PROC, VACC) may be considered as obsolete and diminished with the application flow presented to the users (Hypothesis 2, table 7.2). Table 7.8: From the Work-flows dataset- Most frequent consecutive feature accesses (forming the higher order maximal patterns for knowledge discovery) and their level-of-support. Pattern Length Maximal Patterns # of occurrences Support Proportion 2-sequences DIAG,AP AP,DIAG AP, PE DIAG, DIAG PE, RS HPI, AP sequences AP,DIAG,AP DIAG,AP,DIAG DIAG,AP,PE HPI,AP,DIAG AP,DIAG,DIAG AP,PE,RS DIAG,DIAG,AP Table 7.8 displays the frequent consecutively accessed features of the EHRs database along with their recurrence rate and support. These are the subsequences considered to obtain the maximal (optimal) patterns. It is evident from the table that a user accessing diagnosis (DIAG) features uses the assessment plan (AP) with 21.4% frequency. The components <HPI, AP> and <AP, DIAG> of the frequent subsequence <HPI, AP, DIAG> occur with 19% and 4.5% frequency respectively. Hence, they form a maximal sequence <HPI, AP, DIAG> of length 3 (Table 7.8). The frequent maximal and obsolete patterns of the work-flows are determined at this step by the above analysis (Hypothesis 2, Table 7.2). Figure 7.15 displays the comparison of the lengths of the interesting patterns discovered. This explains the number subsequences that are considered for each length as a candidate set for the higher order subsequence. The length 2 and 3 patterns represented in Table 7.8 represent the consecutive patterns with their respective support thresholds. These are further analyzed, to obtain frequent patterns of each length until a pattern of length 10 is discovered. A full-fledged implementation of an EHR system, will support large number of tasks, as a result large number of longer access sequences 1 An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X (Bayardo and J., 1998) 143

163 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Figure 7.15: A comparison between the length of the interesting patterns discovered and the frequency of their access (Work-flows dataset). might be pursued by the users as compared to the considered work-flows dataset. Figure A.5 (Appendix) shows an example list of archetypes required by a single EHR feature, heart failure summary in the European Union Semantic Health Net Project (EU-SHN Project) (CKM, 2012). As shown in the figure, for a single UI form there are around 15 participating archetypes (features). Hence, the number of features in the complete system will be very large. Table 7.9: Qualitative overview of the influence of distribution of EHRs features in the userwork-flows in the analytical studies. S.No. Tasks Support Threshold Sequential EHR Features Expected Frequent Patterns 1 Variable Single Low Few Patterns, High Frequency 2 Similar Single High More Patterns, Low Frequency 3 Similar Distinct Low More Patterns, Low Frequency 4 Variable Distinct High More Patterns, Low Frequency Number of rules in Knowledge Repository Few More More More The Table 7.9 gives a qualitative overview of the possible variation in the resulting maximal frequent patterns due to the variation in the EHRs features accessed w.r.t to tasks. The table shows that by altering the minimum support threshold, the proposed framework can handle various distributions of access sequences (EHRs features) to obtain maximal patterns. The expected number of rules and frequent patterns are given based on execution of the sequential 144

164 pattern mining on the snippets of work-flows dataset (extracting transactions based on tasks and sequences) using the WEKA tool (University of Wakaito, 2011). The key feature to reduce overhead is the knowledge repository (KRep) of the usability support infrastructure contains only maximal sequences (rather than huge volume of usage logs). These maximal patterns and the associated users form the rules in the knowledge repository. Considering, the consecutive feature accesses of <DIAG, AP> (Table 7.8 ), with frequency of 91 occurrences, the support of the pattern is approximately 0.21 (considering a total of 425 sequence of feature-accesses) which is greater than the support threshold of which means that 21% of times the feature DIAG leads the Assessment plan (AP) feature in the given sample dataset. Whereas, the assessment plan (AP) feature leads the DIAG feature in approximately 19% of the user-system interactions. These features with support greater than the predefined support-threshold represent the preferred features by the physicians whole performing various tasks. On the other hand, the features <HPI, AP> has a frequency of 19 which represents a support of 0.04 which is below the predefined support-threshold. Hence, these consecutive features are not preferred frequently by the users (physician). Using this analysis the system designer can customize the application flow to match the perceived (preferred) application flow by the users (Hypothesis 3, Table 7.2). A sample task, General Patient Checkup, its corresponding work-flows and maximal patterns of distinct lengths are given in Table A subset of 10 work-flows is considered for the task as an example to show the effectiveness of the step two of the framework. The table shows the run information from the WEKA tool. A support threshold of 0.15 was used for the analysis. Since, a large number of itemsets of different lengths are generated varying from length 1 to 10 (considering a 10 feature EHR system), few of the frequent patterns of length 2, 3, 9, 10 are given. Based on the sample, optimal work-flows (of length 10) are obtained. These are stored in the knowledge repository (KRep). An EHR system designer now can refer to this knowledge (optimal paths) for understanding the desired work-flows to realign the application flow accordingly rather than analyzing the raw usage logs which may vary according to users even for a single task. The analysis considers all the combinations of (consecutive) features to obtain the maximal patterns. Hence, in a real world setting, larger number of logs need to be analyzed by a system designer. The proposed framework can reduce this overhead significantly and provide effective result. Task Work-flows (sample set = 11 rows) General Patient Checkup 1,HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,PE,RS,1000 1,HPI,AP,DIAG,LT,AP,MSE,DIAG,AP,PE,AP,1001 1,HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,PE,RS,1002 1,PE,PROC,VACC, LT,DIAG,AP,MSE,QR,AP, MSE,1003 1,PE,PROC,VACC, LT,DIAG,AP,MSE,QR,AP, MSE,1004 1,AP,MED,SH, FH,LT,DIAG,AP,MSE,RS, MED,1005 1,AP,MSE,SH, FH,LT,DIAG,AP,MSE,RS, MED,1006 1,AP,MED,SH, FH,LT,DIAG,AP,MSE,RS, MED,1007 1,AP,MSE,SH, FH,LT,DIAG,AP,MSE,RS, MED,1008 1,HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,PE,RS,1009 1,HPI,AP,DIAG,LT,AP,MSE,DIAG,AP,PE,AP,1010 Run Information Run information (WEKA Tool) Scheme - weka.associations.generalizedsequentialpatterns -S I 0 -F -1 Relation - sequential test set 145

165 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Length 2, Frequent Patterns (Total number of patterns = 90) Length 3, Frequent Patterns (Total number of frequent patterns = 240) Length 9, Frequent Patterns (Total number of patterns = 20) Instances - 11 Attributes - 12 user Feature1 Feature2 Feature3 Feature4 Feature5 Feature6 Feature7 Feature8 Feature9 Feature10 Timestamp Associator model (full training set) GeneralizedSequentialPatterns Number of cycles performed- 10 Total number of frequent sequences Frequent Sequences Details (filtered)- (given below) HPI,AP AP,MED AP,SH MED,SH HPI,DIAG AP,DIAG HPI,AP AP,AP DIAG,AP AP,FH HPI,AP,DIAG HPI,AP,AP HPI,AP,DIAG HPI,AP,AP HPI,AP,DIAG HPI,AP,AP HPI,AP,PE HPI,AP,RS AP,MED,SH AP,MED,FH HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,PE HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,RS HPI,AP,DIAG,AP,DIAG,AP,DIAG,PE,RS 146

166 Length 10, Optimal Path HPI,AP,DIAG,AP,DIAG,AP,AP,PE,RS HPI,AP,DIAG,AP,DIAG,DIAG,AP,PE,RS HPI,AP,DIAG,AP,AP,DIAG,AP,PE,RS HPI,AP,DIAG,DIAG,AP,DIAG,AP,PE,RS HPI,AP,AP,DIAG,AP,DIAG,AP,PE,RS AP,MED,SH,FH,LT,DIAG,AP,MSE,RS AP,MED,SH,FH,LT,DIAG,AP,MSE,MED HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,PE,RS AP,MED,SH,FH,LT,DIAG,AP,MSE,RS,MED Table 7.10: From the Work-flows dataset- Most frequent consecutive feature accesses (forming the higher order maximal patterns for knowledge discovery) and their level-of-support. The knowledge repository is an online support base for the standardized EHRs database, it can be updated periodically (weekly or fortnightly) depending upon the usage of the system and system-designer. The rules associated with a user-class and their support is updated depending on the user-categorization and the work-flow analysis (example given in Section 4.3). For the considered datasets, the work-flow subsequence <DIAG, AP> is stored with a support of 0.21 for the user-class physician and the values (set) of socio-technical features used for user-classification is also stored. The periodic analysis to maintain an upto-date KRep may be performed fortnightly on it. This requires the updating of the support for the given work-flow subsequence, addition of previously not existing sub-sequence. These rules are easily comprehensible by the system designer for system redesign and enhancement rather than performing usage-log analysis (Hypothesis 4, Table 7.2) Qualitative Evaluation Table 7.11, presents a qualitative study of off-line usability support studies using measures of effectiveness, efficiency and satisfaction. For each task a set of expert and trainee users are considered (assuming that the expertise of the user directly impacts his or her task performance). For the trainee users, the effectiveness with which a task is performed is defined as a goal to accomplish a task irrespective of the sequence chosen or the number of clicks required. On the other hand, for the expert users it is defined as whether the task is accomplished using the optimal path and fewer clicks. For the tasks in the procedural and non-procedural settings, it is defined as whether the task is accomplished or not. The efficiency of the system is compared w.r.t the tasks, as the successful completion of the task in minimal amount of time. The satisfaction is defined as the rating which is given by the user to the system while performing the task ( Easy, Very Easy or Difficult ). S.No. Tasks Effectiveness Efficiency Satisfaction 1 (a) Record Patient % Easy-to-Very-Easy Demo-graphics (Trainee) 1 (b) Record Patient % Very-Easy Demo-graphics (Expert) 147

167 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS 2 (a) General Patient % Easy-to-Very-Easy Checkup (Trainee) 2 (b) General Patient % Very-Easy Checkup (Expert) 3 (a) Identify and Investigate % Easy-to-Very-Easy medical problem (Procedural Setting) 3 (b) Identify and Investigate % Easy-to-Very-Easy medical problem (Nonprocedural Setting) 4 (a) Discuss treatment % Easy-to-Very-Easy with patient (Trainee) 4 (b) Discuss treatment % Very-Easy with patient (Expert) 5 (a) Provide Clinical % Easy-to-Very-Easy Summary of visit (Procedural Setting) 5 (b) Provide Clinical % Easy-to-Very-Easy Summary of visit (Non-procedural Setting) 6 (a) Prescribe Medication % Easy-to-Very-Easy (Procedural Setting) 6 (b) Prescribe Medication % Easy-to-Very-Easy (Non- procedural Setting) Table 7.11: Relative comparison of the improvement in task performance using qualitative measures. A heuristic comparison is done based on the responses of the clinical users invited for the pre-study. It is evident from the comparison (Table 7.11) that these qualitative measures are significantly improved by the application of the proposed framework. Note. The requirements and the assumptions considered in the proposed framework are in agreement to the responses of the surveyed clinicians. Hence, the framework is expected to show minimal or no deviation in the actual environments (larger and broader application scenarios). 7.5 Discussions The given experimental results highlight the achievements of the proposed framework according to hypotheses stated in Table 7.2. These evaluate the application of the pattern-discovery 148

168 techniques for primary purpose of improving usability of the EHRs databases. Figure 7.16 depicts the process for improved interactions between a user and the standardized EHRs database through the use of the on-line feedback for usability enhancement (available through the knowledge repository). The rules captured in the knowledge repository represent the correlation uses Improved User Interface Usage Logs Pattern Mining Tools Mined Knowledge User Standardized EHR Database Online-feedback for UI Enhancements Knowledge Repository (KRep) Figure 7.16: Enhanced Decision making with the support of Usability Support Study (Framework) between the user-categories and their work-flows. The EHR designer can use these rules to anticipate and design the user preferred application-flow. The templates can be added or modified and removed accordingly. Hence, the interface can be redesigned as per the end-user expectations. For example, some new archetypes or EHR extracts may be included in the template of a user-interface form for different specializations Applicability of the Framework The framework is generic in nature and is used for automation of usability analysis in large EHRs systems. It is applicable to any standard or non-standard based EHRs system. It can automate the usability studies. Major standards for the EHRs, HL7, CEN 13606, and openehr follow a dual-level modeling approach. Therefore, the framework can readily be applied to the EHRs systems based on any of the mentioned standards. In case of openehr standard the archetypes can be designed from scratch, or adapted from preexisting ones. However, the micro-details of the application such as, the features provided by the system, the usability concerns, require to be understood and implemented. Primarily, it needs complete details of the end-users, their demographical data, other characteristic data, their usage logs and the application flow provided by the system. It performs analysis on the user-data and the usage logs, w.r.t the features provided by the EHR system. The rules in the knowledge repository (KRep) are coupled with the results of the pattern mining and depend on the system to which the framework is applied. The EHR system designer does not need to query the end-users or analyze system-logs; rather he only needs to refer to the rules stored in the knowledge repository. Different archetypes are aggregated into one by means of archetypes templates, which also support semi-automatic derivation of user-interfaces. As explained in Sections 1 and 2 the archetypes and templates have complex structure and store large amount of data. The complexity is further, intensified as they evolve over time w.r.t changing or new needs of the users. Hence, due to the volume of data and its complex structure, traditional usability methods 149

169 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS cannot address the the large-scale usability concerns in temporally evolving environment sufficiently. Hence, to address these concerns, the proposed automated framework facilitates EHRs systems enhancements. Limitations of the Study. The study aims to improve the usability of the EHRs database system by automating the iterative process of understanding the end-users and their tasks, workflows using the analytical tools. Thus, eliminating the need to use surveys, questionnaires, field studies for improving system usability. However, the actual use of the system s features can only be analyzed in a real-world setting. The participating archetypes and templates in an EHR database system vary according to the environments in which they are deployed. The framework may show some deviations in a real-world setting but these are expected not to deflect the purpose of automation of the usability improvement. The standardized EHRs create difficult problems as a result of continuity of template revisions, evolution of archetypes, and these are done at the cost of user ignorance. For example, when an archetype or a template is revised the end-users are not aware of it. To make them available for use by the end-users is a challenge for the usability studies. The new ways of capturing work-flows and usage are required in such complex scenarios. Large portion of the usability concerns has been addressed in this study, and further investigation is on. For the case of deviating work-flows, the size of the problem may be as large as the one, which the proposed framework is trying to cope. The understanding of the deviating cases may involve algorithms to analyze whether the deviation in the flow is temporary and long-term, further whether the features have become obsolete or not and whether it should be removed or if modifying the feature will be sufficient. It may also need to find the reasons for deviation and impact on other features of the EHRs system. An Example. Considering the results of the experiments performed (Table 7.7, Table 7.8 and Figure 7.12), the framework successfully identifies the class of users internal medicine specialists with characteristics <Male, local-hospital, preventive-care, high computer-literacy, primary-care>. This category of users is associated with the frequent work-flow pattern <HPI, AP, DIAG>. This implies that the features History of Patient Illness (HPI), Assessment Plan (AP) and Diagnosis (DIAG) are preferred consecutively by the users. The pattern, user-information and the associated support are stored as rule in the knowledge repository (KRep). Each of these features can be customized by updating/adding or deleting the archetypes from the corresponding templates to improve data quality. The least accessed (lowfrequency of use) features such as Quality Reports (QR) (Figure 7.14) can be removed from the application-flow. 7.6 Automated Usability Enhancement Framework and e-learning Considering the complex structure and evolving nature of the standardized electronic health records (EHRs), there is a need to make the standardized EHR database systems adoptable and usable. The proposed framework can improve the learnability for the end-users in a stepwise manner for the gradual adoption of these databases. The knowledge from the records of users, their characteristics, demographics and their interactions with the standardized EHRs database system can be recorded in an archive. The knowledge archive and EHR repository can be considered as Big Data archives. This can provide real-time insight and granularlevel details for e-learning support. These help in removing the commonly occurring work-flow errors, identifying the obsolete features for the EHR system and provide optimal (preferred) application flow. Usage patterns adopted by the skilled users can be utilized to create e-learning 150

170 Table 7.12: A snippet of Users relation expertise {expert, age location {local, setting {clinical, hospital, university, home, sex {M, qualification {Bachelors, Masters, computerknowledge {WebUser, Programmar, purpose {preventive, diagnostic, research, department {admin, frontoffice, primarycare, nursing, pharmacy, education, procedurebasedcare, UserLabel {Physician, Specialist, Student, Researcher, BillingAgent, front- Desk, pharmacist, expert,65,local,clinical,m,masters,computerliterate,diagnostic, primarycare, Physician expert,60,local,hospital,f,masters,computerliterate,diagnostic, primarycare, Physician expert,63,cosmopolitian,hospital,m,masters,computerliterate,diagnostic, primarycare,physician expert,50,cosmopolitian,hospital,m,more,computerliterate,diagnostic, primarycare,physician expert,48,cosmopolitian,hospital,m,more,computerliterate,preventive, pharmacy,pharmacist prompts for on-line training of semi-skilled and novice users of the health-care domain Objective International standards organization, ISO-9241 (ISO, 2012), defines usability as, the effectiveness, efficiency, and satisfaction with which the intended users achieve their tasks in context of product-use. Learnability is a sub-characteristic of usability. It can be defined as the extent to which a system can be used by end-users with effectiveness, efficiency, and satisfaction in a specified context-of-use (ISO, 2012). Hence, for a user-oriented system such as an EHRs database system, learnability and usability are key requirements (Zhang and Muhammad F., 2011). These can be addressed by considering the user-centered design (UCD) guidelines. The UCD principles include consideration of the users, tasks, context of use and usability goals (Kashfi, 2010).. Understanding the correlation among the users, their characteristics and appreciated workflows is required in a continuous manner can improve system learnability and usability. Figure 7.17 describes the UCD guidelines w.r.t the learnability and usability enhancements. The inner circle defines the UCD principles for system enhancements which include, design, implementation of the system based on user feedback analysis. The outer circle describes the analogous steps from the proposed e-learning scheme. The scheme is expected to reduce the overhead of long training and e-learning sessions. 151

171 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS 3. User feedback 4. Analysis UCD 1. Design Usage logs and User characteristics + demographics 3. Online, automatic, realtime user-feedback 4. Pattern Mining: User records User-system Interactions Learnability Frequent interaction patterns and user categories stored in Knowledge Archive No e-learning overhead 1. Input: User-system correlations, Design : Application Flow + User forms 2. Implementation Users learn and use system simultaneously 2. Iteratively Enhance System Easy-to-use and easy-to-learn system Figure 7.17: User-centered design guidelines and the proposed e-learning scheme to enhance usability. End Users System - Users Interactions Record Clinical Interactions Standardized EHR System EHR System Enhancement System Designer Data Cleaning Decision- tree Classification: Users Complex Domain Knowledge of the EHRs DB Sequential Pattern Analysis : User-Workflows Pre-processing Phase Continuous Feedback User-Workflow associations: stored in Preferred & Outlier Workflows E-Learning Scheme : Components Knowledge Archive (KArch) Figure 7.18: Steps of proposed e-learning Scheme and flow of continuous feedback to the system designer through the KArch for user-learnability enhancement e-learning Goals: Various Dimensions of Learnability The following learnability goals need to be considered for a standardized EHRs database system (Rafique et al., 2012). 1. Task-compliant and understandable interface- The UI features need to be understandable for data entry and query. In case of the standardized EHRs database systems, archetype fields should be well-organized and the templates (user-forms) should be optimally organized for frequent user-tasks (with minimal errors and well-defined steps). For example, to perform the task of preparing an assessment plan for a patient, the templates corresponding to view history of patient illness, examine patient, view medication side effects and prescribe medication should be organized sequentially for timely task completion. 2. User- Prompts- Any change in the participating templates or archetypes should be available to users as notifications or prompts, 152

172 3. Predictability- An EHR database system should enable its users to know the details of its interactions, functionality and content. The system should be able to predict user-needs dynamically based on previous interactions and specialization and other characteristics of the user, and 4. Feedback Suitability- It is the degree to which information regarding the success, failure or awareness of actions is provided to help users interact with the EHR application. The system needs to be capable to capture feedback about the task progress and completeness. Figure 7.18 represents the steps of the e-learning scheme, and the role of the KArch in providing a continuous user-feedback for the EHRs database system designer. This aids in understanding the user-needs in real time. The recurring needs can be fulfilled by systemenhancement. match R1: if age >=30 and age <=60 then if <setting = city hospital> and <specialty = Physician> then user_category = city doctor... Rule Repository Users It contains the user ids and their categories. User_id numeric category_id numeric User_categories It defines the user attributes for a user-category. category_id age location setting computer_efficiency qualification specialization expertise workflow_id numeric numeric categorical categorical boolean categorical categorical categorical numeric Workflows It defines the workflow sequences for a user-category. workflow_id worklfow_pattern timestamp Pattern Repository numeric categorical system_time Figure 7.19: Internal structure of the proposed Knowledge Archive (KArch) Knowledge Archive (Repository) Knowledge archive (KArch) contains the patterns discovered in the user-system interactions, which represent the preferred (optimal) user work-flows. These provide continuous feedback to the EHR system designer. The frequently accessed features per user-group and per task can be customized and un-used features (considering the time of last access) can be removed. As shown, the Figure 7.19, describes the internal structure of the knowledge archive. Each user has a 1:1 relationship with a user-category. Each of the use-category may be associated with multiple work-flows (1:n). The associations (correlations) stored in the rule repository and the details about the users and patterns can be extracted from the pattern repository. These rules are comprehensible by the EHR system designers and can be implemented to customize the system. The knowledge archive may be updated by discovery of new patterns in user-work-flows periodically (weekly/fortnightly/monthly). Hence, KArch provides the following capabilities: 1. Ability to identify the user category and the preferred work-flow sequences, 2. Ability to identify the outlier and frequent patterns from the preferred work-flows, and 153

173 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS 3. A continuous feedback system capable to eliminate errors and minimize usability failures. Table 7.13: The Usability Challenges and e-learning Support Scheme S.No. Challenges Desrirable Improvement Implemented e- Users and their un- EHR Learning Scheme derstanding levels database system features 1 Understandable Optimal feature ac- Discover maximal Skilled and semifacplementation Intercess sequence im- patterns (within skilled users, an EHR feature) with/without during session for knowledge of each user category archetypes required 2 Task Compliance Optimal number of Discover maximal Semi-skilled and steps for task com- patterns (of EHR skilled users, able pletion features) during to use most optimal session for each applications flows user-task that match their workflows 3 Efficient Optimal paths of Discover maximal All users Application features in EHRs patterns (of EHR Flow database system features) during a session for each user-task and each user-category 4 User- Any change in EHR Outlier patterns All users, users Prompts version, template or discovered during learn new features archetype modification SPA, archetypes added to is notified and template user forms and modifications application flow 5 Predictability Intuitive application Accurate user- Novice, semi-skilled flow as per preferred work-flow users, with no or user requirements associations little knowledge of archetypes and templates required 6 Feedback Continuous enhancement Continuous real- All users are sup- Suitability of the time user feedback ported through cus- system tomization of frequent and removal of obsolete features Discussion: Qualitative Performance of e-learning scheme Table 7.13 gives the qualitative analysis and interpretations for the proposed scheme w.r.t the learnability challenges, their implications on the system and e-learning support. The knowledge archive is capable to support users with varying understanding-levels. This aids in understanding the target users and help them use the system in an easier manner. Table 7.14 gives a brief comparison between the traditional methods and the proposed e-learning scheme 154

174 w.r.t various measures of improved system usability. The proposed e-learning scheme reduces the cost in terms of time and effort as compared to the previous usability studies. Table 7.14: Traditional methods vs. the proposed e-learning scheme w.r.t. the qualitative measures of usability enhancement Performance Traditional Methods (Video Proposed e-learning Scheme Measure Recording and Questionnaires) Mode of Support Manual Automatic Completeness of Interviewees may forget exact Yes Information issues Time Interval Long periods Continuous Cost Expensive Few Overheads Training Purposes Optimal interpretations may not be available Dependency User Responses None, Automated Optimal interpretations are obtained with automated tools Several usability challenges arise in the standardized EHRs database systems due to the large volume of complex-structured and temporally changing usage and user data. The relationship between users, their characteristics and preferred work-flows are discovered using pattern-mining tools and stored as a Knowledge archive (KArch). With the overhead of onetime creation of the knowledge archive, several benefits have been achieved, for example, improved learning for novice users, application flow matching the preferred end-user work-flows, and development of easy-to-use query forms. 7.7 Summary and Conclusions The standardized EHRs databases store life-long EHRs of the patients which continuously evolve over time. The dynamic nature of these EHRs and the complex structure of the participating archetypes give rise to critical usability barriers. The former usability studies using manual methods of post-release surveys and user-feedback sessions are not applicable in complex environment of these databases. In this study, we propose an automated support framework to address these usability barriers. The proposed framework is evaluated by a pre-study with the actual end-users of the EHRs databases. The datasets for the evaluation study has been prepared by using the responses of actual end-users. The results successfully correlate the users and the interesting work-flows for application customization (and modification). It considers the granular-level factors that impact the usability. Pattern mining techniques of decision-tree classification, SPA and temporal association mining gives results with high accuracy on the considered data. In addition, the framework helps the system designer to infer the corrective actions such as focused application-flow per userclass (and given task). In view of continuous evolution of the standardized EHRs databases, the proposed enhancements will meet the challenges for improved usability. Tables 7.12 and 7.15 give the snippets of the datasets used for evaluation of the proposed framework. 155

175 7. AUTOMATED USABILITY FRAMEWORK FOR THE STANDARDIZED EHRS Table 7.15: A snippet of Work-flows relation user {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, Feature1 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, Feature2 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, Feature3 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, OT,AA,AF,EP,CI,QR,FB,HR,LTR,DF, Feature4 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, Feature5 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, Feature6 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, Feature7 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, Feature8 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, Feature9 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, Feature10 {HPI,AP,FH,SH,DIAG,PE,LT,PROC,VACC,MED,MSE,RS, 1,HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,PE,RS, ,HPI,AP,DIAG,LT,AP,MSE,DIAG,AP,PE,AP, ,HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,PE,RS, ,HPI,AP,DIAG,AP,DIAG,AP,DIAG,AP,PE,RS, ,PE,PROC,VACC, LT,DIAG,AP,MSE,QR,AP, MSE, ,PE,PROC,VACC, LT,DIAG,AP,MSE,QR,AP, MSE,

176 Chapter 8 Summary and Conclusions The research study considers two most commonly used instances of medical information repositories, the knowledge based information resources and the patient-specific information resources. It describes the model of medical information and the features which make it complex. It considers the heterogeneous end-users, in-depth medical information needs and variety of medical information repositories. The main focus of the research work is the need for better query methods for medical information in the areas of specialized medical document repositories on the Web (on-line medical encyclopedias) and standardized electronic health records repositories. It distinguishes the domain experts from the technical experts. It considers the medical domain experts such as specialists, researchers, and surgeons as the target audience. Further, it considers that querying these resources is a complex task due to the factors such as variety of end-users, their varying domain expertise and query skills, different contexts in which they require information, authenticity of resources and time-constraints. The study also analyzes the medical domain experts in contrast to the Web users, the IR users and the database users and highlights the assumption that these practitioners, specialists and researchers are well-versed with the medical knowledge and terminologies (schema) but often lack tools for efficient querying. Hence, the need is to provide the ability to query to the medical experts. The query capability of the novice users is also expected to improve with the proposed support. In the first part of the study, a literature survey undertaken for understanding the structure of Web documents, their building blocks, and understanding the various structural and semantic associations between the contents and their enclosing labels (headings). It also discusses the various techniques of Web document segmentation and compares the existing approaches on key features of underlying approach, time and space requirements, accuracy of segmentation and steps for segmentation. Subsequently, for the knowledge-based information resources such as medical encyclopedia the study analyzes the shortcomings and strengths of the existing methods of search/query (keyword and form-based search). Considering the limitations of the existing general purpose search engines, to address the complex query needs of the medical experts, the study proposes a user-level schema and a high-level query language over it. Such a schema is expected to map the medical processes and terminologies given in the Web document repositories as attributes that can be directly mapped to how a domain expert perceives them during patient-care. For creating the user-level schema, the MedlinePlus medical encyclopedia is studied. The study defines each segment of a Web document as a visually and semantically distinct block of content. Using this definition, it models the Web document of the repository as a two-dimensional array of document segments and the enclosed contents within them. Further, it models a complete medical document repository as a three dimensional cube, with the attributes (segment labels), contents enclosed by these segments and the number of documents in the repository forming the three dimensions. Once, a Web document is segmented each of 157

177 8. SUMMARY AND CONCLUSIONS the segment labels is mapped as a node of an hierarchical structure. This hierarchical structure is then mapped to an XML schema with concept-based, user understandable tags. Hence, the study creates a user-level schema by using the structure of information and medical-concepts described in the document repository for the segmentation process. The attributes of this schema are various concepts required (queried) during patient-diagnosis and general health-information seeking. Next, the study exhibits how the existing query languages such as XQuery can now be enabled over the user-level schema (corresponding to the Web document repository). The various query-language functions for XQuery are tested over the transformed user-level schema. The study, considers that a medical domain expert seeking information (querying) these resources during patient-care follows sequential steps. The study models the query-flows and classifies them as complex, recursive, simple and medium. It categorizes the types of queries required by the users w.r.t segments of the documents and w.r.t clinical tasks. To cater to these requirements also considers that in complex medical scenarios and patient-diagnosis a medical expert may require to iteratively seek information at various stages of medical-care. To address these complex query requirements a high-level query language interface Query-by-Segment Tag (QBT) is developed. The existing XQBE graphical query language, (as MXQBE - MedlinePlus XQuery By Example) is enabled over the user-level schema. The experiments and evaluation have been performed for the proposed query methods to exhibit their effectiveness, strengths and shortcomings. The experimental evaluation over the high-level query languages (QBT and MXQBE) interfaces prove that the interfaces are easyto-use, accurate and efficient for querying. The interfaces allow the users to query medical information at a granular level (within the individual segments). The users can formulate their queries in a step-by-step, iterative manner based on their needs and various stages of pointof-care. A sample set of 30 queries is considered for experiments. The queries are formulated using user-questionnaires and literature survey of the related work. Experiments performed on the prototype system show reduction in search space and accuracy of results obtained. A usability study with a group of 20 novice users is conducted for the QBT query interface. The results of the study describe the effectiveness of the interface w.r.t query formulation effort, time to obtain results and steps execute a query and accuracy of results. The study summarizes these results and compares it with the existing keyword search based interfaces. The study shows that the users were able to use the interfaces easily after a short training session. The training session explained the queries and interface features to the users. For some complex queries, the users were not able to formulate the accurate queries using the interface. But overall, they were satisfied with the quality of results and the time within which these results were obtained. All of them agreed that they wish to use the interface for querying online medical information in comparison to the existing general purpose search engines such as Google and Yahoo. Hence, the research work emphasizes that database-style query languages can support the detailed, complex and precise query needs of the medical domain experts. In the second part of the thesis the patient-specific instance of medical information is considered. First the study highlights the problem domain, with the patient-data is stored at various distributed health-care settings. It considers the heterogeneity in format and the end-users using this information. It describes the evolution of the earlier paper-based health-care records to EMRs (Electronic medical records), to personal health records (PHRs), EHRs and recently introduced standardized EHRs. It considers the electronic health records (EHRs) based on the openehr standard in this study since this is the latest and increasingly preferred standard for EHRs. The study describes the key features of the openehr standard- dual-level modeling and use of archetypes (as medical concepts). It studies and analyzes these features to provide semantic interoperability among the distributed health-care organizations and support easy-touse query interfaces. It discusses the complexity of these standardized EHRs databases and the 158

178 challenges that occur during information retrieval and querying over them. It further classifies the existing methods and and elaborates their shortcomings w.r.t these databases. Further, the study summarizes the key requirements that need to be considered considered for developing a query language over them. The study also highlights the human-computer interactions that occur in these databases. Subsequently, it emphasizes the need for an efficient persistence mechanism for the huge amount of data generated by the standardized EHRs databases in various clinical settings. The study proposes the use of document-oriented, cloud-based NoSQL database (MongoDB) to store the EHRs, in comparison to a relational or XML based store. The shortcomings and strengths of this choice are also given in the study. The study addresses the need of a query language that can retrieve in-depth information at single patient or population levels. The study proposes a relational-like query language, AQBE (Archetype Query-by-Example) over the NoSQL based persistence for the openehr-based EHRs. The experimental analysis for the AQBE system focus on comparing the appropriateness of a NoSQL data store for the openehr complaint data and query functions provided by AQBE query language. A qualitative comparison of the NoSQL database with the relational and XML database for the standardized EHRs is drawn which emphasizes the strength of the NoSQL database (document-oriented MongoDB), to provide more than pure object-relational mapping and also allow the user s to query the EHRs up-to any arbitrary level. A set of 24 queries using user-questionnaires, and literature survey are formed. These queries are tested w.r.t the query functions provided by any database query language such as, SQL and XQuery. The AQBE query language can perform most query functions (except division) for single-patient and multiple patient queries. The interface of the AQBE system is easy-to-use and data-insertion and query can be done using the same interface. Therefore, the study attempts to develop a relationally-complete query language over a cloud-based EHRs database. A user-study was conducted with 15 clinicians belonging to diverse specializations and across the countries of India, USA and Japan to understand the challenges faced by the doctors, practitioners and researchers in everyday use of the EHRs database system and learn about their expectations from these database systems. Considering the end-users usability concerns and the literature survey performed in the area of usability of EHRs systems, the study proposes an automated usability enhancement framework. This framework uses the various pattern-mining techniques to automate the process of understanding the target users and their characteristics and user-system interactions. The framework is capable of providing continuous feedback to the EHRs system designer by creating a knowledge archive. The knowledge archive stores the classification of the users based on their demographics and characteristics. It also stores the frequent maximal patterns discovered in user-system interactions. It correlates the user categories with the maximal patterns which form knowledge base for an EHR system designer. This enables him or her to develop application flows in the EHRs database system that match the user work-flows. This is helpful in reducing medical errors and time-taken by the skilled and the semi-skilled users for performing their everyday tasks. It eliminates the dependency on existing manual methods such as, questionnaires, video recordings which are generally time and cost expensive. The knowledge archive summarizes the relevant usage-patterns and user-data eliminating the need for storing huge amount of EHRs data and provides real-time feedback in a continuous manner. The study also describes the application of the proposed usability enhancement framework as an e-learning framework. In such a framework, the system understand the users using the knowledge archive created by the use of pattern mining techniques. 159

179 8. SUMMARY AND CONCLUSIONS 8.1 Limitations of the Study The QBT, MXQBE query language interfaces on medical document repository, MedlinePlus medical encyclopedia and AQBE query language for the standardized EHRs can be improved by actual usability studies in actual clinical and hospital settings with the medical experts. The QBT and MXQBE query language interfaces give multiple results for certain user queries as the queried segments exist across multiple documents. To list these results, a ranking mechanism is required to present the most relevant result at the top of the result list.the study discusses the possible ranking approaches which need to be implemented and integrated with these interfaces. Query functions such as divide need to be implemented for the AQBE query language. The query language further needs to be improved to include the epidemiological queries. These queries are critical for improving health-care services such as epidemic prediction. The usability enhancement framework needs to be integrated with the AQBE database system, for understanding the needs of the end-users in an on-line manner and improve the system application flow. This can develop the AQBE database system as an open-source openehr based EHR system, which allows users to query and store patient-data. 8.2 Future Work As a future work, we consider to enhance the features of these query languages by implementing them in real-world setting. Also, testing these with a larger set of queries and performing usability studies with the actual users (clinicians, practitioners and medical researchers) can make these query language interfaces more practical in use. The MXQBE and QBT query language interfaces over the MedlinePlus medical encyclopedia can further be implemented over other similarly structured Web-based medical document repositories. This can provide the medical domain-experts a reliable, authentic database that integrates the major on-line medical information resources. This can help in everyday querying during the patient-care process and improve the quality of care. We also aim to implement the missing query features of the AQBE query language to make it a complete database query language for the standardized EHRs databases. With these enhancements the proposed query-language can help the medical domain-experts seeking health and patient information from the standardized EHR(s) database. The system such as the AQBE can be linked to Web services such as MedlinePlusConnect, these service can link the EHR of the patient to an on-line medical information repository. The pateitns and clinicians can seek the information specific to a person s health issues using it. The cloud-based persistence proposed for the standardized EHRs can further be explored and tested for epidemiological queries over the openehr based database. This can help to integrate the patient data from the distributed health-care organizations through web services. Also, the NoSQL based persistence can be proved as a solution to the big data issues in the huge volume of large EHRs databases. The pattern mining techniques can be applied to the secondary health-care applications such as disease prediction, providing previous results to health experts for accurate patient diagnosis. Further, this will help to improve the health-care delivery. As part of future work, the proposed automated usability enhancement framework needs to be integrated with the AQBE database system, this can provide continuous real time user-feedback about the system over a lifetime of an EHR database system. This can prove significant for the usability and learnability support for complex standardized EHRs based database systems. As an initial step, the study proposed 160

180 an e-learning scheme based on the usability framework. 161

181 Appendix A A.1 Query by Segment Tag: Usability Studies with Endusers Table A.1: Part I: User Information. S.No. Details User Response 1. Name 2. Age (25-30/ 30-40/ 40-50/ above 50) 3. Country (India/ USA/ Japan) 4. Location (Health-care setting(city/ Rural) or University Lab) 5. Specialization (in case doctor/healthcare worker) 6. Education/Qualification A.2 Pre-Study with the Clinicians The Table A.6 gives the details of the questions used to interview the actual clinicians. It also explains, the kind of response is expected from them. S.No. Questions 1. Do you use an electronic health record (EHR) system in your day to day activities (such as, patient assessment/ medication/ patient revisit and so-on)? [YES/NO] 2. If YES, please tell us which software or software provider? 3. If NO, please tell us the reason (not available/ time consuming/ difficult to use/ complex screens)? 4. Have you heard of the standards for EHRs such as HL7, CEN or openehr? [YES/NO] 162

182 5. Have you ever used a clinical application based on any of these standards [YES/ NO]? 6. What are the main tasks or activities you think are important in everyday clinical activities that can be performed by using a clinical application? (e.g. patient diagnosis/referrals/revisit of patient/assignment of assessment plan to the patient/clinical task such as, surgery etc.) 7. Does the application flow (the order in which the screens appear) of the system serve your purpose easily? [YES/NO] 8. How frequently do you face difficulties or errors in accessing a clinical application [Very frequently/frequently/rarely/not at all]? 9. How do you report the errors encountered (report to admin staff of the clinic or hospital/ fill up a form/ Others) Please specify? 10. Were you consulted before the clinical application was deployed at your organization? [YES/NO] 11. Did you receive any training session before using the system? 12. Do you agree if your needs were considered during system design rather than during post-use surveys/ interviews would result in a user-friendly system? [YES/NO] 13. Do you agree a clinician s qualification, specialization and work-experience affect the use of clinical-application? [YES/NO] 14. Do you agree the geographical location of the clinician and the setting (clinic or hospital) affects the use of clinical application? [YES/NO] 15. Do you agree age of the health-care users affect their use of the clinical application? [YES/NO] 16. Do you agree clinical application s flow must be customized to suit your common daily tasks? [YES/NO] 17. Do you agree that the application flow should match the end-user workflows within a clinical application? [YES/NO] 18. Do you recall all the issues/errors encountered by you during system use during the post-release feedbacks and surveys or interviews? [YES/NO] 19. Do you agree the IT providers should capture your needs based on your usage of the system automatically or ask for manual feedbacks or interview sessions? [YES/NO] 20. Please give us some reason for the above question. Name: Qualification: Age: Department: Organization: Address: Table A.6: Questionnaire for the pre-study performed with the end-users (clinicians). 163

183 References sorted by name Table A.2: Part II: Questionnaire about use of on-line medical information. S.No. Question User Response 1. How often do you use medical information during patient-consultation/medical visit 2. How promptly you wish to receive the results of your queries 3. Have you ever used on-line medical encyclopedia or dictionary (say WebMD/MedlinePlus/A.D.A.M.S, Meriam Medical Dictionary) 4. When using search engines such as Google/Yahoo/Bing, for searching information, do you generally find results in the first page 5. What was the reason you think the search engines do not fulfill the query needs of the heath-care expert A.3 Prototype EHR system based on openehr standard and Usability Concerns The AQBE system is an EHR system, based on the openehr standard in the early stages of development (Madaan et al., 2013). The prototype system makes use of a small set of archetypes from clinical knowledge manager (CKM). In the following figures, the usability concerns with respect to the features of this EHR system are highlighted. In the Figure A.1, the data inserter user-interface is shown. List of archetypes, form for patient-details and a query formulation table are given. A dynamic template (form) corresponding to the selected archetype (in the drop down menu) is generated. There is a need to cater the usability challenges that are foreseen. Most of these challenges can be addressed, if the end-users and their preferred workflows are available as rules (guidelines) to the system designer. The classification of users and understanding their relationship with various attributes helps to select the features of the system that are mandatory and those which are optional. The customization of archetypes can be performed on the basis of user attributes and expected workflows. FigureA.3 and Figure A.4 represent a template for the blood pressure archetype generated dynamically after a user selects the concept from the list (Figure A.2). These depict the long list of attributes for a single (blood pressure archetype). The complexity increases when a user selects multiple archetypes and needs forms (templates) with multiple archetypes. 164

184 References sorted by name Table A.3: Part III (a): Queries for study with the end-users (clinicians and other health experts). S.No. Queries Query Operation 1. Find cases where fever is caused by affliction of pneumonia and tuberculosis 2. Find cases where patient has abdominal pain because of gastritis 3. Find cases where a patient has Anaphylaxis allergy from shrimp, but tested NEGATIVE for allergy 4. Find cases where fever is caused by virus 5. Find cases where is a patient can have Eczema symptoms caused by allergy to salt 6. Find if Bipolar Communication patient, should be advised Eye Exam OR Thyroid Exam 7. Find if, Cardiovascular Disease occurs due to high amount of Triglycerides intake. 8. Find cases where, Tumour may be caused by Pacemakers 9. Find treatment options for patients with OSTEOPOROSIS and fewer side effects 10. Find if Oxygen therapy work for the treatment of Chronic Respiratory Failure and symptoms are Lethargy OR Shortness of breath Archetypes to be listed - Task?? - End-user category?? Patient Attributes Template to be generated?? User-attributes required Details of care-provider?? Figure A.1: AQBE prototype system representing the usability concerns - (i) Flow of contents on the user-interface, (ii) mandatory and optional user-attributes, (iii) archetypes required based on the needs of the (specialized) health-care environment, (iv) when and where the template needs to be generated, (v) Distinguishing the mandatory and optional user data. 165

185 References sorted by name Table A.4: Part III (b): End-user Questionnaire for the QBT Interface. S.No. Parameters Q1 Q2 Q3 Q4 Q5 1. Was it possible to understand the interface using the above example (Yes/No) 2. Was it (easy/difficult/not-so difficult) to formulate the queries 3. Do you think the query is relevant in day-today clinical practices (yes/no/may be) 4. The query results received from the interface are (relevant/not-so relevant/irrelevant) 5. Average time taken to formulate the query (in minutes/seconds) 6. Average no. of steps (clicks) taken to formulate the query 7. No. of results received for the query 8. After using the interface for finding results of everyday queries. Are you (satisfied/dissatisfied/not-so-satisfied) 9. Would you like to use this interface over Google or any other search engine (Yes/no/not sure) Select blood pressure concept Archetype List Dynamic form (template) generation Figure A.2: Archetype list in the AQBE system, used for selectiion of archetype(s) for dynamic form generation. 166

186 References sorted by name Table A.5: Part III (b): End-user Questionnaire for the QBT Interface (contd.). S.No. Parameters Q6 Q7 Q8 Q9 Q10 1. Was it possible to understand the interface using the above example (Yes/No) 2. Was it (easy/difficult/not-so difficult) to formulate the queries 3. Do you think the query is relevant in day-today clinical practices (yes/no/may be) 4. The query results received from the interface are (relevant/not-so relevant/irrelevant) 5. Average time taken to formulate the query (in minutes/seconds) 6. Average no. of steps (clicks) taken to formulate the query 7. No. of results received for the query 8. After using the interface for finding results of everyday queries. Are you (satisfied/dissatisfied/not-so-satisfied) 9. Would you like to use this interface over Google or any other search engine (Yes/no/not sure) Data fields and attributes Example of blood pressure archetype as a form Systolic Diastolic Data Figure A.3: Example form corresponding to the blood pressure concept in the AQBE system representing the various attributes in the concept. 167

References sorted by name Data Example of blood pressure archetype as a form (ctd.) Fields State Figure A.4: Example form corresponding to the blood pressure concept (Figure A.

187 References sorted by name Data Example of blood pressure archetype as a form (ctd.) Fields State Figure A.4: Example form corresponding to the blood pressure concept (Figure A.3) continued. Heart Failure Template Participating Archetypes Body Weight Blood Pressure Encounter Figure A.5: The participating archetypes of heart summary failure of the EU-SHN Project. 168

Domain Specific Multi stage Query Language for Medical Document Repositories

Domain Specific Multi stage Query Language for Medical Document Repositories Domain Specific Multi stage Language for Medical Document Repositories ABSTRACT Aastha Madaan Database Systems Laboratory University of Aizu, Aizu Wakamatsu Fukushima, Japan 965 8580 d8131102@u aizu.ac.jp