Semantic Enrichment for Ontology Mapping

Size: px

Start display at page:

Download "Semantic Enrichment for Ontology Mapping"

Lily Allison
5 years ago
Views:

1 Xiaomeng Su Semantic Enrichment for Ontology Mapping Department of Computer and Information Science Norwegian University of Science and Technology N-7491 Trondheim, Norway

2 NTNU Trondheim Norges teknisk-naturvitenskapelige universitet Doktor ingeniøravhandling 2004:116 Institutt for datateknikk og informasjonsvitenskap ISBN ISSN

3 Abstract System interoperability is an important issue, widely recognized in information technology intensive organizations and in the research community of information systems. The wide adoption of the World Wide Web to access and distribute information further stress the need for system interoperability. Initiatives like the Semantic Web strive to allow software agents to locate and integrate data in a more intelligent way via the use of ontologies. The Semantic Web offers a compelling vision, yet it raises a number of research challenges. One of the key challenges is to compare and map different ontologies, which evidently appears in integration tasks. The main aim of the work is to introduce a method for finding semantic correspondences among component ontologies with the intention to support interoperability of Information Systems. The approach brings together techniques in modeling, computation linguistics, information retrieval and agent communication in order to provide a semi-automatic mapping method and a prototype mapping system that support the process of ontology mapping for the purpose of improving semantic interoperability in heterogeneous systems. The approach consists of two phases: enrichment phase and mapping phase. The enrichment phase is based on analysis of the extension information the ontologies have. The extension we make use of in this work is written documents that are associated with the concepts in the ontologies. The intuition is that given two to-be-compared ontologies, we construct representative feature vectors for each concept in the two ontologies. The documents are building material for the construction process, as they reflect the common understanding of the domain. Outputs of the enrichment phase are ontologies with feature vector as enrichment structure. The mapping phase takes the enriched ontology and computes similarity pair wise for the element in the two ontologies. The calculation is based on the distance of the feature vectors. Further refinements are employed to re-rank the result via the use of WordNet. A number of filters, variables, heuristics can be tuned to include/exclude certain mapping correspondences. The approach has been implemented in a prototype system - imapper and has been evaluated through a controlled accuracy evaluation with a set of test users on two limited but real world cases. The system is tested under different configuration of variables to indicate the robustness of

4 ii the approach. The preliminary case studies show encouraging result. The applicability of the approach is demonstrated in an attempt to use the mapping assertions generated by the approach to bridge communication between heterogeneous systems. We present a framework where the mapping assertions are used to improve system interoperability in multi-agent systems. Furthermore, to demonstrate the practical feasibility of the approach, we show how to instantiate the framework in a running agent platform - AGORA. The future direction of this work includes studies on extended customizability, user studies, model quality and technical method revision.

5 Contents Preface xv I Background and Context 1 1 Introduction Background About the Problem Objectives Approach and Scope Way of Working and Major Contributions Publications Thesis Outline Basic Ontology Concepts The Semantic Web The Role of Ontology Shared Vocabularies and Conceptulizations Types of Ontologies Beneficial Applications Ontology Languages Traditional Ontology Languages Web Standards Web-based Ontology Specification Languages Ontology Engineering Life Cycle of an Ontology Ontology-based Architectures Concluding Remarks iii

6 iv CONTENTS 3 Technological Overview Information Retrieval Vector Space Models Computational Linguistics Morphological Analysis Part-of-Speech Tagging Lexical Semantics Concluding Remarks State-of-the-Art Survey Ontology Heterogeneity Ontology Mismatch Current Approaches and Techniques Ontology Mapping Concepts Definition and Scope of Ontology Mapping Application Domains Terminology Automatic Ontology Mapping Tools Automatic Schema Matching Systems for Ontology Merging and Mapping A Comparison of the Studied Systems Concluding Remarks II Design and Architecture 77 5 Ontology Comparison and Semantic Enrichment Prerequisites Scope and Assumption The RML Modeling Language The Abstract Ontology Mapping Model Semantic Discrepancies Mapping Assertions Semantic Enrichment of Ontology Extension Analysis-based Semantic Enrichment The Concept of Intension and Extension Extension Analysis for Semantic Enrichment Feature Vector as Generalization of Extension Feature Vectors

7 CONTENTS v Steps in Constructing Feature Vectors Document Assignment Feature Vector Construction Feature Vectors as Semantic Enrichment Concluding Remarks Ontology Mapping Approach Algorithm Overview The Similarity Calculation for Concepts Adjust Similarity Value with WordNet WordNet The Path Length Measurement The Similarity Calculation for Complex Elements Relations Clusters Ontologies Further Refinements Heuristics for Mapping Refinement Based on the Calculated Similarity Managing User Feedback Other Matchers and Combination of Similarity Values Application Scenarios Concluding Remarks III Implementation and Assessment The Prototype Realization Components in the Realization The Modeling Environment The CnS Client as a Classifier The imapper System Concluding Remarks Case Studies and Evaluation Experiment Design Performance Criteria Domains and Source Ontologies

8 vi CONTENTS Experiment Setup The Analysis Results Filters and Variables Quality of imapper s Predictions Further Experiment Discussion Concluding Remarks Applicability of the Approach A Scenario Introduction Agent Communication KQML FIPA The Explanation Ontology Explanation Interaction Protocol Explanation Profile Explanation Strategy A Working Through Example Two Product Catalogues A Specific Explanation Interaction Protocol A Specific Explanation Profile and Strategy Implementing the Explanation Ontology in AGORA The AGORA Multi-agent System Implementing Explanation Algorithm in AGORA Concluding Remarks Conclusions and Future Work Summary of Contributions Limitation and Future Directions Extended Customizability User Studies on Semantic Enrichment Model Quality Technical Method Revision A Nomenclature 189 A.1 Abbreviations

9 CONTENTS vii B XML Formats Used in the imapper System 191 B.1 Ontology Exported from RefEdit B.2 Classification Results Returned by CnS Client B.3 Mapping Assertions Generated by imapper C The Plan and Action File Formats in AGORA 197 C.1 DTD of the Plan File C.2 DTD of the Action File D The KQML Reserved Performatives 201

10 viii CONTENTS

11 List of Figures 2.1 The basic layer of data representation standards for the Semantic Web Classification of types of ontologies, based on the level of formality (adopted from [81]) Classification of ontology specification languages States and activities in the ontology life-cycle [57] A generic architecture of ontology-based applications, adopted from [111] The cosine of β is used to measure the similarity between d j and q Examples of two steps in the morphological parser A portion of the WordNet 2.0 entry for the noun book Hypernym chains for sense one of noun book Framework of issues on ontology integration, from [83] Hard problems in ontology mismatches Classification of schema matching approaches, from [135] Chimaera in name resolution mode suggesting a merge of Mammal and Mammalia PROMPT screenshot FCA-merge process The MOMIS Architecture The GLUE Architecture Characteristics of studied ontology mapping and merging systems Graphical notations of basic RML constructs Graphical notations of RML abstraction mechanism ix

12 x LIST OF FIGURES 5.3 Mapping assertion metamodel (Adapted from Sari Hakkarainen [1999]) Semantic enrichment in ontology comparison Semantic enrichment through extension analysis Representative feature vector as enrichment structure Two phases of the whole mapping process Overview of the semantic enrichment process Contributions from relevant parts when calculating feature vector for non-leaf concept Two phases of the whole mapping process Major steps in the mapping phase Example on hyponymy relation in WordNet used for the path length measurement Example of calculating cluster similarity Components of the system The Referent Modeling Editor CnS Client in the classification mode The imapper architecture The GUI of imapper system Precision and recall for the mapping results Snapshots of the product catalogue extracted from UNSPSC Snapshots of the product catalogue extracted from Snapshots of the travel ontology extracted from Open Directory Project Snapshots of the travel ontology extracted from Yahoo directory Precision versus recall curve for the two tasks Precision versus recall curves pre and after using WordNet for postprocessing in tourism domain Precision versus recall curves pre and after using WordNet for postprocessing in product catalogue domain Precision recall curves at three confidence level in the case of individual based gold standard in tourism domain Precision recall curves at three confidence levels in the case of group discussion based gold standard in tourism domain.148

13 LIST OF FIGURES xi 8.11 Precision recall curves at high confidence level in the case of individual and group based gold standard in tourism domain Precision recall curves at medium confidence level in the case of individual and group based gold standard in tourism domain Precision recall curves at low confidence level in the case of individual and group based gold standard in tourism domain Precision recall curves when structure information is turned on/off in tourism domain The composition of an explanation mechanism An ER model of the general explanation interaction protocol An ER model of the the main concepts in the explanation profile Segments of two product catalogues A specific explanation interaction protocol Agora node functions Simple agent architecture

14 xii LIST OF FIGURES

15 List of Tables 3.1 An example of a tagged output using the Penn Treebank tagset Scope of the current WordNet 2.0 release in terms of number of words, synsets, and senses Noun relations in WordNet Verb relations in WordNet Adjective and adverb relations in WordNet The product catalogue ontologies characteristics of the fraction of the ontologies used for the experiment The tourism ontologies - characteristics of the fraction of the ontologies used for the experiment Summary of the manually discovered mappings Analysis of the inter-user agreement An example of mappings between two product catalogues Meaning of performatives in the Explanation Ontology A.1 Abbreviations used in the thesis C.1 Explanation of plan DTD D.1 List of KQML reserved performatives xiii

16 xiv

17 Preface This thesis is submitted to the Norwegian University of Science and Technology (NTNU) in partial fulfillment of the requirements for the degree doktor ingeniør. The work has been carried out at the Information Systems Group (IS-gruppen), within the department of Computer and Information Science (IDI), under the supervision of Professors Arne Sølvberg and Jon Atle Gulla. Part of the work was conducted while I was having a six month research stay at the Business Informatics Group, Free University of Amsterdam. The work presented in this thesis has been financed by Accenture Norway, for which I am grateful. Acknowledgments I thank my supervisors for their time, patience, discussions and valuable comments. I also enjoyed the freedom that was given during the pursuing of my research directions. Part of the work has been carried out at the Business Informatics Group, Free University of Amsterdam. I would like to thank Professor Hans Akkermans for inviting me to work with his group. I would also like to thank fellow colleagues there, in particular, Ziv Baida, Vera Kartseva, Michel Klein, and Borys Omelayenko, for inspiring discussions and practical support. I enjoyed cooperating with professor Mihhail Matskin, who gave me valuable guidance and constructive criticism. To Sari Hakkarainen, I am grateful for her guidance and help in the early phase of my thesis writing as well as proof reading in the final stage of the work. Thanks to all at IDI, in particular my colleagues in the Information Systems group, for the stimulating working atmosphere. A warm thank to Darijus Strašunskas, whom I shared office with for three years. I have xv

18 xvi PREFACE great memories. To friends both in Norway and in China, it is a great pleasure to record my appreciation of the joy I shared with them and the help I received from them. A warm thank to the group, which I have shared lunch with for the last two years, for all the jokes, laughters and lively discussions. To my parents, I own thanks for their wonderful love and encouragement. I would also like to thank my brother, since he insisted that I should do so. My sincere thank goes to Jinghai for his support, understanding and encouragement all the way through. Xiaomeng Su October 27, 2004

19 Part I Background and Context 1

21 Chapter 1 Introduction System interoperability is an important issue, widely recognized in information technology intensive enterprises and in the research community of information systems (IS). Increasing dependence and cooperation among organizations have created a need for many organizations to access remote as well as local information sources. The wide adoption of the World Wide Web to access and distribute informations further stresses the need for systems interoperability. 1.1 Background The current World Wide Web has well over 4.2 billion pages [63], but the vast majority of them are in human readable format only. As a consequence software agents cannot understand and process this information, and much of the potential of the Web has so far remain untapped. In response, researchers have created the vision of the Semantic Web [12], where data has structure and ontologies describe the semantics of the data. The idea is that ontologies allow users to organize information into taxonomies of concepts, each with their attributes, and describe relationships between concepts. When data is marked up using ontologies, software agents can better understand the semantics and therefore more intelligently locate and integrate data for a wide variety of tasks. Ontology as a branch of philosophy is the science of what is, that is the kinds and structures of objects, properties, events, processes and relations in every area of reality. Philosophical ontology seeks a classification that is exhaustive in the sense that all types of entities are included in the 3

22 4 CHAPTER 1. INTRODUCTION classification [147]. In information systems, a more pragmatic view to ontologies is taken, where an ontology is considered a kind of agreement on a domain representation. As such, an engineering viewpoint of ontologies is often taken in information system, as reflected in a commonly cited definition: an ontology is a formal, explicit specification of a shared conceptualization [64]. Conceptualization refers to an abstract model of phenomena in the world by having identified the relevant concepts of those phenomena. Explicit means that the type of concepts used, and the constraints on their use are explicitly defined. Formal refers to the fact that the ontology should be machine readable. Shared reflects that ontology should capture consensual knowledge accepted by the communities. Ontology is a key factor for enabling interoperability in the Semantic Web [12]. Ontologies are central to the Semantic Web because they allow applications to agree on the terms that they use when communicating. It facilitates communication by providing precise notions that can be used to compose messages (queries, statements) about the domain. For the receiving party, the ontology helps to understand messages by providing the correct interpretation context. Thus, ontologies, if shared among stakeholders, may improve system interoperability across ISs in different organizations and domains. However, it has long been argued that there is no one single universal shared ontology, which will be applauded by all players. It seems clear that ontologies face the same or even harder problems with respect to heterogeneity as any other piece of information [168]. The attempts to improve system interoperability will therefore rely on the reconciliation of different ontologies used in different systems. The reconciliation is often approached by manual or semi-automated integration of ontologies. The technical issue here is to help resolving ontology mismatches that evidently appear in semantic integration. 1.2 About the Problem The Semantic Web offers a compelling vision, but it also raises many difficult challenges. The Semantic Web proposes to standardize a semantic markup method for resources based on the one hand on a uniform formalism, XML, and on the other hand on an organization of knowledge into ontologies. In this perspective, it is necessary to carry out complex

23 1.2. ABOUT THE PROBLEM 5 tasks such as answering queries or globally computing on distributed information sources managed by distinct, heterogeneous entities. The scientific difficulties are linked to the exact definition of the formalisms to be chosen, to the impossibility of maintaining a worldwide centralization of the ontologies, which raises problems of application interoperability. Other challenges concern robustness because minor errors must in no event have major consequences, and the scalability of these techniques that must work in a reasonable time with the huge amounts of distributed data present on the whole Web and with ontologies which can contain hundreds of thousands of semantic concepts, even when they only concern specialized fields. Among the above listed scientific challenges, the key focus of this work is on comparing and mapping different ontologies. Given the decentralized nature of the development of the web, the number of ontologies will be huge. Many of these ontologies will describe similar domains, but using different terminologies and others will have overlapping domains. To integrate data from disparate ontologies, we must know the semantic correspondence between their elements. To motivate the importance of ontology comparison, we give two examples on its usage in the relevant application domains. 1. Ontology integration: Many works on ontology comparison has been motivated by ontology integration: given a set of independently developed ontologies, construct a single global ontology. In a database setting, this is the problem of integrating independently developed schemas into a global view. The first step in integrating the ontologies is to identify and characterize inter-ontology correspondences. This is the process of ontology comparison. Once the correspondence are identified, matching elements can be confirmed or reconciled under a coherent, integrated ontology. 2. Message translation: In an Electronic Commerce setting, trading partners frequently exchange messages that describe business transactions. Usually, each trading partner uses its own message format. Message format may differ in both syntax (i.e., EDI, XML or custom data structure) and semantics (i.e., different referent ontologies). To enable systems to exchange messages, application developers need to convert messages between the formats required by different trading partners. Part of the message translation problem is translating

24 6 CHAPTER 1. INTRODUCTION between different message ontologies. Translating between different message ontologies is, in part, an ontology mapping problem. Today, application designers need to specify manually how message formats are related. A mapping operation would reduce the amount of manual work by generating a draft mapping between the two message ontologies, which an application designer can subsequently validate and modify if needed. In the Semantic Web setting, this contributes to mapping messages between autonomous agents. In order to achieve integration of ontologies, it is necessary to integrate both syntax and semantics of the involving ontologies. There is a wide agreement on syntactical issues in the software community, and syntax problem may be solved if there is a willingness among the actors to do so. For instance, [64] describes a mechanism for defining ontologies that are portable over representation systems. Definitions written in a standard format for predicate calculus are translated by a system called Ontolingua into specialized representations, including frame based languages as well as relational languages. The deep and unsolved problems are thus with the semantic integration issue. As stated in [73], the integration of ontologies remains an expensive, time consuming and manual activity, even though ontology interchange formats exist. Summing up, as one of the fundamental elements of the ontology integration process, mapping processes typically involve analyzing the ontologies and comparing them to determine the correspondence among concepts and detect possible conflicts. A set of mapping assertions is the main output of a mapping process. The mapping assertions can be used directly in a translator component, which translates statements that are formulated by different ontologies. Alternatively, a follow-up integration process can use the mappings to detect merging points. So, interoperability among applications in heterogeneous systems depends critically on the ability to map between their corresponding ontologies. Today, matching between ontologies is still largely done by hand, in a labor-intensive and error-prone process [124]. As a consequence, semantic integration issues have now become a key bottleneck in the deployment of a wide variety of information management applications.

25 1.3. OBJECTIVES Objectives The purpose of the work is to introduce a method for finding semantic correspondence among the ontologies with the intention to support interoperability of ISs. The overall purpose is decomposed into the intermediate goals of this work. The goals of this work are to: 1. introduce a theoretical framework for ontology comparison and for specification of mappings between ontologies, 2. propose a method for semantic enrichment and discovery of semantic correspondence between the ontologies, 3. provide an analysis of the implementation and evaluation of the method in empirical experiments, and 4. analyze the applicability of the mapping approach in supporting interoperability. In the sequel we will explain how the above objectives have been approached and motivate for the main decisions made during work on the thesis. 1.4 Approach and Scope Ontology mapping concerns the interpretations of models of a Universe of Discourse (UoD), which in their turn are interpretations of the UoD. There is no argumentation for these interpretations to be the only existing or complete conceptualizations of the state of affairs in the real world. We assume that the richer a description of a UoD is, the more accurate conceptualization we achieve of the same UoD through interpretation of the descriptions. Hence, the starting point for comparing and mapping heterogeneous semantics in ontology mapping is to semantically enrich the ontologies. Semantic enrichment facilitates ontology mapping by making explicit different kinds of hidden information concerning the semantics of the modeled objects. The underlying assumption is that the more semantics that are explicitly specified about the ontologies, the more feasible their comparison becomes. The semantic enrichment techniques may be based on different theories and make use of a variety of knowledge sources [71]. We base our

26 8 CHAPTER 1. INTRODUCTION approach on extension analysis, i.e. the instance information that a concept possesses. The instances that we use are documents that have been associated with the concepts. The idea behind is that written documents that are used in a domain inherently carry the conceptualizations that are shared by the members of the community. This approach is in particular attractive on the World Wide Web, because huge amounts of free text resources are available. On the other hand, we also consider information retrieval (IR) technique as one of the vital components of our approach. With information retrieval, a concept node in the first ontology is considered a query to be matched against the collection of concept nodes in the second ontology. Ontology mapping thus becomes a question of finding concept nodes from the second ontology that best relate to the query node. One of the major advantages of employing IR is domain independence. Converging the above two ideas, it becomes clear that the enriched semantic information of a concept needs to be represented in a way that is compatible with an IR framework. Given that vector space model is the most used one in IR, it is natural to think of representing the instance information in vectors, where the documents under one concept become building material for the feature vector of that concept. In some cases, ontologies exist without any available instance information. We tackle that by assigning instance to the ontologies. That is where document classification comes into play, aiming at automating the process of assigning documents to concept nodes. 1.5 Way of Working and Major Contributions Considering the research methodology in the above context, the way of working consists of a descriptive analysis phase, a normative development and construction phase and an empirical evaluation phase. All together the phases include the following steps. 1. The survey of ontology mapping methods step includes an investigation of existing methods of ontology mapping and an analysis of the process of ontology mapping, together with the properties characterizing such a process. 2. The survey of applicable parts of information retrieval and computational linguistics step includes an investigation of applicable parts of the

27 1.5. WAY OF WORKING AND MAJOR CONTRIBUTIONS 9 relevant theories and an analysis of the linguistic basis of the theories. 3. The analysis of requirements step includes an inventory of the problems in mapping of ontology concepts on the specification level and an analysis of the raised requirements. 4. The development of semantic enrichment instruments step includes a specification of the component (result of extension analysis) to be used for semantic enrichment of ontology and stepwise instructions for its construction. 5. The development of mapping algorithm step includes definition of an abstract ontology mapping algorithm and description of the stepwise calculation of correspondence of ontology concepts based on the enriched structure as specified in the previous step. 6. The prototype application step includes development and implementation of a prototypical environment for ontologies based on the mapping algorithm in the previous step. 7. The empirical application step includes experimental evaluation of the approach of using semantic enrichment and the proposed mapping algorithm in two case studies. 8. The applicability analysis step includes the experiment of using the discovered mappings to improve semantic interoperability in a multiagent environment - AGORA. The application of the above way of working has resulted in the contributions of this thesis and the earlier deliverables as described below. A major contribution of this thesis is the development and specification of an approach to semantic integration of ontologies. The work has been directed to improve interoperability across heterogeneous systems, in particular to improve that of the multi-agent systems. During the work it has been natural to incorporate results from earlier and parallel work done by other members of the Information System Group and the Distributed Intelligent System Group. Some relevant venues have also been explored by formulating proper tasks for diploma students that I have supervised at the institute. My own contribution are in particular related to the following:

28 10 CHAPTER 1. INTRODUCTION 1. establish a particular approach to use extension based semantic enrichment method for ontology mapping and integration, 2. propose an architecture for a system to support our approach as well as implement the system in a prototype, and 3. present the results from the validation experiment that evaluates our approach against user performed manual activities. The major contribution of the thesis as a whole may be summarized as follows: 1. The thesis has, apart from proposing and experimenting with a particular approach for semantic integration of ontologies, contributed to the understanding of semantic distance between ontologies in general. 2. Moreover, the work has shown the feasibility of using the discovered mappings to improve interoperability in a multi-agent environment - AGORA. 3. Finally, the work has laid the ground for analyzing and experimenting with other mapping approaches and different combinations of them as well. 1.6 Publications This thesis is partly based on papers presented at conferences published during the work that I was part of, as listed below: Xiaomeng Su and Lars Ilebrekke A comparative study of ontology languages and tools, in Proceedings of Conference on Advanced Information System Engineering (CAiSE 02). Toronto, Canada, 2002, LNCS, Springer-Verlag. This is a state-of-the-art paper, presenting a result of our initial literature study on ontology engineering languages and tools. It reviews existing ontology languages and tools with respect to a quality evaluation framework. Xiaomeng Su and Lars Ilebrekke, Using a Semiotic Framework for a Comparative Study of Ontology Languages and Tools, book chapter in J.

29 1.6. PUBLICATIONS 11 Krogstie, T. Halpin and K. Siau (Eds.), Information Modeling Methods and Methodologies, IDEA Group Publishing This is an extended version of the previous state-of-the-art paper. Xiaomeng Su, Terje Brasethvik and Sari Hakkarainen Ontology mapping through analysis of model extension, The 15th Conference on Advanced Information Systems Engineering (CAiSE 03), CAiSE Forum, Short Paper Proceedings, Published by Technical University of Aachen (RWTH), Klagenfurt/Velden, Austria, June, 2003 This is a position paper, introducing the basic design rationale of the approach and intended way of implementation. It gives an overview of the ideas of the approach. Xiaomeng Su, Sari Hakkarainen and Terje Brasethvik, Semantic enrichment for improving system interoperablity, in Proceedings of the 19th ACM Symposium on Applied Computing (SAC 04), ACM Press, Nicosia, Cyprus, March, This is a core paper, following up the ideas generated from the previous position paper. It presents the specification, design and implementation of the imapper approach in detail, constituting the base of this thesis. Xiaomeng Su and Jon Atle Gulla, Semantic enrichment for ontology mapping, in Proceedings of the 9th International Conference on Natural Language to Information Systems (NLDB04), LNCS Springer- Verlag This is a follow up paper of the previous SAC 04 paper. It describes the added linguistic analysis functionality of the mapping algorithm using WordNet. More over, the evaluation of the system in terms of precision/recall of the mapping prediction in two case studies is presented. Xiaomeng Su, Mihhai Matskin and Jinghai Rao, Implementing Explanation Ontology for Agent System, In Proceeding of IEEE International Conference on Web Intelligence (WI 03), IEEE Computer Society, Halifax, Canada, This paper describes the applicability of the mapping approach in an agent communication setting. It presents both the theoretical framework for using the results in an agent environment and the

30 12 CHAPTER 1. INTRODUCTION practical example on integrating the result into a running agent platform AGORA. 1.7 Thesis Outline In this chapter, an introduction to the thesis is given. The background of the work, the main problem tackled, the overall objectives, the way of working and the main contributions achieved are described. The structure of the rest of the thesis follows the way of working, and it implicitly includes a descriptive, normative and empirical part. The outline of the thesis is as follows. Related work and underlying existing theories are outlined in the descriptive part. Chapter 2 introduces the basic concepts of ontology engineering in order to provide basic understanding of ontologies, which are the basis of this work. Chapter 3 provides a brief overview of the various fields of research that are referred to and have influenced the work presented in this thesis. In chapter 4 a brief survey of state-of-the-art in the development of ontology languages and tools are given. In addition, a general taxonomy of different ontology mapping methods is proposed. The main contributions of this thesis are presented in the normative part. A novel ontology mapping framework, a semantic enrichment method and a ontology mapping algorithm are introduced. Chapter 5 proposes and specifies an extension analysis based semantic enrichment method in the context of ontology mapping. The modeling language used in the examples throughout the thesis is described in this chapter as well. Chapter 6 introduces a computational framework for mapping of ontology elements that are semantically enriched. Chapter 7 describes the prototype implementation of the computational framework. Two case studies underlying an evaluation of the proposed approach and technique are discussed in the empirical part. Chapter 8 presents experiences from two case studies as well as an analysis of empirical observations of the proposed semantic enrichment and mapping methods. The application domain of the first case study is the product catalogue integration task. The performance of the prototype system is evaluated in terms of precision and recall. In the same chapter, another case study in the application domain of tourism sector are presented, which also is aimed at evaluating the validity of the proposed approach. Chapter 9 presents a scenario where the mapping results generated by the system

31 1.7. THESIS OUTLINE 13 can be used to improve system interoperability in a multi-agent environment AGORA. Finally, Chapter 10 outlines a number of directions for future work, presents the conclusions and summarizes the contributions of the work.

32 14 CHAPTER 1. INTRODUCTION

33 Chapter 2 Basic Ontology Concepts This chapter introduces the basic concepts of ontology engineering. Its main goal is to provide basic understanding of ontologies, which are the basis of this work. This chapter is partly based on previously published papers [159] [160]. 2.1 The Semantic Web...The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in co-operation. Tim Berners-Lee, James Hendler, Ora Lassila, The semantic Web, Scientific American, May, 2001 The Web today enables people to access documents and services on the Internet. Today s methods require human intelligence. The interface to services is represented in web pages written in natural language, which must be understood and acted upon by a human. The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling computers and people to work in better cooperation. The vision of the Semantic Web was first introduced by Tim Berners- Lee [12]. An example in [13] illustrated how the Semantic Web might be useful. Suppose you want to compare the price and choice of flower bulbs that grow best in your zip code, or you want to search online catalogs from different manufactures for equivalent replacement parts for a 15

34 16 CHAPTER 2. BASIC ONTOLOGY CONCEPTS Volvo 740. The raw information that may answer these questions, may indeed be on the Web, but it is not in a machine-usable form. You still need a person to discern the meaning of the information and its relevances to your needs. The Semantic Web addresses this problem in two ways. First, it will enable communities to expose their data so that a program does not have to strip the formatting, pictures and ads from a Web page to guess at the relevant bits of information. Secondly, it will allow people to write (generate) files which explain - to a machine - the relationships between different sets of data. For example, one will be able to make a semantic link between a database with a zip-code column and a form with a zip field that they actually mean the same thing. This will allow machines to follow links and facilitate the integration of data from many different sources. The Semantic Web will be built on layers of enabling standards. Figure 2.1 shows the enabling standards of the Semantic Web. Uniform Resource Identifiers (URIs) is a fundamental component of the current Web, which provides the ability to uniquely identify resources as well as relations among resources. extensible Markup Language (XML) is a fundamental component for syntactical interoperability. The Resource Description Framework (RDF) family of standards leverages URI and XML to allow documents being described in the form of metadata. RDF Schema (RDFS) is an extension of RDF, which defines a simple modeling language on top of RDF. The ontology layer provides more meta-information such as cardinality of relationships, the transitivity of relationships etc. The logic layer enables the writing of rules. The proof layer executes the use of rules and evaluates, together with the trust layer, mechanism for applications to decide whether to trust the given proof or not. Digital signatures are used to detect alterations to documents.

35 2.1. THE SEMANTIC WEB 17 Figure 2.1: The basic layer of data representation standards for the Semantic Web

36 18 CHAPTER 2. BASIC ONTOLOGY CONCEPTS 2.2 The Role of Ontology The word ontology comes from the Greek ontos for being and logos for word. It is a relatively new term in the long history of philosophy, introduced by the 19th century German philosophers to distinguish the study of being as such from the study of various kinds of beings in the natural sciences. The more traditional term is Aristotle s word category (kathgoria), which he used for classifying anything that can be said or predicated about anything [151] [150]. The term ontology has been used in many ways and across different communities [65] [66]. Ontology as a branch of philosophy is the science of what is, that is the kinds and structures of objects, properties, events, processes and relations in every area of reality. Philosophical ontology seeks a classification that is exhaustive in the sense that all types of entities are included in the classification [147] [146]. In information systems, a more pragmatic view to ontology is taken, where ontology is considered as a kind of agreement on a domain representation. As such, an engineering viewpoint of ontology is often taken in information systems, as reflected in a commonly cited definition: ontology is an explicit account or representation of a conceptualization [166]. This conceptualization includes a set of concepts, their definitions and their inter-relationships. Preferably this conceptualization is shared or agreed. We also observed that ontologies is a natural continuation of thesaurus in digital library research and conceptual schemas in database and information system research. Next, we will briefly describe the way an ontology explicates concepts and their properties. Furthermore, we list the benefits of this explication in different typical application scenarios Shared Vocabularies and Conceptulizations In general, every person has her individual view on the world and the things she has to deal with every day. However, there is a common basis of understanding in terms of the language we use to communicate with each other. Terms from natural language can therefore be assumed to be a shared vocabulary relying on (mostly) common understanding of certain concepts with little variety. We often call this idea a conceptualization of the world. Such conceptualizations provide terminologies that can be used for communication. The example of natural language already shows that a conceptualization is never universally valid, but rather it is only valid for a limited

37 2.2. THE ROLE OF ONTOLOGY 19 number of persons committing to that conceptualization. This fact is reflected in the existence of different languages which differ more or less. Things get even worse when we are not concerned with every day language but with terminologies developed for specific areas. In these cases, we often find situations where even the same term may refer to different phenomena. The use of the term ontology in philosophy and its use in computer science may well serve as an example. The consequence is a separation into different groups that share a common terminology and its conceptualization. These groups, which commit to the same ontologies are also called information communities or ontology groups [55]. The main problem with the use of a shared vocabulary according to a specific conceptualization of the world is that much of the information remains implicit. Ontologies have been set out to overcome the problem of implicit and hidden knowledge by making the conceptualization of a domain explicit. This corresponds to one of the early definitions of the term ontology in computer science [64]: An ontology is a formal explicit specification of a shared conceptualization. A conceptualization refers to an abstract model of some phenomenon in the world that identifies the relevant concepts of the phenomenon. Explicit means that the type of concepts used and the constraints on their use are explicitly defined. Formal refers to the fact that the ontology should be machine understandable. Shared reflects the notion that an ontology captures consensual knowledge, that is, it is not restricted to one individual but accepted by a group [56]. An ontology is used to make assumptions about the meaning of a term available. It can also be seen as an explication of the context a term is normally used in. Lenat [91] [92] for example, describes context in terms of twelve independent dimensions that have to be shown in order to understand a piece of knowledge completely and also shows how these dimensions can be explicated using the Cyc ontology Types of Ontologies There are different ways in which an ontology may explicate a conceptualization and the corresponding context knowledge. This may range from a purely informal natural language description of a term corresponding to a glossary up to strictly formal approaches with the expressive power of full first order predicate logic or even beyond (e.g., Ontolingua [64] [58]). There exist several ways to categorize types of ontologies. Jasper and

38 20 CHAPTER 2. BASIC ONTOLOGY CONCEPTS Figure 2.2: Classification of types of ontologies, based on the level of formality (adopted from [81]). Uschold distinguish two ways in which the mechanisms for the conceptualization of domain knowledge by an ontology can be compared [167]. Level of Formality One of the well-known divisions to categorize types of ontologies is by their level of formality: ranging from a list of terms to concepts having relations and axioms. Figure 2.2 summarizes these distinctions. It also includes other terminologies for these differences as used by for example [33], discussing lightweight and heavyweight ontologies. Extent of Explication The other comparison criterion is the extent of explication that is reached by the ontology. This criterion is very much connected with the expressive power of the specification language used. The least expressive specification of an ontology consists of an organization of terms in a network using two-placed relations. This idea goes back to the use of semantic networks. More expressive ontology languages like RDF schema contain

39 2.2. THE ROLE OF ONTOLOGY 21 class definitions with associated properties that can be restricted by so called constraint properties. However, default values and value range descriptions are not expressive enough to cover all possible conceptualizations. A greater expressive power can be provided by allowing classes to be specified by logical formulas. These formulas can be restricted to a decidable subset of first order logic. This is the approach of description logic [45]. Nevertheless there are also approaches allowing for more expressive description. In Ontolingua, for example, classes can be defined by arbitrary KIF-expressions. Beyond the expressiveness of first order predicate logic there are also special purpose languages that have an extended expressiveness to cover specific needs of their application areas. On the other hand, the above two criteria are not the only methods of categorizing ontologies. Other variations include level of generality [65], ontology base and commitment layer [78] [122] Beneficial Applications In [36], it is stated that ontologies are used in e-commerce to enable machinebased communication between buyers and sellers, vertical integration of markets (such as verticalnet and description reuse between different marketplaces. Search engines also use ontologies to find pages with words that are syntactically different but semtanically similar. In particular, the following area will benefit from the use of ontologies. Semantic Web The Semantic Web aims at tackling the growing problems of traversing the expanding web space, where currently most web resources can only be found by syntactical matches. The Semantic Web relies heavily on formal ontologies that structure underlying data for the purpose of comprehensive and transportable machine understanding. They properly define the meaning of data and metadata [152]. In general, one may consider the Semantic Web more as a vision than a concrete application. Knowledge Management Knowledge management deals with acquiring, maintaining and accessing knowledge of an organization. The technologies of the Semantic Web

40 22 CHAPTER 2. BASIC ONTOLOGY CONCEPTS build the foundation to move from a document oriented view of knowledge management to a knowledge pieces oriented view where knowledge pieces are connected in a flexible way. Intelligent push service, the integration of knowledge management and business process as well as concepts and methods for supporting the vision of ubiquitous knowledge are urgently needed. Ontologies are the key means to achieve this functionality. They are used to annotate unstructured information with semantic information, to integrate information and to generate user specific views that make knowledge access easier. Applications of ontologies in knowledge management are described in [162] [39]. Interoperability An important application area for ontology is the integration of existing systems. In order to enable machines to understand each other we need to explicate the context of each system in a formal way. Ontologies are then used as inter-lingua for providing interoperability since they serve as a common format for data interchange [153] [166]. Such a feature is specially desirable in large scale web commerce environments [129] [56]. Information Retrieval Common information retrieval techniques either rely on a specific encoding of available information or simple full-text analysis. Both approaches suffer from problems like the query entered by the user may not be completely consistent with the vocabulary of the documents and the recall of a query will be reduced since related information with slightly different encoding is not matched. Using ontology to explicate the vocabulary may overcome some of the problems. When used for the description of available documents as well as for query formulation, an ontology serves as a common basis for matching queries against potential results on a semantic level. In some cases, the ontology can also be directly used as a user interface to navigate through available document [19]. On the other hand, commercial shopping sites, e.g. IBM s, have a dictionary of terms (simple ontology) that they use to help the search function. To summarize, information retrieval benefits from the use of ontologies, because ontologies help to decouple description and query vocabulary and increase retrieval performance [67].

41 2.3. ONTOLOGY LANGUAGES 23 Figure 2.3: Classification of ontology specification languages. Service Retrieval The ability to rapidly locate useful online service (e.g. software applications, software components, process models, or service organizations), as opposed to simple useful documents, is becoming increasingly critical in many domains. Berstain and Klein describes a novel service retrieval approach based on the sophisticated use of process ontologies [14]. The evaluation suggested that using process ontology-based queries produced retrieval precision higher than that of existing service retrieval approaches, while retaining polynomial complexity for query enactment. Along this line of work are also approaches that try to combine services to fulfill the users needs [164] [133]. 2.3 Ontology Languages Over the years, a number of ontology languages have been developed, focusing on different aspects of ontology modelling. Many ontology languages have their roots in first order logic. Some of them have particular focus on modelling ontology in a formal yet intuitive way (mainly framebased languages, Ontolingua, F-logic, OCML and OKBC compatible languages), while others (mainly various description logic based languages

42 24 CHAPTER 2. BASIC ONTOLOGY CONCEPTS like LOOM, OIL and OWL) are more concerned with finding an appropriate subset of first order logic with decidable and complete subsumption inference procedures. Within the vision of the Semantic Web [12], RDF(S) is proposed as a modelling language particularly designed for the Semantic Web metadata and applications. In general, the web languages offer only elementary modelling support, but they form a sound base for other languages to build on top of them. Newly developed languages like OIL, DAML+OIL, and OWL are just like that. In this section, we briefly introduce these languages. For a comparative study of the different languages, we refer to [159] [160] [37]. Figure 2.3 depicts the different languages and how they are related to each other. The categorization is adopted from [34] Traditional Ontology Languages CycL. CycL is a formal language whose syntax derives from first-order predicate calculus and was first developed in the Cyc project [91] [92] in the 80s, which aims at providing a general ontology for commonsense knowledge. The Cycorp has created a large knowledge base for common sense knowledge using the CycL language. To express real-world concepts, the language has a vocabulary of terms (about 160), which can be combined into meaningful CycL expressions. The main concepts of CycL are: constants (the vocabulary or words of the language like thing, concept, etc.), variables (stand for constant or formulas), formulas (combine terms into meaningful expression), predicates (express relationship between terms) and micro-theories. Micro-theories are sets of formulas, but they can also participate in formulas, i.e. reification. Ontolingua. The term Ontolingua denotes both the system and the language [50]. The Ontolingua language is based on KIF (Knowledge Interchange Format) and the Frame Ontology. KIF has a declarative semantic and is based on first order predicate calculus. It provides definitions for objects, function, relation and logic constants. KIF is a language for knowledge exchange and is tedious to use for the development of ontologies. Therefore, the Frame Ontology is built on top of KIF and provides definitions in an object oriented paradigm, like class, subclass-of, instance-of etc. But ad hoc axioms (model sentences, which are always true) cannot be expressed in Frame Ontology. Ontolingua lets the developer decide whether to use the full expressive power of KIF, where axioms can be expressed, or to be more restricted during the specifica-

43 2.3. ONTOLOGY LANGUAGES 25 tion by using only the Frame Ontology. An ontology using Ontolingua is typically defined by: relations, classes (treated as unary relations), functions (a special kind of relation), individuals (distinguished objects) and axioms (relate the relations). F-logic (Frame Logic). F-logic [82] was developed in the late 80s. It is a logic language integrated with the object-oriented (or frame-based) paradigm. Some fundamental concepts from the object-oriented modelling paradigm have a direct representation in F-logic, such as the concepts of class, method, type and inheritance. One of the main problems with the object-oriented approaches lack of formal logic semantics is overcome by the logical foundation of F-logic. There are many similarities between F-logic and Ontolingua, since both try to integrate frames into a logical framework. But the frame-based modelling primitives are explicitly defined as first class citizens in the semantics of F-logic, while Ontolingua treats them as second-order terms defined by KIF axioms. On the other hand, F-logic lacks the powerful reification mechanism Ontolingua inherits from KIF, which allows the use of formulas as terms of meta-formulas. OKBC (Open Knowledge-Base Connectivity). OKBC specifies a knowledge model of knowledge representation systems (classes, slots, facets and individuals) as well as a set of operations based on this model (e.g., find a frame, match a name, delete a frame) [31]. An application uses these operations to access and modify the knowledge stored in an OKBC compliant system. The OKBC knowledge model supports an object-oriented representation of knowledge and provides a set of constructs commonly found in that modelling paradigm, including: constants, frames, slots, facets, classes, individuals and knowledge bases. For representation of axioms and rules, the OKBC knowledge model is not sufficient. OKBC is complementary to KIF, which provides a declarative language for describing knowledge. KIF does not include elements to manipulate or query the ontology and the knowledge base. On the other hand, KIF is more expressive than OKBC, as OKBC focuses on modelling elements that are efficiently supported by most of the knowledge representation systems. OCML (Operational Conceptual Modelling Language). OCML was developed at the Knowledge Media Institute (KMI) at the Open University in the VITAL project [119] [44]. Its primary purpose was to provide operational knowledge modelling facilities. To achieve this, it supports the specification of three types of constructs: functional terms (specify an ob-

44 26 CHAPTER 2. BASIC ONTOLOGY CONCEPTS ject in the domain of investigation), control terms (specify actions and order of execution in modelling problem solving behaviour) and logical expression (to specify relations). Further, interpreters for functional and control terms as well as a proof system are included. The operational nature of OCML makes it possible to support quick prototyping, which is a desirable feature for model validation. OCML provides a set of base ontologies (including meta, functions, relations, sets, numbers, lists, strings, mapping, frames, inferences, environment and task-method) that forms a rich modelling library for building other ontologies on top of it. LOOM. LOOM [96] is a knowledge representation system developed at the University of Southern Californias Information Science Institute in the early 90s. It was designed to support the construction and maintenance of model-based applications. To that end, Loom model specification language facilitates the specification of explicit domain models, while LOOM behaviour specification language provides programming paradigms (object-oriented and rule-based) that can be employed to query and manipulate the models. In that sense, LOOM is also an operational language. The main feature of LOOM is its powerful classification mechanism, which integrates a sophisticated concept definition language with reasoning. Having its root in Description Logic, LOOM has a powerful classifier that could: at the concept and relation level, infer the existence of subsumption relations between defined concepts, and at the instance or fact level, infer new factual relations (class membership for instance). The language and the system are in continuous update. Telos. Telos is a language intended to support the development of information system, developed at University of Toronto [120]. The language was founded on concepts from knowledge representation but also brought in ideas from requirement languages and deductive database (an object-oriented framework which supports aggregation, generalization and classification). Other Telos features include: an explicit representation of time, and primitives to specifying integrity constraints and deductive rules Web Standards XML (Extensible Markup Language). XML [20] is the universal format for structured documents and data on the Web, proposed by the W3C. The main contribution of XML is that it provides a common and communicable syntax for web documents. XML itself is not an ontology lan-

45 2.3. ONTOLOGY LANGUAGES 27 guage, but XML-Schemas, which define the structure, constraints and the semantics of XML documents, can to some extent, be used to specify ontologies. Since XML-schema is created mainly for the verification of XML documents and its modeling primitives are more application oriented rather than concept oriented, it is in general not viewed as an ontology language. RDF (Resource Description Framework). RDF [89] was developed by the W3C (World Wide Web Consortium) as part of its semantic web effort. It is a framework for describing and interchanging metadata, by means of resources (subjects, available or imaginable entities) properties (predicates, describing the resources) and statements (the object, a value assigned to a property in a resource). RDF Schema [21] further extends RDF by adding more modelling primitives commonly found in ontology languages like domain and range restriction on property, class and property taxonomy, etc. More expressive constructs like axioms cannot be expressed in RDF Schema. In combination, RDF Schema enables the representation of class, property and constraint and RDF allows the representation of instances and facts, thus making it a qualified lightweight ontology language. While RDF and RDFS are different, they are complimentary. The combination of the two is usually denoted as RDF(S) Web-based Ontology Specification Languages SHOE (Simple HTML Ontology Extension). SHOE [73] [95] is an extension of HTML to incorporate semantic knowledge in ordinary web documents by annotating html pages with ontologies. SHOE provides modelling primitives to both specify ontologies and annotate web pages. Each page will declare which ontologies it is using, and therefore makes it possible for agents, which are aware of the semantics, to perform more intelligent searching. SHOE allows declaring classification of entities, relationships between entities and inference rule (in the form of horn clause with no negation), as well as ontology inclusion and versioning information. OIL (Ontology Inference Layer). OIL [35] [53] is an initiative funded by the European Union programme for Information Society Technologies as part of the On-To-Knowledge project. OIL is both a representation and exchange language for ontologies. The language synthesized work from different communities (modelling primitives from frame-based languages; semantics of the primitive defined by Description Logic; and XML syntax) to achieve the aim of providing a general-purpose markup

46 28 CHAPTER 2. BASIC ONTOLOGY CONCEPTS language for the Semantic Web. OIL is also compatible with RDF(S) as it is defined as an extension of RDF(S). The language is defined in a layered approach. The three layers are: Standard OIL (mainstream modelling primitives usually found in ontology language), Instance OIL (includes individual into the ontology) and Heavy OIL (not yet defined, but aims at additional reasoning capabilities). OIL provides a predefined set of axioms (like disjoint class, covering, etc.) but does not allow defining arbitrary axioms. DAML+OIL. DAML+OIL [75] [76] is a product of efforts in merging two languages - DAML (DARPA Agent Modelling Language) and OIL. DAML+OIL is a language based on RDF(S) with richer modelling primitives. In general, what DAML+OIL adds to RDF Schema is the additional ways to constrain the allowed values of properties, and what properties a class may have. The differences between OIL and DAML +OIL are subtle, as the same effect can be achieved by using different construct of the two languages (For instance, DAML+OIL has no direct equivalent to OILs covered axiom, however, the same effect can be achieved using a combination of unionof and subclass). In addition, DAML+OIL has better compatibility with RDF(S) (for instance, OIL has explicit OIL instances, while DAML+OIL relies on RDF for instance). DAML+OIL is also a proposed W3C recommendation for semantic markup language for web resources. OWL (Web Ontology Language). OWL [107] is a semantic markup language for publishing and sharing ontologies on the web. OWL is the latest W3C proposed recommendation for that purpose. The language incorporates learning from the design and application of DAML+OIL. OWL has three increasingly-expressive sublanguages, namely, OWL Lite (Classification hierarchy and simple constraints), OWL DL (adding class axioms, Boolean combinations of class expression and arbitrary cardinality) and OWL Full (Permits also meta-modelling facilities in RDF(S)). Ontology developers should consider which sublanguage best suits their needs. The choice between OWL Lite and OWL DL depends on the extent to which users require the more-expressive constructs provided by OWL DL. The choice between OWL DL and OWL Full mainly depends on to which extent the users require the meta-modelling facilities of RDF Schema. The reason why OWL DL contains the full vocabulary but restricts how it may be used is to provide logical inference engines with certain properties desirable for optimization.

47 2.4. ONTOLOGY ENGINEERING 29 Figure 2.4: States and activities in the ontology life-cycle [57]. 2.4 Ontology Engineering Life Cycle of an Ontology The design of an ontology is an iterative maturing process. This means the ontology will become to full development, become mature, by evolving through intermediate states to reach a desired or final condition. As soon as the ontology becomes important, the ontology engineering process has to be considered as a project, and therefore project management methods must be applied. [57] recognized that planning and specification are important activities. The authors list the activities that need to be performed during the ontology development process. The authors explain that the life of an ontology moves on through the following states: specification, conceptualization, formalization, integration, implementation, and maintenance. Knowledge acquisition, documentation and evaluation are support activities that are carried out during the majority of these states (c.f. figure 2.4). Ontology design is a project, and should be treated as such, especially when it becomes large. Project Management and software engineering techniques and guidelines should be adapted and applied to ontology engineering. For a comparative study of ontology guidelines, we refer to [72].

48 30 CHAPTER 2. BASIC ONTOLOGY CONCEPTS Figure 2.5: A generic architecture of ontology-based applications, adopted from [111] Ontology-based Architectures Effective and efficient work with the Semantic Web in general and ontology in particular, must be supported by advanced tools enabling the full powers of the technology. [163] suggests to review the different tools in an ontology-based architecture instead of focusing on individual tools separately. In fact, many of the current ontology engineering environments provide a broad range of services rather than only one service. Figure 2.5 sketches a decomposed design of ontology-based applications, highlighting the different elements that will contribute to the success of the applications. The Ontology Layer The components in this layer serves the common goal of the acquisition of ontologies. In particular, it requires the following elements. Ontology extraction applies Natural Language Processing (NLP) techniques in domain documents to determine the most relevant concepts and their relationships in a domain.

49 2.4. ONTOLOGY ENGINEERING 31 Ontology learning is a more generic term applied to all bottom-up approaches of ontology acquisition that start from a given set of data that reflects the human communication and interaction process. Ontology annotation tool is used to create an instance set on the basis of an existing ontology. Ontology editor is an application intended for creating or editing ontologies manually by a knowledge engineer. Ontology evaluation tools aim at improving the quality of ontologies. Ontology mapping, aligning and merge tools provide support for users to find similarities and differences between sources ontologies. The Middleware Layer Ontology middleware plays the role of hiding the ontology layer in systems and providing advanced services to applications such as ontolgoy management, storage, query and inference. Ontology storage facilities (also called ontology server) provide database like functionality for the persistent storage and selective retrieval of ontologies. The goal of query is to provide high level access to the ontology through questions formulated in a query language that is easy both for people to write and for machines to evaluate. Inference engines process the knowledge structures captured in ontologies to reason implicit knowledge in the ontologies. Ontology management is the set of techniques that are necessary to efficiently use multiple variants of ontologies, and it includes issues like version control, security, access right and trust management etc. Ontology transfer refers to the ability of middleware to connect ontology servers over the network.

50 32 CHAPTER 2. BASIC ONTOLOGY CONCEPTS The Application Layer The application layer is the home of ontology-based applications, and software which supports users to access, organize, exchange and aggregate information through the use of ontologies [167]. Example applications are: Ontology-based search and browsing support different information seeking modes for accessing large collection of instance sets or data items referred by the ontology. Ontology-based sharing provides interoperability between different systems through the use of referring to common ontology. 2.5 Concluding Remarks This chapter has aimed at outlining the theoretical background of the work. It has introduced the basic concepts of ontology engineering with the intention to provide basic understanding of ontologies, which are the basis of this work. Here is a summary of some of the main points we discussed: The Semantic Web will be built on layers of enabling technology, and ontology will be a core element for the Semantic Web. An ontology is a formal explicit specification of a shared conceptualization. Ontologies can be classified according to their level of formality and extent of explication. A number of applications ranging from system interoperability to knowledge management can benefit from using ontology as a core element. There exist several ontology specification languages with different focuses. Among them, DAML+OIL and OWL are W3C ontology language recommendations. The design of an ontology is an iterative maturing process.

51 2.5. CONCLUDING REMARKS 33 To enable the full power of ontology, a variety of tools are needed. They can be classified into three layers, i.e., ontology layer, middleware layer, and application layer.

52 34 CHAPTER 2. BASIC ONTOLOGY CONCEPTS

53 Chapter 3 Technological Overview This chapter provides a brief overview of the various fields of research that are referred to and have influenced the work presented in this thesis. The aim of the chapter is not to give a complete overview of the fields; rather to provide an overview of the basic concepts of the relevant techniques that have been adopted for this work. 3.1 Information Retrieval Information retrieval (IR) deals with the representation, storage, organization of, and access to information items. The representation and organization of the information items should provide the user with easy access to the information in which she is interested [3]. In the past 20 years, the area of information retrieval has grown well beyond its primary goals of indexing text and searching for useful documents in a collection. Nowadays, research in IR includes modeling, document classification and categorization, systems architecture, user interfaces, data visualization, filtering, languages, search engines etc. The part that is related to this work in particular is the vector space model Vector Space Models The vector space model is one of the three classical retrieval models (the other two being boolean model and probabilistic model) [141] [142]. The vector space model recognizes that the use of binary weights in the boolean 35

54 36 CHAPTER 3. TECHNOLOGICAL OVERVIEW Figure 3.1: The cosine of β is used to measure the similarity between d j and q. model is too limiting, and proposes a framework in which partial matching is possible. This is accomplished by assigning non-binary weights to index terms in queries and in documents. These term weights are ultimately used to compute the degree of similarity between a document and a query. The procedure can be divided into three stages. The first stage is the document indexing where content bearing terms are extracted from the document text. The second stage is the weighting of the indexed terms to enhance retrieval of documents relevant to the user. The last stage ranks the document with respect to the query according to a similarity measure 1. Document Indexing In [142], it is defined that: Definition 3.1 Let k i be an index term. d j be a document, and w i, j 0 be a weight associated with the pair (k i, d j ). This weight quantifies the importance of the index term for describing the document semantic contents. For the vector model, the weight w i, j associated with a pair (k i, d j ) is positive and non-binary. Therefore, the vector for a document d j is represented by d j = (w 1, j, w 2, j,, w t, j ), where t is the total number of index terms in the system. Further, the index terms in the query are also weighted. Let w i,q be the weight associated with the pair [k i, q]. Then, the query vector q is defined as q = (w 1,q, w 2,q,, w t,q ) Therefore, a document d j and a user query q are represented as t- dimensional vectors as shown in figure Though, in commercial search engines, documents are ranked not only based on similarity, but also on static qualities, like popularity, document length, language, etc.

55 3.1. INFORMATION RETRIEVAL 37 It is obvious that many of the words in a document do not describe the document s content, words like the, is. By using automatic document indexing those non significant words (function words) are removed from the document vector, so the document will only be represented by content bearing words [142]. This indexing can be based on term frequency, where terms that have both high and low frequency within a document are considered to be function words [142]. In practice, term frequency has been difficult to implement in automatic indexing. Instead a common words stop list is used to remove high frequency words (stop words) [142]. In general, 40-50% of the total number of words in a document are removed with the help of a stop word list [142] 2. Non linguistic methods for indexing have also been implemented. Probabilistic indexing is based on the assumption that there is some statistical difference in the distribution of content bearing words, and function words [108]. Probabilistic indexing ranks the terms in the collection w.r.t. the term frequency in the whole collection. The function words are modeled by a Poisson distribution over all documents, as content bearing terms cannot be modeled. The use of Poisson model has been expand to Bernoulli model [28]. Recently, an automatic indexing method which uses serial clustering of words in text has been introduced [16]. The value of such clustering indicates whether a word is content bearing. Term Weighting Term weighting has been explained by controlling the exhaustivity and specificity of the search, where the exhaustivity is related to recall and specificity to precision. The term weighting for the vector space model has entirely been based on single term statistics. There are three main factors that affect term weighting: term frequency factor, collection frequency factor and length normalization factor. These three factor are multiplied together to make the resulting term weight. A common weighting scheme for terms within a document is to use the frequency of occurrence. The term frequency is somewhat content descriptive for the documents and is generally used as the basis of a weighted document vector [140]. It is also possible to use binary document vector, but the results have not been as good compared to term frequency when using the vector space model [140]. 2 However, there are search engines that do not use stop words at all.

56 38 CHAPTER 3. TECHNOLOGICAL OVERVIEW There are various weighting schemes for discriminating one document from the other. In general this factor is called collection frequency document. Most of them, e.g. the inverse document frequency, assume that the importance of a term is inversely proportional with the number of documents the term appears in. Experimentally it has been shown that these document discrimination factors lead to a more effective retrieval, i.e., an improvement in precision and recall [140]. The third possible weighting factor is a document length normalization factor. Long documents have usually a much larger term set than short documents, which makes long documents more likely to be retrieved than short documents [140]. Different weight schemes have been investigated and the best results, w.r.t. recall and precision, are obtained by using term frequency with inverse document frequency and length normalization [140] [90].The t f idf weighting is therefore, defined as follows: Definition 3.2 Let N be the total number of documents in the systems and n i be the number of documents in which the index term k i appears. Let f req i, j be the raw frequency of term k i in the document d j. Then the normalized frequency f i, j of term k i in document d j is given by f i, j = f req i, j max l f req l, j (3.1) where the maximum is computed over all terms which are mentioned in the text of document d j. further, let idf i, inverse document frequency for k i, be given by Then, the tf-idf term weighting scheme is given by idf i = log N n i (3.2) w i, j = f i, j log N n i (3.3) Several variations of the above expression for the weight w i, j are described in an interesting paper by Salton and Burckley in 1988 [139]. However, in general, the above expression should provide a good weighting schema for many collections. For the query term weight, Salton and Buckley suggest w i,q = ( f req i,q max l f req l,q ) log N n i (3.4)

57 3.2. COMPUTATIONAL LINGUISTICS 39 where f req i,q is the raw frequency of the term k i in the text of the information request q. Similarity Coefficients The similarity in vector space models is determined by using associative coefficients based on the inner product of the document vector and query vector, where word overlap indicates similarity. The inner product is usually normalized. The most popular similarity measure is the cosine coefficient, which measures the angle between the document vector and the query vector as shown in figure 3.1. That is, sim(d i, q) = d t j q d j q = i=1 w i, j w i,q i=1 t w2 i, j t i=1 w2 i,q (3.5) Other measures are e.g., Jaccard and Dice coefficients [138]. 3.2 Computational Linguistics Computational Linguistics is an interdisciplinary field which centers around the use of computers to process or produce human language (also known as natural language, to distiguish it from computer languages) [79]. To this field, linguistics contributes an understanding of the special properties of language data, and provides theories and descriptions of language structure and use. Computer Science contributes theories and techniques for designing and implementing computer systems. Computational Linguistics is largely an applied field, concerned with practical problems. There are as many applications as there are reasons for computers to process or produce language: for example, in situations where humans are unavailable, too expensive, too slow, or busy doing tasks that humans are better at than machines. Some current application areas include translating from one language to another (Machine Translation), finding relevant documents in large collections of text (Information Retrieval), and answering questions about a subject area (expert systems with natural language interfaces). The linguistic side can be broken down into many smaller pieces, e.g. phonetics, phonology, the lexicon, morphology, syntax, semantics, pragmatics, and so on (following divisions of linguistic theory). The parts

58 40 CHAPTER 3. TECHNOLOGICAL OVERVIEW that are relevant to this work include: morphological analysis of words, part-of-speech tagging and lexical semantics. We will briefly review each of the technique in the sequel Morphological Analysis Morphology Morphology is the study of the way words are built up from smaller meaning-bearing units, morphemes. A morpheme is often defined as the the minimal meaning-bearing unit in a language. So for example the word fox consists of a single morpheme (the morpheme fox), while the word cats consists of two: the morpheme cat and the morpheme -s. It is often useful to distinguish two broad classes of morphemes: stems and affixes. The exact details of the distinction vary from language to language, but intuitively, the stem is the main morpheme of the word, supplying the main meaning, while the affixes add additional meanings to various kinds. Affixes are further divided into prefixes, suffixes, infixes and circumfixes. Prefixes precede the stems, suffixes follow the stem, circumfixes do both, and infixes are inserted inside the stem. There are two broad classes of ways to form words from morphemes: inflection and derivation. Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function like agreement. For example, English has the inflectional morpheme -s for marking the plural on nouns, and -ed for marking the past tense on verbs. Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. For example the verb computerize can take the derivational suffix -ation to produce the noun computerization. Morphological Parsing The goal of morphological parsing is to find out what morphemes a given word is built from. For example, a morphological parser should be able to tell us that the word cats is the plural form of the noun stem cat, and that the word mice is the plural form of the noun stem mouse. So, given the string cats as input, a morphological parser should produce an output that looks similar to cat N PL.

59 3.2. COMPUTATIONAL LINGUISTICS 41 Morphological parsing yields information that is useful in many NLP applications. In parsing, e.g., it helps to know the agreement features of words. Similarly, grammar checkers need to know agreement information to detect such mistakes. But morphological information also helps spell checkers to decide whether something is a possible word or not, and in information retrieval it is used to search not only cats, if that s the user s input, but also for cat. To get from the surface form of a word to its morphemes, it is usually proceeded in two steps. First, the words are split up into its possible components. So, cat + s will be made out of cats, using + to indicate morpheme boundaries. In this step, spelling rules will also be taken into account, so that there are two possible ways of splitting up foxes, namely foxe + s and fox + s. The first one assumes that foxe is a stem and s the suffix, while the second one assumes that the stem is fox and that the e has been introduced due to the spelling rules. In the second step, a lexicon of stems and affixes is used to look up the categories of the stems and the meaning of the affixes. So, cat + s will get mapped to cat NP PL, and fox + s to fox N PL. We will also find out now that foxe is not a legal stem. This tells us that splitting foxes into foxe + s was actually an incorrect way of splitting foxes, which should be discarded. But note that for the word houses splitting it into house + s is correct. Figure 3.2 illustrates the two steps of the morphological parser with some examples. The automaton that is used for performing the mapping between the two levels is the finite-state transducer or FST. Two transducer are used in the parsing process: one to do the mapping from the surface form to the intermediate form and the other one to do the mapping from the intermediate form to the underlying form. We will not go into the detail of FST in this work. For details, we refer to [79]. Stemming While building a transducer from a lexicon plus rules is the standard algorithm for morphological parsing, there are simpler algorithms that don t require the large online lexicon demanded by the algorithm. These are used especially in Information Retrieval. Since a document with the word cats might not match the user search keyword cat, some IR systems first run a stemmer on the keywords and on the words in the document.

60 42 CHAPTER 3. TECHNOLOGICAL OVERVIEW Figure 3.2: Examples of two steps in the morphological parser. Since morphological parsing in IR is only used to help form equivalence classes, the details of the suffixes are irrelevant, and what matters is determing that two words have the same stem. One of the most widely used such stemming algorithm is the simple and efficient Porter algorithm [134]. The Porter stemming algorithm (or Porter stemmer) is a process for removing the common morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems. Porter algorithm can be think of as a lexicon-free FST. Porter stemming can be adapted to other languages as well. Terminology Based on the above discussion, we put forward our way of usage for the relevant terms in this area. Throughout the work, we will use the following terms consistently according to their specific meaning: Word - any word in a document. A word may take on several word forms. Word forms - most words can take on several different word forms, for example through inflections or other morphological variations.

61 3.2. COMPUTATIONAL LINGUISTICS 43 Base form - denotes the main word form of the word. In cases where lemmatization is performed according to a dictionary, the base form is the dictionary entry of a set of word forms. Base form is also known as lemma. Phrase - a combination (i.e. a sequence) of words. Term - refers to both words and phases, i.e., a term can be a particular word or a combination of words, especially one used to mean something very specific or one used in a specialized area of knowledge or work [47]. Stem - is the form of a word after its endings are removed. Stemming - is the process of removing word endings. Lemmatization - word. is the process of finding out the base form of a Stop words - are small, frequently occurring words that are often ignored when typed into a database or search engine search. Some examples: THE, AN, A, OF Part-of-Speech Tagging Part-of-speech tagging is the process of assigning a part-of-speech, like noun, verb, pronoun, preposition, adverb, adjective or other lexical class marker to each word in a sentence. The input to a tagging algorithm is a string of words of a natural language sentence and a specified tagset (a finite list of Part-of-speech tags). The output is a single best POS tag for each word. In general, there are two types of taggers: one attaches syntactic roles to each word (subject, object, etc.) and the other attaches only functional roles (Noun, Verb, etc.). For example, table 3.1 shows an example of a sentence and a potential tagged output using the Penn Treebank tagset [101]. Tags plays an important role in Natural Language applications like speech recognition, natural language parsing, information retrieval and information extraction. Lots of work has been done on POS tagging for English. The earliest algorithms for automatically assigning part-ofspeech were based on a two-stage architecture [84]. The first stage used a dictionary to assign each word a list of potential part-of-speech. The second stage used large lists of hand written disambiguation rules to narrow

62 44 CHAPTER 3. TECHNOLOGICAL OVERVIEW Does that flight serve dinner? VBZ DT NN VB NN? Tag Description Example VBZ Verb, 3sg present eats DT Determiner a, the NN Noun, singular or mass cat VB Verb, base form eat? Sentence-final punc. (.!?) Table 3.1: An example of a tagged output using the Penn Treebank tagset. down the list to a single part-of-speech for each word. Disambiguation occurs when a word may have multiple part-of-speech. For example, the word book is ambiguous, meaning it can be a noun, or a verb. To disambiguate, a rule-based tagger can have a hand-written rule which specify, for example, that an ambiguous word is a noun rather than a verb if it follows a determiner. Taggers can be characterized as rule-based or stochastic. Rule-based taggers use hand-written rules to distinguish the tag ambiguity. Stochastic taggers generally resolve tagging ambiguities by using a training corpus to compute the probability of a given word having a given tag in a given context. They are either HMM (Hidden Markov Model) based, choosing the tag sequence which maximizes the product of word likelihood and tag sequence probability, or cue-based, using decision trees or maximum entropy models to combine probabilistic features Lexical Semantics Lexical semantics is the study of the systematic meaning-related connections among words and the internal meaning-related structure of each word. Each individual entry in the lexicon is called a lexeme. A lexeme should be thought of as a pairing of particular orthographic and phonological form with some form of symbolic meaning representation. The term sense is used to refer to a lexeme s meaning component. The lexicon is therefore a finite list of lexemes.

63 3.2. COMPUTATIONAL LINGUISTICS 45 Relations Among Lexemes and Their Senses A variety of relations can hold among lexemes and among their senses. We introduce a list of them that have had significant computational implications. Homonymy refers to lexemes with the same form but unrelated meanings. For example, the lexeme wood and would are homonymy since they share the same phonological forms. Polysemy refers to the notion of a single lexeme with multiple meanings. For example the lexeme serve in serve read meat and that in serve the country. Synonym holds between different lexemes with the same or similar meaning, such as price and fair. Hyponymy relations hold between lexemes that are in class-inclusion relationships. For example, puppy is a hyponym of dog. Meronymy describes the part-whole relation, e.g. car and wheel. Antonym holds between different lexemes that differ in a significant way on at least one essential semantic dimension, e.g. cheap and expensive. WordNet The usefulness of lexical relations in linguistic, psycho-linguistic, and computational research has led to a number of efforts to create large electronic databases of such relations. WordNet is so far, the most well developed and widely used lexical database for English [52] [114] [112]. WordNet consists of three separate databases, one for nouns, one for verbs and a third for adjectives and adverbs. Each of the three databases consists of a set of lexical entries corresponding to unique orthographical forms, accompanied by sets of senses associated with each form. Table 3.2 gives some idea of the scope of WordNet 2.0 release. The database can be accessed directly with a browser (locally or over the Internet), or programatically through the use of API. In their complete form, WordNet s sense entries consist of a set of synonyms (synset in WordNet terminology), a dictionary-style definition, or gloss, and some example uses. Figure 3.3 shows the WordNet

64 46 CHAPTER 3. TECHNOLOGICAL OVERVIEW POS Unique forms Synsets Number of senses Noun Verb Adjective Adverb Totals Table 3.2: Scope of the current WordNet 2.0 release in terms of number of words, synsets, and senses. entry for the noun book. Synset is the fundamental basis for synonymy in WordNet. Consider the following example of a synset: {ledger, leger, account book, book of account}. The dictionary like definition, or gloss, of this synset describes it as a record in which commercial accounts are recorded. Each of the lexical entries included in the synset can, therefore, be used to express this notion in some setting. In practice, synsets are the ones that actually constitute the senses associated with WordNet entries. Specifically, it is this exact synset, with its associated definition and examples, that makes up one of the sense for each of the entries listed in the synset. Looking at this from a more theoretical point of view, each synset can be taken to represent a concept that has become lexicalized in the language. Thus, instead of representing concepts using logical terms, WordNet represents them as lists comprised of the lexical entries that can be used to express that concept. Of course, a simple listing of lexical entries would not be much more useful than an ordinary online dictionary. The power of WordNet lies in its set of domain-independent lexical relations. These relations can hold among WordNet synsets. They are, for the most part, restricted to items with the same part-of-speech. Table 3.3, 3.4, and 3.5 show a subset of the relations associated with each of the four part-of-speech, along with a brief definition and an example. Following the hypernym relations, each synset is related to its immediately more general and more specific synsets. To find chains of more general or more specific synsets, one can simply follow a transitive chain of hypernym and hyponym relations. Figure 3.4 shows the hypernym chain for book (sense 1). This chain eventually leads to the top of the hierarchy entity. Note that WordNet does not have a single top concept, rather it has several top concepts and they are called unique beginners.

65 3.2. COMPUTATIONAL LINGUISTICS 47 Figure 3.3: A portion of the WordNet 2.0 entry for the noun book. Relation Definition Example Hypernym synset which is the more breakfast > meal general class of another synset Hyponym synset which is a particular meal > lunch kind of another synset Holonym synsets which is the whole flower > plant of which another synset is part Meronyms synsets which the parts of bumper > car another synset Antonyms synsets which are opposite in meaning man < > woman Table 3.3: Noun relations in WordNet.

66 48 CHAPTER 3. TECHNOLOGICAL OVERVIEW Relation Definition Example Hypernym synset which is the more fly > travel general class of another synset Troponym synset which is one particular walk > stroll way to perform an- other synset Entails synset which is entailed by snore > sleep another synset Antonyms synsets which are opposite in meaning increase < > decrease Table 3.4: Verb relations in WordNet. Relation Definition Example A-value-of adjective synset which slow > speed represents a value for a nominal target synset Antonyms synsets which are opposite in meaning quickly < > slowly Table 3.5: Adjective and adverb relations in WordNet.

67 3.3. CONCLUDING REMARKS 49 Figure 3.4: Hypernym chains for sense one of noun book. For nouns, there are 25 unique beginners. Lexicons for other languages that resemble the structure and function of WordNet have been constructed as well. EuroWordNet is a multilingual database with WordNet for several European languages (Dutch, Italian, Spanish, German, French, Czech and Estonian). The EuroWord- Net is structured in the same way as the American wordnet for English (Princeton WordNet) in terms of synsets (sets of synonymous words) with basic semantic relations between them. Each wordnet represents a unique language-internal system of lexicalizations. In addition, the wordnets are linked to an Inter-Lingual-Index, based on the Princeton WordNet. Via this index, the languages are interconnected so that it is possible to go from the words in one language to similar words in any other language [48]. 3.3 Concluding Remarks This chapter has covered a wide range of issues concerning the supportive technologies that we used in this work. We consider Information

68 50 CHAPTER 3. TECHNOLOGICAL OVERVIEW Retrieval technique, in particular, vector space model, a vital component of the algorithm we proposed. The following are among the highlights: In vector model, both documents and queries are represented in high-dimensional vectors. Each element in the vector reflects the significance of a particular index word to the document or the query. The significance is measured by term weight. There are three main factors affect term weighting: term frequency factor, collection frequency factor and length normalization factor. These three factor are multiplied together to make the resulting term weight. The similarity in vector space models is determined by using associative coefficients based on the inner product of the document vector and query vector, where word overlap indicates similarity. We employ the vector space model to represent concepts and further calculate similarities between them, which will be introduced in chapter 5 and chapter 6. The second major supportive technology we are using comes from computational linguistics. The parts that are relevant to this work include: morphological analysis of word, part-of-speech tagging and lexical semantics. Some of the highlights are: Morphology is the study of the way words are built up from smaller meaning-bearing units, morphemes. The goal of morphological parsing is to find out what morphemes a given word is built from. To get the morphemes from the surface form of a word through morphological analysis, it is usually proceeded in two steps. First, the words are split up into its possible components. Second, a lexicon of stems and affixes is used to look up the categories of the stems and the meaning of the affixes. Morphological analysis can be automated by using FST. Part-of-speech tagging is the process of assigning a part-of-speech like noun, verb, pronoun, preposition, adverb, adjective or other lexical class marker to each word in a sentence.

69 3.3. CONCLUDING REMARKS 51 The input to a tagging algorithm is a string of words of a natural language sentence and a specified tagset( a finite list of Part-of-speech tags). The output is a single best POS tag for each word. Taggers can be characterized as rule-based or stochastic. Rulebased taggers use hand-written rules to distinguish the tag ambiguity. Stochastic taggers generally resolve tagging ambiguities by using a training corpus to compute the probability of a given word having a given tag in a given context. Lexical semantics is the study of the systematic meaning-related connections among words and the internal meaning-related structure of each word. A variety of relations can hold among lexemes and among their senses. We introduce a list of them that have had significant computational implications. WordNet is a large database of lexical relations for English words. In our work, the relevant linguistic analysis part has been implemented by using third party software. In this work, we mainly explore the hypernym/hyponym relations in WordNet with the intention to augment the mapping process.

70 52 CHAPTER 3. TECHNOLOGICAL OVERVIEW

71 Chapter 4 State-of-the-Art Survey The aim of this chapter is to provide a state of the art survey of tools and environment for automatic ontology mapping. We start with an introduction to the problem of ontology heterogeneity, which is characterized by different kinds of mismatches between ontologies. This kind of heterogeneity hampered us from a combined usage of multiple ontologies, which is needed in many applications. To solve the heterogeneity problem, the mismatches need to be reconciled. This means that we need to map and align different ontologies. A number of approaches have been proposed in the literature. We conduct a feature analysis of the approaches and compare their characteristics. To lay the foundation of a clearer elaboration, we also examine the relevant terminology used in ontology mapping related literature. 4.1 Ontology Heterogeneity Ontology Mismatch Differences between ontologies are called mismatches in [83], and will be used throughout this work. Figure 4.1 depicts a framework of issues, related to the integration of ontologies, [83]. Among the three issues discussed: practical problems, mismatches between ontologies and versioning, the main concern in this thesis is mismatches between ontologies. They are further divided into language level and ontology level. The former conforms to the syntactic layer, and the latter to the semantic layer. 53

72 54 CHAPTER 4. STATE-OF-THE-ART SURVEY Figure 4.1: Framework of issues on ontology integration, from [83]. Language Level Mismatches Mismatches at the language level occur when ontologies written in different ontology languages are combined. Chalupsky [30] [29] defines mismatch in syntax and expressivity. In [83], four types of mismatches are identified. Syntax. Different ontology languages often use different syntaxes. For example, to define the class of car in RDF Schema, one uses <rdfs:class ID = "Car">. In LOOM, the expression (defconcept Car) is used to define the same class. Logical representation. A slightly more complicated mismatch at this level is the difference in representation of logic notions. For example, in some languages, it is possible to state explicitly that two classes are disjoint (e.g. disjoint A B), whereas it is necessary to use negation in subclass statements in other languages (e.g. A subclassof (NOT B), B subclass-of (Not A)) Semantics of primitives. A more subtle possible difference at the language level is the semantics of language constructs. Despite the fact that sometimes the same name is used for a language construct in two languages, the semantics may differ, e.g., there are several interpretation of A equalto B. Language expressivity. The mismatch at the language level with the

73 4.1. ONTOLOGY HETEROGENEITY 55 most impact is the difference in expressivity between two languages. This difference implies that some languages are able to express things that are not expressible in other languages. For example, some languages have constructs to negation, whereas others have not. Ontology Level Mismatches Mismatches at the ontology level happen when two or more ontologies that describe partly overlapping domains are combined. These mismatches may occur when the ontologies are written in the same language, as well as when they use different languages. Several classification framework have been proposed in the literature [170] [169] [174] [29]. In [83], Klein tried to integrate the different types of mismatches discussed in the above frameworks. On the ontology level a distinction is made between conceptualization and explication, as described in [171]. A conceptualization mismatch is a difference in the way a domain is interpreted, whereas an explication mismatch is a difference in the way the conceptualization is specified. Conceptualization mismatches are further divided into model coverage and concept scope (granularity). Scope. Two classes seem to represent the same concept, but do not have the same instances, although they may intersect. The classical example is the class employee, where several administrations use slightly different concepts of employee, as mentioned by Wiederhold [174]. Model coverage and granularity. This is a mismatch in the part of the domain that is covered by the ontology, or the level of detail to which that domain is modeled. Chalupsky [29] gives the example of an ontology about cars: one ontology might model cars but not trucks. Another one might represent trucks but only classify them into a few categories, while a third ontology might make very finegrained distinctions between types of trucks based on their physical structure, weight, purpose, etc. Explication mismatches are divided into terminological, modeling style and encoding. Two types of differences can be classified as terminological mismatches.

74 56 CHAPTER 4. STATE-OF-THE-ART SURVEY Synonym terms. Concepts are represented by different names. One example is the use of term car in one ontology and the term automobile in another ontology. Homonym terms. The meaning of the same term is different in different context. For example, the term conductor has a different meaning in a music domain than it has in an electric engineering domain. Modeling style is related to the paradigm and conventions taken by the developers. Paradigm. Different paradigms can be used to represent concepts such as time, action, plans, causality, propositional attitudes, etc. For example, one model might use temporal representations based on interval logic while another might use a representation based on point [29]. Concept description. This type of differences are called modeling conventions in [29]. Several choices can be made for the modeling of concepts in the ontologies. For example, a distinction between two classes can be modeled using a qualifying attribute or by introducing separate class. One last mismatch in the explication category is encoding. Encoding mismatches are differences in value formats, like measuring distance in miles or in kilometers Current Approaches and Techniques The focus of this work is on ontology level mismatch (semantic mismatch). There are also approaches to tackle syntactic mismatches. We will briefly describe some of those in order to give a complete picture of the state of the art. Solving Language Mismatches In [83] four approaches to enable interoperability between different ontologies at the language level have been identified. Aligning the metamodel. The constructs in the language are formally specified in a general model [17].

75 4.1. ONTOLOGY HETEROGENEITY 57 Figure 4.2: Hard problems in ontology mismatches. Layered interoperability. Aspects of the language are split up in clearly defined layers, and interoperability is to be resolved layer by layer [109] [49]. Transformation rules. The relation between two specific constructs in different ontology language is described in the form of a rule that specifies the transformation from the one to the other [29]. Mapping onto a common knowledge model. The constructs of an ontology language are mapped onto a common knowledge model, e.g. OKBC (Open Knowledge Base Connectivity) [31]. Solving Ontology Level Mismatches The alignment of concepts at the ontology level is a task that requires understanding of the meaning of concepts, and cannot be fully automated. Consequently, at the model level, there exist mainly tools that suggest alignments and mappings based on heuristics matching algorithm and provide means to specify these mappings. Such tools support the user in finding the concepts in the separate ontologies that might be candidates for merging. Some tools go a bit further by suggesting actions to be performed. Approaches that concentrate on semantic level mismatch (ontology level mismatch) are the ones we will focus on. They will be further studied in detail in section 4.3.

76 58 CHAPTER 4. STATE-OF-THE-ART SURVEY Finally, in order to integrate ontologies, it is important to distinguish mismatches that are hard to solve, and those that are not. Both [83] and [171] conclude that conceptualization mismatches often need human intervention to be solved. The same view is stated in [174], where scope differences are stated to be hard to solve. Most explication mismatches can be solved automatically, but the terminological mismatches may be difficult. Encoding mismatches can be quite easily solved with a wrapper or a transformation step [83]. Figure 4.2 depicts the framework once again, where the circle marks up mismatches that can be hard to reconcile. 4.2 Ontology Mapping Concepts In this section, we set the context and scope for ontology mapping. We first outline the definition and scope of ontology mapping and discuss the relevant term usage in literature. We proceed with some motivating applications where ontology mapping plays an important role Definition and Scope of Ontology Mapping In the ontology-related research literature, the concept of mapping has a range of meanings, including integration, unification, merging, mapping, etc. To provide a clearer context for discussion, we list some of the definitions that we considered compatible with our potential usage of the term. In [97], it is defined that a mapping will be a set of formulae that provide the semantic relationships between the concepts in the models. In [124], it is said that Mapping is to establish correspondences among the source ontologies, and to determine the set of overlapping concepts, concepts that are similar in meaning but have different names or structure, and concepts that are unique to each of the sources. Further, two relevant concepts: merging and alignment are also defined. Merging is to create a single coherent ontology that includes the information from all the sources. Alignment is to make the source ontologies consistent and coherent with one another but kept separately. In [24], it is stated that the aim of mapping is to map concepts in the various ontologies to each other, so that a concept in one ontology

77 4.2. ONTOLOGY MAPPING CONCEPTS 59 corresponds to a query (i.e. view) over the other ontologies. To sum up, we consider a general definition of ontology mapping to be determining a set of correspondence that identify similar elements in different ontologies. A well defined mapping process can be considered as a component which provides a mapping service. This service can be plugged into various applications. For example, an ontology integration application can use the discovered mappings as the first step towards an integrated ontology. Two tasks have to be conducted in the ontology mapping process. One is to discover the correspondences between ontology elements and the other is to describe and define the discovered mappings so that other follow-up components could make use of them. For the first task, several different approaches have been proposed in the literature [124] [43] [154] [11] [106] [110] [32]. Mapping correspondences are produced in roughly two ways: (1) applying a set of matching rules or (2) evaluating interesting similarity measures that compare a set of possible correspondence and help to choose valid correspondence from them. These heuristics often use syntactic information such as the names of the concepts or nesting relationships between concepts. They might also use semantic information such as the inter-relationship between concepts (slots of frames in [124]), the types of the concepts, or the labeled-graph structure of the models [23] [110]. Other techniques use data instances belonging to input models to estimate the likelihood of these correspondences [154] [43]. Several systems also have powerful features for the efficient capture of user interaction [124] [106]. A detailed comparison of the different approaches will be discussed in detail in section 4.3. The work presented in [97] [24] [99] discussed the necessary components that should be included in the mapping correspondences and the desired features of the correspondences, including: (1) the ability to answer queries over a model, (2) inference of mapping formulas, and (3) compositionality of mappings Application Domains We motivate the study of ontology mapping by demonstrating first the role it plays in several applications. Mapping between ontologies is the foundation of several classes of applications. Information Integration and the Semantic Web. In many contexts, data

78 60 CHAPTER 4. STATE-OF-THE-ART SURVEY resides in a multitude of data sources. In the Semantic Web context, an ontology captures the semantics of data. Data integration enables users to ask queries in a uniform fashion, without having to access each data source independently. In an information integration system, users ask queries over a mediated ontology, which captures only the aspects of the domain that are salient to the application. The mediated ontology is solely a logical one and mappings are used to describe the relationship between the mediated ontology and the local ontologies. In addition to query, mappings between ontologies are necessary for agents to interoperate. Ontology merging. Several applications require that we combine multiple ontologies into a single coherent ontology [97] [106]. In some cases, these are independently developed ontologies that model overlapping domains. In others, we merge two ontologies that evolved from a single base ontology. In both cases, the first step in merging ontologies is to create a mapping between them, which identify similarities and conflicts between the source ontologies. Once the mapping is given, the challenge of a merge algorithm is reduced to create a minimal ontology that covers the given ones Terminology Based on the above discussion, we put forward our way of usage for the relevant terms. We have tried to be consistent as far as possible with definitions and descriptions found elsewhere. Throughout this work, we will use the following terms consistently according to their specific meaning: Merging, integrating. Creating a new ontology from two or more existing ontologies with overlapping parts. Aligning. Bring two or more ontologies into mutual agreement, making them consistent and coherent with one and another. Mapping. Relating similar (according to some metric) concepts or relations from different sources to each other by specifying the correspondence between them. Mapping assertions, correspondence assertions. The specification of the mappings, which describe the relation between the source ontology concepts, as well as other mapping relevant information.

79 4.3. AUTOMATIC ONTOLOGY MAPPING TOOLS 61 Articulation. The points of linkage between two aligned ontologies, i.e. the specification of the alignment. Translating. Changing the representation formalism of an ontology while preserving the semantics. Transforming. Changing the semantics of an ontology slightly (possibly also changing the representation) to make it suitable for purposes other than the original one. Combining. Using two or more different ontologies for a task in which their mutual relations are relevant. 4.3 Automatic Ontology Mapping Tools The creation of mappings will rarely be completely automated. However, automated tools can significantly speed up the process by proposing plausible mappings. In large domains, while many mappings might be fairly obvious, some parts need expert intervention. There are several approaches for building such tools. The first is to use a wide range of heuristics to generate mappings. The heuristics are often based on structure or based on naming. In some cases, domain independent heuristics may be augmented by more specific heuristics for the particular representation language or application domain. A second approach is to learn mappings. In particular, manually provided mappings present examples for a learning algorithm that can generalize and suggest subsequent mappings. In this section we first discuss the relevant research in database area, namely automatic schema matching. Thereafter, we demonstrate a list of ontology mapping approaches and provide a brief comparison among the approaches Automatic Schema Matching Mapping between models has been approached in several research areas. One closely related topic with ontology mapping is schema matching in database research. Integrating heterogeneous data sources is a fundamental problem in databases, which has been studied extensively in the last two decades both from a formal and from a practical point of view [144] [94] [27] [77] [6].

80 62 CHAPTER 4. STATE-OF-THE-ART SURVEY Figure 4.3: Classification of schema matching approaches, from [135]. Even though ontologies are more semantically complex and are often larger than database schema, the two topics still share many features. Given the substantial efforts that have been directed into schema management, it is worthwhile to give a brief account of the state-of-art in that area. In [135], a comprehensive survey on schema matching was reported. Schema matching is defined as determining a set of correspondence that identify similar elements in different schemas. Figure 4.3 shows the classication scheme together with some sample approaches. For each individual match operator, the following largely-orthogonal classication criteria are identified: Instance vs schema: matching approaches can consider instance data (i.e., data contents) or only schema-level information. Element vs structure matching: match can be performed for individual schema elements, such as attributes, or for combinations of elements, such as complex schema structures. Language vs constraint: a matcher can use a linguistically-based approach (e.g., based on names and textual descriptions of schema

81 4.3. AUTOMATIC ONTOLOGY MAPPING TOOLS 63 elements) or a constraint-based approach (e.g., based on keys and relationships). Matching cardinality: the overall match result may relate one or more elements of one schema to one or more elements of the other, yielding four cases: 1:1, 1:n, n:1, n:m. In addition, each mapping element may interrelate one or more elements of the two schemas. Furthermore, there may be different match cardinalities at the instance level. Auxiliary information: most matchers rely not only on the input schemas but also on auxiliary information, such as dictionaries, global schemas, previous matching decisions, and user input. Note that this classication does not distinguish between different types of schemas (relational, XML, object oriented, etc.) and their internal representation, because algorithms depend mostly on the kind of information they exploit, not on its representation. Further, the individual matchers can be combined either by using multiple matching criteria (e.g., name and type equality) within an integrated hybrid matcher or by combining multiple match results produced by different match algorithms within a composite matcher. In [135], seven published prototype implementation were compared according to the classification criteria, including SemInt [93], LSD [42], SKAT [118], TranScm [115], DIKE [131] [132], ARTEMIS [26], and CUPID [98]. One of the conclusions is that more attention should be given to the utilization of instance-level information and reuse opportunities to perform match. [135] Systems for Ontology Merging and Mapping In this section we will describe a number of systems that can be used to support automatic ontology mapping. Due to the close relatedness of ontology mapping and merging, tools that are used for merging are included here as well. The systems demonstrated here are not intended to be exhaustive. They were primarily chosen based on their relevance to this research, and to illustrate the diversity of existing solutions. For each tool we provide a short introduction and an overall comparison of these systems is presented in the end. We first present Chimaera, a webbased ontology merging and diagnosing environment. Then, we present

82 64 CHAPTER 4. STATE-OF-THE-ART SURVEY PROMPT, an algorithm used in Protégé for ontology merging. Next is FCA-Merge, which merges ontologies using documents on the same domain for the ontologies to be merged. We go on to present MOMIS, which merges ontologies by means of ontology clustering and finally we present GLUE, which performs ontology mapping by machine learning techniques. Chimaera Chimaera [106] is an ontology merging and diagnosis tool developed by the Stanford University Knowledge Systems Laboratory (KSL). Its initial design goal was to provide substantial assistance with the task of merging KBs produced by multiple authors in multiple settings. Later, it took on another goal of supporting testing and diagnosing ontologies as well. Finally, inherent in the goals of supporting merging and diagnosis are requirements for ontology browsing and editing. It is mainly targeted at lightweight ontologies. Its design and implementation are based on other applications such as the Ontolingua ontology development environment [50]. Chimaera is built on a platform that handles any OKBC compliant [31] representation system. The two major tasks in merging ontologies that Chimaera support are (1) coalesce two semantically identical terms from different ontologies so that they are referred to by the same name in the resulting ontology, and (2) identify terms that should be related by subsumption, disjointness, or instance relationships and provide support for introducing those relationships. There are many auxiliary tasks inherent in these tasks, such as identifying the locations for editing, performing the edits, identifying when two terms could be identical if they had small modifications such as a further specialization on a value-type constraint, etc. Chimaera generates name resolution lists that help the user in the merging task by suggesting terms each of which is from a different ontology that are candidates to be merged or to have taxonomic relationships not yet included in the merged ontology. It also generates a taxonomy resolution list where it suggests taxonomy areas that are candidates for reorganization. It uses a number of heuristic strategies for finding such edit points. Figure 4.4 shows the result of someone loading in two ontologies (Test1 and Test2) and then choosing the name resolution mode for the ontologies.

83 4.3. AUTOMATIC ONTOLOGY MAPPING TOOLS 65 Figure 4.4: Chimaera in name resolution mode suggesting a merge of Mammal and Mammalia. PROMPT PROMPT [124] is a tool for semi-automatic guided ontology merging. It is a plugin for Protégé [126] [121] [123]. PROMPT leads the user through the ontology-merging process, identifying possible points of integration, and making suggestions regarding what operations should be done next, what conflicts need to be resolved, and how those conflicts can be resolved. PROMPT s ontology-merging process is interactive. A user makes many of the decisions, and PROMPT either performs additional actions automatically based on the users choices or creates a new set of suggestions and identifies additional conflicts among the input ontologies. The tool takes into account different features in the source ontologies to make suggestions and to look for conflicts. These features include: names of classes and slots (e.g., if frames have similar names and the same type, then they are good candidates for merging), class hierarchy (e.g., if the user merges two classes and PROMPT has already thought that their superclasses were similar, it will have more confidence in that suggestion, since these superclasses play the same role to the classes that the user said are the same), slot attachment to classes (e.g., if two slots from different ontologies

84 66 CHAPTER 4. STATE-OF-THE-ART SURVEY are attached to a merged class and their names, facets, and facet values are similar, these slots are candidates for merging), and facets and facet values (e.g., if a user merges two slots, then their range restrictions are good candidates for merging). In addition to providing suggestions to the user, PROMPT identifies conflicts. Some of the conflicts that PROMPT identifies are: name conflicts (more than one frame with the same name), dangling references (a frame refers to another frame that does not exist), redundancy in the class hierarchy (more than one path from a class to a parent other than root), slot-value restrictions that violate class inheritance. Figure 4.5 shows the screenshot of PROMPT. The main window (in the background) shows a list of current suggestions in the top left pane and the explanation for the selected suggestion at the bottom. The righthand side of the window shows the evolving merged ontology. The internal screen presents the two source ontologies side-by-side (the superscript m marks the classes that have been merged or moved into the evolving merged ontology). Summarizing, PROMPT gives iterative suggestions for concept merges and changes, based on linguistic and structural knowledge, and it points the user to possible effects of these changes. FCA-Merge FCA-Merge is a method for merging ontologies, which follows a bottomup approach offering a global structural description of the merging process [154]. For the source ontologies, it extracts instances from a given set of domain-specific text documents by applying natural language processing techniques. Based on the extracted instances mathematically founded techniques taken from Formal Concept Analysis are applied. Formal Concept Analysis [61] is applied to derive a lattice of concepts as a structural result of FCA-Merge. The produced result is explored and transformed to the merged ontology by the ontology engineer.

85 4.3. AUTOMATIC ONTOLOGY MAPPING TOOLS 67 Figure 4.5: PROMPT screenshot. Figure 4.6: FCA-merge process.

86 68 CHAPTER 4. STATE-OF-THE-ART SURVEY This method is based on application-specific instances of the two given ontologies O1 and O2 that are to be merged. The overall process of merging two ontologies is depicted in figure 4.6 and consists of three steps, namely (i) instance extraction and computing of two formal contexts K1 and K2, (ii) the FCA-Merge core algorithm that derives a common context and computes a concept lattice, and (iii) the interactive generation of the final merged ontology based on the concept lattice. MOMIS The Mediator Environment for Multiple Information Sources (MOMIS), developed by the database research group at the University of Modena and Reggio Emilia, aims at constructing synthesized, integrated descriptions of information coming from multiple heterogeneous sources. MOMIS [9] [10] [8] [7] (see figure 4.7) follows a semantic approach to information integration based on the conceptual schema, or metadata, of the information sources. In the MOMIS system, each data source provides a schema and a global virtual schema of all the sources is semiautomatically obtained. The global schema has a set of mapping descriptions that specify the semantic mapping between the global schema and the sources schema. The system architecture is composed of functional elements that communicate using the CORBA standard. A data model, ODM I3, and a language, ODL I3 are used to describe information source. ODL I3 and ODM I3 have been defined as subset of the corresponding ones in ODMG, augmented by primitives to perform integration. To interact with a specific local source, MOMIS uses a Wrapper, which has to be placed over each source. The wrapper translates metadata descriptions of a source into the common ODL I3 representation. The Global Virtual Schema (GSB) module processes and integrates descriptions received from wrappers to derive the global shared schema by interacting with different service modules, namely ODB-Tools, an integrate environment for reasoning on object oriented database based on Description Logics, WordNet lexical database that supports the mediator in building lexicon-derived relationships, and ARTEMIS tool that performs the clustering operation [26]. In order to create a global virtual schema of involved sources, MOMIS generates a common thesaurus of terminological intensional and extensional relationships describing intra and inter-schema knowledge about

87 4.3. AUTOMATIC ONTOLOGY MAPPING TOOLS 69 Figure 4.7: The MOMIS Architecture. classes and attributes of the source schemas. On the basis of the common thesaurus contents, MOMIS evaluates affinity between intra and intersources classes and groups similar classes together in clusters using hierarchical clustering techniques. A global class, that becomes representative of all the classes belonging to the cluster, is defined for each cluster. The global view for the involved source data consists of all the global classes. A graphical tool, the Source Integration Designer, SI-Designer, supports the MOMIS methodology. In particular, the SI-Designer [7] module of MOMIS is considered main modul for the integration process. GLUE The basic architecture of GLUE is shown in figure 4.8 [43]. It consists of three main modules: Distribution Estimator, Similarity Estimator, and Relaxation Labeler. The Distribution Estimator takes as input two taxonomies O1 and O2, together with their data instances. Then it applies machine learning techniques to compute for every pair of concepts their joint probability dis-

88 70 CHAPTER 4. STATE-OF-THE-ART SURVEY Figure 4.8: The GLUE Architecture.

89 4.3. AUTOMATIC ONTOLOGY MAPPING TOOLS 71 tributions. The Distribution Estimator uses a set of base learners and a meta-learner. Next, GLUE feeds the above numbers into the Similarity Estimator, which applies a user-supplied similarity function to compute a similarity value for each pair of concepts. The output from this module is a similarity matrix between the concepts in the two taxonomies. The Relaxation Labeler module then takes the similarity matrix, together with domain-specific constraints and heuristic knowledge, and searches for the mapping configuration that best satisfies the domain constraints and the common knowledge, taking into account the observed similarities. This mapping configuration is the output of GLUE A Comparison of the Studied Systems To effectively compare the above studied systems, we need to develop an comparison framework. In [125], a set of evaluation criteria for ontology mapping and merging tools was proposed, namely: Input requirements. Tools vary on the kind of information they take into account for analysis, e.g. some of them only compare taxonomies (subclass hierarchies), while others also looks at properties and their restrictions. They may work on classes, instances or both. Some methods make use of mappings to a common thesaurus or foundational ontology. Level of user interaction. Tools might work automatically (batch mode) or interactively. In the latter case, they can build on feedback from the user to improve the quality of the mapping. Type of output. The output of the analysis can be a set of articulation rules (defining similarities and differences), a single merged ontology, an instantiated mapping ontology in a particular language, paris of related concepts, etc. Content of output. As with input, tools differ on what kind of elements in the ontology they relate in their output. We studied the characteristics of the proposed approaches using a set of criteria, which combines the evaluation framework in [125] and the classification schema in section Figure 4.9 shows what the five studied systems have in common and in difference according to the set of criteria.

90 72 CHAPTER 4. STATE-OF-THE-ART SURVEY Figure 4.9: Characteristics of studied ontology mapping and merging systems.

91 4.4. CONCLUDING REMARKS 73 The figure shows that most of the system are heuristically based and they use more than one basic mapping approach, either in a hybrid way or in a combined way. Most systems provide both structural level and element level matching, in particular name and graph structure based match. However, only two of the systems consider instance data and 1:1 match is the main focus. The elements that are compared include concepts and relations for most of the systems, whereas comparing more complicated structures, like constraints or axioms is not yet supported. Most prototypes have been developed in the context of a particular application domain and some of them also use auxiliary information like thesauri to enhance the match. For matching learning based technique (GLUE), additional training set is needed. The main forms of output are a merged ontology or a list of pairs of related concepts. Even though each language choose a particular representation language to base their implementation on, it is possible to map the source ontology to most common types of ontology representation languages. This is also partly because most the studied systems consider only the core elements of ontologies, like concepts and relations, which most of the representation languages support. Furthermore, it is possible to translate between different representational languages [30]. 4.4 Concluding Remarks In this chapter, we have elaborated the different kinds of mismatches that could happen in ontology integration and sketched the current solutions to reconcile different mismatches. We have argued that mappings are crucial components for many applications. Many work on ontology mapping have been done in the context of a particular application domain. Since the problem is so fundamental, we believe the field would benefit from treating it as an independent problem. We also studied the relevant terminology related to ontology mapping and defined our specific meaning to the terms. Several existing ontology mapping methods have been analyzed and situated in a table of comparison. The methods have been compared with respect to the kind of knowledge source they make use of, the input and output requirement and level of user interactions. Five approaches to ontology mapping have been described into more detail in order to illustrate the problems and solutions that are characteristic for the methods

92 74 CHAPTER 4. STATE-OF-THE-ART SURVEY in ontology mapping. Based on the survey, we have identified a list of requirements that an ontology mapping approach should meet, some of which we hope to target in our own work. The approach should be able to generate mappings automatically. Users should be able to intervene the process. Users should be able to accept, reject and add mappings. All the information that is useful for derive mappings should be studied by the approach. The use of each particular type of information should be able to switch on/off conveniently. There is a need for making use of instance information (if available) to augment the mapping process, because instance level data can give important insight into the contents and meaning of ontology elements. This is especially true when useful schema information is limited, as is often the case for semi-structured data. Hierarchy information of the concept should be considered. Information on what source has been used to derive the mapping (and why it is derived) is necessary. Due to the semi-automatic nature of the mapping process, visual language is needed to represent the ontologies so that knowledge engineers can easily navigate the ontology structure, locate any element and ultimately approve/disapprove the suggested mappings. The derived mappings need to be defined and organized systematically, so that other components can make use of the mappings in various settings. It should be possible to conduct reasoning among the derived mappings. The literature study also confirmed that the process of ontology mapping is a difficult problem, since it concerns semantic interpretation of models, where the semantics is only partially available in the syntactic structure of the models. The models are created by different people, at different

93 4.4. CONCLUDING REMARKS 75 times, in different styles and for different purposes. Therefore, complete automation of ontology mapping process can be motivated only if incomplete results can be accepted and the validity of correspondence assertions can be compromised. However, it should be possible to design heuristic methods and tools to assist a user in discovering correspondences between ontologies and checking the validity of the proposed correspondences.

94 76 CHAPTER 4. STATE-OF-THE-ART SURVEY

95 Part II Design and Architecture 77

97 Chapter 5 Ontology Comparison and Semantic Enrichment A framework for analysis and development of ontology comparison techniques is described in this chapter. A background for the semantic enrichment method and the ontology mapping algorithm is set out. A novel method of semantic enrichment based on extension analysis is proposed. We start by outlining prerequisites of our work in section 5.1, including the scope and assumption of the work and also a brief introduction to the modeling language used in this work. Ontology comparison is needed due to the existence of semantic discrepancy among different ontologies. We briefly review the cause of semantic discrepancy and the classification of different discrepancies in section 5.3. The different semantic discrepancy are reflected in the meta model of mapping assertions in section 5.4. The meaning of semantic enrichment in its broad sense and in its particular usage in our work is explained in section 5.5 and section 5.6 respectively. Section 5.7 describes in detail the enrichment technique we propose in this work. This chapter is partly based on perviously published papers [155] [156] [158]. 5.1 Prerequisites In this section, we list the scope and assumption of work and briefly describe the modeling language RML as well, which is chosen to be the underling modeling language for the ontologies in question. 79

98 80 CHAPTER 5. SEMANTIC ENRICHMENT Scope and Assumption The word ontology has been used to describe artifacts with different degrees of structure. These range from simple taxonomies (such as the Yahoo! hierarchy), to metadata schemes (such as the Dublin Core [46]), to logical theories. We now define the scope and assumption of our work. We start with the definition of ontology as the underlying model for the ontologies that we aim to compare. An ontology 1 specifies a conceptualization of a domain in terms of concepts 2, and relations 3. Concepts are typically organized into a tree structure based on subsumption relationships among concepts.to be more exact: Definition 5.1 (Ontology) An ontology is a sign system O := (L, F, G, C, R, T ), which consists of A lexicon L: the lexicon contains a set of lexical entries for concepts, L c, and a set of lexical entries for relations, L r. Their union is the lexion L := L c L r. A set of concepts C: for each c C, there exists at least one statement concerning c in the ontology. A set of relations R: a relation r (r R) specifies a pair (DM, RG), where DM, RG C. DM is called the domain concept of relation r, and RG the range concept of r. An instance i 1 of DM may be related via r to another instance i 2, only if i 2 RG. A taxonomy T : Concepts are taxonomically related by the acyclic, transitive relation T, (T C C). T (C 1, C 2 ) means that C 1 is a subconcept of C 2. There is a ROOT concept in C, and it holds that c C, T (c, ROOT). C 1 is a subconcept of C 2, if C 1 is a specification or a part of C 2, or in other words, if C 2 is a generalization or an aggregation of C 1. Two reference function F, G: with F, 2 Lc 2 C and G, 2 LR 2 R. F and G link sets of lexical entries {L i } L to the set of concepts and relations they refer to, respectively, in the ontologies. In general one lexical 1 We use the term ontology and concept model interchangeably in the rest of the paper unless otherwise explicitly specified 2 Also called classes, entities. 3 Also called attributes, slots, or properties in the literature

99 5.1. PREREQUISITES 81 entry may refer to several concepts or relations and one concept or relation may be referred by several lexical entries. This model summarizes the features which most of the ontology languages support. Other features like has-value constraint in Description Logic or additional axioms in F-Logic are too diverse to be included into a common model. Starting from the definition we further elaborate the scope and assumption of our work : 1. The ontologies that are to be compared, are expressing overlapping knowledge in a common domain. 2. Ontologies can be expressed in different representational languages [159]. Though we assume that it is possible to translate between different formats, in practice, a particular representation must be chosen for the input ontologies. 3. Our approach is based on the Referent Modeling Language (RML) [148], which is an Extended ER-like (Entity Relationship) graphic language with strong abstraction mechanism and a sound formal basis. The language has an XML representation The RML Modeling Language The Referent Model Language (RML) is a concept modelling language targeted towards applications in areas of information management and heterogeneous organisation of data [148] [149]. It has a formal basis from set theory and provides a simple and compact graphical modelling notation for set theoretic definitions of concepts and their relations. RML defines constructs for modelling of concepts, the selection of constructs is based on the concept types given by [22]: Individual concepts - individual concepts apply to individuals. Individuals can be either specific or generic. Class concepts - concepts that apply to collections of individuals. Relation concepts - concepts that refer to relations among objects (individual or class concepts) 4 4 There is a somewhat blurred distinction beween class concepts and relation concepts, as a relation may be considered a class concept in its own right. The concept of mar-

100 82 CHAPTER 5. SEMANTIC ENRICHMENT Quantitative concepts - quantitative concepts do not represent distinct objects, but refer to magnitudes often associated with individual or class concepts. The modelling constructs in RML are derived from the different types of concepts given above. In order to formalise the language, each concept modeling construct is given a definition from set theory. For a detailed elaboration of the RML language, we refer to [18]. The graphical notation of the basic constructs and the abstraction mechanisms of RML are illustrated in figure 5.1 and figure 5.2. Figure 5.1: Graphical notations of basic RML constructs. riage may be considered a relation between persons (that is a relation between two individual concepts the two persons or a recursive relation in the concept class of Persons). However, marriage is also considered a distinct legal entity, thus viewed as a class concept in its own right.

101 5.1. PREREQUISITES 83 Figure 5.2: Graphical notations of RML abstraction mechanism.

102 84 CHAPTER 5. SEMANTIC ENRICHMENT 5.2 The Abstract Ontology Mapping Model To discuss more precisely the ontology mapping task, we introduce the abstract ontology mapping model. The overall process of ontology mapping is defined as: given two ontologies O a and O b, mapping one ontology with another means that for each element in ontology O a, find corresponding element(s), which has same or similar semantics, in ontology O b and vice verse. The comparison activity focuses on basic elements first (i.e., concepts or entities); then it deals with those modeling constructs that represents associations among basic elements (i.e., relationships). We give definition to the relevant terminology as follows: Definition 5.2 (Element) an ontology element is one of the following: a concept, a relation or a cluster. Concepts are called basic elements, while relations and clusters are complex elements. A cluster is a fragment of the whole ontology 5. Definition 5.3 (Abstract mapping model) An ontology mapping model is a 5-tuple [S, T, F, R(s i, t j ), A)] where 1. S is a set composed of logical views (representation) for the elements in the source ontology. 2. T is a set composed of logical views (representation) for the elements in the target ontology. 3. F is a framework for representing ontology elements and calculating relationships between elements in the two ontologies. 4. R(s i, t j ) is a ranking function which associates a real number with an element s i S and an element t j T. Such ranking defines an order among the elements in source ontology with regard to one element t j in the target ontology. 5. A is a set composed of mapping assertions. A mapping assertions is a formal description of the mapping result, which supports further description of the exact nature of the derived mappings. In other words, we can define the mapping process as: S, T F,R A, i.e. given the ontologies S, T as input, using the framework and the ranking 5 its definition will be presented in chapter 6.

103 5.3. SEMANTIC DISCREPANCIES 85 function to produce A as output. The model is abstracted away from any specific implementation detail, yet it outlines in a higher level the task at hand. There are different ways to fulfill the model. Our approach, which will be introduced in this thesis, can be seen as one concrete implementation of the model. The rest of this chapter together with the next chapter will discuss the process and each of the individual components in greater detail. 5.3 Semantic Discrepancies As mentioned earlier, the process of ontology comparison consists of both identifying similarities and analyzing discrepancies among ontology structures. There are various causes for the ontology diversity. It can be different perspectives, equivalence among constructs of the model, and incompatible design specifications. Owing to these causes, conflicts inevitably exist in the representation of the same objects in different ontologies. Two types of conflicts are broadly distinguished: terminological discrepancy and structural discrepancy. Terminological discrepancies arise, as people from different organizations, or from different areas of the same organizations, often refer to the same things using their own terminology and names.this results in a proliferation of names as well as a possible inconsistency among terminologies in the component ontologies. The terminology discrepancies are classified as: a synonym that occurs when the same object or relationship in the UoD is represented by different names 6 in the component ontologies. a homonym that occurs when different objects or relationships in the UoD are presented by the same name in the component ontologies. Structural discrepancies arise as a result of a different choice of modeling constructs or integrity constraints. The following types of structural discrepancies can be distinguished: 6 A name can be either a single word or a phase.

104 86 CHAPTER 5. SEMANTIC ENRICHMENT type discrepancies arise when the same phenomena in a UoD have been modeled using different ontology constructs. For example, a real world phenomenon can be classified into categories. At least two different ways can represent that. One is to use a number of subtypes to a given entity type. Another is to use an attribute that has a fixed set of values in order to indicate the same category to which an object in the UoD belongs. dependency discrepancies arise when a group of concepts are related among themselves with different dependencies in different ontologies. For example, the relationship ProjectLeader between Project and Person is 1:1 in one ontology, but m:n in another. An ontology integration technique may analyze discrepancies among ontology constructs in order to support identification of correspondences, to support canonization, or to support conflict analysis. In this work, the focus is on the comparison step of the integration process. In the next section, the notion of mapping assertion is described. The implication of different discrepancies is reflected in the correspondence assertion. 5.4 Mapping Assertions In order to describe, store and transmit the derived mappings in a systematic way, a model for describing the mappings is defined. In [71], a notion of correspondence assertion is introduced for that purpose. We adopt that correspondence assertion model as a base for organizing different aspects of the mappings. In order to compare different comparison methods, it is important to identify the possible types of semantic relationships between the compared ontology structures. Also it is important to include the emerging methods which use some method of semantic enrichment into development or analysis of comparison techniques. Therefore, the mapping assertion also contains references to semantic enrichment structures, i.e. the source of the assertion, if any. The intension of the concept is to provide an explanation why the particular assertion is chosen. The mapping assertion also contains a measurable degree of correspondence. It is used with the intention to cover comparison methods leading to competitive assertions which may be selected through ranking.

105 5.4. MAPPING ASSERTIONS 87 Figure 5.3: Mapping assertion metamodel (Adapted from Sari Hakkarainen [1999]). The adapted notion of mapping assertion is schematized in figure 5.3. The model is graphically represented using the RML graphic modeling language [148]. It has the following meaning: a mapping assertion is a reified class which describes the relationship between two ontology elements and supports further description of the involved resources. A mapping assertion involves two ontology elements. Each ontology element belongs to one ontology. A mapping type is attached to a mapping assertion, which specifies how the pair of ontology elements is related. Further a mapping degree is attached to a mapping assertion to indicate the confidence of the derived mapping. The measure of the strength of the correspondence relationship provides a way of ordering the output. As a side effect, it also permits imperfect matching and introduces the notion of uncertainty into the comparison process. The intention of the assertion source is to provide an explanation on why the particular assertion is derived (derived by linguistic information of names, for instance). Note however, that two ontology elements can be involved in several mapping assertions where the mapping type and degree, as well as the assertion source, are different depending on the focus of the comparison analysis.

106 88 CHAPTER 5. SEMANTIC ENRICHMENT The output of comparing these structures is a set of mapping assertions. Each mapping is described in a way that is consistent with the assertion model. A more precise definition of such assertions is given in the following. Definition 5.4 (Mapping assertion) A mapping assertion describes the relationship between two ontology elements, and it has the four following components: a pair of ontology elements, a type of correspondence, a degree of correspondence, and a set of sources of assertion. In the following, we discuss the above concepts further. An ontology element is a valid expression in an ontology modeling language either on the specification or the instantiation level. In our work, an ontology element can be a concept or a relation or a cluster. A type of correspondence is one of the following five types - similar, narrower, broader, related-to and dissimilar. The first four types are commonly used in thesauri [176]. Technically, the term dissimilar is used to specify the situation when two concepts have the same (similar) names, but denote two different things (i.e. homonym). Note, however, the intention with related-to is to cover complex inter-schema relationships, which do not fall into the category of similarity and subset. For example, to connect concept country in one ontology with city in another ontology, the inter-schema relationship belongs-to might be specified. As it is impossible to enumerate all these kind of ad hoc relationships, we unified them under related-to. It is, of course, possible to specify in more detail the exact nature of the relationship, when related-to is chosen. The degree of correspondence specifies how strongly a particular mapping type holds between two ontology elements. It is used to describe and modify a mapping type assigned between two ontology elements. It measures the strength of their correspondence of a given type.

107 5.5. SEMANTIC ENRICHMENT OF ONTOLOGY 89 The source of assertion denotes the enrichment structures involved, which lead to the choice of a specific type of correspondence, if any. The mapping assertion contains the reference to the sources of the assertion in order to enable description and classification of the methods involving semantic enrichment. The intention of source of assertion is to provide an explanation why a particular assertion is proposed. The careful reader will have noticed that concept like mapping degree is modeled as a class rather than using a primitive date type directly (in that case, it would have been number). The reason is a matter of flexibility and extensibility. Model it into a class makes it possible to add attributes or relations more easily in the future. To make the result useful in a wider context, it is important to be compatible with current web standards. Therefore, the model is also represented by using the Ontology Web Language OWL [130]. By exporting the results in OWL making it possible for other OWL-aware applications to process and reason the mapping results. In addition, thanks to the formal semantic of OWL, translating the model into OWL gives us a more precise semantic definition for each of the concepts and relations. The formality allows using inference engines to check consistency or completeness of the mappings. 5.5 Semantic Enrichment of Ontology The first step in handling semantic heterogeneity should be the attempt to enrich the semantic information of concepts in ontologies, as it is well understood that the richer information the ontologies possesses, the higher probability that high quality mappings will be derived [71]. An ontology mapping method based on semantic enrichment involves usage of alternative knowledge sources than the original ontology. A semantically enriched ontology expresses more of the semantics of the UoD (Universe of Discourse) than the original ontology by introducing often generic, additional information about an application domain in the UoD. The semantic enrichment techniques may be based on different theories and make use of a variety of knowledge sources [71], such as concept hierarchy, the shared thesaurus, the linguistic knowledge, the fuzzy terminology and the extension analysis. An abstract description of any semantic enrichment techniques may contain the following components.

108 90 CHAPTER 5. SEMANTIC ENRICHMENT Figure 5.4: Semantic enrichment in ontology comparison. A semantically enriched ontology, E(O), expresses more of the semantics of a UoD than an original component ontology O, where an enrichment structure introduces additional information about its ontology structures. An enrichment structure, C, is a structure that captures the enriched knowledge. The syntax and semantics of such an enrichment structure are defined in a language of the chosen enrichment technique. Figure 5.4 depicts the impact of semantic enrichment on the process of ontology comparison. The intuition is, that by semantically enriching the two compared ontology, O1 and O2, into E(O1) and E(O2), we transform the problem of comparing O1 and O2 into that between E(O1) and E(O2). In our work, we instantiate the enrichment structure C with a representative feature vector, which comes out as the result of extension analysis of the relevant ontologies. In the next section, we discuss the above concepts in greater detail. 5.6 Extension Analysis-based Semantic Enrichment The Concept of Intension and Extension The concepts of intension and extension 7 have been introduced to understand the meaning of individual words and expressions [15]. The extension of an expression means the object or the set of objects in the real word to which the expression refers. The extension of the word dog is 7 Also called terminological and extensional

109 5.6. EXTENSION ANALYSIS-BASED SEMANTIC ENRICHMENT 91 Figure 5.5: Semantic enrichment through extension analysis. the set of all dogs. The intension of an expression means its sense, which a person normally understands by the expression. For example, the intension of the word dog might be something like hairy mammal with four legs and tail often kept as pet. An ontology often (but not always) specifies the intensional part of the UoD, which identifies the concepts in the domain and the relations between them. The extensional part consists of facts about specific individuals in the domain, with which the model is populated. There are still different opinions on whether it is the intension or the extension that best decides the semantics of a concept. It is, nevertheless, widely accepted that the more we know about the intension and the extension, the more likely that we are closer to the complete understanding of the concepts. It is from this intuition, that we develop our enrichment technique on the bases of extension analysis, as depicted in figure Extension Analysis for Semantic Enrichment Ontology mapping concerns the interpretation of models of a Universe of Discourse (UoD), which in turn are interpretations of the UoD. There is no argumentation for these interpretations to be the only existing or complete conceptualizations of the state of affairs in the real world. We assume that the richer a description of a UoD is, the more accurate conceptualization we achieve of the same UoD through interpretation of the descriptions.

110 92 CHAPTER 5. SEMANTIC ENRICHMENT Hence, the starting point for comparing and mapping heterogeneous semantics in ontology mapping is to semantically enrich the ontologies. Semantic enrichment facilitates ontology mapping by making explicit different kinds of hidden information concerning the semantics of the modeled objects. The underlying assumption is that the more semantics that is explicitly specified about the ontologies, the more feasible their comparison becomes. The semantic enrichment techniques may be based on different theories and make use of a variety of knowledge sources [71]. We base our approach on extension analysis, i.e. instance information a concept possesses. The instances we use are documents that have been classified to the concepts. The idea behind is that written documents used in a domain inherently carry the conceptualizations that are shared by the members of the community. This approach is in particular attractive on the World Wide Web, because huge amounts of free text resources are available. The belief is that the semantic meaning of a concept should be augmented with its extensions [87] 8. Therefore, the concept is semantically enriched with a generalization of the information its instances provide. The generalization takes the form of a high-dimensional vector. These concept vectors are ultimately used to compute the degree of similarity between pairs of concepts. As illustrated in figure 5.6, the intuition is that given two ontologies A and B, we construct a representative feature vector for each concept in the two ontologies. The documents are building materials for the construction process. Then with the feature vectors at hand, we calculate a similarity measure sim(a i, b j ) pair wise for the concepts in the two ontologies. 5.7 Feature Vector as Generalization of Extension In figure 5.7, we show that the whole ontology mapping process are made of two phases, semantic enrichment phase and the mapping phase. We have been focusing on the semantic enrichment phase in this chapter. The main task in the semantic enrichment phase is to generate the enrichment structure, namely, the representative feature vector. We first define what is a feature vector of a concept, and proceed with the procedures to 8 Though in certain approaches, e.g. Description Logics, an intensional definition is believed to best describe the semantic meaning.

111 5.7. FEATURE VECTOR AS GENERALIZATION OF EXTENSION 93 Figure 5.6: Representative feature vector as enrichment structure. Figure 5.7: Two phases of the whole mapping process.

112 94 CHAPTER 5. SEMANTIC ENRICHMENT generate feature vectors Feature Vectors Definition 5.5 (Feature vector) Let C k be the feature vector of concept K, and let V be the collection of all index words in the document collection. C k = (weight 1, weight 2,..., weight t ) V = (word 1, word 2,..., word t ). C k i denotes the representativeness of index word V i to concept K. We give a simple example to illustrate the structure of a feature vector. For example, if K = Accommodation V = (bed, breakfast, car, computer, flight, hotel, price, travel, tree, water) 9 the feature vector C Accommodation = (0.44, 0.5, 0, 0, 0.2, 0.8, 0.8, 0.4, 0, 0) it means, for instance, index word bed has a representativeness of 0.44 to the concept Accommodation, while word car has no representativeness when it comes to concept Accommodation. We now describe in detail how the feature vectors are generated. Both the steps and the algorithm will be elaborated Steps in Constructing Feature Vectors Figure 5.8 shows the two steps performed in the semantic enrichment process. The algorithm takes the two to-be mapped ontologies in RML format, together with document sets, as input. There can be one or two document sets. In the former case, we assume the documents are relevant to both ontologies, while in the latter, it is assumed that the two document sets share the same vocabulary Document Assignment Document assignment step aims to automatically assign documents to one or more predefined categories based on their contents. We use a linguistically based classifier CnS (Classification and Search) [19] to associate documents with the ontologies. Multiple association is allowed. This is a semi-automatic process, where users need to manually adjust the assignment results to guarantee the correct assignments. 9 In reality, the vocabulary dimension is much higher.

113 5.7. FEATURE VECTOR AS GENERALIZATION OF EXTENSION 95 Figure 5.8: Overview of the semantic enrichment process. Alternatively, for a basic method of assigning documents to concepts, we may consider each concept as a query that is fired against a generalpurpose search engine, which maintains the documents in question. Each document in the result set that has a ranking value greater than a certain minimum threshold is then assigned to the query concept. Which method to choose is partially determined by the kind of resources that are readily accessible. The assigning of documents to concepts is necessary when no instance knowledge of the ontology is available. However, if documents have already been assigned to specific concepts, we can skip the first step and construct feature vector for the concepts directly Feature Vector Construction The above step provides as output two ontologies, where documents have been assigned to each concepts in the ontologies. The next step 10 An example, where documents have already been assigned to concepts is Open Directory Project

114 96 CHAPTER 5. SEMANTIC ENRICHMENT is to calculate a feature vector for each concept in the two ontologies respectively. To calculate a feature vector for each concept, we first need to establish feature vectors for each document that belongs to the concept. The second step concerns building up feature vectors for each concepts in the two ontologies. The intuition is that for each concept a feature vector can be calculated based on the documents assigned to it. Following a classic Rocchio algorithm [1], the feature vector for concept a i is computed as the average vector over all document vectors that belong to concept a i. Following the same idea, the feature vector of a non-leaf concept is computed as the centroid vector of its instance vector, sub concepts vector and related concepts vector. Thus, hierarchical and contextual information is partially taken into consideration. The output of this step is two intermediate ontologies, O A and O B, where each concept has been associated with a feature vector as depicted in figure 5.6. Three sub-steps constitute the process. The first two steps aim at building document vectors, while the third step use the document vectors to build feature vectors for concepts. 1. Pre-processing.The first step is to transform documents, which typically are strings of characters, into a representation suitable for the task. The text transformation is of the following kind: remove HTML (or other) tags; remove stop words; perform word stemming (lemmatization). Auxiliary information like stop words list, English lexicon (WordNet in this particular case) are used to perform the necessary linguistic transformation. 2. Document representation.we use the vector space model [142] to construct the generalization of the documents. In vector space model, documents are represented by vectors of words. There are several ways to determining the weight of word i in a document d. We use the standard tf/idf weighting [142], which assigns the weight to word i in document d in proportion to the number of occurrences of word in the document, and inverse proportion to the number of documents in the collection for which the word occurs. Thus, for each document d in a document collection D, a weighted vector is constructed as follows: d = (w 1,..., w n ) (5.1) where w i is is the weight of word i in document d. w i = f i log (N/n i ) (5.2)

115 5.7. FEATURE VECTOR AS GENERALIZATION OF EXTENSION 97 Figure 5.9: Contributions from relevant parts when calculating feature vector for non-leaf concept. where f i is the frequency of word i in document d, N is the number of documents in the collection D and n i is the number of documents that contains word i. 3. Concept vector construction.we differentiate here leaf concept and non-leaf concept in the ontology. Leaf concepts are those which have no sub-concepts. For each leaf concept, the feature vector is calculated as an average vector on the documents vectors that have already been assigned to this concept. Let C k be the feature vector for concept concept K and let D j be the collection of documents that have been assigned to that concept K. Then for each feature i of the concept concept vector, it is calculated as: C k i = w i j D j K Dj (5.3) When it comes to non-leaf concepts, the feature vector C k for a non-leaf concept K is calculated by taking into consideration contributions from the documents that have been assigned to it, its direct sub concepts and the concepts with which concept K has relation 11. Figure 5.9 illustrates that contributions from the instance, the sub-concepts and the related concepts 11 At this point, all ad hoc relations other than subsumption are treated as related-to.

116 98 CHAPTER 5. SEMANTIC ENRICHMENT are counted in when calculating feature vectors for such nonleaf concepts. Let D j be the collection of documents that have been assigned to that concept K, let S t be the collection of its direct sub concepts and let S r be the collection of its related concepts. The ith element of C k is defined as: C k i = α w i j D j K D j + β w it w ir S t K S + γ r K S t S r (5.4) where α + β + γ = 1. α,β and γ are used as tuning parameters to control the contributions from the concepts instances, sub concepts, and related concepts respectively Feature Vectors as Semantic Enrichment The above steps give us a feature vector for each concept in the ontologies. What does a feature vector tell us? As we mentioned earlier, the intention to have a feature vector is to capture and explicit the extensional information a concept bears. From the way we build up the feature vectors, it is reasonable to say that a feature vector is a statistical measure and representation of a concept s extensional information. Representing this information in a feature vector gives a computational convenient way to study it and read information out of it. 5.8 Concluding Remarks The framework proposed in this chapter intends to capture the semantics of comparison of various aspects of ontologies. It can be interpreted as a descriptive framework, aimed at supporting analysis and development of methods and tools for ontology comparison. The intention is to capture the specific properties of conflicts and correspondences the existing methods handle and consider in ontology comparison. The framework clarifies the extent to which the existing ontology comparison methods detect different kinds of correspondences. The method we proposed is not universally applicable, however. The relevant scope and assumptions of our work have therefore been elaborated. We introduced the abstract mapping model to formalize the mapping task in an implementation independent fashion. The concept of se-

117 5.8. CONCLUDING REMARKS 99 mantic discrepancy is included with the intention to set the design consideration for mapping assertions. The mapping assertion model, on the other hand, describes the structure and intended semantic meaning of the mapping results. The particular semantic enrichment method used in this work (i.e. extension analysis based) has been elaborated both in terms of the enrichment structure and in terms of the construction process. The two steps that constitutes the semantic enrichment phase, namely document assignment and feature vector construction have been extensively discussed. The second phase of the whole approach, the mapping phase, will be explained in the next chapter.

118 100 CHAPTER 5. SEMANTIC ENRICHMENT

119 Chapter 6 Ontology Mapping Approach In this chapter, an approach for mapping of semantically enriched ontologies is proposed. The defined steps for mapping are intended to support a knowledge worker or application engineer in the problematic task of ontology mapping or integration. The current version of the prototype implementation of the algorithm is for the purpose of experiments with the proposed approach. Parts of the work in this chapter have been published before [158] [157]. 6.1 Algorithm Overview The focal point of this chapter is the second phase of the approach, namely the mapping phase, as illustrated in figure 6.1. A novel algorithm for ontology mapping is specified in the following sections. The basic idea of mapping assertion analysis from Chapter 5 are further developed and applied in practice for comparison of relevant elements of two ontologies. The algorithm takes as input semantically enriched elements of two ontologies and produces as output suggestions to the user for possible correspondences. As figure 6.2 illustrates, the algorithm has the following five main components. Figure 6.1: Two phases of the whole mapping process. 101

120 102 CHAPTER 6. ONTOLOGY MAPPING APPROACH The mapper performs a computation of a correspondence measure for the pairs of compared ontology elements, based on the similarity of their enriched structures. The enhancer utilizes an electronic lexicon to adjust the similarity values that have been computed by the mapper, with the intention of re-ranking the mapping assertions in the result list. The presenter determines which recommendations to suggest to the user, based on the partial ordering of correspondence measures and the current configuration profile. The exporter translates and exports the mapping results to a desired format so that other follow-up applications can import and use the results in a loosely coupled way. The configuration profile is a user profile to assign individual variable values for different tuning parameters and a threshold value for exclusion of mappings with low similarity. The mapping algorithm is used to semi-automate the process of comparing and mapping two semantically enriched ontologies represented in RML. It analyses the extension and intension of ontology elements in order to determine quantitatively the measure of similarity in the two compared elements. Based on the degree of similarity among element pairs, the algorithm produces a set of ranked suggestions. The user is in control of accepting, rejecting or altering the assertions. The level of automatic exclusion from user presentation is adjustable. The mapping phase starts with the mapper taking the two semantically enriched ontologies as input and calculating similarity values for the concepts in the two ontologies. The enhancer works in a plug-in manner and updates the initially computed similarity values for the mapping assertions. Next, the mapper works on the refined mapping results for concepts to calculate correspondences for more complex structures, ranging from relations, to clusters to ontologies. Then, the results are presented to the user for possible manual inspection before they are exported. Sections 6.2, 6.3, 6.4 describe each of the step in greater detail. Further refinement of the algorithm for the sake of more accurate mappings is presented in section 6.5. Prerequisites to apply the algorithm are discussed in section 6.6 and possible application scenarios that satisfy

121 6.2. THE SIMILARITY CALCULATION FOR CONCEPTS 103 Figure 6.2: Major steps in the mapping phase. the prerequisites are identified in the same section. Finally, section 6.7 summarizes the chapter. 6.2 The Similarity Calculation for Concepts To find concept pairs that are similar, we calculate a similarity value for concepts pairwise in the two ontologies. A threshold value is defined by the user to exclude pairs that have too low similarity values. The calculation of concept similarity is the foundation for similarity calculation of other more complex elements. The similarity of two concepts in two ontologies is directly calculated as the cosine measure between the two representative feature vectors. Let two feature vectors for concept a and b respectively, both of length n, be given. The cosine similarity between concept a and concept b is defined as: sim(a, b) = sim(c a, C b ) = where C a C b C a C b = n i=1 n i=1 (Ca i )2 (C a i C b i ) n i=1 (Cb i )2 (6.1)

122 104 CHAPTER 6. ONTOLOGY MAPPING APPROACH C a and C b are feature vectors for concept a and b, respectively n is the dimension of the feature vectors C a and C b are the lengths of the two vectors, respectively For concept a in ontology A, to find the most related concept b in ontology B, the top k ranked concept nodes in ontology B are selected according to the initial similarity measure calculated above. Those pairs that are not selected, either because of its similarity value is lower than the threshold value or because it is not in the top k ranked, their similarity values will be deduced to zero. The chosen pairs are further evaluated by other matching strategies. For instance, if one concept has a similar name as concept a, its similarity measure will get a boost, which results in the change of its rank in the result set. How large the boost is can be tuned by a parameter. This leads us to the next section, where adjusting similarity value is elaborated. 6.3 Adjust Similarity Value with WordNet Given the central position of concept similarity calculation, it is desirable to make the suggestions as accurate as possible. This requires additional technique to adjust the similarity value. We use a electronic lexicon WordNet for that purpose. We start this section with a brief description of WordNet. Then we introduce the path length measurement which is used to compute semantic relatedness of concepts WordNet WordNet 1 is a lexical database constructed on lexicographic and psycholinguistic principles, under active development for the past 20 years at the Cognitive Science laboratory at Princeton Unversity. It contains 138,838 English words [113]. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. In WordNet each unique meaning of a word is presented by a synonym set or synset. Each synset has a gloss that explains the concept the 1 wn/

123 6.3. ADJUST SIMILARITY VALUE WITH WORDNET 105 synset represents. For example the words car, auto, automobile, and motocar constitute a single synset that has the following gloss: four wheel motor vehicle, usually propelled by an internal combustion engine. Synsets are connected to each other through explicit semantic relations that are defined in WordNet. These relations only connect word sense that are used in the same part of speech. Noun synsets are connected to each other through hypernym, hyponym, meronym, and holonym relations. If a noun synset A is connected to another noun synset B through is-akind-of relation then B is said to be hypernym of synset A and A a hyponym of B. In the car example, the synset containing car is a hypernym of the synset containing ambulance and the ambulance synset is a hyponym of the car synset. If a noun synset A is connected to another noun synset B through the is-a-part-of relation then A is said to be a meronym of B and B a holonym of A. In the car example, the synset containing bumper is a meronym of car and car is a holonym of bumper. Noun synset A is related to adjective synset B through the attribute relation when B is a value of A. For example the adjective fast is a value of the noun synset speed. Taxonomic or is-a relations also exist for verb synsets. Verb synset A is a hypernym of verb synset B if to B is one way to A. Synset B is called a troponym of A. For example the verb synset containing the word operate is a hypernym of drive since to drive is one way to operate. Conversely drive is a troponym of operate. Adjective synsets are related to each other through the similar to relation. For example, the synset containing the adjective. For example fast is similar to rapid. Verb and adjective synsets are also related to each other through cross-reference also-see links. While there are other relations in WordNet, those described above make up more than 93% of the total number of links in WordNet. Our approach does not explore beyond the scope described above The Path Length Measurement With the initial mapping suggested by the previous step, the users can choose to go for a post processing step to strengthen the prominent mappings. WordNet [52] may be used to strengthen the mappings whose concept names have a close relatedness in WordNet. The goal is to update the similarity value two concepts have based on the distance of the two concepts in WordNet. As we explained in the introduction, in WordNet, nouns are orga-

124 106 CHAPTER 6. ONTOLOGY MAPPING APPROACH Figure 6.3: Example on hyponymy relation in WordNet used for the path length measurement. nized into taxonomies where each node is a set of synonyms (a synset) representing a single sense. If a word has multiple senses, it will appear in multiple synsets at various locations in the taxonomy. These synsets contain bidirectional pointers to other synsets to express a variety of semantic relations. The semantic relation among synsets in WordNet that we use in this experiment is that of hyponymy/hypernymy, or the isa-kind-of relation, which relates more general and more specific senses. Verbs are structured in a similar hierarchy with the relation being troponymy in stead of hypernymy. One way to measure the semantic similarity between two words a and b is to measure the distance between them in WordNet. This can be done by finding the paths from each sense of a to each sense of b and then selecting the shortest such path. Note that path length is measured in nodes rather than links. So the length between sister nodes is 3. The length of the path between member of the same synset is 1. In the example of figure 6.3, the length between car and automobile is 1, since they belong to the same synset. Similarly, the path between car and bike is 4 and the path between car and fork is 12. we did not make any effort in joining together the 11 different top nodes of the noun taxonomy. As a consequence, a path cannot always be found between two nouns. When that happens, the algorithm returns a

125 6.3. ADJUST SIMILARITY VALUE WITH WORDNET 107 not related message. The path length measurement above gives us a simple way of calculating relatedness between two words. However, there are issues that need to be addressed. Word form. When looking up a word in WordNet, the word is first lemmatized. So the distance between system and systems is 0. Multiple part-of-speech. The path length measurement can only compare words that have the same part-of-speech. This implies that we don not compare, for instance, a noun and a verb, since they are located in different taxonomy trees. The words we compare in this context are concept names in the ontologies. Even though most of the names are made of single noun or noun phrase, verbs and adjectives do occasionally appear in a concept name label. In some cases, one word would have more than one part-of-speech (for instance, backpacking is both a noun and a verb in WordNet). For these words, we first check if it is a noun and if the answer is yes, we treat it as a noun. In the case of backpacking, for instance, it will be treated as a noun and its verb sense will be disregarded. If it is not a noun, we check if the word is a verb and if the answer is yes, we treat it as a verb. Words that are neither nouns, verbs nor adjectives will be disregarded. This makes sense since the different part-of-speech of the same word are usually quite related and choosing one of them would be representative enough. Compound nouns. Those compound nouns which have an entry in WordNet (for example, jet lag, travel agent and bed and breakfast ) will be treated as single words. Others like railroad transportation, which have no entry in WordNet, will be split into tokens ( railroad and transportation in the example) and its relatedness to other word will be calculated as an average over the relatedness between each token and the other word. For instance, the relatedness between railroad transportation and train will be the average relatedness of railroad with train and transportation with train. We have integrated the Java WordNet Library (JWNL) [38], a Java API, into the system for accessing the WordNet relational dictionary and

126 108 CHAPTER 6. ONTOLOGY MAPPING APPROACH calculate semantic relatedness based on the path length measurement described above. The computed relatedness will be amplified by a tuning parameter and then will be added to the similarity values computed in the previous step. The changing of similarity values will change the ranks of the involved mappings. The intension is that the more likely to be correct ones will be strengthened by that post processing procedure and be ranked high in the results. Whether the post processing step achieves that goal or not has to be checked by the evaluation process, which comes in chapter The Similarity Calculation for Complex Elements Based on the correspondences calculated for the concepts, we could further expand the correspondence discovery into other elements and structures in the ontologies. In this section, we introduce how similarity between relations and between clusters of concepts are defined Relations The similarity of relations is calculated based on the corresponding domain concepts and range concepts of the relations. Precisely, the similarity between relation R(X, Y) and R (X, Y ) is defined as the arithmetic mean value of the similarity values their domain and range concepts have: sim(r, R ) = (sim(x, X ) + sim(y, Y )) (6.2) 2 where X and X are domain concepts of R and R, respectively Y and Y are the range concepts of R and R respectively the sim(x, X ) and sim(y, Y ) can be calculated by equation 6.1 for concepts similarity Clusters Based on the correspondences calculated for the concepts, we could further expand the correspondence discovery into more complex structures. For this, we define the concept of cluster. A cluster is a group of related

127 6.4. THE SIMILARITY CALCULATION FOR COMPLEX ELEMENTS 109 Figure 6.4: Example of calculating cluster similarity. concepts, which includes a center concept a and its k-nearest neighbors. A cluster of 1-nearest neighbor includes a center concept and its direct parent, and its direct children. A cluster of 2-nearest neighbor includes the grandparent, the siblings and the grandchildren, in addition to the 1-nearest neighbor. The correspondences between clusters in two ontologies reveal areas that are likely to be similar. This helps knowledge workers to locate and concentrate on a bigger granularity level. The similarity of clusters is calculated based on the weighted percentage of established mappings between member concepts in proportion to the number of all connections between the two clusters. Figure 6.4 illustrate an example of two 1-nearest neighbor cluster, A and B, where a2 and b2 are the center concepts of A and B respectively. Four mappings between member concepts exist, namely (a1, b1), (a2, b2), (a4, b3) and (a4, b4). That situation given, the similarity between cluster A and cluster B is therefore computed as: (sim(a1, b1) + sim(a2, b2) + sim(a4, b3) + sim(a4, b4)) sim(a, B) = 4 5 (6.3) We define the equation of calculating two cluster similarity as: where sim(x, Y) = (a i,b j ) M sim(a i, b j ) X Y (6.4) X and Y are clusters of k-nearest neighbor. X = {a 1, a 2, a 3,, a n } and Y = {b 1, b 2, b 3,, b m }.

128 110 CHAPTER 6. ONTOLOGY MAPPING APPROACH M is a subset of the cartesian product of X and Y, where M X Y, M = {(a i, b j ) (a i X) (b j Y) (sim(a i, b j ) > 0)} X and Y are number of elements in the two sets, respectively. the sim(a i, b j ) is calculated by equation 6.1 for concepts similarity Ontologies Extending the idea of cluster similarity one step further, we come to the point where the similarity between two ontologies can be quantified as the weighted percentage of established mappings in proportion to all the connections between concepts in the two ontologies, as defined in the following equation. where sim(o 1, O 2 ) = (a i,b j ) M sim(a i, b j ) O 1 O 2 (6.5) O 1 and O 2 are two ontologies. O 1 = {a 1, a 2, a 3,, a n } and O 2 = {b 1, b 2, b 3,, b m }. a i (i=1...n) are the concepts in O 1 and b j (j=1...m) are the concepts in O 2. M is a subset of the cartesian product of O 1 and O 2, where M O 1 O 2, M = {(a i, b j ) (a i O 1 ) (b j O 2 ) (sim(a i, b j ) > 0)} O 1 and O 2 are number of concepts in the two ontologies, respectively. the sim(a i, b j ) is calculated by equation 6.1 for concepts similarity. Such a value is in particular useful when several ontologies in a domain need to be merged, for this value could help revealing the most similar two, which constitute good candidates for the merging process. 6.5 Further Refinements In the design of the system, we noticed that to achieve more accurate mapping results, further refinements of the results are always welcome. Even though our current implementation does not directly incorporate them into the system, due to the reason of keeping focus and marginally

129 6.5. FURTHER REFINEMENTS 111 cost/benefit gain, it is still advisable to discuss them in theory and prepare the system architecture in such a way that adding them will be relatively easy and require the lest possible extra efforts. There are mainly three kinds of efforts which fall in the realm of our further refinement techniques. We will discuss them in turn Heuristics for Mapping Refinement Based on the Calculated Similarity To further improve the mapping accuracy, it is desirable to incorporate commonsense knowledge and domain constraints into the mapping process. For that purpose, domain independent and domain dependent heuristic rules are defined. The goal is to update the similarity two elements take based on the execution of the heuristic rules. The heuristic rules can be domain independent or domain dependent. Some example domain independent heuristic rules are listed as follows. If all children of concept A match concept B, then A also matches B. Two concepts match if their children also match. Two concepts match if their parent match and k% of their children also match. If all children of concept A match concept B, A also matches B. The domain dependent heuristic rules incorporate domain knowledge into the mapping process. For example, a domain dependent heuristic rule in the tourism domain can be that if concept B is a descendant of concept A and B matches hotel, it is unlikely that A matches Bed and Breakfast Managing User Feedback In the design of the mapping system, we observed that in general fully automatic solutions to the mapping problem are not possible due to the potentially high degrees of semantic heterogeneity between ontologies. We thus allow an interactive mapping process, e.g. to allow users to manually add, confirm, reject, or alter mapping assertions. On the other hand, users actions on the mapping results is a good source to improve the algorithm performance in the next round of mapping result calculation or updating.

130 112 CHAPTER 6. ONTOLOGY MAPPING APPROACH Currently, we track the users actions on the mapping results into a log file. This leaves room for integrating learning components in an iterative mapping process Other Matchers and Combination of Similarity Values Even though our approach is based mainly on the idea of exploring extension of concepts for deriving similarity pairs, it is desirable to have the flexibility for adding other complementary mapping strategies into the algorithm. This is because, to achieve high mapping accuracy for a large variety of different ontologies, a single technique is unlikely to be successful. An example of an alternative mapping strategy may be one that is based on studying and comparing the data type of relevant elements. Hence, it is necessary to combine different approaches in an effective way. We therefore introduce the coordinator component in the system architecture to be responsible for combining the similarity values returned by different mapping components. The coordinator assigns to each mapping strategy a weight that indicates how much that particular strategy will contribute to the whole picture. Then the coordinator combines the returned similarity values via a weighted sum. 6.6 Application Scenarios The proposed mapping algorithm is not universally applicable, however. In order to successfully apply the algorithm, the component ontologies, i.e. any two ontologies considered as input, should fulfill the following two conditions. There exists a fair amount of textual resources that reflect the extension of the relevant ontologies. The component ontologies need to be semantically enriched by use of the linguistic instrument as described in Chapter 5, or an equivalent feature vector construction system. The first requirement for the algorithm presumes that the extension of the concerned ontologies is in the format of textual resources so that textual analysis techniques can be applied in the process of extracting essential information from extensions. The second requirement for the

131 6.7. CONCLUDING REMARKS 113 algorithm presumes that prior to the mapping, the ontology structures have to be semantically enriched using the semantic enrichment system of Chapter 5, or equivalent. Given the above requirements, there are several scenarios where this algorithm can naturally fit in. One is document retrieval and publication between different web portals. Users may conform to their local ontologies through which the web portals are organized. It is desirable to have support for automated exchange of documents between the portals and still let the users keep their perspectives. To achieve this, we need to map terms in one ontology to their equivalents in the other portal ontology. Using the documents that have been assigned to each category to enrich the concepts and compute afterwards similarity for pairs of ontology concepts fits nicely with our approach. Another area is product catalogue integration. In accordance with [54] different customers will make use of different classification schemas (UNSPSC, UCEC, and to name a few). We need to define links between different classification schemas that relate the various concepts. Establishing such a connection helps to classify new products in other classification schemas and this in turn will enable a full fledged B2B business, where a certain number of well know standard vocabularies coexist among business partners and mapping relates their mutual vocabularies [74]. Service matching is yet another good candidate to apply the method, though we have to assume that there are some service description hierarchies (the MIT process handbook [100], for instance) and that the provider and the requester are using different classification schemas. By using some extension description of the service hierarchy, we can compute a feature vector for each service concept. Then the matching can be conducted by calculating the distance between the representative feature vectors. 6.7 Concluding Remarks In this chapter the particular problems in computational comparison of ontology elements have been analyzed and a novel approach to meet the requirements arising from the analysis has been proposed. The algorithm supports a semi-automatic computation of correspondences between concepts, based on the enriched structures. In addition, this al-

132 114 CHAPTER 6. ONTOLOGY MAPPING APPROACH gorithm also takes advantage of hierarchies or taxonomies of concepts. This is reflected in the way non-leaf concepts similarity is calculated. The algorithm produces a set of ranked suggestions of possible concept correspondences. Concept correspondence plays a central role in the whole algorithm, for it is the foundation for computing correspondences for other ontology elements and structures. In order to produce more accurate concept correspondence suggestions, additional resources are used to update the similarity value. Word- Net has been used extensively in computational linguistics for tasks ranging from relationship discovery to word sense disambiguation [5]. We used it mainly for calculating semantic relatedness of concepts in Word- Net hierarchy. Based on the suggested concept correspondences, the correspondence between relations in two ontologies, the correspondences between clusters in the two ontologies, and ultimately the correspondence between two ontologies are defined in sequel. The correspondence in a bigger granularity level (cluster and ontology level for instance) is complementary to the more fine grind single concept level correspondence. The former helps user to have a quick overview of the distribution and the latter provides the user with more detailed information. The algorithm compares solely the representative feature vectors, i.e. the enrichment structures and the name of the concepts. In that respect, the algorithm is modeling language independent. Further refinement techniques for the purpose of even more accurate mappings are discussed. These includes using domain independent or domain dependent heuristic rules to update the results, logging user feedback for learning components, and combining other mapping strategies into the algorithm. Finally, we have also identified the conditions under which the algorithm will most likely succeed. Three example scenarios which meet the conditions are introduced and how the algorithm can be applied in these scenarios are discussed as well.

133 Part III Implementation and Assessment 115

134

135 Chapter 7 The Prototype Realization A prototype of the approach has been implemented in order to verify that the proposed approach is an applicable solution. It also paves the way for evaluating the approach in a quantitative manner, as described in the next chapter. This chapter is focused on functionality specification, rather than technical details. 7.1 Components in the Realization The system architecture is composed of three separately developed parts that communicate using XML. The modeling environment: to build the necessary ontologies, we need a modeling environment. During the work, it has been natural to incorporate previous modeling methodologies and tools accumulated in the Information System group into our system. We have not developed any new modeling tools, rather we have used the current baseline of modeling support in the Information System Group [70] [69] [145] [175] [143] [2] [25] [85] [51] [18]. The CnS Client: to classify documents when no instance are available, we need a classifier. The CnS (Classification n Search) is a client for model-based classification and retrieval of documents. In our context, we use only the classification function of the CnS. The client interacts with the server side classification component and presents the results through a graphical user interface, which allows user manual adjusting of the classification results. The im- 117

136 118 CHAPTER 7. THE PROTOTYPE REALIZATION Figure 7.1: Components of the system. plementation of the CnS client and the classification component was part of another doctoral thesis, and is described in more detail in [18]. The imapper system: to implement the process of constructing feature vectors and generating mappings, we have developed the imapper as the core part of our system. The prototype is developed as a stand alone java application, which communicates with the other software components through XML file exchange. Figure 7.1 illustrates the three parts and how they interact with each other. The modeling environment is responsible for constructing/importing the ontologies in RML format. The ontologies are passed on to CnS for assigning documents to the relevant ontology elements. The ontologies, together with the classification results stored in XML, are delivered to the imapper system for the mapping process. In the rest of the chapter, we present each part in greater detail. 7.2 The Modeling Environment The referent model language and the corresponding tools are developed at the Information Systems group at IDI, NTNU. The RML language is a recent language that initially springs out from the PPP integrated modelling environment [70]. PPP initially contained support for several modelling languages; a Process Model Language PrM, an extended ER modelling language (ONE-R) and a rule modelling language (PLD),

137 7.2. THE MODELING ENVIRONMENT 119 Figure 7.2: The Referent Modeling Editor. and also comprised specifications and partial implementations of extensive methodology support; versioning mechanisms [2], view generation [143], concepts and notation for hierarchical modelling [145], prototyping and execution [175] as well as explanation generation and translation of models [69]. Work on cooperative support in terms of enabling discussions and awareness in the process of constructing models was carried out in [51]. Later work have refined the initial modelling languages and also added new languages. The most recent are the RML concept modelling language [148], the APM workflow modelling language [25], and the task modeling and dialogue modeling languages for user interface design [165]. The toolset we are using for the ontology constructing process consists of the following components: The RML modelling editor, RefEdit. The editor is a standalone Windows tool that stores the models as XML files. Figure 7.2 shows a snapshot of the RefEdit modeling tool. The XML based model repository with support for importing/exporting, consistency checking and versioning (in progress).

138 120 CHAPTER 7. THE PROTOTYPE REALIZATION In this thesis, we have focused on semantic issues rather than syntactical issues, and have assumed that the same representation language is used in both ontologies and furthermore, we have picked RML as our representation language. For the approach to be useful in a wider context, we have to cope with ontologies that are represented in other languages and transfer them into RML. This requires the toolset to have an extensive import/export support for different representation formats. 7.3 The CnS Client as a Classifier Figure 7.3 shows the CnS Client in classification mode with a fragment of a particular ontology of collaboration technology and a small set of corresponding documents. The ontology fragment shows the hierarchy of collaborative processes, which is defined as an aggregation of coordination, production and other activities. The fragment also shows the specialization of the concepts representing coordination and production activities. The classification of a document according to the domain model amounts to selecting the model concepts considered relevant for this document. In the classification mode, the toolbar provides the user with the options of getting suggestions from the server-side model-matcher as well as quickkeys for accepting or rejecting these suggestions. The Fast classification button lets the user accept whatever suggestions the server provides without further examination our alternative to automatic classification. The document management function allows the user to manage a local set of documents that are under classification, these may be divided in folders. While working in a classification mode, a user may switch between working with one document at a time or a selection of documents simultaneously a summary view. Figure 7.3(a) shows the summary view of all the selected documents. In this view the user has first received suggestions from the model-matcher and is now manually refining these suggestions. Suggestions are marked with a green triangle. In the document list, documents with unprocessed suggestions are marked with a small green triangle, while documents where the user has made actual selections (or accepted the suggestions) is marked with a filled rectangle. Green triangles illustrate concepts that have suggestions; the size of the triangle (along the bottom line) illustrates the percentage of documents

139 7.3. THE CNS CLIENT AS A CLASSIFIER 121 (a) Working with multiple documents. (b) Working with one document. Figure 7.3: CnS Client in the classification mode.

140 122 CHAPTER 7. THE PROTOTYPE REALIZATION Figure 7.4: The imapper architecture. in which this concept is suggested. The more suggestions (i.e. the more documents the concept is located), the more the triangle grows from right to left. Alternatively, a user could choose to work on one single document at a time (figure 7.3(b)). Similarly, users may accept and reject a single suggestion by clicking either on the green or white part of the concept respectively, or all suggestions at once by using the toolbar buttons. When examining classifications for one document, the user can examine the full text of the document by clicking on the document tab. Finally, the client performs a check of the selected documents classifications and provides warnings to the user according to some predefined rules. For instance, the system will signal when classifications are saved with unprocessed suggestions. The classification results are saved in XML files.

141 7.4. THE IMAPPER SYSTEM The imapper System As the core part of the whole mapping system, figure 7.4 shows an overview of the imapper system architecture. In the storage level, five kinds of data exist: Ontology represented in RML is stored in an XML file. The ontologies are exported from the Referent modeling editor, RefEdit. The XML format used to export ontology in RefEdit are presented in appendix B. Classification results returned by the CnS client are stored in XML format, documenting which concept has which documents as instances. The relevant XML format is explained in appendix B The real documents are stored in plain txt or html formats. We use the WordNet lexical database version In our setting, we use only the dictionary files (plain text). The final mapping assertions are stored in an XML file. The relevant XML format is presented in appendix B. In the service level, the system is composed of eight functional elements that communicate through well defined java interfaces: Model Manager is the component that reads the ontology and provides model related service, including finding a particular concept by its name or ID, getting the attribute of a concept, getting all the sub concepts of a given concept, getting all the related concepts of a given concept and so on. Extension Manager reads the classification results and the relevant documents to build feature vectors for the ontology concepts, both leaf and non-leaf concepts. Within this module, a linguistic analyzer is responsible for preprocessing the documents if necessary, performing morphological processing if required, and of course building up the document term frequency matrix. The morphological processing is performed using the default morphological processor in the JWNL, which will be introduced next in the list. The 1 wn/wn2.0

142 124 CHAPTER 7. THE PROTOTYPE REALIZATION main service provided by this module is getting feature vectors for a given concept. JWNL (Java WordNet Library) is a java API for accessing WordNetstyle relational dictionaries [38]. It is an open source project and provides API-level access to WordNet data. It is pure Java (uses no native code), so it is completely portable. Apart from data access, it also provides functionality such as relationship discovery and morphological processing. JWNL implements a default morphological processor, but also leaves space for using any other user customized morphological processors. The basic usage of JWNL is to look up WordNet index words for a given token, find relationships of a given type between two words (such as ancestry), find the path from the source synset to the target synset, and find chains of pointers of a given type (for example, get the hypernym tree of a given synset). WordNet Adapter is the component that wraps up JWNL and provides services that are needed in this application. To get WordNet related services, other components will interact with the WordNet Adapter rather than with JWNL. In the future, when a newer version JWNL is in place, only WordNet Adapter needs to be updated to cope with the change. This guarantees the minimum updating efforts. Apart from mediating between JWNL and other system components, WordNet Adapter also implements higher level services like get relatedness of two words using a particular measurement (path length measurement, the most informative class measures and so on). Mapper is the component that computes similarity values for concepts in the two ontologies. As a core component, it uses the services other supportive components provides. To get concepts and the relations, it uses the Model Manager; for the initial mappings, it gets feature vector from Extension Manager; and for the post processing, it gets word relatedness from WordNet Adapter. The only services provided by Mapper is to return the mapping results for two given ontologies. The Graphical User Interface (GUI) of imapper is shown in figure 7.5. The figure illustrates a mapping process and the obtained mapping

7.4. THE IMAPPER SYSTEM 125 Figure 7.5: The GUI of imapper system. assertions. Both ontologies are represented visually in RML (Referent Modelling Language) [148] [137].

143 7.4. THE IMAPPER SYSTEM 125 Figure 7.5: The GUI of imapper system. assertions. Both ontologies are represented visually in RML (Referent Modelling Language) [148] [137]. Steps in the approach are trigged in sequence by the user through pushing the relevant buttons. As a result, a list of top ranked mapping assertion will be generated and listed in the table at the lower part of the frame (see figure 7.5 ). An ID uniquely identifies each mapping assertion. It concerns two concepts, one from the ontology in the left, the other from the right. The fourth column of the table describes what kind of mapping relation holds between these two. A confidence level is given to indicate the probability that this prediction is true. An explanation about the source of the mapping assertion is given in the last column. For example, in figure 7.5, mapping assertion 1 states that concept family travel in the left ontology is most similar to concept family in the right ontology with a similarity value of and this mapping is derived mainly from both extension analysis and WordNet relatedness calculation. When the user selects one mapping assertion (by clicking the row of that assertion in the mapping table), the corresponding concepts in the two ontologies are highlighted, making it possible for the user to get a clear overview

144 126 CHAPTER 7. THE PROTOTYPE REALIZATION of the relevant locations of the concepts in the ontologies. The user could sort the mapping assertions according to any of the column by clicking the relevant column heading. It is also possible for the user to edit, delete or add mapping assertions. The service provided by Exporter is straightforward, that is to save the mapping assertions in XML format after the user has approved or adjusted the mapping results. Coordinator is responsible for controlling the interactions among components, sequencing the sub-tasks, and passing information among components. In the future, when new matchers are added into the architecture, it is also the Coordinator that combines the newly plugin components into the existing mapping strategy. In developing the architecture for this version of imapper, one design goal was to make testing out different combination of mapping strategies as simple, and require as little extra code, as possible. A major component of this was making imapper run off a configuration file. This resulted in a plugin-style architecture. For example, in the configuration file, you specify whether lemmatization is turned on or off via the lemmatization tag. Also, the various tuning parameters are defined in the configuration file so that changing their values becomes easy and does not affect any other part the system. The configuration file is loaded during initialization. For morphological analysis, we used JWNL. An alternative way is to use the linguistic workbench developed at Norwegian University of Science and Technology (NTNU) [80]. In that setting, documents are processed through a chain of components that each transform the document contents into different middle products. The workbench consists of 8 components: POS (Part-Of-Speech) tagging, Lemmatization, Phrase detection, Language detection, Stop-word removal, word class filtering, weirdness test and XSL style sheet. Each component is running as an XML-RPC server and each has a specific port number for addressing. XML is used as the data exchange format between various components. This architecture assures to use the workbench in a flexible way, since each task can be executed independently and the order of the tasks can be controlled freely.

145 7.5. CONCLUDING REMARKS Concluding Remarks In the chapter, we have elaborated the three different parts of the system, namely, the modeling environment, the CnS software and the imapper. The implementation is of prototype quality and we have tried to integrate available tools into the system. We further discussed in length the functional settings of the three parts respectively. The modeling environment consists of an RML editor and a model repository. It is employed to build or import ontologies. The CnS software is implemented as a standalone java application, which supports the semi-automatic assignment of documents to ontology elements. Both provide inputs in XML format to the imapper. The imapper system consists of eight components: Model Manager, Extension Manager, JWNL, WordNet Adapter, Mapper, Exportor and GUI. The eight parts work together to perform the task of predicting mappings. What is the quality of the predicted mappings is the subject of the coming evaluation chapter.

146 128 CHAPTER 7. THE PROTOTYPE REALIZATION

147 Chapter 8 Case Studies and Evaluation A comprehensive evaluation of the match processing strategies supported by the imapper system has been performed on two domains. One domain is described by two different catalogues applying two different ontologies. The other domain is tourism, where two different vocabularies and conceptual structures are found in, e.g., the Open Directory Project (ODP) and Yahoo!. The main goal was to evaluate the matching accuracy of imapper, to measure the relative contributions from the different components of the system, and to verify that imapper can contribute to helping the user in performing the labor intensive mapping task. The design and methodology of the evaluation are described first. Afterwards, the results and analysis of the results are presented. This chapter is partly based on previous published paper [157]. 8.1 Experiment Design Performance Criteria Traditionally, in database schema integration tasks, the performance of the mapping algorithm is measured according to system performance or in some case, based merely on a feature analysis [6]. System performance evaluations consider measures such as response time and algorithm complexity. More user oriented evaluation metrics have also been proposed. They borrow ideas from information retrieval and focus on the usefulness of the suggested mappings [125] [43] [110] [41] [40]. The usefulness is typically measured based on the classical precision and recall 129

148 130 CHAPTER 8. CASE STUDIES AND EVALUATION measures, although the two measures have been given adjusted interpretations in the task of mapping. System performance criteria are not relevant for our trial, since our implementation is a prototype, which has not been implemented with system performance criteria in mind. Moreover, in reality, even though certain time and resource limitation will apply, the task of mapping is generally not a time or resource critical task. Precision and recall measures are designed to compare the set of correctly identified mappings in the automatically suggested mapping results with the correct set of mappings predefined by the users. One obvious problem with these measures is that correctness is a subjective measure, which are bound to be different from user to user. This makes the correct set of mappings a more or less moving target. In standard information retrieval tests such as the TREC series 1 the predefined relevant document sets are determined by expert judges. In the context of ontology mapping, such kind of predefined tasks and results are so far not available. We therefore measure only the relative usefulness of the approach in different settings by tuning a number of variables in order to suggest in what circumstances, the algorithm is likely to be useful, as well as measuring the robustness of the system. Following the discussion above, we use the following measures on our trial. To evaluate the quality of the match operations, we compare the match result returned by the automatically matching process (P) with manually determined match result (R). We determine the true positives, i.e. correctly identified matches (I). Figure 8.1 illustrates these sets. Based on the cardinalities of these sets, the following quality measures are computed. Precision = I / P, is the fraction of the automatic discovered mapping which is correct, that is, belong to the manually determined mappings. It estimates the reliability of the automatic procedure for the match prediction relative to the manual procedure. Recall = I / R, is the the fraction of the correct matches (the set R) which has been discovered by the mapping process. It specifies the share of real matches that are found. 1 Text REtrieval Conference.

149 8.1. EXPERIMENT DESIGN 131 Figure 8.1: Precision and recall for the mapping results. Precision and recall have been used extensively to evaluate the retrieval performance of retrieval algorithms in the information retrieval field [3] and have also been used in other studies [41] [124]. For each mapping the system predicated, there is a similarity degree associated with it. The degree indicates the confidence of the predication. It also provides a practical way to rank the mappings. As the mappings are ranked in a descending order of the degree, we can calculate precision at different recall levels by gradually adding more mappings into consideration. We plot the precision versus recall curve at 11 standard recall levels [3]. Precision versus recall figures are useful because they allow us to evaluate quantitatively both the quality of the overall mapping collection and the breadth of the mapping algorithm. Further, they are simple, intuitive, and can be combined in a single curve. Finally, for this version of the experiment, we made evaluation only on concept-concept mappings, whereas the more complex mappings between relations, clusters, etc are not in focus. Two reasons account for the choice: Concept-concept mappings are the bases for any other more complex mappings and ensuring its high quality will form a sound base for the other type of mappings. When user manual work is involved, we have to carefully limited the scope and complexity of the task. Therefore, the more complex mappings are omitted in this version of the evaluation. Thus, even though the purpose of the experiment is to test the performance of the proposed approach in general, the results should be inter-

150 132 CHAPTER 8. CASE STUDIES AND EVALUATION preted only as preliminary, due to the limited amount of data and the scope of the test Domains and Source Ontologies We evaluated imapper on two domains, whose characteristics are shown in table 8.1 and table 8.2. Next, we will describe the backgrounds, the contents, and the peculiarities of the chosen ontologies in detail. The Product Catalogues The product catalogue integration task was first introduced in [54], where consumers and vendors may use different classification schemas (UN- SPSC, UCEC, and to name a few) to identify their requests or products. Links between different classification schemas need to be defined in order to relate the corresponding concepts. Establishing such a connection helps to classify new products in other classification schemas and this in turn will enable a full fledged B2B business, where a certain number of well known standard vocabularies will coexist among business partners and mappings relate their mutual vocabularies [74]. In our experiment, the two relevant product catalogues are the United Nations Standard Products and Services Code (UNSPSC) 2 and the Standardized Material and Service Classificatione ecl@ss 3. UNSPSC contains about categories organized into four levels. Each of the UN- SPSC category definition contains a category code and a short description of the product (for example, category Personal communication devices). ecl@ss defines more than categories and is organized in four-level taxonomy. It is generally understood that UNSPSC classifies Ontologies # concepts # non-leaf # relations max # average max # concepts depth instances subconcepts per concept of a concept UNSPSC ecl@ss Table 8.1: The product catalogue ontologies characteristics of the fraction of the ontologies used for the experiment.

151 8.1. EXPERIMENT DESIGN 133 Figure 8.2: Snapshots of the product catalogue extracted from UNSPSC.

152 134 CHAPTER 8. CASE STUDIES AND EVALUATION Figure 8.3: Snapshots of the product catalogue extracted from

153 8.1. EXPERIMENT DESIGN 135 Ontologies # concepts # non-leaf # relations max # average max # concepts depth instances subconcepts per concept of a concept ODP travel ontology Yahoo travel ontology Table 8.2: The tourism ontologies - characteristics of the fraction of the ontologies used for the experiment. products from a suppliers perspective, while ecl@ss is from a buyer s perspective. For our current experiment, two small segments of the relevant catalogues, both of which concern the domain of computer and telecommunication equipments, are selected. They contain concepts (corresponding to the categories) and are organized in 3-4 levels by generalization relationships. Two datasets of product descriptions collected from online computer vendor websites are classified according to UN- SPSC and ecl@ss. The classification is performed in two steps. First is the automatic classification by the CnS client, then come human adjustments of the automatic results. The classified product descriptions are viewed as the instances of the relevant concepts. Tourism Ontologies The second domain we choose is the tourism section. The two ontologies are constructed based on vocabularies and structures from relevant travel sections of the Open Directory Project (ODP) 4 and Yahoo! Category 5. In both ODP and Yahoo!, categories are organized in hierarchies augmented with related-to links. However the exact nature of the hierarchical relationship is not specified. Therefore, it is not clear whether a specific hierarchical relationship is a generalization abstraction (is-a relationship) or aggregation abstraction (part-of relationship) or something else. Accordingly, we further specify the two ontologies by making explicit the nature of the hierarchical relationships using our own modeling knowledge. For example, it is reasonable to say that travel is an aggregation of lodging, destination, transportation and preparation, while business travel is a kind of special travel. It is worth noting that we did not change the hierarchical structure or vocabulary of the original categories, htto://

154 136 CHAPTER 8. CASE STUDIES AND EVALUATION because we wanted to maintain as much as possible the original design rationale of its respective creators. The refined ontologies are then modeled in Referent Modeling Language with the tool - refedit 6. Figure 8.4 and figure 8.5 show the snapshots of the two tourism ontologies concepts are included in the ontologies. The Open Directory Project aims to build the largest human-edited directory of Internet resources and is maintained by community editors who evaluate sites to classify them in the right directory. Yahoo! category is maintained by the Yahoo! directory team for the inclusion of web sites into Yahoo! directory. We consider the web pages under one category the instances of that category. As a result, in this domain, unlike the product catalogue example above, instances of each concept are already directly available without the need to classify them. For each category we downloaded the first 12 web site introductions. If there is less than 12 instances in the category, we downloaded all that is available. A very small number of categories (more in ODP than in Yahoo!) have no instance classified and we just leave them as they are. It is worth reiterating here that even if a concept has no instance information available, a match involving that concept is still possible since the sub-concepts of this concept will contribute to the construction of its feature vector. These two sets of ontologies constitute good targets for the mapping experiment. First, the two ontologies in each pair cover similar subject domains and on the other hand, they were developed independently of each other and therefore there was no intentional correlation among terms in the ontologies. In addition, the domains are relatively easy to understand for everyone Experiment Setup For the manual part, we conducted a user study in the Information System Group at the Norwegian University of Science and Technology. 6 users have conducted the manual mapping independently. All of them have good knowledge of modeling in general. None of the users had addressed the problem of mapping ontologies prior to the experiment. For each of the two mapping tasks, each participant received a package containing: 1. a diagrammatic representation of the two ontologies to be matched 6 ppp/referent/

155 8.1. EXPERIMENT DESIGN 137 Figure 8.4: Snapshots of the travel ontology extracted from Open Directory Project Figure 8.5: Snapshots of the travel ontology extracted from Yahoo directory

156 138 CHAPTER 8. CASE STUDIES AND EVALUATION 2. a brief instruction of the mapping task 3. a scoring sheet to fill in the user identified mappings The participants performed the mapping independently at their convenience in their own offices. They were asked to use their background knowledge to perform the judgment. They were also informed that: 1. no cardinality constraints, meaning one to many, many to one and many to many mappings are allowed 2. any pair of concepts can make a legal mapping, meaning leaf to leaf, leaf to non-leaf, non-leaf to non-leaf are allowed 3. to help the user making decision, they can use numbers to indicate how confident they are towards each match (3 for fairly confident, 2 for likely and 1 for need to know more to suggest the match). It also helps to compare system performance when different confidence level are considered After they finished the task, they sent back the scoring sheets for analysis. The product catalogue mapping task was performed first and the tourism ontology mapping task was conducted a month later. Both use the same 6 participants. 8.2 The Analysis Results The primary goal of our experiment is to evaluate the quality of imapper s suggestions and examine the contribution from different component of the system. We also aim at testing the robustness of the approach by a series of sensitivity analysis. In addition, we analyze the overlapping between ontologies by studying the inter-user manual mapping differences. This section presents the result of our evaluation. We start by explaining the different variable that may affect the results before we present the initial results. The different variables also constitute the subjects of a series of sensitivity tests.

157 8.2. THE ANALYSIS RESULTS Filters and Variables Filters Filters are used for choosing a selection of mapping candidates from the list of ranked mapping pairs returned by the mapping algorithms. Usually, for every element in the ontologies, the algorithm delivers a large set of match candidates. In the literature, people argue that many of the mapping algorithm are of limited usefulness, because too many false positive are generated. It will be an overwhelming task for a user to select the right mappings from the wrong ones if too many candidates are presented to the user. It is therefore of vital importance that the mappings are filtered and ranked in a correct way. It is not evident, though, which criteria can be useful for selecting a desirable subset from the initially suggested mappings and present them to the user. For a set of n mapping pairs, as many as 2 n different subsets can be formed. In our approach, we used basically two filters: Cardinality to constrain if we want a 1:1 mapping or a n:m mapping. We use parameter CARDINALITY to denote this. Threshold to remove mappings with low confidence scores. We use parameter THRESHOLD VALUE to denote this. Variables A number of variables affect the results. They are subjected to a sensitivity test. Desired Mapping Results. Both precision and recall are relative measures that depend on the desired mapping results 7 the user identified mappings. For a meaningful assessment of mapping quality, the desired mapping result must be specified precisely. In this experiment, we have two versions of the desired mapping results. One is developed by 6 users independently, the other is based on group discussion. The intention is to test if different user efforts will lead to different mapping results and to what extend will the different desired mapping results affect the final precision and recall values. Another variable in the gold standard is related to the 7 Also being referred to as gold standard.

158 140 CHAPTER 8. CASE STUDIES AND EVALUATION fact that we allow users to specify a confidence level to each mapping they suggest. 3 for fairly confident, 2 for likely and 1 for need to know more to suggest the match. Therefore, two variables are relevant here: DESIRED MAPPING RESULT to indicate wether the gold standard is individual or group discussion based. CONFIDENCE LEVEL to specify whether only confident mappings are included into the gold standard (when set confidence level to 3) or less confident ones are included as well (when confidence level is 2 or 1) 8. Structural Information. Recall in chapter 5, for non-leaf concepts, contributions from the instances, the sub-nodes and the related nodes are counted in when calculating feature vectors for such non-leaf concepts. Let D j be the collection of documents that have been assigned to that node K, let S t be the collection of its direct sub nodes and let S r be the collection of its related nodes. The ith element of C k is defined as: C k i = α w i j D j K D j + β w it w ir S t K S + γ r K S t S r where α + β + γ = 1, and α,β and γ are used as tuning parameters to control the contributions from the concepts instances, sub concepts, and related concepts respectively. For instance, if we assign 1 to α, 0 to β and γ, it means that no structure information will be counted. WordNet Contribution. In chapter 6, we mentioned that the contribution from WordNet postprocessing will be adjusted by a tuning parameter RELATEDNESS WEIGHT. If RELATEDNESS WEIGHT = 0, it means WordNet contributions are not counted. 8 confidence level 2 includes all the mapping that have a confidence level equal or bigger than 2, and confidence level 1 includes those that are equal or bigger than 1.

159 8.2. THE ANALYSIS RESULTS Quality of imapper s Predictions Baseline Filter and Variable Configuration To compare the situation in different configurations, we need a baseline configuration of the filters and variables. Since the variables will be subjected to sensitivity test later, we need to first determine the values of the filters. For both domains, we set the different variable values as follows: α=0.5 β=0.25 γ=0.25 RELATEDNESS WEIGHT = 0 CONFIDENCE LEVEL = 1 DESIRED MAPPI NG RESULT = individual If we assume that deleting a mapping takes as much user efforts as adding one, at precision 50% half of the suggestions are false positive, which means the user has no extra gain or pay in using the tool. More than 50% means the user can save some efforts if using the tool compare to manual work all by herself while less than 50% means user will have to user more efforts in using the tool. Therefore, we compare different configuration of the filters by comparing the recall value at precision 50%. The configuration which gets the highest recall value will be chosen. As a result of this comparison, we determined the value for the two filters as follows: For the product catalogue task CARDINALITY = 3 THRESHOLD VALUE = 0.2 For the tourism domain CARDINALITY = 4 THRESHOLD VALUE = 0.4 Only mappings that satisfy both the cardinality and threshold constraints are included in the final results. The results obtained by using the baseline configuration above will be referred to as baseline version later on. The last parameter in the baseline configuration desired mapping results, will be discussed in the beginning of next section.

160 142 CHAPTER 8. CASE STUDIES AND EVALUATION Domain average # max # min # manual mappings manual mappings manual mappings Product catalogue Tourism Table 8.3: Summary of the manually discovered mappings. Baseline Analysis For the two tasks, a number of mappings were identified manually by the users. Table 8.3 summarizes the manual results. Overall, an average of 30 mappings are discovered by the users between the two product catalogues and an average of 62 in the tourism domain. The individual manual mappings are determined to be correct and are used as a gold standard to evaluate the quality of the automatically suggested mappings in the baseline version. The automatic result is evaluated against each of the six manual mapping proposals to calculate the respective precision and recall and then an average precision and an average recall are computed. Figure 8.6 summarizes average precision versus recall figures for the two mapping tasks respectively. Since the mappings are ranked according to their similarity degree so that the most similar ones are ranked high, the precision drops at lower recall levels. For the tourism ontology mapping task, the precision is 93% at recall level 10% and drops gradually when more mappings are included. For the product catalogue task, the precision at levels of recall higher than 70% drops to 0 because in the baseline version, not all user identified mappings in this task can be discovered by the imapper system automatically. In that particular case, it is 69% that has been discovered by the system. For the tourism domain. around 92% are discovered. Note that the tourism ontology mapping task achieved higher precision than the product catalogue task at all recall levels. There are several possible explanations for the difference. First, the number of instances of the product catalogues is smaller than that of the tourism ontologies. As a result, the feature vectors generated by a larger instance set will have a better chance to capture and condense the terms that differentiate one concept from others. More accurate feature vectors will in turn boost the accuracy of the mappings.

161 8.2. THE ANALYSIS RESULTS 143 Figure 8.6: Precision versus recall curve for the two tasks. Second, the significance of overlapping in content and structure of the to-be-mapped ontologies varies in the two tasks. It seems that the overlapping between the two tourism ontologies is larger than that between the two product ontologies. A higher overlapping makes it easier for the system to detect the mappings correctly. To verify that, we sum up all the 6 users manual results and make an analysis of the inter-user agreement level for the two tasks respectively. We assume that the more similar two ontologies are in content and structure, the more likely users will come up with similar results and hence a higher level of inter-user agreement will be achieved. Table 8.4 summaries the analysis. In the product catalogue task, 9.7% of the user identified mappings are mappings that all 6 users have discovered, 2.4% are mappings that 5 users have discovered, and a significant 50% are mappings that only one user have discovered. In the tourism domain, 26.4% have 6 users agreed, 8% get 5 users agreed and only 32% are those only one user noticed. These numbers give a indication that the inter-user agreement for the tourism ontology mapping task is higher than that of the product catalogue task. The higher agreement suggests that the overlapping between ontologies in the tourism task is likely more significant than that of the product catalogues.

162 144 CHAPTER 8. CASE STUDIES AND EVALUATION Domain Percentage of user agreed mappings Product catalogue 9.7% 2.4% 7.3% 9.7% 20.7% 50% Tourism 26.4% 8% 8% 9.1% 16% 32% Table 8.4: Analysis of the inter-user agreement. Figure 8.7: Precision versus recall curves pre and after using WordNet for postprocessing in tourism domain. Third, the documents used in the two tasks have different characteristics. In the product domain, there exist a fair amount of technical terms, proper nouns and acronyms (for instance, 15inch, Thinkpad, LCD, etc.) in the product descriptions. Lacking of special means to treat these special terms hampers the system from generating high quality feature vectors. In contrast, in the tourism domain, the documents contain far much less specific technical terms or proper nouns Further Experiment Tuning with WordNet With both domains, we did further experiment on assessing the effect of using WordNet [52] to post-process the system initially generated mappings. WordNet is used to strengthen the mappings whose concept names

163 8.2. THE ANALYSIS RESULTS 145 Figure 8.8: Precision versus recall curves pre and after using WordNet for postprocessing in product catalogue domain. have a close relatedness in WordNet. In this experiment, the relatedness is defined as the hierarchical depth from one concept to the other in WordNet. Figure 8.7 shows the precision and recall curves pre and after using WordNet for post-processing in the tourism ontology mapping task. The figure shows that WordNet marginally improves the precision at levels of recall lower than 60%. This suggests WordNet is useful in strengthening the similarity value of the correct mappings and boost their ranks in the result set. The changing of ranks makes the predication more accurate at lower recall levels. At recall level 20% and 50%, WordNet makes an apparent improvement for the precision. Figure 8.8 demonstrates the precision and recall curves pre and after using WordNet in the product catalogue domain. On the contrary to the tourism domain, the effect of WordNet here is not apparent and indeed the precision gets worse after using WordNet than that before using it, at high recall levels. One possible reason to explain it is that lots of technical terms are used in the product catalogue domain. This technical terms are not documented and classified specifically in accordance to their usage in technical domains. For instance, IT has no entry in WordNet, in the case of CD writable, writable has not entry in WordNet either, and in

164 146 CHAPTER 8. CASE STUDIES AND EVALUATION the case of portable, its only noun sense is related to a small light typewriter which has not much to do with portable computer in WordNet. This plus a relatively small set of concepts lead to the effect that Word- Net strengthened the pairs in a more or less random way, which in turn results in slightly worsening the situation. We also noticed the limitations for using WordNet to calculate concept relatedness in both domains. In WordNet, nouns are grouped into several hierarchies by hyponymy/hypernymy relations, each with a different unique beginner. Topic related semantic relations are absent in WordNet, so travel agent and travel have no relation between them. And in fact, they are in two different taxonomies, since travel agent is in the taxonomy which has entity as top node and travel is in the taxonomy where act is the top node. This results in a not-related result being returned when applying the path length measure for measuring the relatedness of the two terms. That result however does not mirror the human judgment. A possible way to overcome this limitation is to augment WordNet with domain specific term relations or use other domain specific lexicons. In this experiment, we used the path length measurement to estimate the semantic relatedness of terms in WordNet. There are other measures being proposed in the literature, for instance, the most informative class measures [136]. It would be interesting to see the performance of other alternative measurements. Desired Mapping Results To test the effect that the desired mapping results have on the precision recall figures, we gathered the users and made an extended user study on the tourism domain 6 months after the first user study. Five out of six users participated the extended user study 9. The five users sat together, discussed the ontologies and made decisions jointly. As a result, a group discussion based gold standard came into being. A group of precision recall curves are generated using different combinations of the confidence levels and individual vs. group gold standards: Precision recall curves under individual based gold standard at three confidence levels. 9 One is not available in Trondheim at that time.

165 8.2. THE ANALYSIS RESULTS 147 Figure 8.9: Precision recall curves at three confidence level in the case of individual based gold standard in tourism domain. Precision recall curves under group discussion based gold standard at three confidence levels. Precision recall curves of individual vs. group discussion based gold standard at three confidence levels respectively. Figure 8.9 presents the precision recall curves based on individual based gold standard at three confidence levels. As the figure shows, precision values are higher when the confidence level is high. This holds true at almost all recall levels. This indicates that the system is very accurate in identifying mappings that are more obvious for the users. A high consensus is achieved both among the users themselves and between the user and the system for these high confidence mappings. When it comes to low confidence mappings, the users typically give low confidence to a mapping when they need to make assumptions or scenarios, where the mapping may hold true. Also, choices tend to vary a lot when it comes to low confidence mappings. If we adopt a pessimistic view, i.e., only high confidence mappings are considered to be valid, the algorithm works better than if a more optimistic view is taken. The figure indicates that the quality of the mapping algorithms may vary significantly in presence of different mapping goals.

166 148 CHAPTER 8. CASE STUDIES AND EVALUATION Figure 8.10: Precision recall curves at three confidence levels in the case of group discussion based gold standard in tourism domain. Figure 8.10 presents precision recall curves based on group discussion based gold standard at three confidence levels. Unlike the individual cases, here, the precision values under high confidence gold standard and that under medium confidence gold standard do not have a significant increase or decrease between the two. On the other hand, the difference in precision at the same recall level between the high/medium case and that of the low confidence situation is more or less obvious. We observed that during the group discussion, people tend to be more cautious to assign high confidence to a mapping. As a result, some of the mappings that had been assigned as high confidence in individual case drops to medium confidence in group case 10. The same numbers from both the individual and group cases are used to compare the difference of the individual and the group gold standards having on the precision recall curves in each confidence level. Figure 8.11, figure 8.12 and figure 8.13 present precision recall curves of individual vs. group discussion based gold standard at three confidence levels respectively. At both high and medium confidence levels, the precision is generally 10 In the group discussion session, the users do not have access to the information they previous made in the first individual user study.

167 8.2. THE ANALYSIS RESULTS 149 Figure 8.11: Precision recall curves at high confidence level in the case of individual and group based gold standard in tourism domain. higher under group gold standard than that under individual gold standard. It is especially true when the recall level increases. We observed that in the group discussion session, users tend to read the ontologies more carefully. Discussions took place when users had different understanding or interpretations of the concepts. It is reasonable to conclude that users put more efforts in group discussion based sessions than they do in individual based sessions. Some of the suggestions a user made in the individual cases were identified to be false in the group discussion session. One typical scenario is that when one user proposed a mapping based on the fact that the two concepts are synonyms, another user argued that the mapping is not valid, since the concepts actually have different meanings if taking their structures 11 into consideration. The two argues a little bit, while others added their opinions as well. In the end, all agreed that the mapping was not a valid one (or at least one that should not have high confidence). As a result, some of the very obvious mistakes one user made in individual bases vanished through the group discussion phase. It seems that the group discussion based gold standard is a more accurate one. At low confidence level, we observed no significant difference be- 11 The super node, super super node, etc.

168 150 CHAPTER 8. CASE STUDIES AND EVALUATION Figure 8.12: Precision recall curves at medium confidence level in the case of individual and group based gold standard in tourism domain. tween the individual and group ones. This may relate to the fact that when users had dispute over a mapping, they quite often made compromises in the end, so that instead of completely delete a proposed mapping or assign a high confidence level to it, they would agree as a middle way to assign a low confidence level to it. Since the low confidence level gold standard includes all the mappings, we end up with more or less similar set of mappings in the individual case and the group case 12. This leads to the fact that the two curves in figure 8.13 are very close. Structural Information The last experiment we did was to test whether taking into account structural information makes any differences for the mapping accuracy. A rather straight forward test was conducted on the tourism domain. We tuned the structural parameter β and γ to 0, and compared the precision recall curves in that setting with that in the baseline version. Recall that in the baseline version, α = 0.5, β = 0.25, and γ = In figure 8.14, the β,γ = 0 version is referred to as structure-off version, while the baseline is referred to as the structure-on version. As shown in the figure, the structure-off version has a decrease in precision at recall level 20%, 12 When come up to a higher confidence level, the two sets are different.

169 8.2. THE ANALYSIS RESULTS 151 Figure 8.13: Precision recall curves at low confidence level in the case of individual and group based gold standard in tourism domain. recall level 50% and above. This indicates that to disregard structural information completely makes the mapping accuracy worse at high recall levels Discussion To summarize, the main results of our study were the following: The system discovered most of the mappings and ranked them in a useful manner. The number of documents, the nature of the terms used in the documents and the overlapping of the ontologies account for difference in mapping accuracy in the two tasks. The effect WordNet has on the mapping accuracy through re-ranking varies in terms of the domain and document characteristics. The gold standards significantly influence the results. It seems that a group discussion based gold standard has less errors. Take into consideration structural information helps improve the mapping accuracy.

170 152 CHAPTER 8. CASE STUDIES AND EVALUATION Figure 8.14: Precision recall curves when structure information is turned on/off in tourism domain. 8.3 Concluding Remarks...user-based evaluation would seem to be much preferable over system evaluation: it is a much more direct measure of the overall goal. However, user-based evaluation is extremely expensive and difficult to do correctly. Vorhees, 2001 [172] The effectiveness of the proposed method and its implementation have been tested in this chapter. The algorithm was evaluated in an experiment based on observed data and control data. The manually identified mappings by the users were used as control data. The performance of the algorithm was analyzed considering the precision (correctness of the predictions), and the recall (capability to predict) values for predicting mappings. There are a number of variables that affect the precision recall values, and we did experiment on measuring the effect of those variables. As mentioned, the purpose of the experiment is to test the performance of the proposed approach. Yet, the results should be interpreted only as preliminary, because of the limited amount and scope of test data. Even though the system discovered most of the mappings and ranked them in a useful manner, it is still relevant to ask what prevents the sys-

171 8.3. CONCLUDING REMARKS 153 tem from achieving even better precision and recall figures. There are several reasons that prevent imapper from correctly mapping the concepts. One reason is some of the questionable mappings the users identified. For example, one user mapped Destination with Hitchhiking and Automotive with Railroad transportation. On the other hand, there are also plausible mappings which the system discovered but no user has identified. For example, the system maps Backpacking with Budget travel but it has not been reported in any of the user s results. Further, the successful mapping is based on the successful construction of representative feature vectors which could differentiate one concept from another. The quality of the feature vectors is affected by the number of instance documents and the natural language processing techniques to extract textual information from the document in order to construct the feature vectors. A solution to this problem is to use more document instances and employ more sophisticated textual processing techniques. Our evaluation experiment was not ideal. We were limited by available resources and, in some case, by circumstances. We had 2 domains and 6 users in the experiment. The numbers are still too small. Such an experiment however would still give some credible indications on the performance of the system. The question is whether taking precision and recall figure in isolation from other mapping algorithms is meaningful. What is needed here is a larger scale experiment that compares different systems with similar experiment settings. This requires the community to develop resources like standard source ontologies, benchmark results and evaluation measures. Ideally, we need a system that resembles the kind of function TREC [173] has in the Information Retrieval community. Another negative aspect of the evaluation is that we are aiming to develop a matcher that is as accurate as human users. As we don t know how human users make the match (i.e., we re not working with a cognitive theory of meaning) this seems to be a moving target. What we are missing here is an underlying theory (i.e., a cognitive theory of meaning) that would guide our research and would tell us what our automatic matcher needs to simulate or mimic. On the other hand, the usefulness of the approach can also be measured according to the amount of human work required to reach the perfect match by adjusting the automatically suggested matches against the amount of human work required to come up with a perfect match from scratch. Due to time and resource limitations, evaluation in this line has not been possible so far. However, based on the discussion in this chapter,

172 154 CHAPTER 8. CASE STUDIES AND EVALUATION we suggest the following points to be considered in order to effectively measure the amount of human work: The time that a user needs to achieve the perfect results in the two settings (adjusting the suggested one vs. from scratch) is a possible indications of user efforts. Another way of measuring user efforts is to track down the user s steps when performing the tasks. It is reasonable to assume that accepting a suggested mapping takes less effort than adding one or deleting one. By assigning proper weights to the accept, add, delete, and adjust steps, we could come up with quantitative measures of the efforts used in the two settings. The definition of the perfect results is still a tricky one. The process of mapping often involves multiple players. In that case, the perfect results would be a result of the social negotiation of meaning. As an added motivation, it would be also interesting to investigate whether the suggested mappings help the players to achieve the agreement with less efforts or not (easier or faster).

173 Chapter 9 Applicability of the Approach A Scenario As we mentioned earlier, one way to utilize the derived mapping assertions is to use them to bridge communications between heterogeneous systems. This chapter presents a scenario where the derived mapping assertions are used to improve system interoperability in multi-agent systems. We start with a brief recount of the system semantic interoperability problem in a multi-agent setting. Thereafter, we introduce the idea of using explanation as a way to approach that problem. The explanation process is expressed in terms of an explanation ontology shared by the agents who participate in the explanation session. The explanation ontology is defined in a way general enough to support a variety of explanation mechanisms. This chapter describes the explanation ontology and provides a working through example illustrating how the proposed generic ontology can be used to develop specific explanation mechanism. Furthermore, the ontology is being integrated into a running agent platform - AGORA to demonstrate the practical usefulness of the approach. Parts of this chapter have been published before [161]. 9.1 Introduction Over the past few years, researchers and industry have both been involved in a great drive towards enabling interoperability between diverse information sources through the explicit modelling of concepts used in communication. Semantic Web is one of the most significant endeav- 155

174 156 CHAPTER 9. APPLICABILITY OF THE APPROACH ours to that end. The hope is that the Semantic Web can alleviate some of the problems with the current web, and let computers process the interchanged data in a more intelligent way. Ontology is a key factor for enabling interoperability in the semantic web [12]. It includes an explicit description of the assumptions regarding both the domain structure and the terms used to describe the domain. Ontologies are central to the semantic web because they allow applications to agree on the terms that they use when communicating. Ontology facilitates communication by providing precise notions that can be used to compose message (queries, statements) about the domain. For the receiving party, ontology helps to understand messages by providing the correct interpretation context. Within a multi-agent system, agents represent their view of the world by explicitly defined ontologies. A multiagent system is achieved through the reconciliation of these views by a commitment to common ontologies that permit agents to interoperate and cooperate. Thus, ontologies, if shared among stakeholders, will improve system interoperability across agent systems. However, it is highly doubtable that there will be one single universal shared ontology, which will be applauded by all players. Besides, when ontologies are developed independently of each other in a large, distributed environment such as the web, it is inevitable that the same piece of domain knowledge will be captured in different ontologies. Therefore, it is highly likely that the involved agent systems may use different ontologies to represent their view of the domain. The problem of improving system interoperability will therefore boils down to the reconciliation of different ontologies used in different agent systems. This reconciliation usually exists in the form of mappings that relate concepts in the two ontologies. The mappings are normally computed off-line, either manually or automatically (in most case, semi-automatically). The mapping approach presented in the previous chapters can be used for that purpose. Also, a number of approaches introduced in the literature can be used for deriving mappings as well. This chapter concentrates on how to explore the mapping derived by other methods and how to incorporate them into agent systems in order to achieve the goal of greater semantic interoperability within and across agent systems. Imitating the human way of communication, we propose to view the mappings as a source of explanations to clarify mismatches that occur during agent communications. We base our research on a running agent platform, called AGORA.

175 9.2. AGENT COMMUNICATION 157 The AGORA system is a multi-agent system environment, which provides support for cooperation between agents [103] [104]. Ontologies are used during communication to give semantic meaning to the contents of messages sent between agents. For AGORA, the need for concept explanation arises when two agents use different ontologies to identify what is in fact similar or related concepts. For example, if two agents want to buy and sell cars, it is possible that they use different product ontologies. A simple mismatch would be that one uses the term car, while the other uses automobile. It is however clear that they have overlapping interest and mechanism should be developed to enable them to communicate. Explanation, like any other kind of agent communication, is a complex task and a range of agreements has to be made before any meaningful communication could ever happen. Therefore, in order to use explanation for reaching consensus on terms used in heterogeneous multiagent systems, we first need the agreement to use a consensual terminology for enabling the explanation process. We denote the agreed conceptualization of the explanation process by the term explanation ontology. Committing to the explanation ontology is a prerequisite for successful explanation. According to the minimum ontology commitment rule [166], we try to keep our ontology simple and small. We believe that a commitment to such ontology is essential and necessary. The ontology consists of three parts, an explanation interaction protocol, an explanation profile and an explanation strategy. Furthermore, the explanation ontology is seamlessly integrated into the current AGORA system. The rest of the chapter is organized as follows. First, we introduce some of the basic concepts and terminology in the literature of agent communication in order to lay a clearer foundation for the discussion afterwards. Then, we present the proposed explanation ontology, which includes the interaction protocol, the explanation profile and the explanation strategy. Next, an example is provides to illustrate the idea. Then, we demonstrate how it can be incorporated into AGORA. Related work and future study will conclude the chapter. 9.2 Agent Communication Agents exchange information and knowledge using Agent Communication Language (ACL) [62] [88]. Existing ACLs are KQML with its many dialects and variants [59] [86], and FIPA ACL [60].

176 158 CHAPTER 9. APPLICABILITY OF THE APPROACH KQML The KQML (Knowledge Query and Manipulation Language) language 1 is divided into three layers: the content layer, the message layer, and the communication layer. The content layer bears the actual content of the message, in the programs own representation language. The communication layer encodes a set of features to the message which describe the lower level communication parameters, such as the identity of the sender and recipient, and a unique identifier associated with the communication. The message layer forms the core of the the KQML language, and determines the kinds of interactions one can have with a KQMLspeaking agent. The primary function of the message layer is to supply a speech act or performative which the sender attaches to the content (such as, that it is an assertion, a query, a command, or any of a set of known performatives). The KQML language has a set of reserved performatives (communicative acts), with some associated arguments. The arguments (parameters) are indexed by keywords and connected to the respective value (key/value pairs). The syntax of KQML is based on the familiar s-expression used in Lisp, i.e., a balanced parenthesis list [88]. The following illustrates the syntax: (KQML performative :sender :content :receiver :reply-with :language :ontology ) A KQML message form agent joe representing a query about the price of a share of IBM stock might be encoded as: 1 The KQML Web page at the University of Maryland, Baltimore County:

177 9.2. AGENT COMMUNICATION 159 (ask-one :sender joe :content (PRICE IBM?price) :receiver stock-server :reply-with ibm-stock :language LPROLOG :ontology NYSE-TICKS) In this message, the KQML performative is ask-one, the content is price ibm?price, the ontology assumed by the query is identified by the token nyse-ticks, the receiver of the message is to be a server identified as stockserver and the query is written in a language called LPROLOG. The value of the :content keyword is the content layer, the values of the :reply-with, :sender, :receiver keywords form the communication layer and the performative name (ask-one, in this case), with the :language, :ontology form the message layer. KQMLs reserved performatives are included in appendix D, but they are neither a minimal required set nor a closed one. In [59] it is emphasized that developers should try to follow the reserved performatives (be KQML compliant), and by that enable interoperability FIPA The Foundation for Intelligent Physical Agents (FIPA) 2 is a non-profit association whose purpose is to promote the success of emerging agentbased applications, services and equipment. FIPA s goal is to maximize interoperability across agent-based systems. The FIPA Agent Communication Language (FIPA ACL), like KQML, is based on speech act theory: messages are actions, or communicative acts 3, as they are intended to perform some action by virtue of being sent. Apart from speech act theory-based communicative acts, FIPA ACL also deals with message exchange interaction protocols, and content language representations. In short, there are three group of specifications in FIPA ACL. FIPA Communicative Act (CAs) specifications deal with different 2 The Foundation for Intelligent Physical Agent, website: 3 FIPA s communicative acts roughly equals performatives in KQML

178 160 CHAPTER 9. APPLICABILITY OF THE APPROACH utterances for ACL messages. Like KQML, FIPA has a set of reserved communicative acts as well. FIPA Interaction Protocols (IPs) specifications deal with pre-agreed message exchange protocols for ACL messages. Agents exchange sequences of messages to communicate. The communication patterns they follow are called interaction protocols. Agents should know or be able to figure out the next move, according to a message received from another agent and its state. The current FIPA specification provides the normative description of a set of highlevel interaction protocols, including requesting an action, contract net protocol and several kinds of auctions. FIPA Content Language (CL) Specifications deal with different representations of the content of ALC messages. The current specification defines the framework for using several languages as content language, including, FIPA SL (Semantic Language), FIPA CCL (Constraint Choice Language), KIF (Knowledge Interchange Format), and RDF (Resource Description Framework). KQML and FIPA ACL are almost identical with respect to their basic concepts and the principles they observe and differ primarily in the details of their semantic framework. For a comparison of the two languages, we refer to [88]. In this work, however, the differences is not the focus. The system, AGORA, which we will introduce later on, is both KQML and FIPA compliant. Both KQML and FIPA ACL try to maximize interoperability across agent-based systems by requesting the participant agents to follow the same communication framework. However, at a semantic level, heterogeneity still exists. When two KQML/FIPA-speaking agents use two different ontology to define their contents, they will still not be able to understand each other even though the syntax of the message and the intended meaning of the message (such as if it is a request or a inform, etc.) is predefined in ACL. This work therefore is intended to tackle interoperability problem at the semantic level. 9.3 The Explanation Ontology We believe two types of knowledge are required for the agents engaged into an explanation process:

179 9.3. THE EXPLANATION ONTOLOGY 161 Figure 9.1: The composition of an explanation mechanism. Knowledge concerning the domain of interest, i.e., the concepts, which represent the objects that are to be explained and the parameters of explanation. Knowledge concerning the explanation, i.e., what are the concepts that describe the explanation process and what are the permitted interactions in the process. The first kind of knowledge is encoded in the agents own ontologies and the mapping assertions that relate their ontologies. As mention earlier, they are used as sources of explanations. This kind of knowledge varies from domain to domain as well as from application to application. The second kind of knowledge, however, can be generalized and applied in spite of different applications. It is, therefore, the second part, i.e., the knowledge concerning the explanation, which constitutes the explanation ontology. The structuring of the ontology is motivated by the need to provide three essential types of knowledge about the explanation. They are: knowledge about interaction (explanation interaction protocol), knowledge about description of explanation (explanation profile) and

180 162 CHAPTER 9. APPLICABILITY OF THE APPROACH knowledge about how explanations are derived (explanation strategy). Each of the three classes provides an essential type of information about the explanation, as characterized in the rest of the section. As a summary, figure 9.1 illustrates the different components involved in a complete explanation mechanism as well as their relations between each other. Two components constitute the explanation mechanism, namely, the way of explaining and the source of explaining. Each of them corresponds to one of the two types of knowledge listed in the beginning of this section. The knowledge about way of explaining is further partitioned into three parts: explanation interaction protocol, explanation profile and explanation strategy. The three parts together is named explanation ontology. The knowledge about source of explaining may come from the mapping assertions, which can be generated using for example, the imapper system, or from a generic electronic dictionary or any other sources that relate the concepts in the two agent ontologies. The rest of the chapter focuses on the explanation ontology. The generic explanation ontology is intended to capture similarities between different explanation mechanisms. It can be used as classification framework that permits the analysis of the explanation mechanisms available, and more important to develop new ones. Furthermore, by committing to the same high-level concepts, the communication among agents is facilitated, in a more flexible way. It should be noted that while we define a particular generic ontology for interaction, for profile, and for strategy, the construction of alternative approaches in each case is allowed. Our intention here is not to prescribe a single approach in each of the three areas, but rather to provide generic approaches that will be useful for the majority of cases. In the following three sub sections we discuss the resulting explanation interaction protocol, explanation profile, and explanation strategy in greater detail Explanation Interaction Protocol An explanation interaction protocol defines how the explanation is performed. In particular, it describes the dataflow and possible interactions among participants during the explanation process. The concept of explanation interaction protocol is a pre-specified pattern of message exchange between agents. It is a pragmatic solution for

181 9.3. THE EXPLANATION ONTOLOGY 163 Figure 9.2: An ER model of the general explanation interaction protocol. agent conversation, so that an agent can engage in meaningful conversation with other agents, simply by carefully following the interaction protocol. There can be different explanation interaction protocols. A general explanation interaction protocol is build based on generalizing the commonalities of different protocols. Figure 9.2 identifies the main concepts of the general protocol. The concept protocol defines a generic explanation interaction protocol. There are several roles involved in the protocol. Each role is played by one or more agents (participants). Possible involved roles are Initiator, the agent who asks for explanation; Explainer, the agent who provides explanation; and Explanation Manager, who mediates the explanation. A protocol is also guided by a number of explanation rules. Each rule tells what action for a role to take when a certain precondition is met. By refining the concepts in the general protocol we can define different specific protocols. The refinement of a concept is achieved by restricting the value set of an attribute or by adding new attribute to concepts. For example, we define the concept protocol has an attribute has-role, whose minimum cardinality is 2. By that, we say at least two agents (initiator and explainer) need to be engaged into a protocol. However, when defining a mediated explanation protocol, we restrict the minimum cardinality to 3, since an explanation manager, who functions as a mediator, is added into the interaction. We consider a general framework for presenting such protocols as well as some of the specific protocols. A specific explanation interaction protocol is presented in the example section.

182 164 CHAPTER 9. APPLICABILITY OF THE APPROACH Figure 9.3: An ER model of the the main concepts in the explanation profile Explanation Profile The explanation profile defines the main concepts that are used in the explanation. It is presented in figure 9.3. We use RDF Schema to encode the profile in order to be compatible with the semantic Web initiative. The main concepts in the explanation are query and explanation. Each query is in correspondence with a number of explanations. Each query consists of one source element, one source ontology and one target ontology. Each explanation concerns two ontology elements, one source and one target. Each ontology element belongs to an ontology. An explanation also has a type, which defines the kind of relationship between the corresponding source and target ontology elements. Ideally, the correspondence between ontology elements should be translation where semantics of the concept are completely preserved. However, transformations (mappings that lose some of the semantics) are also permitted to allow for approximate explanation. Thus the type of an explanation can be one of the following: 1. Similar concept (car automobile) 2. Narrower concept (station wagon is a car)

183 9.3. THE EXPLANATION ONTOLOGY Broader concept (car is-a-kind-of vehicle) 4. Related concept (car is related with transport) A degree is associated with each explanation to suggest the confidence of the mapping. Additional explanation method can be easily added, by making another instance of ExplanationType, for example, logical expression. It is worth to note that the explanation profile is not intended to be an exhaustive list of methods for solving semantic heterogeneity between agents, rather it serves as an anchor point for accommodating potential useful techniques which deals with resolving ontology mismatch. Newly agreed explanation methods can be added into the explanation ontology and be integrated into the AGORA system afterwards Explanation Strategy The explanation strategy describes how the explainer does the analysis of discrepancy and what are the evaluation criteria for the initiator to decide whether to accept a certain explanation or not. For the explainer, different strategies may be configured by using different combinations of the following aspects: Source of explanation. The explainer may use different information sources to derive explanation. Possible sources are mappings between local ontologies, mappings between local and global ontologies, and external lexicon (e.g., WordNet [112]). Ranking strategy. When multiple explanations are available for a given concept, a ranking strategy is need to determine which one to use in the first place. The ranking strategy may take the form of a set of ranking rules or to employ a ranking function to calculate a numerical figure for each explanation in the result set. Termination criteria. The explainer could terminate the process, when no more explanation is available, or when a predefined maximum round of explanation is exceeded. On the initiator side, two main aspects define its strategy: 1. Acceptance criteria. This is the strategy about when to accept an explanation, when to reject and when to ask for more explanation.

184 166 CHAPTER 9. APPLICABILITY OF THE APPROACH 2. Termination criteria. It defines when to terminate the explanation process. Based on the general framework discussed above, specific configurations of explanation strategy can be introduced (for example, the explanation strategy in next section). 9.4 A Working Through Example In order to better illustrate the idea of how to make use of the generic explanation ontology presented above to develop new explanation mechanism, we have considered the application scenario of an electronic market place. In an electronic market place, where buyers and sellers are brought together, each individual participant, (possibly software agent) may use its own product catalogue to represent the required and available products respectively [54]. In that context, making it possible for agents to understand what is required and what is offered becomes nontrivial. We see explanation as one way of solutions. A specific explanation interaction protocol, the generic explanation profile, and a specific explanation strategy constitute this specific explanation mechanism Two Product Catalogues We have given an extensive account of the product catalogue integration problem in chapter 8. Here we use the same context. Let us look at the two product catalogue extracts in figure 9.4. The first product encoding standard UNSPSC contains about categories organized into four levels. Each of the UNSPSC category definitions contains a category code and a short description of the product (for example, category Personal communication devices). Another product catalogue is ecl@ss, which defines more than categories and is organized in four-level taxonomy. It is generally understood that UNSPSC classifies products from a suppliers perspective, while ecl@ss is from a users perspective. We assume that a supplier uses the UNSPSC standard, while a buyer uses the ecl@ss standard. We further assume that the mappings between two standards are maintained by a service agent. The mappings can be derived either by the mapping approach described in this work or any

185 9.4. A WORKING THROUGH EXAMPLE 167 (a) Segment of UNSPSC. (b) Segment of ecl@ss. Figure 9.4: Segments of two product catalogues.

Implementing Explanation Ontology for Agent System

Implementing Explanation Ontology for Agent System Xiaomeng Su 1, Mihhail Matskin 2, Jinghai Rao 1 1 Department of Computer and Information Sciences, Norwegian University of Science and Technology, 7491