SCALABLE MATCHING OF ONTOLOGY GRAPHS USING PARTITIONING

Size: px

Start display at page:

Download "SCALABLE MATCHING OF ONTOLOGY GRAPHS USING PARTITIONING"

Shanon Summers
5 years ago
Views:

1 SCALABLE MATCHING OF ONTOLOGY GRAPHS USING PARTITIONING by RAVIKANTH KOLLI (Under the Direction of Prashant Doshi) ABSTRACT The problem of ontology matching is crucial due to decentralized development and publication of ontological data. An approach proposed towards matching the ontologies is by inferring a match between two ontologies as a maximum likelihood problem, and solves it using the technique of expectation maximization (EM). The structural and lexical similarities between the graphs are identified in this approach, and a many-one map between the concept nodes is produced. Due to highly intensive computations during the EM it becomes extremely difficult to work with larger ontologies. In order to scale the method to large ontologies, we identify the computational bottlenecks and modify the generalized EM by using a memory bounded partitioning scheme. The algorithm is improved by implementing two additional similarity measures: edge labels and instance similarity. We also present a tool for visual ontology alignment called Optima that has an interactive user interface and uses generalized EM for matching. The tool is supported by an intuitive user interface that facilitates the visualization and analysis of ontologies in N3, RDF and OWL and the alignment results. We provide comparative experimental results in support of our method on two well-known ontology alignment benchmarks and discuss their implications. Keywords: Ontologies, matching, homomorphism, expectation-maximization, OWL, RDF.

2 SCALABLE MATCHING OF ONTOLOGY GRAPHS USING PARTITIONING by RAVIKANTH KOLLI B.E, Osmania University, India, 2006 A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE ATHENS, GEORGIA 2008

4 SCALABLE MATCHING OF ONTOLOGY GRAPHS USING PARTITIONING by RAVIKANTH KOLLI Major Advisor: Prashant Doshi Committee: John A. Miller Ismailcem Budak Arpinar Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia August 2008

5 DEDICATION To my family who have supported me all through the path towards achieving my goals and aspirations. iv

6 ACKNOWLEDGEMENTS I would like to thank Dr. Prashant Doshi for his guidance, encouragement and support throughout my academic and research career at the University of Georgia and giving me an opportunity to contribute to the field of Semantic Web and ontologies. I would also like to thank Dr. John A. Miller and Dr. Budak Arpinar for their valuable suggestions during course work and in my thesis. I would also like to thank my sister Vasanthi for all the support that she has given me during my Masters, without which it would not have been possible. Last but not the least I would like to thank my friends for their support and for just being there for me during my stay in UGA. v

7 TABLE OF CONTENTS PAGE ACKNOWLEDGEMENTS..v LIST OF FIGURES... viii LIST OF TABLES... x 1. INTRODUCTION Ontology Matching And Its Applications Challenges Of Ontology Alignment Structure Of Thesis BACKGROUND Ontology Resource Description Framework (RDF) Web Ontology Language (OWL) Ontology Matching Expectation Maximization RELATED WORK Ontology matching systems Comparison Between matching algorithms ONTOLOGY MATCHING USING EXPECTATION MAXIMIZATION Graph Matching Using GEM Expectation Step vi

8 4.3 Maximization Step Lexical Similarity Between Concepts Names Random Sampling with Local Improvements Computational Complexity SCALABLE MATCHING USING MEMORY-BOUNDED PARTITIONING Partitioning Of Ontologies Instance Based Similarity Edge Labels Based Similarity Optima Evaluation The OAEI Campaign and its Test Cases I3CON Repository CONCLUSION AND FUTURE WORK Summary Of Contributions Future Work APPENDIX 52 REFERENCES.61 vii

9 LIST OF FIGURES PAGE 1.1: Ontology Matching : Data integration using ontology matching : Ontology : The GLUE Architecture : System Architecture : COMA System Architecture : RiMOM System Architecture : Architecture of PRIOR+ Approach : Example application of heuristics : Incorrect matching using heuristics : Partitioning of Ontologies : Transforming an edge labeled graph into bipartite graph : Screenshot of Optima User Interface showing two loaded ontologies : Optima highlighting the matched nodes after alignment : Average recall, precision and F-measure of ontologies in the OAEI 2006 benchmark tests : Performance comparison with the other participating ontology matchers : Performances on very large ontology pairs independently developed by different groups of domain experts : Performance comparisons with other ontology matchers on very large ontology pairs : Recall, precision and F-measure of the identified matches between the ontologies : Performance comparison with the other participating ontology matchers viii

10 A.1: Optima User interface A.2: the load dialog box A.3: Status bar to choose seed map A.4: status bar to enter parameters A.5: Alignment progress dialog A.6: Optima displaying the matched nodes ix

11 LIST OF TABLES PAGE 3-1 : Comparison of Alignment Tools based on Matchers : Comparison of Alignment Tools based on I/O :Ontology pairs from I3CON repository x

12 1. INTRODUCTION The growing usefulness of the semantic Web is fueled in part by the development and publication of an increasing number of ontologies. Ontologies are formalizations of commonly agreed upon knowledge, often specific to a domain. Each ontology consists of a set of concepts and relationships between the concepts and these are typically organized in the form of directed graphs. Rather than a central repository of ontologies, we are witnessing the growth of disparate communities of ontologies that cater to specific applications. Naturally, many of these communities contain ontologies that describe the same or overlapping domains but use different concept names and may exhibit varying structure. Because the development of these ontologies is occurring in a decentralized manner, the problem of matching similar ontologies resulting in an alignment, and merging them into a single comprehensive ontology gains importance. 1.1 ONTOLOGY MATCHING AND ITS APPLICATIONS The semantic web presents many technologies and features that overcome the limitations of the current web. Ontologies play an important role in semantic web and are used to solve the problem of semantic heterogeneity. Given the idea of many ontologies of same domain, Ontology Matching aims to find semantic correspondences between similar elements of different ontologies. In other words, Ontology alignment tools find classes of data that are semantically equivalent. The goal of this thesis is to make heterogeneous information more accessible. 1

13 APC is-a Conventional Weapon is-a Combat Vehicle is-a Tank Vehicle Data Conventional Weapon is-a Armored Vehicle is-a Tank Vehicle Model Figure 1.1: Ontology Matching The above figure illustrates the example of ontology matching. The figure contains two ontologies namely: model ontology and data ontology. In the algorithm, the ontology with larger number of nodes is considered as the data ontology and the one with smaller number of nodes as the model ontology, following the conventional EM and the graph matching terminology. This is an example of a many one ontology match with the matched nodes being APC, Tank Vehicle and Tank Vehicle, Tank Vehicle. Thus, our ontology matching algorithm involves discovering many-one correspondences from the nodes of the data graph to the nodes of the model graph. Ontology Matching is important to the success of the Semantic Web. The extensive use of software agents and web services is the characteristic of semantic web. With the decentralized development of ontologies there is a need for an interpreter for the agents to understand these ontologies. Therefore, ontology alignment is the mapping between ontologies that provides the means for agents and services to either translate their messages or integrate bridge axioms in their models. 2

14 Figure 1.2: Data integration using ontology matching Ontologies are extensively used in database integration and information transformation. For example, considering a scenario [8] in Figure 1.2 where the data is present in two different data sources D 1 and D 2, and are associated with ontologies O 1 and O 2. To integrate instances from the data sources, a mapping relation m between the ontologies has to be generated. Ontology matching is also used in the area of sensor networks. Identification and categorization of metadata on incoming data feeds is a key step for automatic publication and querying of streaming data on Web based portals like SensorMap 1. Several data providers develop and curate details of metadata on their feeds, which is often in the form of ontologies. Thus there is no room for these provider defined ontologies in the existing frameworks of data publishers. Ontology matching aims to link the provider ontologies with those of the publishers, there by automating the step of metadata registration and identification. Peer-to-peer systems, e-commerce, semantic web services, social networks [9] are some of the others areas where ontology matching is widely used

15 1.2 CHALLENGES OF ONTOLOGY ALIGNMENT Though many researchers work on ontology matching, ontology matching systems have still a long way to go. Ontologies are created to describe the existence of things in the world by different people who usually have different viewpoints about what the world looks like. The existence of information heterogeneity [8] and the importance of ontology matching in different applications motivate our research interest in the area of ontology matching. Therefore, the ultimate goal of our research is to provide a solution to the problem of ontology matching, and thus enable semantic interoperability between different web applications and services in the WWW. In the thesis, we adopt the precision, recall and f-measure as our evaluation criteria, consistent with the approach used by other researchers. We also represent our mapping results as a list of mapping pairs, between the concepts in the ontology. 1.3 STRUCTURE OF THESIS The document is organized as follows. In the next chapter, we briefly explain ontology, languages for representing ontologies, ontology matching, and EM technique. Chapter 3 reviews representative work by different researchers for ontology matching and we provide a comparison between the different algorithms. In chapter 4, we describe the ontology matching approach using the generalized EM scheme. We also discuss the computational complexity involved in the algorithm and the problem for the system not being able to scale for large ontologies. Chapter 5 contains the description of the improvements in the scalability and the enhancements in the features of the matching using EM. We also describe an ontology alignment visualization tool called OPTIMA that has an interactive user interface and uses generalized EM as the underlying algorithm for matching. Finally, in Chapter 6 we show the evaluation results of the algorithm based on the test cases from the Ontology Alignment Evaluation Initiative (OAEI) and we conclude this document in Chapter 7. 4

16 2. BACKGROUND 2.1 ONTOLOGY There are several definitions for the term ontology in literature. One of the most cited is the one proposed by Gruber [10]: an ontology is a formal, explicit specification of a shared conceptualization. One of the most recent definitions is that ontology can also be described as a set of concepts, properties of each concept describing various features and attributes of the concept, and instances of the concepts[11]. Ontologies enable shared knowledge and reuse where information resources can be used between humans or software agents. The semantic relationships in ontologies are machine readable, and thus enable inferencing and asking queries about a subject domain. According to [11] ontologies are used to build knowledge bases, where a knowledge base is formed by adding instances to the concepts in the ontology. Knowledge bases are used by agents to improve, reuse and to maintain them. The following figure [5] is an example ontology of a computer science department domain with concepts and properties. Figure 2.1: Ontology 5

17 Ontologies can be used to support a great variety of tasks in diverse research areas such as knowledge representation, natural language processing, information retrieval, databases, knowledge management, online database integration, digital libraries, geographic information systems, and visual information retrieval or multi-agent systems. 2.2 RESOURCE DESCRIPTION FRAMEWORK RDF is a general purpose language for representing information in the web. Resource Description Framework (RDF) [1, 2] is a foundation for processing metadata; it provides interoperability between applications that exchange machine-understandable information on the Web. RDF emphasizes facilities to enable automated processing of Web resources. RDF provides an information model based on graphs. Each edge corresponds to a statement whose predicate is the label of the edge, whose subject is the source vertex of the edge, and whose object is the target of the edge. Like HTML, RDF is machine processable and, using URIs, can link pieces of information across the web. RDF can be used in a variety of application areas; for example: in resource discovery to provide better search engine capabilities, in cataloging for describing the content and content relationships available at a particular Web site, page, or digital library, by intelligent software agents to facilitate knowledge sharing and exchange, in content rating, in describing collections of pages that represent a single logical "document", for describing intellectual property rights of Web pages, and for expressing the privacy preferences of a user as well as the privacy policies of a Web site. RDF with digital signatures will be key to building the "Web of Trust" for electronic commerce, collaboration, and other applications. 6

18 2.3 WEB ONTOLOGY LANGUAGE The OWL Web Ontology Language [3] is designed for use in applications that need to process the content of information instead of just presenting information. OWL enables greater machine interpretability of web content than that supported by XML, RDF and RDF Schema (RDF-S) by providing additional vocabulary along with formal semantics. OWL has 3 sublanguages defined: OWL Lite: supports those users who primarily need classification hierarchy and simple constraints; OWL DL: supports those users who want the maximum expressiveness while retaining computational completeness and decidability; OWL Full: supports for users who want maximum expressiveness and syntactic freedom of RDF with no computational guarantees. Among these, OWL Lite is a subclass of OWL-DL, OWL-DL is a subclass of OWL Full. OWL adds more vocabulary for describing properties and classes: among others, relations between classes like disjointness, cardinality, equality, enumerated classes and additional features to properties. Inference can be specified as a part of the semantics of the language. The inferences that dialects of OWL support go beyond the various other forms of representation to capture increasingly subtle shades of meaning. Each well formed OWL document legitimizes a number of inferences. 2.4 ONTOLOGY MATCHING Ontology matching aims to find correspondences between semantically related entities of different ontologies. The need for ontology matching arose out of the need to integrate heterogeneous databases, each developed independently and thus each having their own data vocabulary. In the semantic web context having decentralized development of ontologies, and merging them into a single comprehensive ontology gain importance. Matching between ontologies is a critical challenge for semantic interoperability. There are possible individuals, undirected alignments for ontology graphs of sizes m and n [12]. For large 7

19 ontologies with tens of thousands of elements, manual matching methods are clearly impractical [13] and semi-automated approaches are not suitable for real-time applications. There is no guarantee that two ontologies in the same domain will have terms that all precisely and completely overlap: in one ontology, an element name might be equivalent to several or none in other. Clearly, matching techniques must be sensitive to a number of ontology features to find corresponding elements. Contemporary languages for describing ontologies such as RDF(S) [1, 2] and OWL [3] allow ontology schemas to be modeled as directed labeled graphs. Therefore, analogous to graph matching techniques, the ontology matching approaches also differ in the cardinality of the correspondence that is generated between the ontological concepts. For example, several of the existing approaches[4-6] focus on identifying a one-one mapping between the concepts. A correspondence (or map) between two ontologies is one-one if a concept within each ontology is mapped to at most one concept in the other ontology also called exact maps in graph terminology. Less restrictive are the many-one and many-many correspondences between the concepts, also called inexact maps in graph matching terminology. Multiple concepts may each be mapped to a single or multiple target concepts. For example, the concepts of day, month, and year in an ontology may each be mapped to the concept date in the target ontology in the absence of identical concepts, by a many-one mapping method. 2.5 EXPECTATION MAXIMIZATION We briefly describe the EM approach below, and point the reader to [14] for more details. The expectation-maximization (EM) technique was originally developed by [15] to find the maximum likelihood estimate of the underlying model from observed data instances in the presence of missing values. It is increasingly being employed in diverse applications such as unsupervised clustering [16], learning Bayesian networks with hidden variables from data [17] and object recognition[18, 19]. The main idea behind the EM technique is to compute the expected values of the hidden or missing variable(s) 8

20 using the observed instances and a previous estimate of the model, and then recompute the parameters of the model using the observed and the expected missing values as if they were observations. Let X be the set of observed instances, the underlying model in the n th iteration, and Y be the set of missing or hidden values. The expectation step is a weighted summation of the log likelihood, where the weights are the conditional probabilities of the missing variables: E-Step: where L( X, y) is the log likelihood of the model, computed as if the value of the hidden variable is known. The logarithm is used for simplifying the likelihood computation. The maximization step consists of selecting the model that maximizes the expectation: M-Step: Where is the set of all models. The above two steps are repeated until the model parameters converge. Each iteration of the algorithm is guaranteed to increase the log likelihood of the model estimate, and therefore the algorithm is guaranteed to converge to either the local or global maxima (depending on the vicinity of the start point to the corresponding maxima). Often, in practice, it is difficult to obtain a closed form expression in the E-step, and consequently a maximizing in the M-step. In this case, we may replace the original M-step with the following: Select such that Q (M n M n ) The resulting generalized EM (GEM) method [15] guarantees that the log likelihood of the model,, is greater than or equal to that of. Therefore, the GEM retains the convergent property of the original algorithm, while improving its applicability. 9

All these approaches use diverse techniques based on the state of art approaches. 3.

21 3 RELATED WORK This chapter is a study of some of the most significant ontology alignment systems like GLUE[5], FALCON [20], COMA [21], RiMOM [22], PRIOR [23], OLA [24] and a comparison between them. All these approaches use diverse techniques based on the state of art approaches. 3.1 ONTOLOGY MATCHING SYSTEMS GLUE GLUE [5] is a system that aims to generate semantic mapping between the entities of two ontologies. The basic architecture of GLUE is shown in Figure 1. Figure 3.1 : The GLUE Architecture 10

22 The system uses multiple similarity measures and machine learning techniques on data instances and taxonomic structures to obtain matches between ontologies. GLUE has three components: Distribution Estimator performs the initial mapping tasks with the learners; Similarity Estimator applies a user defined similarity measure, Jaccard Coefficient equation (1), to compute a similarity value for each pair of concepts to generate the similarity matrix. Each concept is modeled as set of instances from a finite universe of instances. The joint probability distribution between any two entities consists of four probabilities:, Where ) is the probability that a randomly chosen instance from the universe belongs to entity A but not to entity B. Jaccard Coefficient is defined based on the four probabilities defined above Jaccard-sim(A,B) = (1) Relaxation Labeler takes the similarity matrix and applies domain specific constraints and heuristics to generate the best mappings between the two ontologies. Distribution Estimator is the core component of the system. The Distribution Estimator takes two taxonomies O 1 and O 2 and their data instances as input. For every pair of concepts (A O 1, B O 2 ), the Distribution Estimator applies machine learning techniques to compute their joint probability distributions, and then it uses two base learners and a meta-learner for computation. Each base learner exploits a certain type of information from the training instances to build prediction hypotheses. The following are the 2 base learners The content learner exploits the frequency of words in the text of the instance. It employs the Naïve Bayes learning technique. The textual information of an instance (usually the name, attributes) is considered as a set of tokens. In order to make a prediction, the content learner needs to compute the probability that an input instance is an instance of entity A given its tokens. 11

The name learner makes predictions based on the full name of the input instance but not on the content. Full Name is the complete name of the entity starting from the root of the ontology.

23 The name learner makes predictions based on the full name of the input instance but not on the content. Full Name is the complete name of the entity starting from the root of the ontology. The meta-learner assigns each base learner with learner weight which specifies how much it can trust the learner s predictions. The results of the base learner are combined in the meta-learner using the weights assigned. FALCON-AO Falcon [20, 25] exploits the language and the structure of the ontologies. Two matching algorithms are integrated in Falcon-AO: LMO linguistic matching, matches the concepts based on the linguistic similarity of nodes; GMO graph matching, matches the concepts based on the structural similarity of the nodes. The following figure illustrates the system architecture of FALCON-AO. Figure 3.2 : System Architecture 12

24 LMO Linguistic Matching combines the approaches of lexical analysis, and statistic analysis. In lexical analysis, the edit distance between entities names is calculated and the following function is used to capture the string similarity. SS = (2) Where ed denotes the edit distance between s1 and s2; s1.len and s2.len denotes the length of the input strings s1 and s2. In statistical analysis [26], Vector Space Model algorithm is used. A collection of documents are generated where each document is an entity in the ontology. The virtual document of an entity consists of a bag of terms extracted from the textual information of the entities along with the information of its neighbors. Each of the documents is represented as a vector in n-dimensional space. The component of the vector is the term weighing function for each of the N-unique terms. The weight of each term in the document can be obtained using Term Weighting (3) TF = (4) IDF = (5) Where t denotes the number of times a term occurs in a document, T denotes the maximum number of times a term occurs in a document, D denotes the number of documents in a collection, d denotes the number of documents a term appears at least once. Cosine Similarity between the documents by taking vectors dot product DS = (6) 13

N denotes the number of unique terms in a collection Both the lexical comparison and statistical analysis are combined using Linguistic Similarity = (7) GMO Graph Matching, uses directed bipartite

25 N denotes the number of unique terms in a collection Both the lexical comparison and statistical analysis are combined using Linguistic Similarity = (7) GMO Graph Matching, uses directed bipartite graphs for ontologies, and measures the structural similarity between the graphs. Similarity between the two entities is determined based on the similarity of involved statements where entities take the same role (subject, predicate, object). GMO takes a set of matched entity pairs that are generated by LMO and tries to generate additional matches by comparing the structural similarity. The GMO is calculated and if the values come over a fixed threshold the values of alignments from GMO are added to the final alignment else discarded. COMA COMA [21] is developed as a schema and ontology matching tool. Figure 3.3: COMA System Architecture 14

26 Above figure illustrates the system architecture of COMA along with its various components. There are five essential parts in COMA. Repository is used to persistently store the match related data, Execution Engine to perform the actual match operations; Match Customizer is used to configure matchers and match strategies for each iteration; Mapping Pools are used to manage the ontologies and mappings in memory. COMA supports ontologies in OWL-Lite. Execution Engine iterates over the matcher execution for best results. It consists of component identification to determine the components that are used for matching; Matcher execution applies multiple matchers to compute the similarities; similarity combination combines the similarity values by various matchers and derives the correspondence between the entities in the ontology. The main step during the execution is the execution of multiple independent matchers chosen from the matcher library. Current matchers fall under three classes: simple, hybrid and reuse-oriented matchers. They exploit various kinds of schema information like name, data properties, comments, and so on. The results of k matchers with m entities in ontology O 1 and n entities in ontology O 2 generates a cube of values which is used for later selection steps. Element names are given the most priority for finding the similarity between the entities. Some of the matchers based on the name of entity 1) Affix: Strings are compared based on common prefixes and suffixes between two string names. 2) N-gram: Strings are compared based on the sequence of n-characters. 3) Edit Distance: Strings are compared based on the number of operations needed to transform one string into another. 4) Soundex: Strings are compared based on the phonetic similarity between names from their corresponding soundex codes. 15

27 Taxonomy Matcher: This matching technique utilizes global shared taxonomies for matching nodes of two given ontologies. The similarity value of two nodes is determined based on the distance of the location of the nodes and if they are the homonyms of the same entity. RiMOM RiMOM [22, 27], Risk Minimization based Ontology Mapping is based on Bayesian decision theory. RiMOM integrates different alignment strategies: edit distance based strategy, vector similarity based strategy, path similarity based strategy, background knowledge based strategy, and three similarity propagation based strategies. RiMOM uses risk minimization to search for optimal mappings from the results of multiple strategies. The following figure illustrates the System Architecture of RiMOM with five major steps Figure 3.4 : RiMOM System Architecture 1. User Interaction: RiMOM is said to support an optional user interaction to capture the information provided by the user. This is later used to improve the matching accuracy. 2. Multi-Strategy execution: Each of the strategies determines matching values between 0 and 1. The output of the mapping execution phase for k strategies, m entities in O 1, n entities in O 2 is of predicted values that are represented as a cube in the figure. 16

28 3. Multi-Strategy Combination: The various results from all the strategies are combined to produce the final similarity values. 4. Mapping Iteration: Mapping takes place in one or more iterations based on the mode of interaction. In interactive mode, the user can interact with the RiMOM in each iteration to change the match strategies, manipulate the mappings, and create new mappings. In automatic mode the output of an iteration is taken as input to the next iteration. 5. Mapping Discovery: The mappings between the entities are generated based on the comparison of final similarity values with the thresholds. The entity matches that have a lower similarity than the thresholds are discarded. Three of the most important strategies employed are Name based decision is generated based the similarity between the words according to the thesaurus, Wordnet. Instance based decision exploits the word frequencies in the textual content of the instance to discover mappings. It formulates ontology mapping as a classification problem. Given two ontologies O 1 and O 2, the instances of O 2 are considered as training samples and instances of O 1 as test samples. A naïve Bayesian classifier is employed that tries to generate a model from the training samples. This model is used to classify the test samples. Description based decision exploits the comments or description section of the ontology. Word frequencies in entity descriptions in the target ontology are used in constructing a Bayesian classifier. The entity descriptions in the source ontology are exploited for predictions. 17

PRIOR+ PRIOR+ [23, 28] is an automated ontology alignment tool, based on propagation theory, information retrieval technique and artificial intelligence model.

29 PRIOR+ PRIOR+ [23, 28] is an automated ontology alignment tool, based on propagation theory, information retrieval technique and artificial intelligence model. The following figure is an overview of the system architecture of the PRIOR+ alignment tool. Figure 3.5: Architecture of PRIOR+ Approach The three core components of PRIOR+: 1) IR-based similarity Generator: The term profile [29] of a concept is a combination of all linguistic information associated with the concept. It generates the similarity of both linguistics and structural information of ontologies. To calculate the structural similarity various structural features like the number of properties and depth of the entity starting from the root are considered. Later, the difference between these structural features is calculated. The output of this model is three matrices for each of the comparison strategies. 2) Similarity Aggregator: The term harmony is used to represent the similarity between the ontologies. Three types of harmony of ontologies are defined: Name harmony, profile harmony and structural harmony. They are calculated based on the similarity matrices from similarity generator. Harmony of the ontologies are defined using h k = (8) 18

30 h k denotes different types of harmony, and denote the number of elements in ontologies. is the number of cells that own the highest similarity in its corresponding row/column in similarity matrix M k. 3) Constraint satisfaction problem (CSP) an intriguing research problem in ontology mapping due to the characteristics of ontology itself and its representations. The various constraints that are expressed in RDFS and OWL are to be handled to produce the best result. CSPs are typically solved by a form of search e.g., backtracking, constraint propagation and local search. The constraint only 1 to 1 mapping allowed for pair of nodes (E 1i, E 2j ) and (E 1i, E 2k ) where K J should return a negative connection specifying that the match between the nodes is incorrect and thus the match should be removed. The constraint two elements match if their children match for a pair of nodes (E 1i, E 2j ) and (E 1k, E 2t ) returns a positive connection specifying that the match between the nodes is correct. OWL-LITE ALIGNMENT OLA [24] is specifically for the alignment of ontologies expressed in OWL-Lite. OLA starts producing results with the lexical similarity measure initially and later brings in contributions from structural matching. They provide an easy to use graph visualizer for ontologies called OL-Graph. Similarity model of OLA applies functions to each node category in the OL-Graph. Functions are designed such that it makes use of all the descriptive information available for a couple of entities. Thus, given a category X, the similarity of two nodes from X depends on: The similarities of the terms used to designate them such as node names, comments, labels and so on, The similarity of the pairs of neighbor nodes in the respective OL-Graphs that are linked by edges expressing the same relationships, 19

31 The similarity of other local descriptive features depending on the specific category such as cardinality property types and so on. Lexical comparison of the entities relies on Wordnet. For a given set of entities the lexical similarity mechanisms retrieves the sets of synonyms for each term. A normalized hamming distance is applied to these sets. For entities with no entry in Wordnet, a default similarity value, a variant of the substring distance is applied. 3.2 COMPARISON BETWEEN ALIGNMENT ALGORITHMS This table compares the matchers of various alignment tools that we talked about previously. Element level matcher compute correspondences by analyzing entities or instances of those entities in isolation; External Matcher exploits external resources of a domain and common knowledge in order to interpret the input; Structural level matcher compute correspondences by analyzing how entities or their instances appear together in a structure. 20

32 Ontology Element Level External Structural level Number of Cardinality of Alignment Matcher Matcher Matcher matchers mappings Tools GLUE Naïve bayes Domain constraints Hierarchical structure 2 one-one FALCON- String based Wordnet Structural affinity 2 one-one AO COMA String based, Auxillary DAG matching 12 one-one language based thesauri, with bias towards alignment reuse various structures RiMOM String based, Wordnet Taxonomic 6 many-one Naïve bayes structure, similarity propagation PRIOR Propagation - Properties, depth of 3 one-one theory, the entities from information root retrieval OLA String based, Wordnet Matching of 3 language based neighbors, taxonomic structure EM String based none Hierarchical 2 Many-one (entity, instance, structure edge labels) Table 3-1 : Comparison of Alignment Tools based on Matchers [7] 21

33 This table contains comparison of the inputs to be supplied, interaction with the system, output produced by the system, and the results of OAEI 2006 [30] benchmarks. Ontology Alignment Tools Input Interaction Output OAEI 2006 Benchmarks GLUE Relational schema, taxonomy Results (F-Measure) Auto Alignment Did not participate FALCON-AO RDF,OWL Auto Alignment 87% COMA OWL-Lite User Alignment 85% RiMOM OWL Auto Alignment 88% PRIOR OWL Auto Alignment 69% OLA OWL-Lite Auto Alignment 89% EM RDF, OWL, N3 Auto Alignment, Did not participate Visualization Table 3-2: Comparison of Alignment Tools based on I/O [7] 22

34 4 ONTOLOGY ALIGNMENT USING EXPECTATION MAXIMIZATION A graph-theoretic method that generates mappings between the participating ontologies to be matched is presented in [48]. It focuses on the ontology schemas and use directed graphs as the underlying models for the ontologies. The authors formulate the problem as finding one of the most likely map between two ontologies, and compute the likelihood using the expectation-maximization (EM) technique [15]. The EM technique is typically used to find the maximum likelihood estimate of the underlying model from observed data containing missing values. In their formulation, they treat the set of correspondences between the pair of ontologies to be matched as hidden variables, and define a mixture model over these correspondences. The EM algorithm revises the mixture models iteratively by maximizing a weighted sum of the log likelihood of the models. The weights, similar to the general EM algorithm, are the posterior probabilities of the hidden correspondences. Within the EM approach, they exploit the structural as well as the lexical similarity between the schemas to compute the likelihood of a map. The authors focus on identifying a many-one match between the ontologies. In particular, they allow multiple concepts of one ontology to be mapped to a single concept in the other ontology. Thus, a concept of the target ontology may appear in more than one correspondence in the alignment. In identifying the maps, they limit to utilizing the structural similarity between the two ontologies and lexical similarities between the concept names, labels and instances. They do not use special-purpose matchers or domain knowledge [31] to discover the functions that relate groups of nodes. Although generating many-one maps is computationally more efficient than many-many because the search space is smaller, it is obviously more restrictive. While analogous approaches for graph matching appear in computer vision [19], they are restricted to unlabeled graphs. As with other graph-theoretic ontology matchers [32], their approach while being directly applicable to taxonomies, may be applied to edge-labeled ontologies using reification. Furthermore, in comparison to non-iterative matching techniques that produce an alignment in a single 23

35 step, iterative approaches allow the possibility of improving on the previous best alignment, though usually at the expense of computational time. The particular form of the mixture models in their formulation precludes a closed form expression for the log likelihood. Subsequently, standard maximization techniques such as (partial) differentiation are intractable. Instead, the authors adopt the generalized version of the EM [15] which relaxes the maximization requirement and simply requires the selection of a mixture model that improves on the previous one. Since the complete space of candidate mixture models tends to be large and to avoid local maximas, they randomly sample a representative set of mixture models and select the candidate from among them. To speed up convergence, they supplement the sampled set with locally improved estimates of the mixture models that exploit general (domain-independent) mapping heuristics. They first evaluate their approach on small ontology pairs that were obtained from the I3CON repository [33]. They report on the accuracy of the matches generated by their approach, as well as other characteristics such as the average number of sample sets generated and the average run times until the EM iterations converge. From their preliminary results, it is clear that the method, though accurate, does not scale well to large ontologies with several nodes. We identify two computational bottlenecks within the approach that make it difficult to apply it to larger ontologies. We develop ways to mitigate the impact of these bottlenecks, and demonstrate favorable results on larger ontology pairs obtained from the I3CON repository, the OAEI 2006 campaign [30] and other independently developed real-world ontology pairs. We compare our results with those obtained from some of the other ontology matching approaches on similar ontology pairs and show that our approach performs better in many cases. Several of the existing approaches utilize multiple matching techniques and external domain knowledge to gauge the similarities between ontologies. For example, COMA [21] uses four string based matchers and one language based matcher as well as an auxiliary thesauri such as Wordnet [34] for assessing the similarity between node labels. GLUE [5] utilizes domain specific heuristics to uncover some of the mappings between the ontologies. 24

36 There is inconclusive empirical evidence whether the improvement in performance due to the presence of multiple matchers and external sources outweighs the additional computational overhead. Indeed, the fact that it makes the matching unwieldy requiring tedious tuning of several parameters has been noted in [35]. 4.1 GRAPH MATCHING USING GEM As the authors mentioned previously, they model the ontologies as graphs and, and consequently focus on the graph matching problem. Let be a matrix that represents the match between the two graphs. In other words, M = Here each assignment variable, = They call the property of preserving edges across transformations as edge consistency. The correspondence f : is a homomorphism if it is a many-one or many-many mapping, and is edge consistent. In the paper, they focus on tractably generating homomorphism with many-one mappings. They formulate the graph matching as a maximum likelihood (ML) problem. Specifically, they are interested in the match matrix, M, that gives the maximum conditional probability of the data graph,, given the model graph, and the match assignments. Formally, 25

37 (9) where M is the set of all match assignments. In general, there may be different matrices, but by restricting their analysis to many-one correspondences these may be partial they reduce the search space to. As is common, they may assume the data graph nodes to be conditionally independent, and sum over the model graph nodes using the law of total probability. where = is the prior probability of the model graph vertex,, given the mixture model, M. In order to solve the ML problem, they note that the correspondence, f, is hidden from us. Additionally, if they view each assignment variable,, as a model, then the matrix M may be treated as a mixture model. Consequently, the mixture model, M, is parameterized by the set of the constituent assignment variables. Both these observations motivate the formulation of an EM technique to compute the model with the maximum likelihood. 4.2 EXPECTATION STEP They start by formulating a conditional expectation of the log likelihood with respect to the hidden variables given the data graph and a guess of the mixture model. (10) 26

38 The expectation may be rewritten as a weighted summation of the log likelihood with the weights being the posterior probabilities of the hidden correspondences under the matrix of assignment variables at iteration n. Equation 10 becomes: Next, they address the computation of each of the terms the above equation. They first focus on the posterior,. Once they establish a method of computation for this term, the generation of log follows analogously. Using Bayes theorem may be rewritten, (11) Now they try to evaluate the term in Eq.11. This term represents the probability that the data graph node,, is in correspondence with the model graph node,, under the match matrix of iteration n. Using Bayes theorem again, (12) As they mentioned before, is a mixture of the models,. They treat the models to be independent of each other. This allows them to write the above equation as, 27

39 We note that, Substituting this into the numerator of the previous equation results in, (13) They first focus on the term, which represents the probability that is in correspondence with given the assignment model,. As we talked about previously, is 1 if is matched with under the correspondence f, otherwise it is 0. They call the set of nodes that are adjacent to as its neighborhood,. Since they seek f to be a homomorphism that must be edge consistent, for each vertex, the corresponding vertex, Therefore, the probability of being in correspondence with is dependent, in part, on whether the neighborhood of is mapped to the neighborhood of under f. Several approaches for schema matching and graph matching [19] are based on this observation. To formalize this, they introduce EC: (14) In addition to the structural similarity, is also influenced by the lexical similarity between the concept labels of the nodes and. (15) Here is the correspondence error based on the lexical similarity of the node labels. We address the computation of later in this paper. 28

40 In the term in Eq. 13, is independent of in the absence of the mixture model. Therefore, = whose value depends only on the identity of the node,. In this paper, they assume this distribution to be uniform. Substituting Eqs. 13 and 15 into Eq. 12, (15) where is the normalizing constant and EC is as in Eq. 14. They now look at the log likelihood term,, in Eq. 11. The computation of this term follows a similar path as before, with the difference being that they use the new mixture model,. Analogous to Eq. 13, we get, The presence of the log considerably simplifies the above. may be computed analogously to Eq MAXIMIZATION STEP The maximization step involves choosing the mixture model,, that maximizes, shown in Eq. 11. This mixture model then becomes the input for the next iteration of the E-step. However, the particular formulation of the E step and the structure of the mixture model make it difficult 29

41 to carry out the maximization. Therefore, they relax the maximization requirement and settle for a mixture model,, that simply improves the Q value. As they mentioned before, this variant of the EM technique is called the generalized EM. (16) The priors,, for each α are those that maximize Eq. 11. They focus on maximizing the second term, of the equation. Differentiating it partially with respect to, and setting the resulting expression to zero results in, The term was computed previously in Eq. 12. They use in the next iteration of the E step. 4.4 LEXICAL SIMILARITY BETWEEN CONCEPTS NAMES They compute the correspondence error, between a pair of data graph and model graph nodes (Eq. 15) as one minus the normalized lexical similarity between their respective labels. Under the umbrella of edit distance, several metrics for computing the similarity between strings, such as n-grams, Jaccard, and sequence alignment exist, they use the Smith-Waterman (SW) sequence alignment algorithm [36] for calculating the lexical similarity between the node labels. The SW algorithm may be implemented as a fast dynamic program and requires a score for the similarity between two characters as the input. They assign a 1 if the two characters in consideration are identical and 0 otherwise. The algorithm generates the optimal local alignment by storing the maximum similarity between each pair of segments of the labels, 30

42 and using it to compute the similarity between longer segments. They normalize the output of the SW algorithm by dividing it with the length of the longer of the two labels. Recently, several approaches to schema and ontology matching have used external sources such as WordNet [34] to additionally perform linguistic matches between the node labels [21], that include considering synonyms, hypernyms and word senses. While these approaches may potentially discover better and semantic matches between the node labels, they rely on other external sources and incur associated computational overheads, which could be significant. Nevertheless, such approaches could easily be accommodated within our implementation to compute the correspondence error. For example, the lexical similarity between not only the node labels but also between their synonyms and hypernyms, obtained from WordNet, could be computed and the best similarity be used in computing the correspondence error. Investigating whether utilizing external sources, such as WordNet, is beneficial in comparison to the additional computational overhead is one line of our future work. 4.5 RANDOM SAMPLING WITH LOCAL IMPROVEMENTS Here they address the computation of the mixture model,, that satisfies the inequality in Eq. 9. They observe that an exhaustive search of the complete model space is infeasible due to its large size there are many distinct mixture models. On the other hand, both the EM and its generalization, GEM, are known to often converge to the local maxima [15] (instead of the global) when the search space for selecting is parsimonious. This suggests that any technique for generating should attempt to cover as much of the model space as possible, while maintaining tractability. A straightforward approach for generating is to randomly sample K mixture models,, and select the one as that satisfies the constraint in Eq. 9. They sample the models by assuming a flat distribution over the model space. The set of samples,, may be seen as a representative of the complete model space. However, since there is no guarantee that a sample 31

43 within the sample set will satisfy the constraint in Eq. 16, they may have to sample several, before a suitable mixture model is found. This problem becomes especially severe when the model space is large and a relatively small number of samples, K, are used. A B C A B C A B C A B C A B (a) C A B A B C C A B (b) C Figure 4.1 : Example application of heuristics In order to reduce the number of s that are discarded, they exploit intuitive heuristics that guide the generation of. For taxonomies in particular, if exhibits correspondences between some subclasses in the two graphs, then match their respective parents, to generate a candidate. For the case where a subclass has more than one parent, lexical similarity is used to resolve the conflict. This heuristic is illustrated in Fig However, a simple example, Fig. 4.2, demonstrates that solely utilizing such a heuristic is insufficient to generate all the correspondences. In particular, the performance of the heuristic is dependent on the initial correspondences that are provided. To minimize the convergence to local maximas, we augment the set of heuristically generated mixture models with those that are randomly sampled. In this manner, not only do they select candidate mixture models that have a better chance of satisfying Eq. 16, but they also cover the model space. This approach is analogous to the technique of random restarts, commonly used to dislodge hill climbing algorithms that have converged to local maximas. 32

44 C11 C11 A B 1 C12 A B 1 C12 B 2 C21 B 2 C21 C22 C22 A B 1' B 2' C11' C12' C21' C22' A B 1' B 2' C11' C12' C21' C22' (a) (b) Figure 4.2: Incorrect matching using heuristics 4.6 COMPUTATIONAL COMPLEXITY Concluding with some of the limitations of this approach, we try to propose a solution to the problem by trying to reduce the computationally intensive overhead. We first analyze the complexity of computing, which forms the E-step. From Eq. 17, the complexity of the posterior is a combination of the complexity of computing times, the correspondence error, and the term. We observe that EC may be computed through a series of look-up operations, and is therefore of constant time complexity. The complexity of calculating dependent on the algorithm for arriving at the lexical similarity, is for the SW technique, where l is the length of the largest concept label. If instances are present, l is the length of the largest instance. The complexity of the exponential term is. Hence the computational complexity of the posterior is + O ) + =. The computational complexity of the log likelihood term is also O ), because its computation proceeds analogously. Since the product of these terms is summed over, the final complexity of the E-step is. In the M-step, if we generate K samples within a sample set, the worst case complexity is. 33

45 5 SCALABLE MATCHING USING MEMORY-BOUNDED PARTITIONING The computational overhead section in Chapter 4 gives us an idea about the inability of the algorithm to match larger ontologies those containing several hundreds of concepts. We identify two reasons why the approach does not scale well to larger tasks. First, within the E-step, the computation of the term,, as shown in Eq. 11 requires iterating over all the model and data graph nodes, and computing the weighted log likelihood. Consequently, both the ontologies need to be entirely stored in main memory to facilitate the computation. For large ontologies, this becomes a computational bottleneck. Second, the exponent,, in Eq. 14 causes the computed value to become unmanageably large (often causing overflows) for ontologies with several nodes. 5.1 PARTITIONING OF ONTOLOGIES In order to address these difficulties, we partition each of the two ontologies to be matched. Let and be the partitions of the set of data and model graph vertices (including nodes in the transformed bipartite graphs), respectively. We denote each non-empty disjoint subset of within the partition as, where i = 1..., and analogously for the model graph. Iterating over all the graph nodes is equivalent to iterating over all the partitions and over all nodes within a partition, so Eq. 11 may be decomposed into: A simple rearrangement of the summations yields the following: 34

46 Notice that the terms within the parenthesis are analogous to the terms in Eq. 11, except that we now sum over all nodes in a subset within the partition. One simple way to partition an ontology graph is to traverse the graph in a breadth-first manner starting from a root node and collect MB number of nodes within a subset. Here, 2 MB is the total number of nodes that can be held in main memory at any given time. The resulting disjoint sets of nodes collectively form a partition over the set of ontology nodes. We label a node within ontology as root if it has no outgoing arcs. While taxonomies, by definition, contain a root node, ontologies in general may not. In the absence of a root node, we may create one by selecting a node with the lowest number of outgoing arcs and connecting it to a dummy root node. This node, for example, could denote the rdfs:thing class within the RDF schema, which generalizes all classes in RDF. Of course, other approaches for traversing and partitioning a graph also exist with varying computational overheads (for example, see [37, 38]). Whether these approaches result in partitions that better facilitate matching is an aspect of our future investigations. Equation 17 helps alleviate the difficulty of insufficient memory for large ontologies by requiring that only the nodes within a subset of the partition be held in memory for log likelihood computations. However, this approach does not address the problem of having to compute large exponents, referred to previously. In order to address this, we observe that in computing EC (Eq. 15), we utilize only the nodes within the ontology graphs that are in the immediate neighborhood of the nodes in consideration for matching. This simple insight allows us to replace the mixture models, and, used in the computation of the weighted log likelihood with the following: if is a subset within the partition of the data graph nodes, let be the expanded set that additionally includes all nodes (belonging to ) that are the immediate neighbors of the nodes in. This involves identifying the fringe or boundary nodes in 35

47 and including their immediate neighbors (Fig. 5.1). Let be the analogously expanded set of nodes for the subset. Then, define to be that region (sub-matrix) of, which consists of the nodes in as rows, and the nodes in as columns of the sub-matrix, and analogously for Figure 5.1 : Partitioning of Ontologies In Eq. 13, we replace the mixture models in the computation of the weighted log likelihood with the partial ones. This transforms Eq. 17 into the following: 36

48 While this modification will not affect Eq. 16, the exponent in Eq. 14 becomes, as we show below. For large ontologies, this is likely to be much less than the former. (19) Although the right hand side of Eq. 18 is no longer equivalent to that of Eq. 17, the property that high Q-values indicate better matches is preserved. To summarize, we modify the GEM in the following way so as to enable matching of larger ontologies: 1. Partition the data and model ontology graphs where the number of nodes within any subset in the partitions does not exceed MB. Here, 2 MB is the total number of nodes that can be held in memory. 2. Expand each of the subsets within the partitions to include the immediate neighbors of the original nodes in the subset. 3. Perform the E-step using Eq. 18 where partial regions of the mixture models are used to compute the weighted log likelihoods. 5.2 INSTANCE BASED SIMILARITY We view each instance of a concept including the properties and their values as text. If an instance is a large document, we utilize word tokenization and stemming techniques and select tokens with high normalized frequencies and high inverse document frequencies if there are multiple documents. For each instance, we identify the corresponding one that is most similar. We compute the lexical similarity between the textual instances (or the representative tokens) of the pair of concepts as the average of such similarities across all instances. We utilize the SW algorithm mentioned previously to compute the lexical 37

49 similarity. Analogous to text classification approaches as used in [22]; we measure the similarity between the instances. While simpler, our approach is computationally efficient. 5.3 EDGE LABELS BASED SIMILARITY To facilitate graph matching, we transform the edge-labeled graphs into unlabeled ones by elevating the edge labels to first class citizens of the graph. This process involves treating the relationships as resources, thereby adding them as nodes to the graph. We note that the transformation becomes unnecessary, from the perspective of ontology matching, when all edges have the same labels. We illustrate the transformation using a simple example in Fig. 5.2 and point out that the transformed graph is a bipartite graph [39]. A B y x C B x C y A (a) (b) Figure 5.2 : Transforming an edge labeled graph into bipartite graph However, this transformation into a bipartite graph comes at a price. The bipartite graph contains as many additional nodes as the number of edges and distinct edge labels. For each of the edge labels two additional nodes are generated, blank node and the other with the edge label as a node. This approximately increases in the number of nodes in the ontology with multiple edge labels by two-folds. Due of this, the transformation works well for smaller ontologies with edge labels and may be considered inappropriate for large ontologies with multiple edge labels. The matches generated by edge matching generate much better matches since even the edge characteristics are exploited. Consequently for large ontologies, we may consider matching the concept taxonomy and matching edge labels across ontologies as separate but dependent tasks. This is similar to matching the concept taxonomy and passing the match results as external input for matching edge labels. 38

5.4 OPTIMA We have developed Optima [40] a general purpose tool for automatically matching concepts and relationships residing in multiple ontologies that address similar domains.

50 5.4 OPTIMA We have developed Optima [40] a general purpose tool for automatically matching concepts and relationships residing in multiple ontologies that address similar domains. The core of Optima is the Expectation Maximization ontology matching algorithm that we talked about in the previous chapters. A tool related to Optima is OLA [24]. While OLA provides a graphical visualization of ontologies, it does not allow a visualization of the alignment. In comparison, AlViz [41] highlights clusters of similar concepts, but lacks a sophisticated alignment algorithm and does not show details about which individual concepts are matched. A somewhat related tool is Promptviz 2 that provides a visual representation of the differences between versions of the same ontology. Optima incorporates both, a highly interactive visualization and a sophisticated algorithm that improves on these tools. Figure 5.3: Screenshot of Optima User Interface showing two loaded ontologies

51 Optima s user interface builds on an open source ontology browser called Welkin which is a part of MITs simile project 3. As we show in Fig. 5.3, Optima provide an intuitive interface that allows the user to load and view the ontologies. To facilitate browsing the ontologies, several different layouts of the ontology graphs such as tree, circle, scramble (randomly generated locations) and active are available. The tree representation helps us to visualize the ontology hierarchically a shown in the Figure 5.4. In addition, nodes may be clustered and those belonging to different namespaces may be filtered out from the display to reduce clutter. Figure 5.4 : Optima highlighting the matched nodes after alignment On performing the alignment, nodes of both the ontologies that are matched, are highlighted in a different color than all the other nodes, and the user may select any of these nodes to identify its corresponding match in the other ontology as shown in Fig 5.4. The progress of the alignment task is shown to the user and may range from a few seconds to hours depending on the size and complexity of

52 the ontologies. The user can stop the alignment at any time during the execution, if he is not satisfied with the progress. The user can save the discovered alignment in XML format as specified by the Ontology Alignment Evaluation Initiative (OAEI) 4 is provided as well. To facilitate or steer the alignment, users may initially enter seed matches by selecting pairs of nodes from the ontologies, or can choose the default seed match feature that generate the seed matches based on similar strings in both the ontologies. A more detailed explanation of the features and components of Optima are provided in Appendix

53 6 EVALUATION We evaluated the performance of the GEM algorithm modified using the memory-bounded partitioning scheme on two well-known benchmark sets for evaluating ontology alignment algorithms. 6.1 THE OAEI CAMPAIGN AND ITS TEST CASES We used the Ontology Alignment Evaluation Initiative (OAEI), which utilizes a systematic approach to evaluate ontology matching algorithms and identify their strengths and weaknesses BENCHMARK TESTS OAEI 2006 [30, 42] provides a series of ontology pairs called the benchmark set, each of which is a modification or mutation of a base ontology on bibliographic references (36 concepts, 61 relationships). The benchmark set formulates 3 categories of test ontology pairs: test pairs 101 to 104 pair the base ontology with itself, an irrelevant one, and those expressed in more generalized or specialized languages; test pairs compare the base ontology with modifications that include node label mutations, structural mutations, removing edge labels and removing instances; and test pairs pair the base ontology with different real-world bibliographic ontologies. In Fig. 6.1, we show the performance of our matching algorithm on the benchmark set of OAEI The perfect recall for the set 1xx indicates that language generalization or specialization did not impact the match performance. Within the set 2xx, our technique compensates for the dissimilarity in node labels by relying on the structural similarities between the ontologies. However, for the ontology pairs where one of the ontologies has randomly generated node labels (for example, pairs 248 to 266), it performs relatively poorly demonstrating its (partial) reliance on the node labels for matching. 42

real-world bibliographic ontologies in the set 3xx.

54 Figure 6.1: Average recall, precision and F-measure of ontologies in the OAEI 2006 benchmark tests We obtain better performances comparatively when the base bibliographic ontology is compared to different real-world bibliographic ontologies in the set 3xx. While the base ontology focuses exclusively on bibliographic references, the real-world ontologies have wider scopes focusing on peripheral domains, for e.g. departmental organization, as well. Nevertheless, our technique detected the structural and lexical similarities between the ontologies. Notice that the standard deviations for 1xx and 3xx reveal performance variances that are not inordinately large. However, the performance on the set 2xx varies more, in particular, in tests 248 to 266. Figure 6.2: Performance comparison with the other participating ontology matchers 43

55 In Fig. 6.2, we compare the performance of our ontology alignment technique with the other best performing algorithms that participated in the benchmark evaluations in The participants include COMA[21], FALCON-AO [20, 32] and RiMOM [22] along with 6 other ontology matching algorithms. Notice that the weakest performances are obtained for the ontology pairs in set 3xx thereby indicating the relative difficulty of the task. Our matching technique did not improve on the top performances in the categories 1xx and 2xx, though it exhibited results that were significantly better than the average performance of the best five participating algorithms in each category. For 2xx, its performance would place it second in rank order. It showed better results than all participants for the test set 3xx. In particular, for the test pair 304, over 95% of the concepts and edge labels was correctly matched, though about 65% were correctly matched for the pair 303. This is because the real-world ontology in the latter case had a much wider scope focusing on domains such as departmental organization, than the base ontology REAL WORLD ONTOLOGIES In addition to the benchmark set, pairs of real world ontologies for evaluation are also available. As the true maps for these are not provided, we obtained or computed partial true maps ourselves. We proceed to evaluate the approach on very large ontology pairs developed independently and often diverse in structure. We utilize three pairs of ontologies in our evaluations: Directory pair consisting of taxonomies that represent Web directories from Google, Yahoo and Looksmart; Anatomy pair consisting of ontologies on the adult mouse anatomy available from Mouse Genome Informatics 4 and human anatomy from NCI 5; and the Cell pair [43] consisting of ontologies on the human cell, developed by FMA [44] and the cell portion of the Gene ontology The Directory and Anatomy taxonomies were used in the OAEI 2006 and 2007 contests, respectively. 44

56 Figure 6.3: Performances on very large ontology pairs independently developed by different groups of domain experts We show the results in Fig As the true maps for these pairs are not publicly available, we either obtained them by contacting the source (for the Anatomy pair) and by inferring them ourselves (for the Directory and Cell pairs). In order to generate the true correspondences ourselves, we utilized a semiautomated procedure. As the first step, we compared node labels and comments between subsets of partitioned ontologies using our lexical matching technique. We considered both exact and near-exact matches as potential correspondences. Next, we manually filtered out correspondences from this set that did not seem correct. For this, we looked at the labels and comments pertaining to the identified correspondences. Additionally, we identified and visualized their structural contexts in the subsets of the respective ontologies using the Jambalaya tool bundled in Protégé [45]. The latter step also resulted in some new correspondences between nodes being added whose node labels did not match. However, as not all true correspondences may have been discovered, the accuracy of our results could be slightly different than shown. The true map for the Anatomy pair contained 1,500 correspondences while the one for the Directory pair contained 1,530 correspondences. Our methodology for matching these large ontology pairs consists of partitioning them using our memory bounded approach as described in Chapter 6 and executing the GEM on each pair of parts. The 45

parts varied in size and ranged, for example, from 200 nodes to approximately 500 for the Directory pair. Results of evaluating the parts were gathered and aggregated to generate the final results.

57 parts varied in size and ranged, for example, from 200 nodes to approximately 500 for the Directory pair. Results of evaluating the parts were gathered and aggregated to generate the final results. As fringe nodes are included in the parts, these may be evaluated independent of each other. Hence, we could execute several runs in parallel to speed up the computation. Overall, the total evaluation time was approximately 8 hours for the larger ontologies and approximately 2 hours for the Cell pair. The experiments were run using a JDK 1.5 program on a single processor Xeon 3.6GHz, 3GB RAM out of which 1.9GB was utilized by the Java virtual machine, and Linux machines. Figure 6.4: Performance comparisons with other ontology matchers on very large ontology pairs In Fig. 6.4, we compare our results with those of other alignment tools on the Directory and Anatomy pairs. Note that these comparisons are rough as the true maps used for computing the results may not be identical. For the Directory pair, our approach performed significantly better than the average of the top five performances among the other tools in the OAEI 2006 contest. On the Anatomy pair, our result was significantly better than the average of the performances of the other tools in the OAEI 2007 contest (type C) as well, though lower than the best performance. It would be placed third in rank order for its performance on the Anatomy pair. Interestingly, a majority of the correspondences are between 46

58 leaf nodes and a naive approach of just comparing node names performs better than the average. The Cell ontology pair is not part of OAEI and alignments by other matchers are not available for comparison. 6.2 I3CON REPOSITORY The next set is the previously mentioned I3CON repository [33] from which we selected six ontology pairs and the corresponding true maps for evaluation. We list these pairs along with the associated statistics such as the number of concepts, relationships (edge labels) and the depth of the hierarchies in the ontologies. Table 6-1 :Ontology pairs from I3CON repository Ontology pair No. of concepts No. of relationships Max. depth Wine A & B 33 & 33 1 & 1 2 & 2 Weapons A & B 63 & 58 1 & 1 4 & 5 Network A & B 28 & 28 6 & 7 3 & 3 CS A & B 109 & & 7 6 & 4 People+Pets A & B 60 & & 15 4 & 3 Russia A & B 158 & & 76 6 & 8 Of the ontology pairs, the pair labeled Russia A & B consists of the largest ontology graphs with 158 and 150 nodes, respectively, which are related to each other using 81 and 76 edge labels, respectively. These ontologies partially describe the culture, places of interest, infrastructure and jobs in the country of Russia. We also point out the CS ontology pair because of its dissimilarity one member of the pair contains 109 nodes and 52 edge labels, while the other contains 20 nodes and 7 edge labels. These ontologies describe the social organization, infrastructure and functions of a typical computer 47

science department, though at widely varying granularities. We show the performance of our matching algorithm on the ontology pairs in Fig. 6.5.

59 science department, though at widely varying granularities. We show the performance of our matching algorithm on the ontology pairs in Fig The recall and precision of the maps were calculated according to Eq. 12, the F-measure was calculated as: The seed maps consisted of mapping 5% to 10% of the leaf nodes or correspondences obtained by matching node labels exactly. We observe that the performance on the CS and Russia ontologies is weak. This is because the CS ontologies exhibit large structural dissimilarities, as is evident from their respective sizes shown in Table 6.1. Though the Russia pair of ontologies has approximately similar number of nodes and edge labels, their internal graph structure as well as the node labels are largely dissimilar leading to a low level of performance. On the other hand, our algorithm perfectly matched the Wine pair of ontologies and demonstrated satisfactory performance on the other pairs. Figure 6.5 : Recall, precision and F-measure of the identified matches between the ontologies A comparison with the performance of other ontology matching algorithms that participated in the I3CON benchmark evaluations is shown in Fig The six participants include the ontology mapping engine, Arctic [46] from AT&T, LOM [47] from Teknowledge, and an alignment algorithm from INRIA, among others. As the shown benchmark evaluations were performed in 2004, many of these alignment algorithms may have since been improved. We show both, the F-measure averaged over the 48

performances of all the participants and the best F-measure obtained for each ontology pair. Note that the best F-measure for each of the ontology pairs was not obtained by the same participant.

60 performances of all the participants and the best F-measure obtained for each ontology pair. Note that the best F-measure for each of the ontology pairs was not obtained by the same participant. The ontology pairs of Wine and Weapons have been excluded as benchmark evaluations for these are not available. Figure 6.6: Performance comparison with the other participating ontology matchers We observe that the performance of our matching technique, as indicated by the F-measure, is significantly better than the average, and improves on the best performing participant for some of the ontology pairs. Notice that all the participants fare poorly on the tasks of matching CS and Russia ontologies, indicating the difficulty of these tasks. All of our correspondences were obtained within a reasonable amount of time, usually on the order of a few minutes. We used an Intel Xeon 3.6GHz processor and 3GB RAM based machine with Linux. These results demonstrate the comparative benefit of utilizing a lightweight but sophisticated matching algorithm that exploits general-purpose heuristics to guide its search over the space of maps. Analogous to the OAEI benchmarks, no single participating algorithm performed the best in all sets. 49

61 7 CONCLUSION AND FUTURE WORK We presented an enhanced and scalable method for identifying maps between ontologies that model similar domains. It is competitive with all top-ranked systems on benchmark tests, web directory test and anatomy tests. It also shows the potential to improve performance in real world ontology matching tasks. In this chapter, the main contributions of this thesis are outlined and a number of directions of future work are presented. 7.1 SUMMARY OF CONTRIBUTIONS We aim to enhance the Expectation-Maximization ontology matching approach to scale for ontologies of size of hundreds of nodes. Within the GEM approach, we enhanced the algorithm using instance matching and edge-label matching to select the most accurate match. Similar to some of the recent approaches, we generate inexact matches between the ontologies that may result in multiple nodes of one ontology to be mapped to a single node of the other; such methods have a wider applicability. While the algorithm s initial experimental results illustrated the good performance of their method, they also highlighted its inability to scale to large ontologies. To address this limitation, we modified the GEM to operate on smaller subsets within partitions of the ontologies. Our empirical results on benchmarks such as I3CON and OAEI 2006 indicate the favorable performance of the method and provide support for lightweight but sophisticated approaches to matching ontologies. The systematic evaluation using the test ontology pairs provided by OAEI revealed some of the strengths and weaknesses of our method. First, because we utilize the lexical similarity in concept labels to partly establish the likelihood of a map, our technique exhibits a weak performance where the node labels are highly dissimilar. We offset this limitation by utilizing instances, if available. Examples of affected tasks include matching ontologies on similar domains written in different languages and ontologies expressed using alternate 50

62 standards. However, the performance improves if the structures of the ontologies are comparable. Second, small changes in the structures such as flattening or expansion of the concept hierarchy did not significantly impact the performance. This is because the most likely map that we discover often turns out to be the correct one too, thereby reducing the impact of such structural changes. Of course, the task of matching ontologies that is both syntactically and structurally disparate is the most difficult, for example, matching ontologies written in different languages with partially overlapping scopes. We note that further efficiency is needed for performing the optimization step. This is especially pertinent, if we allow both many-one and many-many matches between the ontologies. 7.2 FUTURE WORK Given our findings, there is still lot more work to be done. For example, to extend the 1-1 and many-1 matching to a many many matching for which we need to improve our matching algorithm and also adjust constraints. We are also investigating the efficacy of utilizing sources of general background knowledge, like a global upper level ontology for improving the match quality. We are also looking at different ways in which structural contexts may be used efficiently to discover a map. Recently several approaches to schema and ontology matching have used external sources such as WordNet [34] to additionally perform linguistic matches between the node labels [21] that include considering synonyms, hypernyms and word senses. Such approaches could easily be accommodated within our implementation to compute the correspondence error. For example, the lexical similarity between not only the node labels but also between their synonyms and hypernyms, obtained from Wordnet, could be computed and the best similarity be used in computing the correspondence error. 51

APPENDIX OPTIMA USER GUIDE USER INTERFACE LAYOUT Visualizers: Optima have 2 visualizers, one for each of the ontologies. Optima uses welkin platform as starting point.

63 APPENDIX OPTIMA USER GUIDE USER INTERFACE LAYOUT Visualizers: Optima have 2 visualizers, one for each of the ontologies. Optima uses welkin platform as starting point. Statistics: On the bottom of each of the visualizer, the following information is displayed numbers of nodes in the ontology numbers of edges in the ontology matches tell us the number of matched node after the alignment elapsed time tells us the time it took to complete the alignment process. Figure A.1: Optima User interface 52

64 Predicates: Each predicate contained in the loaded RDF model is shown, grouped by their URIs common prefixes. Resources: This area of the tool is used to focus on the resources themselves. This is the panel that allows that. Two widgets perform actions on the relative URI: A checkbox - turns on/off the visualization of the nodes (and the properties that were connected to it) associated to this resource. This is useful to restrict the model to only those resources that belong to a particular ontology or namespace. A slider - changes the color of the nodes associated to this resource. This is useful to color-code different resources and isolate ontologies based on the information contained in the URI. Charts: There are three graph-theoretical properties of a node In Degree - the number of edges that point to the node. Out Degree - the number of edges that start from the node. Clustering Coefficient - the proportion of edges between the nodes within its neighbourhood divided by the number of nodes that could possibly exist between them. Tree structure visualization: Optima have an option of visualizing the ontologies in a tree format. It uses the breadth-first search graph traversal algorithm. If the ontology has no root, Optima add a virtual root node called "Top" to the ontology. It traverses all the nodes starting from the root to specify its position in the tree structure. Apart from the other forms of visualization, this provides a visualization of the hierarchical structure of the ontology. 53

Load ontology: The load ontology button on the tool performs the following actions: 1) Loads the source ontology Figure A.2: the load dialog box 2) Loads the target ontology.

65 Load ontology: The load ontology button on the tool performs the following actions: 1) Loads the source ontology Figure A.2: the load dialog box 2) Loads the target ontology. 3) Provides options to enter the alignment parameters and seed-map (Details are specified in the seedmap and parameter sections below) Clear: The clear button in the tool is used to clears the visualizers and optima can be used to load different set of ontologies. Align Button: The align button is used to start the alignment process. Commands: The commands are divided in three tabs: Drawing - contains the main commands that drive the activity and the drawing properties of the graph pane. Here is a list of the commands and their effect: o o o Active- turns on/off the activity of the graph pane. Circle- resets the position of all the nodes in a circle centered inside the graph pane. Scramble- resets the position of all the nodes randomly around the graph pane. 54

66 o Shake - add a random shift from the current position. This is useful to "give the graph a kick" because there are cases where the energy minimization algorithm encounters a local minimum and not a global one. By shaking the nodes, you might force the graph to reach the global minimum, thus a more meaningful representation of the clusters. o o Nodes- turns on/off the drawing of nodes. Edges - turns on/off the drawing of edges. This is very useful to speed up the calculations since normally a lot of time is spent in drawing the edges. For large graphs it's useful to start the graph without edges and hen turn them on once the graph has reached a more stable state. o Antialias- turns antialiasing on/off. In some platforms, antialiasing can be very expensive, just like edges above you can achieve a faster drawing performance by disabling it during graph processing and re-enable it when the graph reaches a more stable state. o Background - turns background drawing for the graph panel on/off. On some platforms, disabling this gains a little time too. Highlight - drives the search and highlight of nodes that have literals or URIs that match particular substrings. The nodes that match get their labels turned on. Parameters - fine tune parameters for the graph activity algorithm. Seed-map: To facilitate or steer the alignment, users may initially enter the seed matches by selecting pairs of nodes from the ontologies. The user is given an option to choose between specifying a seed-map and using a default seed-map. 55

Figure A.3: Status bar to choose seed map How do default values work? The general idea behind the default values is to generate the seed-map without the user worrying about it.

67 Figure A.3: Status bar to choose seed map How do default values work? The general idea behind the default values is to generate the seed-map without the user worrying about it. The algorithm used for generating default seed-map is to generate the map based on the string similarity between the nodes of the two ontologies. If a node name of source ontology has a 100% string match with a node name of target ontology, then those two nodes are added into the default seed-map for alignment. How do you choose a seed-map? Clicking: A single click on the node will display the node name that goes off when the button is released. Double-clicking: Double clicking the node highlights the node name of the ontology. The node name stays until a double click occurs again. 56

68 After loading the ontologies, user should do the following: 1) Double click a node from the source ontology and double click a node from the target ontology to add it into the seed map. Continue this process until the complete seed map is selected. 2) Clicking OK in the status bar will have the seed map accepted and it is later used in alignment. 3) If the user is not satisfied with the selected seed-map and wants to go with the default, he can do so by clicking the default seed-map button. Parameters: Parameters control the alignment process. Figure A.4: status bar to enter parameters How to set parameters? There are 2 parameters to be set to run the alignment algorithm. 57

69 1) Samples: The number of mixture model samples that should be generated at each iteration. A portion of these are generated by heuristics and some randomly. 2) Iterations: The number of iterations the algorithm should be run to obtain the results of the alignment. The user can then click the OK button to accept the values and use them in the algorithm or can go with the default values by clicking the default values button. What are the default values? The default values for the samples and iterations are 3 and 2 respectively. Status bar: The status bar is present above the visualizers and beside the load and align buttons. The status bar provides status of the tool at any particular time. 1) Loading Ontologies: This status is displayed when the load button is clicked and the user is in the process of loading the ontologies. 2) Seed-map and parameter status: After clicking the submit button in the load dialog box, status box displays the user choice of seed-map and parameters the status bar. 3) Alignment completed: This status is displayed after the alignment is finished. Clicking: A single click on the node will display the node name that goes off when the button is released. Double-clicking: Double clicking the node highlights the node name of the ontology. The node name stays until a double click occurs again. Alignment Dialog: After starting the alignment of ontologies, the alignment dialog is displayed which shows the user a progress of alignment. After the alignment completes the dialog disappears. 58

70 Figure A.5: Alignment progress dialog Figure A.6: Optima displaying the matched nodes 59

Inexact Matching of Ontology Graphs Using Expectation-Maximization

Inexact Matching of Ontology Graphs Using Expectation-Maximization Prashant Doshi LSDIS Lab, Dept. of Computer Science University of Georgia Athens, GA 30602 pdoshi@cs.uga.edu Christopher Thomas LSDIS