SeMap: A Generic Schema Matching System

Size: px

Start display at page:

Download "SeMap: A Generic Schema Matching System"

Stephanie Miller
5 years ago
Views:

1 SeMap: A Generic Schema Matching System by Ting Wang B.Sc., Zhejiang University, 2004 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia August, 2006 c Ting Wang 2006

2 Abstract The rapidly growing number of autonomous data sources on the web makes the need of effective tools of creating semantic mappings increasingly crucial. Moreover, the goal of allowing applications to have more expressive semantics requires a change in focus. While most previous work focus on creating mappings in specific data models for data transformation, they fail to capture a richer set of possible relationships between schema elements. For example, current schema matching approaches might discover that TA in one schema equals to grad TA in another one, even though the relationship can be modeled more accurately by saying that grad TA is a specialization of TA. This increased semantics of the mapping in turn allows for applications involving richer semantics. In this thesis we concentrate on the following problem: given initial match (correspondence) information produced by current schema matching techniques, how to construct a complex, semantically richer mapping that can be used across data models? Specifically, we aim at detecting the relationship types of Has-a, Is-a, Associates and Equivalent. Technically, we achieve this goal in mainly three steps: (1) exploiting various types of semantic evidence for possible matches; (2) finding a globally optimal match assignment; (3) identifying the relationship embedded in the selected matches. We implemented our semantic matching approach within a prototype system SeMap, and tested its accuracy and effectiveness. ii

3 Table of Contents Abstract ii Table of Contents iii List of Tables v List of Figures vi Acknowledgements viii 1 Introduction Motivation Contribution Organization Related Work Relationship Classification Equivalence Relationships Set-Theoretic Relationships Generic Relationships Schema Matching Techniques Rule-Based Solutions Learning-Based Solutions Ontology Alignment Techniques Sample Prototypes Rondo Cupid COMA imap Problem Formulation Representation iii

4 Table of Contents 3.2 Problem Statement Semantic Resources Internal Resources External Resources Approach Overview Schema Matcher Match Selector Mapping Assembler SeMap System Schema Matcher Base Matcher Similarity Score and Lineage Information Element-Level Matcher Structure-Level Matcher Architecture of Schema Matcher Match Selector Representation Bidirectional search Modeling user interaction Mapping Assembler Combining Map s and Map t Identifying relationships Assembling mapping Experimental Analysis Experimental Setting Data Set Expert Mapping Evaluation Metrics Experimental Methodology Experimental Result Matching Accuracy Component Contribution Incorporating User Feedback Discussion Conclusion & Future Work Bibliography iv

5 List of Tables 5.1 Characteristics of the input schemas Characteristics of the expert mappings v

6 List of Figures 1.1 An example of input schemas and output mapping A classification of current schema matching techniques. Courtesy of [22] Representation of model. The left plot shows a graphical representation of a model, comprised of nodes (elements) and edges (relationships). The right table shows the tuple representation of edges Illustration of four relationship types handled by SeMap An example of complex mapping handled by SeMap Illustration of the matching process The basic system architecture of SeMap. It takes two models and external resources as input, and produces generic semantic mapping. It consists of three main parts: the schema matcher, the match selector and the mapping assembler Architecture of schema matcher. It consists of three layers, base matcher, combining layer and structure matcher Partial match assignments from the perspectives of source and target schemas respectively Mapping assembling for matches of different types. Each 1-1 equivalence match corresponds to one mapping element, while each element of complex match is associated with one mapping element Matching accuracy of SeMap. The three plots show the recall, precision and F-measure of the matching results for the three relationship types Equivalent, Has-a, Is-a and total correct matches respectively Error analysis of the resulting mappings vi

7 List of Figures 5.3 The precision of SeMap after pruning incorrect matches. The bars from left to right shows the matching results for the three relationship types Equivalent, Has-a, Is-a respectively Relative contribution of different types of semantic evidences to the matching results of SeMap. The two plots (from up to down) show the F-measure of identified matches (correspondences) and identified relationships respectively F-measure of correct correspondences versus the amount of user interaction (percentage of expert matches provided over the total number of matches). The curves for four datasets (Real Estate 1/2, Course Info 1/2) are shown vii

8 Acknowledgements I would like to express my gratitude to all those who have offered me help in completing this thesis. Especially, I owe the greatest thanks to my supervisor Rachel Pottinger, who provided me with excellent guidance and support in the entire process of this thesis project. I want to thank Dr. Tsiknis for giving me insightful comments on this work, and being my second reader. I would like also to thank all the members at database management lab, especially Jian Xu for their constructive suggestions. Without their help, this work would not be possible. Finally, I thank all my friends at the University of British Columbia. It has been a wonderful experience to grow up with them. viii

9 Chapter 1 Introduction 1.1 Motivation Spurred by the growth of data sources on the web, information systems are witnessing a paradigm shift from monolithic databases to heterogeneous, interacting data sources. The fundamental problem in sharing data from multiple sources is to deal with the semantic heterogeneity inherent in their autonomous nature, and the key is to identifying the semantic correspondences between them. The operation of finding such correspondences is called Match, which takes two schemas as input and produces a semantic mapping, specifying the relationships between elements of the two schemas. Such semantic mappings play a crucial role in numerous data sharing applications, including web data integration, schema evolution and migration, component-based development, etc. Currently, the creation of semantic mappings, especially complex ones is still mostly done manually, possibly supported by a graphical user interaction interface. Manually creating semantic mappings is a tedious, errorprone process. The labor-intensity grows linearly as the matches to be performed. Hence the rapidly increasing number of web data sources necessitates automatic support for schema matching. 1

10 Chapter 1. Introduction The problem of semi-automatically creating mappings has attracted on intensive research in both the database and AI communities [2, 4, 10, 15, 28]. The procedure is comprised of two phases, schema matching and mapping construction. In schema matching, equivalence correspondences between elements of both schemas are identified. The equivalence correspondences can be one-to-one (1-1) matches, e.g., class corresponds to course, or complex matches containing more than one element in each schema, e.g., TA maps to some combination of grad TA and ugrad TA. Note that the focus of schema matching is to find such potential correspondences, rather than giving a final mapping to the users. Finding this mapping is done in mapping construction, where the identified correspondences are built on by adding more specific semantic information to generate a semantically rich mapping. schema S Map S T schema T class m 1 (=) course Has-a professor Has-a instructor Associates dept Has-a m 2 (=) Has-a m 3 (=) Is-a m 5 (=) m 4 (=) Is-a m 6 (=) Associates college Has-a grad TA Associates faculty Has-a TA m 7 (=) ugrad TA Is-a Is-a m 8 (=) m 9 (=) Figure 1.1: An example of input schemas and output mapping. As a typical example of mapping construction, Clio [32] includes a set of user-interaction techniques to create SQL-style mappings, based on the output of an initial schema match. Such semantic mappings are necessary 2

11 Chapter 1. Introduction to transform data. Clio however, like most other previous work on mapping construction, is restricted to relational and XML-style schemas; it does not capture the general richness of the possible relationships between elements in a data-model-independent fashion. Thus, although many common relationship types exist across SQL and XML (e.g., specialization), this work cannot be used to create the XML-style mapping. Data sources on the web however are, of various data models, e.g., XML, HTML, RDF, ontologies, text, etc. Hence exploring how to create richer, general relationships between schema elements, rather than concentrating on the specific data model under consideration, allows us to understand the general space of the possibilities. It also allows better reuse of ideas, since one does not have to create a separate algorithm for each ensuing data model. After a mapping with such general relationships is constructed, the transformations into a specific data model can be made more concretely. For example, it can be easily transformed into specific forms, e.g., SQL views or XSLT transformations, thus excluding the need of maintaining specific mappings separately. Also, a generic mapping can create a uniform interface between domain knowledge (ontologies) and web interface (database schemas), which is helpful for semantic web applications. Furthermore it can be fed into a model management system [17], which aims to solve meta-data problems in a data model neutral fashion, or used for knowledge inference when applied to ontology domain. An example of a generic semantic mapping is shown in Figure 1.1, where two schemas S and T represent the concepts of class and course respectively. A generic mapping S T is constructed, specifying a rich collection of 3

12 Chapter 1. Introduction semantic relationships between the elements of S and T, e.g., college of T Has-A dept of S, while instructor of S Is-A faculty of T. The relationship types adopted in this thesis follow the relationship classification of [21]. Compared with the equivalence relationships (1-1 or complex) considered in previous literature, this relationship classification is semantically richer and more expressive. Equipped with such generic mappings, one can envision a number of applications. For example, one problem facing current semantic web applications is the lack of domain specific knowledge (e.g., ontologies). If domain knowledge in different representations can be mutually converted, the collection of knowledge will be significantly enriched. 1.2 Contribution In this thesis we explore constructing such generic semantic mappings, based on initial match information that shows correspondences between the elements of both schemas. This initial match information can be produced by current schema matching techniques. Mapping construction takes as input a set of initial matches produced by a set of schema matching algorithms, and generates a semantically richer mapping, such as the one in Figure 1.1, which describes complex relationships between elements of both schemas. Specifically, mapping construction is responsible for searching for a global optimal match assignment from the pool of possible assignments, solving the conflicts among the selected matches, and identifying the complex relationships between the schema elements, e.g., the Has-A relationships in Figure 1.1. However, constructing 4

13 Chapter 1. Introduction a generic semantic mapping is fundamentally difficult for several reasons: Finding correspondences with generic semantic relationships is substantially harder than simple equivalence, since the space of possibility under consideration is much larger, and more semantic evidence is needed; The pool of initial matches is possibly quite large. This search space is large enough in considering n:1 equivalence matches to make most matching algorithms only consider 1:1 matches, but when relationships other than simple equivalence are considered, it is infeasible to try all possible combinations to find the optimal assignment; Various semantic constraints can be imposed, rendering match selection a complicated constrained optimization problem; Identifying the relationships implicit in matches is a hard problem, and one that is made more difficult by attempting to make our output data model independent. As in schema matching, mapping construction inherently can not be fully automatic. The importance of user feedback is recognized in schema matching research [4, 31], however no systematic modeling of user interaction for mapping construction is available to date. One of the goals of our work is to limit interaction to critical points to help focus user attention and minimize user effort. Aiming at overcoming the problems listed above, in this thesis we describe a prototype system SeMap to create a generic, semantic mapping. We 5

14 Chapter 1. Introduction choose a graph-based representation that is similar to that used in model management [17], which is expressive enough to accommodate both schemas of many types and other meta-data, such as ontologies. Specifically, we make the following contributions: An architecture for semi-automatically constructing generic semantic mappings based on initial correspondence information; A novel probabilistic framework that incorporates match uncertainty and semantic constraints in a uniform way, and expresses match selection to a mathematical optimization problem; Effective modeling of user interaction to help focus user attention and minimize user effort, by detecting critical points where feedback is maximally useful; Effective solution to extracting implicit relationship of initial match based on various types of semantic evidences; A prototype system embodying the innovations above and a set of experiments to illustrate the correctness and effectiveness of our approach. 1.3 Organization This thesis is a specification of our schema matching system SeMap. The goal is to present the technical details in implementing the system. Specifically, we intend to make clear mainly the following three aspects: 6

15 Chapter 1. Introduction 1. The formulation of the problem, including the exact representation of the input/output of the system, the resources we use and the assumptions we have made; 2. The specification of the system, including the system architecture, the exact input/output and interior structure of each component and their interaction; 3. The experimental analysis, including the dataset we can use, the metric we use to evaluate our approach, the experimental result and its explanation. The remainder of the thesis will be organized as follows: Chapter 2 presents a survey of related work; Chapter 3 formally defines the problem of mapping construction and gives an overview of the architecture of our system. In Chapter 4, we describe our mapping construction approach in more details; Chapter 5 presents the experimental analysis of our approach; and Chapter 6 concludes this thesis and presents future work. 7

16 Chapter 2 Related Work Semi-automatically creating semantic mappings has attracted upon intensive research in both the database (schema matching) and AI (ontology alignment) communities. The key differences and similarities of schema matching and ontology alignment include: Differences. Ontologies are logical systems, which obey some formal semantics, i.e., they can be interpreted as a set of logical axioms; however database schemas often provide no explicit semantics for their data. Similarities. Schemas and ontologies are quite similar in the sense that (1) they both provide a vocabulary of terms that describe a domain of interest and (2) they both constrain the meaning of terms used in the vocabulary [30]. Due to their differences, schema matching is usually performed with the techniques to guess the semantics implicit in the schemas, while ontology alignment is designed to exploit the knowledge explicitly encoded in the ontologies. Their similarities however make the solutions from these two problems mutually beneficial. Following, we will discuss the problems of schema matching and ontology alignment as a whole. 8

17 Chapter 2. Related Work In this chapter, we present a survey of related work in three parts: first in Section 2.1 we classify the current schema matching/ontology alignment techniques based on the relationships they can handle; we then discuss some typical techniques used in these approaches, specifically, schema matching in Section 2.2 and ontology alignment in Section 2.3; finally, we present several example prototype matching systems in Section Relationship Classification The relationship types created by matching techniques can be roughly divided into three categories: equivalent relationship, set-theoretic relationships and generic relationships. Specifically, two schema elements having the equivalent relationship means they are semantically equivalent, and the techniques to identify equivalent relationship is described in Section 2.1.1; the set-theoretic relationship classification regards each schema element as a set, and specifies their relationship as one of equivalence, subsumption, intersection, disjointness and incompatibility, which is discussed in Section 2.1.2; the generic relationships refer to those non-equivalent relationships, such as Has-a and Is-a relationships discussed in this thesis. Two typical classification of generic relationships can be found in ontology modelling [18] and meta-data management [21]. The techniques developed so far to handle generic relationships is presented in Section

18 Chapter 2. Related Work Equivalence Relationships With the main goal of data transformation in specific data models, most schema matching/ontology alignment algorithms to date aim at discovering the equivalence relationships [2, 3, 4, 10, 13, 14, 16, 31]. The found equivalence correspondence can be the case of a 1-to-1 match (e.g., course = class ), or a complex match (e.g., name = concat( first-name + lastname )). The complexity of creating multi-arity (1-to-n or even n-to-m) matches is significantly harder than 1-to-1 matches for several reasons: (1) while the number of candidate matches is bounded for 1-to-1 match (the product of the sizes of two schemas), the number of match candidates to be considered in complex case is much larger. (2) it is inherently difficult to generate a match to start with in the case of multi-arity matches. That is in the case of n-to-m match, it is difficult to determine n and m in order to generate a set of candidate matches. Hence to date most the work on schema matching has been focused on discovering 1-to-1 equivalence correspondences between schema elements [3, 4, 10, 13, 14, 16, 31]. R. Dhamankar et al. [2] proposed imap, a prototype of identifying 1-to-n correspondence matches, which reformulates schema matching as a search in an often very large match space. To search effectively, it employs a set of searchers, each discovering specific types of complex matches. However, while attempting to discover semantically equivalent correspondences, it is possible that the matches identified by these techniques may not be exactly of equivalence relationships; they may instead be the 10

19 Chapter 2. Related Work semantically richer relationships we are endeavoring to find, such as the relationship between TA and { grad TA, ugrad TA } as shown in Figure Set-Theoretic Relationships The equivalence relationship can be considered as a special case of the settheoretic relationships, which can specify the relative containment relationship between two sets. In [26], an effective solution is proposed to identify inter-set relationships by bidirectionally comparing the containment of data instances and meta-data, of different schema elements. The problem with this approach is that the data instances associated with the two schemas should be in the same universe, otherwise the comparison of containment relationship is not meaningful. However, in many applications, especially web data integration, the data sources do not overlap Generic Relationships There have been very few works on finding generic relationships between schema elements. The solution proposed by D. Embley et al [5] relies heavily on a domain-specific ontology to find the relationships of Merge/Split (e.g., Address consists of Street, City and State ), Superset/Subset (e.g., Phone contains both Phone day and Phone evening ), and Set-Name as Value (e.g., the attribute Water-front in one schema appears as a value of the attribute House-description in the other schema). The basic idea is to first map the schema elements to a comprehensive domain-specific ontology, and the relationships between schema elements can then be determined by that of their counterparts in the ontology. This 11

20 Chapter 2. Related Work approach requires (1) a comprehensive ontology that covers all possible concepts that may appear in schemas in that domain; (2) a domain-specific thesaurus that can map schema elements to their alternative representations in the ontology. Such ontology and thesaurus are usually hard to obtain in real scenarios. Our work has fairly simple requirement for the needed semantic information, available in most schemas, and does not assume any comprehensive ontology. Nevertheless the existence of such ontology can improve the quality of the matching results of our system. F. Giunchiglia et al. [7] proposed the concept of semantic matching, a pure schema-based approach. The basic idea is to first populate each element name with their meanings in some domain-specific dictionaries, and computes the specialization relationship of schema elements based on the containment relationship of their meanings. Their approach however works only for identifying Is-a relationship and tree-structured schemas. 2.2 Schema Matching Techniques The research on schema matching/ontology alignment provides a wealth of techniques to semiautomatically find semantic matches. The techniques can be classified by the information they exploit [22] as shown in Figure 2.1: the matches can be found by exploiting one type of semantic evidence (schemalevel, data instance-level, etc), or combining multiple types of evidences (i.e., hybrid matchers, which integrate multiple matching criteria, and composite matchers, which combine results of independently executed matchers [22]). Matching techniques can also be classified by their methodologies into rule- 12

21 Chapter 2. Related Work based and learning-based solutions, which will be discussed in Sections and respectively. schema matching techniques individual matcher combining matcher schema-level instance-level hybrid matcher combining individual matcher element-level structure-level element-level name similarity graph matching value patterns type similarity... data distribution... frequent term... manually: iterative user feedback automatic: matcher selection result combination Figure 2.1: A classification of current schema matching techniques. Courtesy of [22] Rule-Based Solutions Rule-based matching [7, 14, 16] techniques constitute a wealthy collection of schema matching solutions, which have been used in both early and current matching applications. Rule-based techniques discover similar schema elements by exploiting schema-level information using hand-crafted rules. A broad variety of rules have been devised to exploit all possible information, including element name (label), data types, structures, number of subelements, and integrity constraints. For example, F. Giunchiglia et al. [7] proposed to exploit the semantic meanings of element names to discover similar elements; Cupid [14] employs rules that categorize elements based on names, data types and domains; Similarity flooding measures pairwise similarity by propagating similarity from some fixed points according to the schema structures. 13

22 Chapter 2. Related Work The rule-based techniques have some desirable features: (1) they are usually inexpensive in computation and require no training process as in learning-based approaches; (2) they usually require only schema-level information, which is available in most matching scenarios; (3) if some domain knowledge is available, one can specify domain-specific rules, which can work very well in certain types of applications. For example, users can write regular expressions that encode times or phone numbers, or quickly compile a collection of zip codes to help recognize these types of entities. The learning methods however can hardly deal with these scenarios. They either can not learn some complex rules, or require a large amount of training data with the correct representation for desired result, which is usually hard to obtain. However the rule-based techniques have several drawbacks: (1) they can not effectively exploit data-instance level information, even though the data instances provide valuable information, e.g., precise data format, data distribution, statistical values, etc. It is possible in some cases that the schemalevel information is opaque or very difficult to interpret, e.g., the element names like A or B1 are too abstract to be interpreted. In contrast, learning methods such as Naive Bayes can easily construct some probabilistic rules that find similarity in such scenarios, based on the distribution of data instances [11]; (2) moreover, rule-based techniques can not exploit previous matching results to improve the current matching process. Hence in a matching application for a specific domain, the rule-based techniques are usually insufficient. 14

23 Chapter 2. Related Work Learning-Based Solutions Motivated by the drawbacks of rule-based matching methods, a collection of learning-based solutions have been proposed: these methods have considered a variety of learning techniques, and exploited both schema-level and data instance level information. For example, Doan et al. proposed the LSD system, which employs Naive Bayes learning method over data instances, and also exploited the structure information of XML data format; The imap system [2] pays attention to the description of elements, in addition to other schema information. In developing learning techniques for schema matching, it has been realized that considering only schema-level or data instance-level evidence in the schemas being matched is often insufficient for a purpose of more accurate matching. Hence, several types of external resources have been considered to improve the matching quality. For example, assuming a domain-specific ontology is available, one technique is to first maps schemas/ontologies into the ontology, then constructing the matches based on the relationships inherent in that ontology [5]. For example, it is hard to identify the relationship between direct and free toll by using regular approaches such as string comparison. However, by mapping them to a domain-specific ontology, one can find that they are both specializations of the concept phone, so that it can be concluded that direct is highly similar to free toll. Some recent works advocate exploiting past matching results to improve current ones [3, 4], with the basic idea of learning from past matches to predict unseen matching scenarios. An alternative solution considers learning 15

24 Chapter 2. Related Work from a corpus of schemas and matches [14]. Such corpus provides alternative representations of concepts in the domain, i.e., functions in the same way as ontology, thus can be leveraged to discover similarity between schema elements. However, it is not always practical to have such external resources, particularly since such these resources must be domain-specific to be effective. 2.3 Ontology Alignment Techniques Ontology alignment deals with finding corresponding concepts in different ontologies. In this section, we present some typical work on ontology alignment, and a comprehensive survey is referred to [10]. OntoMorph [1] focused on the problem of translation of symbolically represented knowledge between different knowledge representations. It used a description logic based approach, and offers syntactic rewriting to support the translation between two different knowledge languages, and semantic rewriting to support inference-based transformation. OntoMorph requires users to provide transformation rules, thus can be regarded as one type of rule-based technique. Prompt [20] proposed an ontology alignment mechanism that finds corresponding concepts by refining an initial mapping (pairs of anchors) given by users or some simple linguistic matching approaches. Specifically, it analyzes the paths in sub-graphs limited by the anchors and determines which concepts frequently appear in similar positions on similar paths. The philosophy followed by Prompt is similar to that of Similarity Flooding [16]. 16

25 Chapter 2. Related Work FCA-Merge [28] is an example of alignment technique depending on external resource. The resources used in FCA-Merge are domain-specific documents, which cover the concepts in the ontologies. Through natural language analysis techniques, it generates a formal context for each document, which tells which documents contain which concepts. Based on these formal contexts, the Is-a relationships between concepts are inferred. However since the formal context is built upon the generalization/specialization hierarchy of the concepts, this approach could not be extended to other relationships, such as Has-a. Moreover, the requirement of domain-specific documents is not always feasible. MAFRA [15] proposed a framework for sharing distributed ontologies via mapping. A multi-strategy process is employed to calculate the similarities between ontology entities, including lexical similarity, property similarity (attributes or relations). Both top-down and bottom-up similarity propagations are employed. This can be considered as a counterpart of the hybrid matching techniques in schema matching. To our best knowledge, though the ontology itself can have complex relationships, e.g., Has-a or Is-a, the focus of most previous work on ontology alignment is finding semantically equivalent concepts or one specific type of relationships, e.g., Is-a in FCA-Merge, in different ontologies [10], rather than discovering corresponding concepts with more types of generic relationships, and the rich relationships in ontologies are used only as one type of semantic evidence. 17

26 Chapter 2. Related Work 2.4 Sample Prototypes In this section, we consider some recent prototype of schema matching systems Rondo Rondo [17] is a complete prototype of generic model-management system, in which high-level operators are used to manipulate models and mappings between models. As one of its main operators, match is implemented using the Similarity Flooding (SF) algorithm [16]. SF utilizes a hybrid matching approach based on the idea of similarity propagation. It starts from a stringbased comparison, e.g., common prefixes, suffixes, of the schema elements names to get an initial mapping, which is further refined using a fix-point computation. The matching process is well formulated as a mathematical optimization problem in SF Cupid Cupid [14] implements a hybrid matching algorithm that analyzes syntactic information at elements (e.g., string prefixes, suffixes), and structure information of schemas (e.g., tree matching weighted by leaves). Moreover, it exploits external resources, i.e., a pre-compiled thesaurus COMA COMA [3] is a composite schema matching system. It provides a matcher library composed of different matching algorithms. Its framework allows 18

27 Chapter 2. Related Work the combination of partial results. The matcher library can be extended by adding new matching algorithms. Specifically, it contains 6 elementary matchers, 5 hybrid matchers and one reuse-oriented matcher. Compared with Cupid, this reuse-oriented matcher is a novel algorithm, which tries to leverage previously obtained results for new schemas imap imap [2] is a matching system that considers 1-to-n equivalence matches. The authors regard the problem of matching as a search in a usually infinite match space. The overall goal is achieved in three steps: (1) a set of basic matchers, called searchers are employed to detect similar elements according to different criteria (e.g., linguistic similarity, numerical equivalence, etc). Specifically, for each element in the target schema, a set of similar elements are found in the source schema by the searchers, including 1-to-1 and n-to-1 matches. (2) the match candidates generated in the first step are evaluated by a similarity evaluator module in this step, and the result is a similarity matrix which indicates the similarity between the target element and different match candidates. (3) a match selector module selects the best match candidate as the final result. imap also provides a explanation module which can provide explanation for the generated matches, e.g., the reason the match is selected, and the implicit equivalence relationship, etc. To the best of our knowledge, most previous work on schema matching focus on one-to-one equivalence relationships in finding semantic correspondences between two schemas. Little work is done in identifying multiple types of 19

28 Chapter 2. Related Work complex relationships. In the following chapters, we present SeMap, a prototype schema matching system which is designed to find generic semantic correspondences. 20

29 Chapter 3 Problem Formulation As discussed in the related work (Chapter 2), most work on schema matching so far focuses on finding one-to-one equivalence relationships between schema elements. The overall goal of our schema matching system,, is to identify generic semantic mapping between two schemas. And generic semantic mapping means (1) the matches may be non one-to-one, e.g., one element is mapped to multiple elements of the other schema, a.k.a, 1-to-n matches; (2) the relationship types may be non-equivalence, e.g., Has-a, Is-a, etc, as classified in Vanilla meta-meta model [21]. An example of a generic semantic mapping is shown in Figure 1.1, where two schemas represent the concept of class / course in different ways. The mapping contains complex correspondences, such as TA of schema S is mapped to undergrad TA and grad TA of schema T. Instead of the equivalence relation considered in most schema matching approaches, the relationship types involved are also complex, e.g., the department of schema S is considered as a member of the college of schema T. 21

30 Chapter 3. Problem Formulation model course Associates Associates college faculty Has-a Has-a grad TA ugrad TA (a) s p o course Associates course course Associates faculty course Has-a ugrad TA course Has-a grad TA (b) Figure 3.1: Representation of model. The left plot shows a graphical representation of a model, comprised of nodes (elements) and edges (relationships). The right table shows the tuple representation of edges. 3.1 Representation In this thesis we consider how to form a generic semantic mapping. Because we are attempting to solve this problem in a data model neutral fashion that could be applied equally well to relational or XML schemas or an ontology, we adopt the terminology from Model Management [17], and say that we take as input two models 1. A model is a complex design artifact, such as a relational schema, XML schema, XML DTD, or an ontology, etc. Technically, a model can be represented as a directed labelled graph (V, E). Specifically, V is the set of nodes, each denoting an element of the schema, e.g., attributes in relational database table, type definition in XML schemas, clauses of SQL statement, etc. E is the set of binary, directed typed edges over V. Formally, each edge is a tuple < s, p, o >, where s is the source node, p is the type of edge, and o the target node 2, and p denotes the relationship between s and o. An 1 In what follows, we will use schema and model exchangeably. 2 The notation < s, p, o > follows the notation of <subject, predict, object> in ontologies. 22

31 Chapter 3. Problem Formulation example of model representation is depicted in Figure 3.1, which illustrates the concept of course. As indicated in [19, 24], in addition to Equivalent relationship, the concepts of generalization/specialization and part-of/whole have been long recognized as ubiquitous and essential mechanisms in object-oriented modeling techniques, which have a large scope of applications, such as CAD, manufacturing, software development and computer graphics. In this thesis, we follow the relationship classification in Vanilla meta-meta model [21], which embeds the concepts of generalization/specialization and part-of/whole. Specifically, in the Vanilla meta-meta model, there are five relationship types, namely Associates, Is-a, Has-a, Contains, and Type-of, In this thesis, we concentrate on the first four Associates, Is-a, Has-a and Contains, where Is-a represents the concept of generalization/specialization, Contains and Has-a represent the concept of part/whole, and Associates represents all other weak semantic relationships. Strictly speaking, though both Has-a and Contains embed the concept of part-of/whole, they are different in semantics. As indicated in [19], partof relationships can be categorized in two dimensions, that is (1) the degree of sharing of parts among whole objects and (2) the degree of dependence between some part objects and some whole object(s). Contains and Has-a are different in the second dimension in that part objects are highly dependent on whole object(s) in Contains, while this dependence is not so strong in Has-a. This difference brings the rule that in a Contains relationship, the containee is a part of its container element, and cannot exist on its own (delete propagation). Moreover, Contains is a transitive relationship 23

32 Chapter 3. Problem Formulation and must be acyclic; while Has-a is weaker than Contains in that it does not propagate deletion and can be cyclic. Since we focus on the high-level part-of/whole relationship, we treat Has-a and Contains as the same in our framework. In addition, we also consider the equivalence relationship, which is the main focus of previous schema matching approaches. So totally, in our framework, we consider four relationship types: Equivalent, Has-a, Is-a, and Associates. Their formal definitions are specified as follows, and their graphical representation is shown in Figure 3.2: Equivalent: E(x, y) means that x is equivalent to y semantically. This is a symmetric relationship type, i.e., E(x, y) E(y, x); Has-a: H(x, y) indicates that x has a sub-component/member of y. This is an asymmetric relationship, i.e., H(x, y) can not infer H(y, x); Is-a: I(x, y) means that x is a specialization of y. This is an asymmetric relationship; Associates: A(x, y) indicates that x is associated with y. It is the weakest relationship that can be expressed. It has no constraints or special semantics. This is a symmetric relationship type. This representation is complex enough to capture many of the semantic relationships that appear in models, and yet is simple enough for a reasonable initial foray into the problem. A mapping, Map S T is a formal description of the semantic relationships between two schemas, S and T. A mapping itself is a model consisting of a 24

33 Chapter 3. Problem Formulation class class college faculty dept course dept professor Associates Equivalent Has-a Is-a Figure 3.2: Illustration of four relationship types handled by SeMap. set of mapping elements E, and a set of relationships R on E. The elements of the two schemas are related through the mapping elements. Each mapping element e E is like any other element in schemas S and T. In addition to being the origin or destination of any kind of relationship found in a model, i.e., R, each e E can be the origin of one or more mapping relationships, M(e, s), where s S T, which specifies that the origin element e, corresponds to the destination element s. The semantics of a mapping relationship is such that for all s 1, s 2 S T s.t. M(e, s 1 ) and M(e, s 2 ), s 1 = s 2, and s 1 corresponds to s 2. Given this rich mapping structure, the generic semantic relationship, not just simple correspondences between the elements of S and T can be expressed in this way: two semantically equivalent elements is represented by one mapping element; while the relationship of two mapping elements indicate that between their corresponding schema elements. For example, in Figure 3.3, the mapping element m 1 corresponds to the elements class and course representing the same concept; the relationship between m 4 and m 5 indicates instructor is-a faculty. 25

34 Chapter 3. Problem Formulation schema S Map S T schema T class m 1 (=) course Has-a professor Has-a instructor Associates dept Has-a m 2 (=) Has-a m 3 (=) Is-a m 5 (=) m 4 (=) Is-a m 6 (=) Associates college Has-a grad TA Associates faculty Has-a TA m 7 (=) ugrad TA Is-a Is-a m 8 (=) m 9 (=) Figure 3.3: An example of complex mapping handled by SeMap. 3.2 Problem Statement Given the definition of model and mapping, we are now ready to formally define the goal of SeMap: given two models, S and T, find generic semantic relationships required to create the mapping S T between S and T. There may be some optional inputs to the matching process, specifically (1) an initial mapping S T which provides an initial set of correspondences, and needs to be refined by the process; (2) external semantic resources r used by the matching process, e.g., domain-specific thesauri, ontologies, etc. The matching process is illustrated in Figure 3.4. resource r Schema S initial mapping S T Schema T Matching Mapping S T Figure 3.4: Illustration of the matching process. 26

35 Chapter 3. Problem Formulation 3.3 Semantic Resources The semantic resources used by matching techniques can be categorized as internal resources, which is contained in the input schemas or associated data instances, and external resources, which is the semantic information not presented in the schemas or data instances Internal Resources The semantic resources of the input schemas include both element-level information, which refers to the information stored at each schema element (e.g., element name, data type, structure, etc) and structure-level information, which refers to the information contained in the relationships between schema elements (e.g., relationship type, constraints, etc). In Section and , we introduce the element-level and structure-level resources considered in our SeMap system respectively Element-Level Information We consider the following element-level information: Element name (label). Each element name is of String type. The name (label) provides a first layer semantic evidence of the possible meaning of this schema element. Element type. If an element contains data, it is usually associated with a type indicating the storing format of the data. Note that in many representations, the data type of an model element is considered as a separate element, which is linked to the element itself by a 27

36 Chapter 3. Problem Formulation Type-of relationship. In our system, we consider data type as an attribute of the model element, e.g., String is an attribute of the element professor, rather a separate element. The element type can provide hints in the sense that similar schema elements usually have the same or compatible data types. Element description. It is a short description of the semantic meaning of the element, which usually contains more information than the element name only. For web interface where only schema-level information is available, the element description is especially valuable in determining the exact semantics of the elements. For example, it is hard to tell the semantics of an element only by its name people, in a flight ticket booking website. However, with the help of its description of total passengers, one can conclude that people stands for the overall number of tickets bought. Data instances. As discussed in Chapter 2, data instances can provide valuable information that could not be found in schemas, e.g., precise data format, data distribution, statistical values, etc. Specifically, the data type of an element may not be exactly how its data is stored, which can only be found in data instances. For example, the element phone may be of an Integer type. However, if looking at its data instances, one may notice that its exact format is of xxx-xxx-xxxx, which is not reflected in its data type. Meanwhile, the distribution of data instances is also useful in identifying similar schema elements, especially when the element names are obscure, e.g., A 1 and B 2 [11]. 28

37 Chapter 3. Problem Formulation Structure-Level Information In addition to the element-level information discussed above, we also consider structure-level information. In our system SeMap, we mainly consider two types of structure-level evidence: Relationship Type. Each edge between two schema elements is of certain relationship, which can be leveraged in matching process. The basic intuition is that if two elements are semantically similar, the elements having the same relationship with them are also highly likely to be semantically related. Constraints. Each edge can have constraints, including (1) cardinality in relational database table, e.g., 1-n, 1-1, etc, (2) key properties of elements, e.g., unique, primary, etc External Resources Previous work on matching techniques has shown that internal semantic evidence is usually insufficient for achieving high quality matching results; some additional external resources should be leveraged to improve the matching quality. In SeMap, we consider two types of external resources: Thesaurus. It is a dictionary which provides the different representations of the same concept. Hence the element names can be first populated with their synonyms, so that one has a better chance to find similar elements. Specifically, SeMap uses WordNet as the thesaurus. WordNet is a comprehensive English lexical reference system, 29

38 Chapter 3. Problem Formulation which organize more than nouns, verbs, 6000 adjectives and 3000 adverbs into synonym sets (synsets). It is considered one of the most powerful tools for computational linguistics, and has been used in several matching applications [7]. Ontology. Ontologies, especially domain-specific ontologies are powerful tools in discovering similar elements, even in identifying their implicit relationships. However they are not always obtainable. The collection of ontologies we employed in SeMap is provided from Onto- Builder [6]. 3.4 Approach Overview In this section, we present an overview of our generic matching system SeMap. As an implementation of the match operator, SeMap takes as input two schemas (models) S and T, and produces their generic semantic mapping S T. In addition, SeMap has additional input of external semantic resource r. In order to identify the generic semantic relationships between schema elements, SeMap not only has to identify the correspondences of complex relationships, but also extract the implicit relationship types. Figure 3.5 shows the basic architecture of this mapping construction system. SeMap implements this goal mainly in three phases: In the first phase, schema matching, the candidate matches (correspondences) of generic semantic relationships are identified. Note that most previous work focus on finding correspondences of Equivalent relationships, while in our work we also have 30

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Markus Krötzsch Pascal Hitzler Marc Ehrig York Sure Institute AIFB, University of Karlsruhe, Germany; {mak,hitzler,ehrig,sure}@aifb.uni-karlsruhe.de