A Conflict Resolution Environment for Autonomous Mediation Among Heterogeneous Databases

Size: px

Start display at page:

Download "A Conflict Resolution Environment for Autonomous Mediation Among Heterogeneous Databases"

Audrey Ford
5 years ago
Views:

1 A Conflict Resolution Environment for Autonomous Mediation Among Heterogeneous Databases Jinsoo Park Information and Decision Sciences Department Carlson School of Management University of Minnesota th Avenue South Minneapolis, MN Phone: (612) Fax: (612) Sudha Ram Department of Management Information System Eller College of Business and Public Administration The University of Arizona Tucson, AZ Phone: (520) Fax: (520) Index Terms: Heterogeneous Database Systems, Interoperability, Mediators, Semantic Conflict Resolution Working paper Last revised on April 2001

2 A Conflict Resolution Environment for Autonomous Mediation Among Heterogeneous Databases Abstract Our objective in this research is to develop a formal framework and methodology to facilitate semantic interoperability among distributed and heterogeneous information systems. The primary research question is, how do we identify and resolve various data- and schematic-level conflicts among such disparate information sources? A comprehensive formal framework for managing various semantic conflicts is proposed. We employ set theory and logic to formalize our framework, which provides a unified view of the underlying representational and reasoning formalism for the semantic mediation process. This framework is then used as a basis for automating the detection and resolution of semantic conflicts among heterogeneous information sources. Our framework is based on the concept of a mediator. We define several types of semantic mediators to achieve semantic interoperability. An ontology is used to capture various semantic conflicts. A mediation-based query processing technique is developed to provide uniform and integrated access to the multiple heterogeneous databases. This framework serves as the underlying basis of a semantically interoperable system environment. A usable prototype is implemented as a proof of concept for this work. The system has been integrated with the Internet and can be accessed through any Java-enabled web browser. Finally, the usefulness of our approach is evaluated using three cases in different application domains. Various heterogeneous datasets are used during the evaluation phase. The results of the evaluation suggest that correct identification and construction of both schema and ontology-schema mapping knowledge play very important roles in achieving interoperability at both the data and schema levels. The research adopts a multimethodological approach that incorporates set theory, logic, prototyping, and case study.

3 1. Introduction Enterprise data integration is the top item on every CIOs wish list (Stonebraker 1998). Data management in a heterogeneous environment has been one of the most challenging problems. In particular, establishing semantic interoperability among heterogeneous and distributed information sources has been a critical issue attracting significant attention from research and practice. Semantic interoperability is the ability of cooperating businesses to bridge semantic conflicts arising from differences in implicit meanings, perspectives, and assumptions, thus making a semantically compatible information environment based on the agreed concepts between different business entities. Another type of interoperability, syntactic interoperability, is the ability of multiple components to cooperate even though their implementation languages, interfaces, and execution platforms are different (Ram et al. 1999c). We believe, however, that emerging standards (i.e., CORBA, RMI, DCOM, Z39.50, XML, KIF, etc.) can resolve most syntactic level interoperability. Today s database system technologies are being challenged in virtually all areas of data management with new applications that demand ways of dealing more explicitly with the meaning and use of the data being managed. However, expressive semantic representations and processing capabilities are, unfortunately, not well handled by these technologies. The emerging application areas that require mechanisms for dealing with data semantics include, but are not limited to, data warehouses, geographic information systems (GIS), knowledge management systems (KMS), and e- commerce systems. In a recent research commentary published in Information Systems Research, March et al. (2000) discuss semantic interoperability as one of the most important research issues and technical challenges in heterogeneous and distributed environments. The design of a semantically interoperable system environment that manages various semantic conflicts among different systems is a daunting task. It should provide the capability to detect and resolve incompatibilities in data 1

4 semantics and structures, as well as a standard query language for accessing information on a global basis. At the same time, it should involve minimal or no changes to existing systems to preserve the local autonomy of the participating systems. The environment must be flexible enough to permit adding or removing individual systems from the integrated structure without major modifications. The work described in this article provides a solution to this challenge. The objective of this research is to provide an automated solution for managing various levels of semantic conflicts, thus facilitating semantic interoperability among autonomous and heterogeneous information systems. This objective is pursued by (1) proposing a comprehensive formal framework and methodology to provide semantic interoperability among these systems, (2) using the framework to develop and implement a system for this work, and (3) evaluating the utility of our approach. We adopt a multi-methodological approach with a combination of set theory, logic, prototyping, and case study for this research. In this sense, our research approach adheres to March and Smith s (1995) design science research framework. In order to justify our approach, we built a system and evaluated its effectiveness and usefulness. The research instruments employed in this work are prototyping and a case study. The combination of such instruments enables the researcher to counterbalance their respective strengths and weaknesses. The prototype system is used as a proof of concept to evaluate the impacts of our approach and its potential for acceptance in real world settings. The case study is exploratory because the prototype to be evaluated had no clear, single set of outcomes; the case study was used for meta-evaluation (Yin 1994). The rest of this article is organized as follows. Section 2 discusses the relevant research in semantic interoperability and argues that no single method can achieve comprehensive semantic interoperability. Section 3 presents our formal framework, which consists of local schemas, a federated schema, schema mapping knowledge, a semantic conflict resolution ontology, an ontology relationship knowledge, an ontology-schema mapping knowledge, semantic mediators, and a 2

5 query processing knowledge. A simple example is presented in section 4 to help readers understand the semantic reconciliation process. In section 5, we show the architecture of a prototype system that has been implemented based on the proposed framework and summarize the functions of each component. Section 6 provides the details of the case study and discusses the result of the study. Section 7 concludes this article with a discussion of the contributions of our research and future directions. 2. Background 2.1. Previous Work Previous research in semantic interoperability can be categorized into three broad areas. The mapping-based approach asserts an isomorphic mapping between semantically related information sources. It is usually accomplished by constructing a federated (or global) schema and by establishing mappings between the federated (or global) schema and the participating local schemas (Hayne and Ram 1990; Navathe et al. 1986; Reddy et al. 1994; Sheth et al. 1993). It is also possible to establish direct mappings between disparate information sources (Larson et al. 1989). The drawback of the federated schema method is its lack of semantic richness and flexibility (Ram and Ramesh 1999). However, mappings are not limited to schema components (i.e., entity classes, relationships, and attributes), but may be established between domains and schema components (Collet et al. 1991; DeMichiel 1989; Kashyap and Sheth 1996). Explicit representations of semantics of information sources can help resolve the problems associated with interoperability when constructing mappings between them. Another promising approach is the intermediary-based approach. This approach depends on the use of intermediary mechanisms (e.g., mediators, agents, ontologies, etc.) to achieve interoperability (Genesereth et al. 1997; Kuokka and Harada 1996; McLeod and Si 1995; Ouksel 1999; Papakonstantinou et al. 1996; Sciore et al. 1994). Such intermediaries may have domain specific knowledge, mapping knowledge, or rules specifically developed for coordinating various autono- 3

6 mous information sources. In most cases, such intermediaries use ontologies to share standardized vocabulary or protocols to communicate with each other (Goh et al. 1994; Kahng and McLeod 1998). The advantage of using ontologies is its ability to capture the tacit knowledge within a certain domain in great detail in order to provide a rich conceptualization of data objects and their relationships. Even though such an approach may be theoretically valid, the application of the ontology approach is practically infeasible due to the inherent complexities of the knowledge domain. Hence, this approach is typically applied only to a restricted application domain, which limits its general applicability in practice. Furthermore, in order to represent complex conceptualizations, the formalism used in representing the ontology also often becomes too complicated for wide application. The third approach, query-oriented approach, is based on interoperable languages, most of which are either declarative logic-based languages or extended SQL (Arens et al. 1996; Calmet et al. 1996; Czejdo et al. 1987; Krishnamurthy et al. 1991; Lakshmanan et al. 1997). They are capable of formulating queries spanning several databases. In order to resolve semantic conflicts over data structure and data semantics, it is desirable to have high order expressions that can range over both data and metadata. One of the main drawbacks of this approach is that it places too heavy a burden on the users by requiring them to understand each of the underlying local databases. This approach typically requires users to engage in the detection and resolution of semantic conflicts, since it provides little or no support for identifying semantic conflicts (Goh et al. 1994). Consequently, users are also responsible for semantic conflict resolution. Note that research approaches classified into these three categories may not be mutually exclusive. For example, the intermediary-based approach may not necessarily be achieved only through intermediaries. Some approaches based on intermediaries also rely on mapping knowledge established between a common ontology and local schemas. It is also often the case that mapping and intermediaries are involved in query-oriented approaches. 4

7 2.2. Types of Semantic Conflicts In a broad sense, semantic conflict analysis can occur at two different levels: at the data-level and at the schema-level. Data-level conflicts are differences in data domains caused by the multiple representations and interpretations of similar data. Examples of data-level conflicts are data value conflicts (e.g., in soil suitability analysis databases, the data value suitable in one database may mean that the soil in a particular area is suitable for road construction, while the same value in another database may indicate that the soil is suitable for sewage disposal), data representation conflicts (e.g., dates can be represented as a six-character string, such as or as a Julian date, such as 7-May-98 ), data unit conflicts (e.g., amount of rainfall may be represented as centimeters in one database and inches in another database), and data precision conflicts (e.g., assignment of building grades may be reported using different granularities a three-point scale with letter grades excellent, good, poor, or perhaps a four-point scale with A, B, C, and D ). Data-level conflicts can be further classified into two different levels depending on the granularity of the information unit (IU). For instance, semantic conflicts can occur at the level of objects properties and their values (attribute as IU) or objects themselves (entity as IU). The former attempts to identify semantic equivalence at the attribute level (attribute equivalence), while the latter operates at the entity level (entity equivalence). Both, however, do not necessarily try to resolve structural differences. While several studies have focused on resolving semantic conflicts at the data level (DeMichiel 1989; Goh et al. 1999; Kahng and McLeod 1998; Ventrone and Heiler 1991; Yu et al. 1991), others attempt to achieve interoperability resolving schema-level conflicts, i.e., structural differences (Batini and Lenzerini 1984; Garcia-Solaco et al. 1995; Geller et al. 1992; Navathe et al. 1986). Schema-level conflicts are characterized by differences in logical structures and/or inconsistencies in metadata (i.e., schemas) of the same application domain. Examples of such conflicts are naming conflicts, entity identifier conflicts, schema isomorphism conflicts, generalization con- 5

8 flicts, aggregation conflicts, and schematic discrepancies (Ram et al. 1999b). Naming conflicts arise when labels of schema elements (i.e., entity classes, relationships, and attributes) are somewhat arbitrarily assigned by different database designers. For example, both the relations labeled Publication in one database and Paper in another database actually capture the same information about research papers published in journals. Entity identifier conflicts are often caused by assigning different identifiers (primary keys) to the same concept in different databases. Schema isomorphism conflicts occur when the same concept (entity class) is described by a dissimilar set of attributes; i.e., the same concept is represented by a number of different attributes and, therefore, the sets of entities are not set-operation compatible (Sheth and Kashyap 1992). Generalization conflicts result from different design choices for modeling related entity classes. For instance, one database may have separate representations for Surface Water and Ground Water, whereas another database may have a Water entity class to collectively represent the two different but related entity classes (i.e., Surface Water and Ground Water ). Since the Water entity class in one database is defined as two distinct entity classes, Surface Water and Ground Water in another database, this requires not only mapping between the two databases, but also specifying rules to identify which instances (in Water ) belong to Surface Water and which to Ground Water. Aggregation conflicts arise when an aggregation is used in one database to identify a set of entities in another database. Therefore, the properties and their values in one database may aggregate corresponding properties and values of the set of entities of another database (Sheth and Kashyap 1992). Schematic discrepancies can occur when the logical structure of a set of attributes and their values belonging to an entity class in one database are organized to form a different structure in another database. Conflicts of this type are discussed in Krishnamurthy et al. (1991), Lakshmanan et al. (1997), and Sheth and Kashyap (1992). The pure schema-level approach, without data-level interoperability, however, may result in achieving interoperability between different schemas that may be semantically different but structurally similar. It is, there- 6

9 fore, desirable to achieve interoperability at both levels Our Approach and Contributions As discussed in Section 2.1, each approach has its own limitations. For example, in the mappingbased approach, due to the inherent complexity of mappings that requires the prior resolution of semantic conflicts, it is extremely difficult to maintain all possible mappings in a frequently changing environment. Thus, it would be desirable to provide a flexible mechanism that does not enforce the identification and resolution of semantic conflicts a priori by the system integrator. The intermediary-based approach tends to depend on a particular application domain. Therefore, its solution is usually application specific and cannot be easily extended to different domains. In addition, this approach may not be able to resolve schema-level conflicts because it focuses on reconciling semantic conflicts in the context representations rather than structural representations. On the other hand, many studies from the query-oriented approach (e.g., Krishnamurthy et al. 1991; Lakshmanan et al. 1997) have focused on resolving structural representation and do not provide data-level semantic interoperability. Results of the previous studies suggest that no single method could provide a comprehensive solution for achieving semantic interoperability. Our research attempts to give a comprehensive solution by embracing the advantages of different methods and overcoming their limitations. In this article, we present a hybrid approach to achieving semantic interoperability at both the dataand schema-levels. The specific contributions of our research are as follows: A generalized formal framework of an interoperable software development environment is presented, in which both data- and schema-level semantic conflicts can be automatically detected and resolved. This framework is comprehensive enough to manage various types of semantic conflicts in heterogeneous information sources while preserving the autonomy of individual sources. We have developed a mechanism for handling multidatabase queries using a semantic query 7

10 processing technique managed by semantic mediators. Our query interface provides uniform and integrated access to the large number of underlying distributed and heterogeneous databases, based on a conceptual user interface that assists the user in formulating queries without imposing the burden of acquiring familiarity with local databases. The feasibility of our proposed formal framework is demonstrated in a prototype system implementation. This multi-tier architecture supports various levels of user activities in managing the semantically interoperable environment, and demonstrates the practicality and performance of our approach. Our framework provides the theoretical basis for defining the various components required to develop a software environment for database conflict resolution. It clearly defines the purpose and function of each component, and describes their inter-relationships. 3. CREAM: Semantically Interoperable Environment Our proposed solution of a semantically interoperable system environment is called Conflict Resolution Environment for Autonomous Mediation (CREAM). CREAM can identify semantically related data from different database systems and resolve semantic conflicts among them. CREAM also allows users to access a large number of autonomous information systems without prior knowledge of their information content. In this section, we provide an overview of various features of our proposed framework. We first present a formal representation of the CREAM framework, and then highlight its various features in the subsequent subsections Formal Representation of CREAM Framework The goal of this subsection is to define the various components of the framework and their interrelationship to extent that they can be effectively and efficiently controlled. We present a simple formal (set theoretic) model of the CREAM framework, which provides insights into how each component interacts with the others and allows us to understand the overall behavior of the system independent of the implementation details. We believe that providing a comprehensive formal framework for defining each component is necessary and important because such a generalized framework can be very useful in developing a software environment. 8

11 framework can be very useful in developing a software environment. Definition 1. A CREAM framework is an 8-tuple = (Γ, Σ, Ξ, Λ, Ω, Φ, Ψ, Θ) where Γ is a set of local schemas considered for integration, and given by Γ = Uγ i, where γ i is a local schema and i identifies the local schema. Each γ i is represented in the form of a predicate atom to describe the structure of each schema. Σ is a federated schema that is an integration of Γ to represent a federated view of Γ and given by Σ Γ. Ξ is the schema mapping knowledge between the federated schema and participating local schemas. It is defined as a function from Σ to Γ and denoted as Ξ Σ Γ. Λ is a semantic conflict resolution ontology (SCROL), which is a tuple Λ = (OC, OI, RP, RS, RM, u), where OC, OI, RP, RS, RM, and u are concepts, instances, parenthood relationships, sibling relationships, domain value mapping relationships, and the root of Λ, respectively. Ω is the ontology relationship knowledge that is defined as a relation on λ σ σ where λ RP RS RM in Λ and σ OC OI in Λ. It is given by Ω λ σ σ. Φ is the ontology-schema mapping knowledge and is a set of mappings between schemas (both federated and local schemas) and SCROL. The ontology-to-schema mapping function Φ is therefore described as a relation on Λ (Σ Γ) and given by Φ Λ (Σ Γ). Ψ is a set of semantic mediators given by {coordinator, conflict detector, selector, query generator, data collector, {conflict resolver}, message generator}. Θ is the query processing knowledge of semantic mediators. In the following subsections, the above eight components of the CREAM framework are described in more details Local Schema A local schema is a description of data organization stored in the autonomous, local system par- 9

12 ticipating in CREAM. Since each local schema uses its own data model, it is necessary to translate each local schema into a single data model for comparison purposes in order to facilitate the identification of semantically similar data objects in different local schemas by explicitly representing all underlying assumptions in the corresponding local schema (Reddy et al. 1994). Consequently, we adopt a semantic model proposed by Ram et al. (1999a) to capture semantics of various types of data objects. This semantic model, called USM* (Unifying Semantic Model*), is used as a canonical model (also called common data model ) to represent all local schemas and federated schemas in a uniform way. This will ensure the uniformity in the modeling constructs by transforming local schema expressed in different data models into a single data model (Reddy et al. 1994). The intension (including structure, integrity rules, and meta-properties) of local schemas are captured and stored in the form of metadata and represented by USM*. USM* is also used as a graphical query formulation tool. Set theory is used to formalize USM*. Set theory is a powerful representation device that can be used to model a complex system. The discussion of USM* is beyond the scope of this paper, and its formal definitions can be found in Ram et al. (1999a). Although our approach uses the USM*, it can be used with any other semantic model. We used the USM* because it is one of the most comprehensive semantic models and has a number of abstractions to support not only simple entity classes but also complex aggregates and composite entity classes Federated Schema After identifying equivalent schema components between local schemas, a federated schema is created and mapped to local schemas. A federated schema in CREAM is an integrated view of the sharable data from the set of participating local schemas; thus, it could be equal to Γ (i.e., a set of local schemas) or a subset of Γ if any local schema administrator decided not to share some portion of its schema components and associated data. In the past, many integration approaches have 10

13 been proposed. These approaches can be broadly categorized into two different systems: tightly coupled federated system and loosely coupled federated system (Sheth and Larson 1990). In a tightly coupled federated system, one or more federated schemas are constructed from the schemas of the participating local databases. In most cases, the schemas of the local databases (called component databases ) are heterogeneous. This approach provides uniform and integrated access to the underlying local databases, because the federated schema serves as a front-end system that supports a common data model and a single global query language on top of these local database systems. In this approach, however, since semantic conflicts must be identified and resolved a priori by the administrators, it is very difficult to construct and maintain the federated schema. In a loosely coupled federated system, users can directly interact with local databases instead of being restricted to querying federated schemas. In this approach, however, due to the decentralization, users must have knowledge on the data sources to resolve the potential conflicts. End-users are responsible for detecting and resolving conflicts. Our approach to query processing is similar to the tightly coupled federated system approach, since users are only allowed to interact with one or more federated schemas which mediate access to data stored in all local databases. Accordingly, the query processing and query answering services are accomplished through a mediation layer. The initial query requested by the user through a federated schema refers to a global query. The global query is then processed by a semantic mediator, query generator. Note that both federated and local schemas are graphically depicted by the corresponding USM* constructors in our prototype system. The main advantage of this approach is that a single global query can be issued to access data from underlying multiple local databases in an integrated view without affecting any existing application program written on the basis of any local system (Yu and Meng 1998). On the other hand, although we have a federated schema, our approach requires only a reasonable amount of schema integration effort because semantic conflicts do not have to be identified and resolved a priori by a centralized administrator. 11

14 The specific context of each data object in the schema is identified through mappings to the extendable, domain independent ontology called SCROL (see section 3.5), and the local database administrators and domain experts are responsible for encoding semantic conflict classification knowledge Schema Mapping Knowledge Schema mappings are mappings between a federated schema and local schemas, and the schema mapping knowledge is the knowledge about schema mappings. The schema mapping knowledge can be established at any level (e.g., attribute, entity, or relationship level), and is defined as a relation on Σ Γ. Thus, each component of the federated schema is mapped to the corresponding component of the local schema. The mappings are established by identifying semantically similar concepts (i.e., schema components). In CREAM, these mappings are used to generate valid local queries. The schema mapping process is not completely automated in our current system. We are developing ways to automate the schema integration process so that we can reduce the time required to construct a federated schema. However, we believe that the development of schema mapping knowledge should be based on mutual agreement between knowledgeable database administrators from both the federated schema and local schemas. Consequently, human integrators are an essential part of the schema analysis and mapping process Semantic Conflict Resolution Ontology Semantic Conflict Resolution Ontology (SCROL) provides a dynamic mechanism for comparing and manipulating contextual knowledge about each information source, which is useful in facilitating semantic interoperability. It is used for detecting semantic conflicts between semantically equivalent data objects. Unlike other traditional ontology frameworks designed to capture domain specific (MacGregor 1991; Mahalingam and Huhns 1997; van der Vet and Mars 1998) or commonsense knowledge (Collet et al. 1991; Lenat et al. 1990; Storey et al. 1997), SCROL is developed to encode extensible knowledge on commonly found semantic conflicts that have been identified in Ram et al. (1999b). SCROL is different from existing domain ontology models in that we 12

15 provide a simple formalism to capture only the domain knowledge pertaining to potential types of semantic conflicts, thus it is not very complex. Another advantage of this approach is that, while existing approaches using ontologies are domain specific, our ontology is not domain specific. Therefore, our ontology can be applied to many different application domains without the burden of having to capture tacit knowledge about each domain. However, our approach does not lose any semantic richness since it also provides a semantic model that captures the intensional description of the particular application domain. We argue that using both a common ontology and a semantic model will provide a more complete understanding of the application domain. The structure of SCROL is basically a tree. A tree is a partially ordered set in which the predecessors (e.g., superconcepts) of each element are well-ordered. SCROL contains two distinct sets, i.e. concepts and instances, both of which are called nodes of SCROL. Concepts and instances are represented as terms. A concept is a more generalized term that may have several concrete instances. For example, the term Ratio is a concept, and Percent and Fraction are instances of Ratio. Concepts can have more than one child, but instances are childless. The parent of any instance is a concept. A concept that does not have any child is a leaf concept. A leaf concept is a concept that has no instance (childless) or at least one instance. The formal definition and detailed description of each SCROL construct can be found in Reference-1 1 (2000) Ontology Relationship Knowledge Another important element of the CREAM framework, associated with SCROL, is the ontology relationship knowledge. This knowledge is a core part of the reasoning process for semantic reconciliation, and thus is used as an inference engine in CREAM. There are three different types of relationships: parenthood, sibling, and domain value mapping relationships. The parenthood relationship is a binary vertical relationship (parent-child) that can be defined over the hierarchical structure of SCROL. In addition, we extend and modify the basic tree structure of SCROL to 1 Author information is suppressed for the review and marked as Reference-1 and Reference-2. 13

16 define the sibling relationship that is basically a horizontal relationship between two constructs. The sibling relationship consists of a disjoint relationship, a peer relationship, a part-of relationship, and an is-a relationship. The sibling relationship can only be established between concepts. The disjoint and peer relationships are used by semantic mediators to determine whether semantic conflicts exist (particularly data-level conflicts) and, if semantic conflicts exist, whether they are resolvable. The purpose of using part-of and is-a relationships is to allow semantic mediators to detect schema-level conflicts between schemas. The last relationship defined in SCROL is a domain value mapping relationship. This relationship can only occur between instances. It is used by semantic mediators to determine whether the actual data values that are mapped to instances can be transformed from one value to another and vice versa. It consists of one-one, one-many, many-many, or none. In addition, total and partial mappings are used in combinations of one, one-many, and many-many relationships to indicate whether every value in one instance has a corresponding value in other instances Ontology-Schema Mapping Knowledge After the schema mapping knowledge has been established between a federated schema and participating local schemas, the next step is to establish a mapping between these schemas and SCROL (however, the order of mapping knowledge construction is insignificant). Contextual knowledge in a local/federated database component is captured via mappings between SCROL and the schema component. The ontology-schema mapping knowledge defines a function from Λ to Σ Γ. This is the knowledge that identifies semantically related schema components through the ontology components in SCROL. The use of ontology-schema mapping knowledge in CREAM is explained in section Semantic Mediation Service Layer and Semantic Mediators Among the many features provided by agent technology, such as design autonomy, communication infrastructure, directory service, message protocol, mediation services, security services, re- 14

17 mittance services, and operations support (Huhns and Singh 1998), mediation services are the major goal of using mediators in our research. Wiederhold (1992) defined a mediator as a software module that exploits encoded knowledge about a particular dataset to bring the source information into a common form for a higher layer of applications. Therefore, a mediator can provide valueadded services (called semantic mediation services ), such as (1) accessing and retrieving relevant data from multiple heterogeneous resources and (2) abstracting and transforming retrieved data into common representations and semantics (Wiederhold and Genesereth 1997). The effective construction of mediators requires some common representation of the meanings (i.e., a common ontology) of the resources and applications they connect (Huhns and Singh 1998). The employment of a consistent ontology is useful because the users and administrators can have identical semantics for all terms; otherwise, our semantic mediation service will be incomprehensible to users (Wiederhold and Genesereth 1997). For that reason, in our approach, semantic mediators are required to use SCROL for conflict detection and resolution. Thus, a crucial element in semantic mediation services is the ability to effectively query and manipulate both schemas and SCROL. In this research, we developed a set of semantic mediators. The term semantic mediator refers to a mediator (or software agent) that is responsible for semantic reconciliation, i.e., the identification and resolution process of semantic conflicts. They also provide services, such as query generation, directory service, security service, and semantic transformation. We define seven distinct types of semantic mediators based on different functionalities and tasks, that is, semantic mediator = {coordinator, conflict detector, selector, query generator, data collector, {conflict resolver}, message generator}. The coordinator is responsible for coordinating and monitoring all kinds of activities provided by each type of semantic mediator. It also combines and filters communication messages passed by each semantic mediator. In this sense, the coordinator has the global view of a problem. The conflict detector is responsible for detecting various data- and 15

18 schema-level semantic conflicts. This is accomplished by traversing SCROL and using ontology relationship knowledge, schema mapping knowledge, and ontology-schema mapping knowledge to reason about semantic conflicts. It also determines whether semantic conflicts found can be resolved by any conflict Resolver. The conflict resolver is mainly responsible for semantic reconciliation. In CREAM, there are a number of specialized conflict resolvers, each of which can resolve a specific semantic conflict, based on semantic transformation rules. The selector is responsible for managing local queries generated by the query generator. It also maintains the addresses of all participating local databases. The query generator is responsible for parsing the global query and generating valid query statements for the corresponding local databases. The data collector actually executes each local query and assembles the query results. The message generator is an auxiliary mediator whose main responsibility is to generate message placeholders in which each mediator can put and retrieve several different types of message information during communication. The communication protocol used by our semantic mediators to process various user queries is motivated by the theory of speech acts (Austin 1962; Searle 1969) that considers language as a social action. Recently, many researchers have proposed agent communication languages based on speech act theory (Cohen and Levesque 1995; Genesereth and Ketchpel 1994; Labrou and Finin 1997). Most of these works are related with KQML (knowledge query and manipulation language). In KQML, agents can communicate with each other by exchanging performatives that provide hints for whether the content of the communication is an assertion, a request, or a query. Our current work, however, is not intended to work with KQML, and only provides a small set of conversation policies used for the semantic mediators. In our approach, since semantic mediation services add value to the data by applying the knowledge of the expert who has created the mediator, mediators are maintained by domain experts. This approach can ensure that our semantic mediators remain effective in a constantly 16

19 changing world. The semantic mediation service layer is illustrated in Figure 1. Strictly speaking, our mediation framework is neither purely a decentralized control system (i.e., distributed multiagent system) nor a centralized control system (Sikora and Shaw 1998). It is not a purely distributed multi-agent system in the sense that a semantic mediator, called coordinator, controls message-passing between some mediators (e.g., message-passing between conflict detector and selector). Neither is it a centralized control system, because mediators can communicate among themselves as well (e.g., message-passing between selector and query generator), and take actions based on the local state of each mediator (i.e., distribution of control ; Jennings and Wooldridge 1998). Figure 1 Semantic Mediation Service Layer Users Federal Schema Semantic Mediation Service Layer Local Schema 2 Local Schema 4 SCROL Local Schema 1 Local Schema 3 Local Schema n Metadata represented by USM* Schema Mapping Ontology-Schema Mapping Services in the semantic mediation layer are specialized by breaking them up based on function and domains. Partitioning a complex task into sub-tasks is desirable particularly in a complex and heterogeneous environment. There are several advantages to this approach. First, by dividing a complex task into smaller ones, a mediator can perform a well-bounded small task (goalorientedness). In addition, it is much easier to maintain and update a mediator as the environment evolves. The approach can also provide a better means of conceptualizing and implementing expert knowledge in a given domain. 17

20 3.9. Query Processing Knowledge The major responsibilities of semantic mediators include generating correct local queries from a global query that is created through the query-by-schema (QBS) user dialog interface, and performing various tasks to identify different levels of semantic conflicts that have been captured during the schema mapping and ontology-schema mapping processes. QBS allows the user to graphically formulate a query through a federated schema expressed in USM*. In order to process a user-specified query and handle semantic conflicts among local databases, we adopt a messagebased approach. In this approach, different types of message objects are generated, processed, and updated by various semantic mediators to answer the user s request. At each task, new message information is added to a message object by different semantic mediators. Message objects are used by various semantic mediators during the semantic mediation to answer the user-specified query. The query processing in CREAM can be accomplished broadly in following four steps: (1) Query parsing and decomposition parses the global query and creates a query graph. If the query contains join operation(s) (i.e., a query involving more than one relation), then it decomposes the query into several simple queries (one relation per query) and generates a query graph for each relation. In CREAM, a query graph is a tree consisting of several nodes that represent a relation, operations (projection, selection, join), attributes, and selection conditions, as illustrated in Figure 2. (2) Query modification modifies the relation and attribute names of the global query to the proper relation and attribute names of the corresponding local databases. During this phase, the query generator communicates with the conflict resolvers to modify the semantics of global query selection conditions (i.e., the WHERE clause) so that each generated local query contains a valid query statement before being executed by the local systems. The query modifi- 18

21 cation is realized by inspecting mappings between the federated schema and local schemas (i.e., schema mapping knowledge). (3) Semantic transformation is the process of resolving semantic conflicts and transforming the original context of the data sources to the target context in which the user wants to view the retrieved data. This process enables the user to view and manipulate data in the most appropriate manner for the user. This process is similar to context mediation, where the conflict resolution is deferred to the time when the data is actually retrieved (Sciore et al. 1994). However, unlike the context mediation approach, which focuses solely on the semantics of individual data items, our semantic transformation can also handle semantic conflicts at the schema level. (4) Assembly of global query results combines the query results returned by local queries to form the answer for the user. Figure 2 Query Graph Template relation projection selection join attribute attribute condition condition attribute attribute attribute relation relation 4. Semantic Reconciliation: A Motivating Example To illustrate the semantic reconciliation process, let us consider two very simple subsets of heterogeneous local schemas, County Schema and City Schema for census data used by counties and cities, respectively. The description and meta-information of each attribute of relations used by County Schema are presented in Table 1, and those used by City Schema are presented in Table 2. The federated schema that represents the integrated view of the two local schemas and the corresponding schema mapping knowledge are graphically illustrated in Figure 3. Recall that 19

22 a federated schema in CREAM is an integrated view of the participating local schemas, which could be equal to the union of the local schema components or a subset of the local schema components. In the current example, it happens that the federated schema is a union of both local schemas. It is worth noting that, since QBS provides a homogeneous view of heterogeneous information sources, which is independent of the structure and the location of the actual data sources, the user can transparently access data from difference sources without the prior knowledge of their information content. The goal of this section is to demonstrate (1) how the userspecified query is manipulated by different types of semantic mediators to generate valid local queries, (2) how the semantic conflicts related with the query are detected, and (3) how the query results retrieved from different local databases are semantically reconciled and delivered to the user. Table 1 Relations in County Schema COUNTY-AREA relation Attribute Name Data Type Description Name string name of county (identifier) Area real total area size of a county in square meters Image vector county map POPULATION relation Attribute Name Data Type Description pop-id integer system generated unique identifier County string FK referencing name in COUNTY-AREA Size integer total population of a county in thousands census-starting-date date census starting date census-ending-date date census ending date Table 2 Relations in City Schema CITY-SIZE relation Attribute Name Data Type Description Name string name of city (identifier) gross-size Real total area size of a city in acres Map TIFF city map CENSUS relation Attribute Name Data Type Description census-id integer System generated unique identifier City string FK referencing name of CITY-SIZE Population integer total population of a city census-date string Census starting date Duration integer total days taken for the census of a city 20

23 Figure 3 Federated Schema and Corresponding Schema Mappings County Schema Federal Schema City Schema COUNTY-AREA name area image REGION name area map CITY-SIZE name gross-size map POPULATION pop-id county size census-starting-date census-ending-date CENSUS id region-name population starting-date ending-date duration CENSUS census-id city population census-date duration Let s assume that the user submits a query, Retrieve all region names, their sizes, populations, and the census start dates where the census started after January 1, 1997, and the size of region is greater than or equal to 30,000,000 square meters. Recall that, in CREAM, users can interact only with federated schemas to submit queries, as shown in Figure 4. The QBS user dialog interface allows the user to select desirable attributes defined in each entity class, as illustrated in Figure 5. The user can also specify query conditions. The query generator then parses the global query and generates two query graphs because the query has two relations through a join operation. Note that the syntax of the global query is based on the federated schema, and later it will be modified by the semantic mediator to generate valid local queries. Based on the user s specification, the query generator creates the following global query: Example (Global Query): SELECT Region.name, Region.area, Census.population, Census.starting-date FROM Region, Census WHERE Region.area >= AND Census.starting-date >= 01-JAN-1997 AND Census.region-name = Region.name After generating the global query, corresponding temporary subqueries are generated based on the schema mapping knowledge. In this step, the entity and attribute names used in the global query are modified to subqueries whose names are matched with the corresponding local schemas. These temporary subqueries will be used for generating valid local queries in a later stage. The conflict detector then reasons SCROL to identify possible semantic conflicts for each temporary 21

subquery. This is accomplished by accessing the ontology-schema mapping knowledge. This mapping knowledge identifies semantically related schema components.

Note that, to simplify the complexity of the example we use, Figure 6 shows neither the complete hierarchical structure of concepts and instances of SCROL nor the schema mappings between the

24 subquery. This is accomplished by accessing the ontology-schema mapping knowledge. This mapping knowledge identifies semantically related schema components. The mappings between the schemas and SCROL are graphically illustrated in Figure 6. Note that, to simplify the complexity of the example we use, Figure 6 shows neither the complete hierarchical structure of concepts and instances of SCROL nor the schema mappings between the federated and local schemas. The dotted arrowhead lines indicate the mapping knowledge between each schema component and the corresponding SCROL component used in this example. Figure 4 Global Query Initiation Figure 5 Global Query Formulation Figure 6 Graphical Illustration of Ontology Schema Mapping Knowledge Area total 1-1 Scale total 1-1 Temporal Format Date Square Meter Acre peer Julian Date Type total 1-1 String Type total 1-1 Month Day, Year mm/dd/yy yyyy-mm-dd Date, Month Day, Year map starting-date area ending-date name population REGION CENSUS Federated Schema id duration censusending-date image area size COUNTY- AREA name County Schema pop-id censusstarting-date POPULATION name map CITY- SIZE City Schema gross-size population CENSUS duration census-id census-date ontology-schema mapping knowledge 22

25 As shown in the Figure 6, the attributes area in County-Area and gross-size in City- Size are mapped to Square Meter and Acre, respectively. This is an example of data unit conflicts. The mapping relationship between Square Meter and Acre is a total one-one. Therefore, all values of area in County-Area can be transformed to those of gross-size in City- Size. In addition, note that, since the parent of the two instances, Square Meter and Acre, is a concept Area, it implies that the two attributes, area and gross-size, are synonyms. Synonyms occur when the same concept is called by different names. In other approaches (e.g., Reddy et al. 1994), homonyms and synonyms are detected by external specification and resolved by providing mapping tables (or lookup tables) for translating from one value to another. In CREAM, however, naming conflicts are automatically inferred and resolved from the ontology relationship knowledge. The attribute size in Population is mapped to the instance 10 3 of the concept Scale, while the attribute population in Census is mapped to the instance 10 0 of the concept Scale. The mapping relationship between 10 3 and 10 0 is total one-one because it is possible to convert all values in 10 3 to 10 0 by multiplying values in size by 1,000 and vice versa (i.e., by dividing values in population by 1,000). This is an example of data precision conflicts. Again, since both size and population are mapped to the same concept Scale, they are automatically detected as a synonym conflict. Another example is a data representation conflict between census-starting-date in Population and census-date in Census. The census-starting-date is mapped to mm/dd/yy, which is an instance of a concept Julian Data Type, while census-date is mapped to Month Date, Year, which is an instance of a concept String Type. Both Julian Data Type and String Type are children of the concept Date, and have a peer relationship. Thus, by definition, values from both attributes can be directly converted to each other. Note that, although not shown in this example, one-many problems, often found in data precision conflicts, are typically resolved by one-way mapping unless specific transformation rules are specified. For 23

26 instance, in one database, customer locations may be recorded based on their street addresses, while they are stored as postal codes in another database. The former is described in finer granularity (i.e., data granularity conflict). In this case, typically, only one-way mapping (a one-tomany mapping) is possible; that is, from the more precise data scale (i.e., address) to the coarser scale (i.e., postal code). However, some other cases, such as the mapping between a three-point scale with letter grades excellent, good, and poor and a numeric scale from 1 to 100, rules may have to be specified to transform from the more precise data scale (i.e., numeric scale) to the coarser scale (i.e., three-point scale) and vice versa. Similarly, in the case of many-many problems, it may not always be possible to specify precise transformation between the two. In CREAM, conflicts resolvers can handle these types of conflicts using semantic transformation rules. After identifying semantic conflicts through the ontology-schema mapping knowledge, the conflict detector keeps them in individual local queries. The conflict detector also identifies and maintains three new pieces of information that will be needed by conflict resolvers in a later stage. They are a conflict controller, an original context, and a target context. The conflict controller is a concept that has child concepts or instances. Each of these represents multiple interpretations of semantically related objects. Each conflict controller has a corresponding semantic resolver that handles semantic reconciliation between the conflict controller s child concepts or instances. The original context is a concept or an instance that has been mapped to the local schema component to represent the specific context of the local schema component. The target context represents the resulting context in which the user wants to transform the meanings of data values from local databases. As depicted in Figure 7, CREAM displays each attribute s corresponding conflict controller, and asks the user to select a target context for each attribute. In this example, three attributes that contain semantic conflicts are presented to the user: a data unit conflict (i.e., detected from the 24

federated schema, Region.area, that is mapped to County-area.area in County Schema and City-size.gross-size in City Schema), a data precision conflict (i.e., detected from the federated schema, Census.

27 federated schema, Region.area, that is mapped to County-area.area in County Schema and City-size.gross-size in City Schema), a data precision conflict (i.e., detected from the federated schema, Census.population, that is mapped to Population.size in County Schema and Census.population in City Schema), and a data representation conflict (i.e., detected from the federated schema, Census.start-date, that is mapped Population.censusstarting-date in County Schema and Census.census-date in City Schema). The user can select a target context for each semantic conflict in the preferred view. Figure 7 Target Context Selection In the next step, the user has an option to choose an individual source. The metainformation on each data source at the attribute level is visualized to help the user decide which data sources should be retrieved from the local databases. If the user decides not to retrieve a particular data source, no further query processing for the source occurs. This approach is similar to source tagging described in Saltor and Rodríguez (1997). The next step after the target context selection is to generate valid local queries. As explained previously, the query generator already checked the schema mapping knowledge to produce appropriate relation and attribute names for each local schema. It now examines the ontology-schema mapping knowledge to modify selection conditions for each local subquery. This is done by comparing the conditional statements that reflect target contexts of the global query with the conditional statements of the local subqueries in their original contexts. The original contexts 25

28 reflect the intended semantics of the local databases. This is a very important step because without proper semantic transformation from the global query conditions to the matching local query conditions, the subqueries cannot reflect the intended semantics of the local databases. The context of the conditional statements should be the same if the original contexts and the target contexts are identical; otherwise it should be transformed to the original contexts of the local database. For example, since the global query condition statement is WHERE Region.area >= and the target context of the attribute area is Square Meter, the context of the local query condition statement for County Schema does not need to be transformed (the original and target contexts are the same: square meters). However, the context of the local query condition statement for City Schema must be transformed to reflect the original context in the database, because the original context, Acre, is different from the target context, Square Meter. Therefore, the query condition should be modified to WHERE City-size.gross-area >= (30,000,000 m 2 is the same as 7, acres). This semantic transformation is automatically performed by the conflict resolver. The parsed local query graphs are shown in Figure 8 and, subsequently, the valid local query statements are generated from the local query graphs as follows: Example (A Generated Query for County Schema): SELECT County-area.name, County-area.area, Population.size, Population.census-starting-date FROM County-area, Population WHERE County-area.area >= AND Population.census-starting-date >= 1/1/97 AND County-area.name = Population.county Example (A Generated Query for City Schema): SELECT City-size.name, City-size.gross-size, Census.population, Census.census-date FROM City-size, Census WHERE City-size.gross-size >= AND Census.census-date >= January 1, 1997 AND City-size.name = Census.city 26

Figure 8 Query Graphs Generated from the Local Schemas County Schema County-area Population projection selection join projection selection join name area area >= name county 30000000 size

29 Figure 8 Query Graphs Generated from the Local Schemas County Schema County-area Population projection selection join projection selection join name area area >= name county size census-starting-date census-starting-date >= '1/1/97' county name Population County-area City Schema City-size Census projection selection join projection selection join name gross-size gross-size >= name city population census-date census-date >= 'January 1, 1997' city name Census City-size Finally, all the valid local queries are passed to each local database server and executed to retrieve the data. The conflict resolver then performs semantic reconciliation by applying appropriate semantic transformation rules and converting the data to the target contexts that have already been specified by the user. The final query results are then combined by the data collector and displayed to the user, as shown in Figure 9. Figure 9 Query Result after Semantic Reconciliation 2 2 Note that the host names of data sources are temporarily suppressed for the review, which could identify the authors. 27

An Approach to Resolve Data Model Heterogeneities in Multiple Data Sources

Edith Cowan University Research Online ECU Publications Pre. 2011 2006 An Approach to Resolve Data Model Heterogeneities in Multiple Data Sources Chaiyaporn Chirathamjaree Edith Cowan University 10.1109/TENCON.2006.343819