Propagating Dependencies under Schema Mappings A Graph-based Approach

Size: px

Start display at page:

Download "Propagating Dependencies under Schema Mappings A Graph-based Approach"

Chester Bridges
5 years ago
Views:

1 Propagating Dependencies under Schema Mappings A Graph-based Approach ABSTRACT Qing Wang Research School of Computer Science Australian National University Canberra ACT 0200, Australia qing.wang@anu.edu.au Schema mapping plays an important role in many databaserelated transformation tasks, such as data exchange, data integration and data migration. In this paper, we study the dependency propagation problem in the context of schema mappings. This allows us to understand and discover logical consequences among source constraints, target constraints and mapping constraints of a schema mapping. In order to precisely characterize the relationships between source and target schemas, we consider mapping constraints as being bipartite TGDs, i.e., a class of tuple-generating dependencies (TGDs) that include both source-to-target dependencies and target-to-source dependencies. We then develop propagation graphs to represent the relationships among the attributes of different relations and, based on such propagation graphs, we propose algorithms to propagate inclusion and functional dependencies between source and target schemas. We have also designed a schema mapping reasoning tool to implement and evaluate our proposed approach. Categories and Subject Descriptors H.2.1 [Information Systems]: Database Management Logical Design Keywords Schema Mappings, Dependency Propagation, Data Integration, Data Dependencies 1. INTRODUCTION A schema mapping is concerned with specifying relationships between the elements of a source schema and a target schema. It plays an important role in many database-related Some of the work reported in this paper was undertaken when the second author was visiting the Research School of Computer Science, Australian National University. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. IDEAS 15, July 13-15, 2015, Yokohama, Japan Copyright c 2015 ACM /15/07...$ Xi Wen Department of Computer Science Nanchang University Nanchang City, China wenxi@ncu.edu.cn transformation tasks, such as data exchange [12, 25], data integration [26] and data migration [32, 33]. As the relationships between two schemas are often quite complicated, not just one-to-one correspondence at the schema level, specifying such relationships is by no means an easy task. Generally, two lines of research exist in the area of designing schema mappings: one is to generate schema mappings from a visual specification provided by users, and the other is to derive schema mappings based on data examples. The former has been traditionally studied for many years [6, 12, 22, 25], while the latter has attracted considerable interest in recent years [1, 2, 21, 30]. Nevertheless, these works also have some limitations. For instance, a visual specification is often ambiguous, causing difficulties to generate a schema mapping as desired. Data examples may not always be available, or even if available, could be biased, leading to deriving schema mappings inaccurately. To remedy these deficiencies, existing approaches either require a manual process of tuning schema mappings, which is often laborious and error-prone, or demand more data examples for improving accuracy. Despite these attempts, the resulting design quality of schema mappings is still far from ideal. In this paper, we aim to develop an approach that helps understand how well a schema mapping is designed, including to answer the following questions of interest: Can we ensure certain properties of a source database to be preserved in the desired target database through the design of a schema mapping? Can we determine whether or not a target constraint can be enforced on a target database before the target database is transformed from a given source database? If some target constraints cannot be enforced, can we efficiently identify which data in the source database need to be cleansed, or determine whether the schema mapping and target constraints need to be re-designed? In many real-life applications, implementing a schema mapping to materialize a target database is an expensive undertaking, especially when the source database has a large amount of data. Therefore, it is crucial to check, if possible, whether a schema mapping is designed meaningfully and effectively in advance, before an implementation takes place. Related work. Research on schema mappings has received a great deal of attention over the past decades [3, 6, 11,

2 12, 25, 31]. In relating to the universal target solution and query answering problems, high-level specification languages used for schema mappings have been investigated [3, 6]. In viewing schema mappings as metadata, the composition and inverse operations of schema mappings have been developed in [4, 11, 14, 27, 29]. Early works on generating schema mapping from a visual specification have led to the development of a good number of schema mapping design systems such as Clio [22], HePToX [7], and Altova MapForce 1. Recent works focused more on deriving schema mapping from data examples such as Eirene [2] and Muse [1]. Several works have also studied schema mappings in terms of optimization [13, 20], simplification [8], debugging [9] and learning [30]. Our work in this paper is to complement not replace these existing techniques by providing an approach for efficiently analyzing the design of a schema mapping through dependency propagation from target to source, or from source to target under the schema mapping. Previously, several works have proposed some graphical approaches to represent TGDs and EGDs [10, 12, 28]. In [10] the authors studied the implication problem of functional and inclusion dependencies using a directed graph in which each node represents a relation and each edge represents an inclusion dependency between two relations. A set of inclusion dependencies is acyclic if such a graph has no cycle. The authors of [12] proposed a graphical approach to identify a class of TGDs that can guarantee the termination of a chase procedure, so-called weakly acyclic sets of TGDs. Our approach in this paper generalizes the graphical representations of these works by associating a relation schema with each vertex, and labelling each edge with a function between attributes of two relation schemas. Dependency propagation has previously been studied in the context of views by several works [18, 23, 24]. Such views may be defined over a database using different fragments of relational algebra, and dependencies that have been considered in these works were primarily (conditional) functional dependencies [18, 23] or join dependencies [24]. In particular, the authors of [18] generalised the results on propagating functional dependencies to conditional functional dependencies [17]. Different from dependency propagation on views that are unidirectional mappings, we study dependency propagation between two schemas that may have bidirectional mappings, and investigate the dependency propagation problem for improving the design quality of schema matching. Until recently, to our best knowledge, this area has not been explored yet in the literature. Contributions. This paper has the following contributions: We study schema mappings by allowing mapping constraints to be bipartite TGDs over source and target schemas, including both source-to-target and targetto-source TGDs. This enables us to accurately represent the relationships between two database schemas. We propose the notion of propagation graph, and use it as an effective model to visualize the relationship between source and target schemas. We can also navigate through a propagation graph to analyze logical connections among source, target and mapping constraints when designing a schema mapping. 1 We investigate the dependency propagation problem in the context of schema mapping. Based on propagation graphs, we develop algorithms to propagate dependencies between source and target schemas under a schema mapping. It is well-known that the implication problem for TGDs is undecidable [5]. Our graphbased approach provides an approximate but efficient solution to this problem. We have developed a bipartite schema mapping tool to visualize propagation graphs, and on top of propagation graphs, to incorporate our propagation algorithms for inclusion and functional dependencies. We have applied our schema mapping tool over two schema mapping data sets to evaluate its usability in real-world applications. Outline. The remainder of the paper is structured as follows. Section 2 provides an example to illustrate why dependency propagation is needed in schema mappings. Section 3 presents the basic definitions related to schema mappings and dependency propagation. In Section 4 we describe a graphical model that captures the inter-relationships among TGDs, and discuss the algorithms for propagating inclusion and functional dependencies. The experimental study is presented in Section 5. We conclude the paper in Section 6 with an outlook to future work. 2. A MOTIVATING EXAMPLE Generally, a schema mapping over a source schema S and a target schema T is associated with three kinds of constraints: source constraints over S, target constraints over T, and mapping constraints over S and T. Source and target constraints govern the integrity of data stored in source and target databases, respectively, while mapping constraints control the transformations between them. Given such a schema mapping, a natural question is: How can we discover logical consequences among its source, target and mapping constraints? An answer to this question would provide us a conceptual view on how well a schema mapping is designed, i.e., how much semantics specified by the source schema is preserved by the target schema, or conversely, how much semantics specified by the target schema is captured by the source schema. The following example illustrates that in real-world applications, in order to find out how well a schema mapping is designed, it is often desired to compare the source and target constraints in terms of a given set of mapping constraints. Example 2.1. Suppose that we have the following schema mapping over a source schema S and a target schema T in a App application. (1) S contains three relation schemas: Rent(id, name, address), Rent (no, address, rent), All(name, dob, gender, cid), and source constraints: Σ s =.

3 Rent id name address c1 Tim Jenkin 5 Jicket St, Dunedin c2 Linda Lee 36 Novar St, Dunedin c3 Mike Carl 2 Manor St, Dunedin Rent no address rent 1 5 Jicket St, Dunedin Jicket St, Dunedin Manor St, Dunedin 450 All name dob gender cid Linda Lee 30/Mar/1990 f c2 Mike Carl 15/Jun/1884 m c3 Peter Wong 01/Jan/1880 m c4 Figure 1: A source instance I over the source schema S id c1 c2 c3 name Tim Jenkin Linda Lee Mike Carl no address 1 5 Jicket St, Dunedin 2 2 Manor St, Dunedin id no rent c c Figure 2: A target instance over the target schema T (C1) x, y, z.(rent(x, y, z) (x, y)); (C2) x, y, z, x, z.(rent(x, y, z) Rent(x, z, z ) (x, x, z )); (C3) x, y, z.(rent(x, y, z) x.(x, y) (x, x, z)). (C4) x, y, z.((x, y, z) z.rent(y, z, z)); (C5) x, y, z.((x, y, z) y, z.rent(x, y, z )); (C6) (x, y, z) x, y, z.all(x, y, z, x). Figure 3: Mapping constraints C1 C6 (2) T also contains three relation schemas: (no, address), (id, name), (id, no, rent), and target constraints: Σ t = { : no rent, [id] [id], [no] [no]}, where : no rent is a functional dependency defined on, and the others are inclusion dependencies. (3) Mapping constraints between S and T are presented in Figure 3, which contain: (i) C1 C3, which specify how source instances are transformed into target instances, and (ii) C4-C6, which specify how target instances relate to source instances. For example, C2 states that whenever a tuple (i, n, a) in Rent and a tuple (o, a, r) in Rent coincide on the attribute address, then there must be a tuple (i, o, r) in ; (C3) states that each tuple (o, a, r) in Rent leads to a tuple (o, a) in and a tuple (i, o, r) in where i is a value unknown. In the presence of C1 C6, it would be desired to find a set of constraints Σ s (resp. Σ t) that can be automatically propagated from the target schema to the source schema (resp. the source schema to the target schema), as shown below: Σ s = {Rent : no rent}, Σ t = {[id] [id], [no] [no]}. Then, by comparing Σ s (i.e., propagated source constraints under the given schema mapping) and Σ s (i.e., original source constraints), we would know that some source instance might violate the constraint Rent : no rent, e.g. the source instance in Figure 1. It means that either data in Rent need to be cleansed, or this constraint needs to be reconsidered. Suppose that we clean up Rent by removing the first tuple. Then we would obtain the target instance depicted in Figure 2. Similarly, by comparing Σ t (i.e., propagated target constraints under the given schema mapping) and Σ t (i.e., original target constraints), we would know that the two inclusion dependencies in Σ t can hold on every target instance under this schema mapping. We will discuss how Σ s and Σ t can be obtained through dependency propagation under the given schema mapping in Section PRELIMINARIES A (relational) database schema S consists of a finite, nonempty set of relation schemas. Each relation schema R has a fixed arity, and a finite set attr(r) of attribute names. A relational atom is an expression of the form R(t 1,..., t n) for n-ary R S, and an equality atom is an expression of the form t 1 = t 2, where t 1,..., t n are variables or constants. A constraint (so-called embedded dependency [16]) σ over S is an expression of the form x, ȳ.(ϕ( x, ȳ) z.ψ( x, z)), where ϕ and ψ are conjunctions of atoms over S, and x, ȳ and z are mutually disjoint variables. If only relational atoms occur in ψ, σ is called a tuple-generating dependency

4 R Figure 4: A source instance I P (a) J 1 Q P (b) J 2 Q P (c) J 3 Q 2 3 Figure 5: Three possible target instances J 1, J 2 and J 3 (TGD). If only equality atoms occur in ψ, σ is called an equality-generating dependency (EGD). For brevity, the universal qualification is often omitted in σ. We call ϕ the premise of σ, i.e. P re(σ) = ϕ( x, ȳ), and ψ the conclusion of σ, i.e. Con(σ) = ψ( x, z). A functional dependency (FD) defined on a relation schema R is an EGD expressed as R : X Y for X, Y attr(r) [16] (i.e., whenever two tuples of R agree on attributes X, then they must also agree on attributes Y ). An inclusion dependency (IND) defined on two relation schemas R 1 and R 2 is a TGD expressed as R 1[X] R 2[Y ] for X attr(r 1), Y attr(r 2) and X = Y (i.e., whenever a tuple t 1 occurs in R 1, there must exist a tuple t 2 in R 2 such that the values in attributes X of t 1 are the same as the values in attributes Y of t 2). A (relational) database instance I over S assigns to each relation schema R S a finite relation I(R). As a convention, we use inst(s) to refer to the set of all database instances over S, and I = σ to denote that a database instance I inst(s) satisfies σ. We have I = Σ iff I = σ for every σ Σ. A schema mapping is a triple M = (S, T, Σ m) consisting of a source schema S, a target schema T and a set Σ m of mapping constraints specified by some logical formalism over S and T. Instances of S are called source instances and instances of T are target instances. Similarly, constraints over S are called source constraints and constraints over T are called target constraints. In previous studies of data integration and data exchange, mapping constraints are typically formulated as source-to-target TGDs [3, 12, 31] or certain subclasses of source-to-target TGDs such as LAV, GAV and GLAV [26]. A source-to-target TGD is a TGD in which ϕ is a conjunction of atoms over S and ψ is a conjunction of atoms over T. Conversely, a target-to-source TGD is a TGD in which ϕ is a conjunction of atoms over T and ψ is a conjunction of atoms over S. Target-to-source TGDs are often used in inverting schema mappings, such as Fagin s S-inverse [11] and quasi-inverse [15]. In this paper, we consider source and target constraints as the union of a set of INDs and a set of FDs, and mapping constraints as including source-to-target and also target-tosource TGDs. To express this formally, we say that each mapping constraint is a bipartite TGD over S and T, which is either a source-to-target TGD or a target-to-source TGD over S and T. Let (I, J) = Σ m denote that a source instance I inst(s) and a target instance J inst(t ) satisfy all mapping constraints in Σ m. A model transformation under M = (S, T, Σ m) translates a given source instance I inst(s) into a target instance J inst(t ) such that (I, J) satisfies every mapping constraint in Σ m, and inst(m) = {(I, J) (I, J) = Σ m}. The co-existence of target-to-source and source-to-target TGDs in Σ m gives us the flexility to precisely capture the known relationship between source and target instances. Targetto-source TGDs can constrain target instances by working simultaneously with source-to-target TGDs specified in the same Σ m, as illustrated by the following example. Example 3.1. Consider a source schema S = {R} and a target schema T = {P, Q}. Figure 4 depicts a source instance I over S and Figure 5 depicts three possible target instances J 1, J 2 and J 3 over T. (1) If we have a schema mapping M 1 = (S, T, Σ 1) where Σ 1 contains R(x, y, z) P (x, y) Q(y, z), then (I, J i) inst(m 1) for i = 1, 2, 3. (2) If we have a schema mapping M 2 = (S, T, Σ 2) where Σ 2 contains R(x, y, z) P (x, y) Q(y, z); R(x, y, z) P (x, y) Q(y, z), then (I, J i) inst(m 2) for i = 2, 3, but (I, J 1) inst(m 2). (3) If we have a schema mapping M 3 = (S, T, Σ 3) where Σ 3 contains R(x, y, z) P (x, y) Q(y, z); P (x, y) z.r(x, y, z); Q(y, z) x.r(x, y, z), then (I, J 3) inst(m 3), but (I, J i) inst(m 3) for i = 1, GRAPH-BASED DEPENDENCY PROPA- GATION In this section we develop a graphical model for describing the inter-relationships among TGDs. Then, based on this graphical model, we propose algorithms for propagating inclusion and functional dependencies across two schemas under a schema mapping.

5 Rent All Rent f 1 f 4 f 2,1 f 6 f 5 f 3,2 f 3,1 g 1 g 2 Edges Labels of Edges Rent f1 f 1 : 1 1, 2 2. Rent f2,1 f 2,1 : 1 1. Rent f2,2 f 2,2 : 1 2, 3 3. Rent f3,1 f 3,1 : 1 1, 2 2. Rent f3,2 f 3,2 : 1 2, 3 3. f4 Rent f 4 : 1 1. f5 Rent f 5 : 2 1, 3 3. f6 All f 6 : 1 4. g1 g 1 : 1 1. g2 g 2 : 2 1. Rent f 2,1 4.1 Propagation Graphs Formally, a propagation graph G = (V, E) consists of a set V of vertices and a set E All f 6 of edges, where each vertex R V is a relation schema, and each edge R f R E is directed and labelled by a function f : attr(r) attr(r ). Given a set Σ of TGDs, the propagation graph of Σ can be constructed by applying the follow rules for each σ ϕ( x, ȳ) z.ψ( x, z) in Σ: f 1 Rent (1) Add an edge R f R for each relational atom R( x, ȳ ) f 4 in P re(σ) and each relational atom R ( x, z ) in Con(σ), where x, x x, ȳ ȳ, and z z, and the edge is labelled by f : attr(r) attr(r ) such that t and f(t) refer to the same variable in x x. (2) If P re(σ) contains more than one relational atom, all the edges yielded by σ are of type approximate ; otherwise, edges are of type exact. An approximate edge is removed from the graph when there exists another exact edge with the same start vertex, end vertex and label. f 5 Intuitively, an edge R f R represents that the existence of values in some attributes of R may require the existence Rent f 3,1 of values in some attributes of R, where the label f specifies which attributes of R and R are related. We distinguish two types of edges exact and approximate. For an approximate edge, the existence of values in some attributes of R does not always require the existence of the values in the corresponding attributes of R. Nevertheless, for an exact edge the existence of values in some attributes of R implies that the values must exist in the corresponding attributes of R. Definition 4.1. Given a propagation graph G, a propagation path in G is a sequence of edges R 1 f1 R 2,..., R n 1 fn 1 R n such that the composition of f 1,..., f n, denoted as f = f n 1 f 1, is a function that maps a non-empty subset of the attributes of R 1 into the attributes of R n. We call such a path is labelled by f. Note that, although every propagation path is a path (i.e., a sequence of edges) in a propagation graph, not every path in a propagation graph is a propagation path. For simplicity, we consider the attributes of a n-ary relation schema as being ordered, each having a distinct position between 1 and n. Consequently, every attribute of a relation schema R can simply be represented using its distinct position. Figure 6: A propagation graph Example 4.1. Consider the schema mapping discussed in Example 2.1. We may have the propagation graph depicted in Figure 6, where: C1 yields Rent f1. C2 yields Rent f2,1 and Rent f2,2. C3 yields Rent f3,1 and Rent f3,2. C4 yields f4 Rent. C5 yields f5 Rent. C6 yields f6 All. The two INDs in Σ t yield g1 and g2. In Figure 6, only Rent f2,1 is an approximate edge that is represented by a dashed line. The other edges are exact. Rent f2,2 is removed from the propagation graph because it is approximate and identical to the exact edge Rent f3,2. f4 Rent, Rent f1 is a propagation path with the label f = f 1 f 4 such that f(1) = 1. However, Rent f3,2, f6 All is not a propagation path because f 6 f 3,2 does not map any attributes of Rent to the attributes of All. In accordance with the rules of constructing propagation graphs, we thus have the following theorem. Theorem 4.1. Let Σ be a set of TGDs. The propagation graph of Σ can be constructed in linear time in the size of Σ, i.e., the number of constraints in Σ. 4.2 Propagating Algorithms In this section, we present our algorithms for propagating inclusion and functional dependencies across a schema mapping between source and target schemas. The algorithm for propagating inclusion dependencies is discussed in Section and the algorithm for propagating functional dependencies is discussed in Section Let M = (S, T, Σ m) be a schema mapping, which associates with a set Σ s of source constraints and a set Σ t of target constraints.

6 Input: a schema mapping M = (S, T, Σ m) and a set Σ s of source constraints Output: a set Σ t of INDs over T. Steps: 1. Initialize Σ t := ; 2. Construct the propagation graph G of Σ m Σ s; 3. Repeat the following for each propagation path between two R 1, R 2 T in G with the label f: If all edges in the propagation path are exact, then Σ t := Σ t {(R 1[X] R 2[Y ], exact)}, Otherwise Σ t := Σ t {(R 1[X] R 2[Y ], approximate)}, where f maps the attributes X of R 1 to the attributes Y of R 2 4. If {(R 1[X] R 2[Y ], exact), (R 1[X] R 2[Y ], approximate)} Λ, then Σ t := Σ t {(R 1[X] R 2[Y ], approximate)}, 5. Return Σ t Figure 7: Algorithm for propagating inclusion dependencies from S to T Definition 4.2. A constraint σ over T is said to be propagated from S to T under M if, for every model transformation (I, J) of M, whenever I satisfies Σ s, J must satisfy σ. Analogously, we can define constraints that are propagated from T to S under M. Since the same principles apply for propagating dependencies in either direction, we will only discuss the algorithms for propagating dependencies from S to T to avoid repetition Inclusion Dependencies Our algorithm for propagating inclusion dependencies is built upon the notion of propagation graph. Figure 7 depicts the algorithm that takes a schema mapping M and a set Σ s of source constraints as input to construct the propagation graph of Σ m Σ s and then generates a set of propagated inclusion dependencies over T. The key idea is that each propagation path from R 1 to R 2 in the target schema corresponds to an inclusion dependency R 1[X] R 2[Y ] between these two relation schemas. Note that, such an inclusion dependency may be associated with more than one propagation path from R 1 to R 2. Depending on the types of the edges occurring in these propagation paths, an inclusion dependency is either exact or approximate. More specifically, R 1[X] R 2[Y ] is exact if there exists at least one propagation path from R 1 to R 2 in which all the edges are exact, and R 1[X] R 2[Y ] is approximate if all the propagation paths from R 1 to R 2 contain at least one approximate edge. To resolve the ambiguities of approximate inclusion dependencies, we may involve human feedback using the following approach. For each approximate IND σ and each of its propagation paths, generate a pair σ, {ϕ 1 ψ 1,..., ϕ n ψ n} where ϕ i ψ i (i = 1,..., n) correspond to approximate edges occurring in the propagation path. Prompt the pair to users so that they can fine-tune the semantics of a schema mapping by confirming whether or not each ϕ i ψ i (1 i n) holds. If ϕ i ψ i does not hold, then mark the edge of ϕ i ψ i as forbidden (i.e., remove the propagation path). If σ has no propagation path left, then remove σ from Σ t. If ϕ i ψ i holds, then add ϕ i ψ i into Σ m. If σ has one propagation path with all exact edges, change σ to be exact in Σ t. In doing so, Σ t is refined to only contain exact inclusion dependencies, which are sound. Consequently, the specification of the schema mapping is fine-tuned to specify the desired transformations between source and target schemas. Example 4.2. Consider the schema mapping depicted in Example 4.1 and the propagation graph depicted in Figure 6. Using the algorithm presented in Figure 7, we can obtain Σ s = {(σ 1, approximate)} and Σ t = {(σ 2, exact), (σ 3, exact)}, where σ 1, σ 2 and σ 3, and the propagation paths for σ 1, σ 2 and σ 3 are depicted in Figure 8. Because σ 1 is approximate, σ 1, {Rent(x, y, z) x, z.(x, x, z )} is prompted to the users, where Rent(x, y, z) x, z.(x, x, z ) corresponds to the approximate edge Rent f2,1. If it holds, then we have Σ s = {(σ 1, exact)}. If it does not hold, we have Σ s = Functional Dependencies Now we discuss how to propagate functional dependencies under a schema mapping M = (S, T, Σ m). In the same spirit of computing a propagation cover of functional dependencies in the context of view dependencies [18, 19], the set of all functional dependencies implied by Σ s needs to be calculated first. The following example illustrates this in more detail. Example 4.3. Suppose that we have a relation schema R(A 1, A 2, A 3) in S, a relation schema R (B 1, B 2, B 3) in T, Σ s = {R : A 1 A 2, R : A 2 A 3} and Σ m = {R(x, y, z) y.r (x, y, z), R (x, y, z) y.r(x, y, z)}. By Σ m, we know that the values of the attributes B 1 and B 3 in R are identical to the values of the attributes A 1 and A 3 in R. If propagating FDs in Σ s onto R, then R : B 1 B 3 would

7 f 5 g 2 Rent f 3,2 f 3,1 All f 6 All f 6 Rent f 2,1 Rent f 1 f 4 Rent f 1 f 4 All f 6 (a) The propagation path for σ 1 (b) The propagation path for σ 2 Rent f 1 f 4 f 5 f 5 Rent f 3,1 Rent f 3,1 (c) The propagation path for σ 3 σ 1: Rent(x, y, z) x, y, z.all(x, y, z, x); f 5 σ 2: (x, y, z) y.(x, y ); Rent σ 3: (x, y, f 3,1 z) z.(y, z ), Figure 8: Three propagation paths in the propagation graph depicted in Figure 6 be lost. However, if propagating FDs in Σ s onto R, then we would have R : B 1 B 3 propagated from S to T by R : A 1 A 3. Let Σ s contain the set of all functional dependencies implied by the functional dependencies in Σ s. Then we use the following algorithm to identify a set of FDs that are (possibly conditionally) propagated from S to T under M. Push backward: For each R : X Y Σ s, if there exists a propagation path R,..., R labelled by f in the propagation graph of M, where R T, and each of XY has a preimage in f such that X = {f 1 (x) x X} and Y = {f 1 (y) y Y }, then R : X Y is propagated from S to T under M. Push forward: For each R : X Y Σ s, if there exists a propagation path R,..., R labelled by f in the propagation graph of M, where R T, and each of XY has an image in f such that X = {f(x) x X} and Y = {f(y) y Y }, and there does not exist a propagation path R,..., R labelled by f 1, then R : X Y is propagated from S to T under M with the condition that this FD can only be applied to a subset of tuples in a relation I(R ), i.e., {t t I(R ), t I(R), and t.z = t.f(z)}, where t.z z X Y and t.f(z) denote the values of the attributes z of t and f(z) of t, respectively. Example 4.4. Consider { : no rent} Σ t, and the propagation graph as in Figure 6. A propagation path v 1, v 2 corresponds to Rent f3,2, where v 1 = Rent, v 2 =, f 3,2 is the label of the path, and the preimages of no and rent under f 3,2 are still no and rent respectively. By the rule push backward, Rent : no rent is propagated from T to S. Although there is a propagation path v 2, v 1 corresponding to f5 Rent, by the rule push forward and the fact that f 1 3,2 = f5, no further FDs can be propagated from T to S based on this path. 5. EXPERIMENTS In order to evaluate our work presented in this paper, we have developed a bipartite schema mapping (BSM) tool. In practice, this tool can help schema mapping designers in several aspects: (1) visualizing propagation graphs for any given schema mappings, (2) assessing the design quality of schema mappings by propagating dependencies between source and target schemas, and (3) facilitating the data cleaning tasks of source instances in accordance with a given schema mapping and desired target constraints. Our BSM tool was written in Python. We have conducted our experiments over two schema mapping data sets using this BSM tool. The first data set is App. The source and target schemas of App, source and target constraints as well as a schema mapping were described in Example 2.1. In Example 4.1, we have also discussed the propagation graph of the schema mapping described in Example 2.1. Based on such a propagation graph, several propagation paths and the corresponding propagated dependencies are presented in Examples 4.2 and 4.4. The second data set is Amalgam which was taken from the web page of the Clio Project at the University of Toronto 2. Amalgam contains four individual database schemas S 1, S 2, S 3 and S 4 in the area of bibliographic databases. In the following, we illustrate the main features of the BSM tool based on Amalgam, and treat S 1 and S 2 as the source and target schemas, respectively. 2 miller/amalgam/

8 Figure 9: A schema mapping M S1S2 over Amalgam (S 1 is the source schema and S 2 is the target schema) Figure 10: The propagation graph for visualizing the schema mapping M S1S2

9 Source schema Target schema (S 1) (S 2) No of relations No of INDs No of FDs No of MCs 10 Table 1: Some statistics about the source schema S 1 and the target schemas S 2 in Amalgam, where MCs refer to mapping constraints. Figure 9 presents the main user interface built in the BSM tool for specifying the source schema, target schema and a schema mapping between the source and target schemas of Amalgam. As can be seen from Figure 9, source and target constraints in the form of FDs and INDs can also be specified through the user interface. In our experiments, as described in Table 1, the source constraints contain 14 INDs and 23 FDs, and the target constrains are expected to have 26 INDs and 21 FDs. We manually set up 10 mapping constraints which transform the records in the relation schemas Article, ArticlePublished, TechReport, TechPublished and Author of the source schema (i.e. S 1) into the corresponding ones in the relation schemas of the target schema (i.e. S 2), including Authors, Allbibs, Titles, CitJournal, Institutions, Publisher, Journal, Volumes, Years, Months, Pages, Numbers, etc., and vice versa. Some of these mapping constraints are presented in Figure 12. Figure 10 visualizes the propagation graph of the schema mapping M S1S2 depicted in Figure 9. Each white node represents a relation schema in the source schema, each black node represents a relation schema in the target schema, exact edges are black and approximate edges are red. From Figure 10, we can see that a schema mapping is often quite complicated in real-world applications, even when source and target schemas are relatively small. There are much more edges existing between relation schemas in the source and target schemas than existing between relation schemas in the same schemas. Edges that are across two different schemas indicate the correspondences between the source and target schemas which are specified through M S1S2. To evaluate how effectively the BSM tool can propagate dependencies across two different schemas through a schema mapping, we have conducted an experiment on Amalgam to derive all possible INDs over the target schema S 2 from the source constraints over S 1 and the mapping constraints in M S1S2. The experimental result shows that there are 20 propagation paths existing between relations in the target schema (i.e., both the starting and ending vertices are in S 2), and correspondingly 16 non-trivial INDs are derived over the target schema S 2. Among these 16 non-trivial INDs, 2 INDs are covered by the expected target constraints while the others are not. Figure 11 presents the propagation graph for deriving such target constraints over S 2 (we omit the isolated vertices for simplicity). 6. CONCLUSION We have presented an approach for propagating dependencies under schema mappings in this paper. Mapping constraints of a schema mapping are permitted to be bipartite TGDs. This enables us to precisely specify the relationship Figure 11: The propagation graph for deriving target constraints over S 2 based on the schema mapping M S1S2 and the source constraints over S 1 between source and target databases. We have also developed a graphical model to represent the inter-relationships among the attributes of relation schemas, and on this basis, studied the dependency propagation problem in the context of schema mappings. Our solution to this problem supports us to develop a conceptual analysis tool that exploits the semantics of a schema mapping through propagation paths in the corresponding propagation graph. In doing so, the design quality of schema mappings can be assessed before actually implementing them. As future work we will extend our work in two directions: We will study the dependency propagation problem in a peer-to-peer data management environment, which would require us to generalize our schema mapping tool to handling the propagation of dependencies across multiple databases. We will also conduct experiments to investigate how our mapping tool of propagating dependencies can be incorporated into other existing mapping tools of designing schema mappings for improved quality. In particular, we are interested in exploring how our approach can be used to reason about and repair the mapping constraints in a schema mapping. 7. REFERENCES [1] Alexe, B., Chiticariu, L., Miller, R. J., and Tan, W.-C. Muse: Mapping understanding and design by example. In ICDE (2008), pp [2] Alexe, B., Ten Cate, B., Kolaitis, P. G., and Tan, W.-C. EIRENE: Interactive design and refinement of schema mappings via data examples. PVLDB 4, 12 (2011), 1414âĂŞ [3] Arenas, M., Barcelo, P., Libkin, L., and Murlak, F. Relational and XML data exchange.

10 (Article(x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8, x 9, x 10, x 11, x 12), ArticlePublished(x 1, y), Author(y, z)) (Allbibs(x 1), Authors(x 1, z), Titles(x 1, x 2), CitJournal(x 1, z 2), Journal(x 3, z 2), Years(x 1, x 4), Months(x 1, x 5), Pages(x 1, x 6), Volumes(x 1, x 7), Numbers(x 1, x 8)); Article(x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8, x 9, x 10, x 11, x 12) (Allbibs(x 1), Titles(x 1, x 2), CitJournal(x 1, z 2), Journal(x 3, z 2), Years(x 1, x 4), Months(x 1, x 5), Pages(x 1, x 6), Volumes(x 1, x 7), Numbers(x 1, x 8)); ArticlePublished(x, y) Article(x, x 2, x 3, x 4, x 5, x 6, x 7, x 8, x 9, x 10, x 11, x 12), Authors(x, z 1); (TechReport(x 1, x 2, x 3, x 4, x 5, x 6, x 7, x 8, x 9, x 10, x 11, x 12), TechPublished(x 1, y), Author(y, z)) (Allbibs(x 1), Authors(x 1, z), Titles(x 1, x 2), Institutions(x 1, x 3), Years(x 1, x 4), Months(x 1, x 5), Pages(x 1, x 6), Volumes(x 1, x 7), Numbers(x 1, x 8)); Authors(x, y) Author(z, y); Journal(x, y) Article(x 1, x 2, x, x 4, x 5, x 6, x 7, x 8, x 9, x 10, x 11, x 12). Figure 12: Some mapping constraints used in the experiments over Amalgam Synthesis Lectures on Data Management 2, 1 (2010), [4] Arenas, M., Pérez, J., and Riveros, C. The recovery of a schema mapping: bringing exchanged data back. ACM TODS 34, 4 (2009), 22. [5] Beeri, C., and Vardi, M. Y. A proof procedure for data dependencies. JACM 31, 4 (1984), [6] Bellahsene, Z., Bonifati, A., and Rahm, E. Schema Matching and Mapping. Springer, [7] Bonifati, A., Chang, E. Q., Lakshmanan, A. V., Ho, T., and Pottinger, R. HePToX: marrying xml and heterogeneity in your p2p databases. In PVLDB (2005), pp [8] Calvanese, D., De Giacomo, G., Lenzerini, M., and Vardi, M. Y. Simplifying schema mappings. In ICDT (2011), pp [9] Chiticariu, L., and Tan, W.-C. Debugging schema mappings with routes. In VLDB (2006), pp [10] Cosmadakis, S. S., and Kanellakis, P. C. Functional and inclusion dependencies: a graph theoretic approach. In PODS (1984), pp [11] Fagin, R. Inverting schema mappings. ACM TODS 32, 4 (2007). [12] Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L. Data exchange: semantics and query answering. TCS 336, 1 (2005), [13] Fagin, R., Kolaitis, P. G., Nash, A., and Popa, L. Towards a theory of schema-mapping optimization. In PODS (2008), pp [14] Fagin, R., Kolaitis, P. G., Popa, L., and Tan, W.-C. Composing schema mappings: Second-order dependencies to the rescue. ACM TODS 30, 4 (2005). [15] Fagin, R., Kolaitis, P. G., Popa, L., and Tan, W.-C. Quasi-inverses of schema mappings. ACM TODS 33, 2 (2008). [16] Fagin, R., and Vardi, M. The theory of data dependencies a survey. Proceedings of Symposia in Applied Mathematics 34 (1986), [17] Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. Conditional functional dependencies for capturing data inconsistencies. ACM TODS 33, 2 (2008). [18] Fan, W., Ma, S., Hu, Y., Liu, J., and Wu, Y. Propagating functional dependencies with conditions. PVLDB 1, 1 (2008), [19] Gottlob, G. Computing covers for embedded functional dependencies. In PODS (1987), pp [20] Gottlob, G., Pichler, R., and Savenkov, V. Normalization and optimization of schema mappings. VLDB 2, 1 (2009), [21] Gottlob, G., and Senellart, P. Schema mapping discovery from data instances. JACM 57, 2 (2010), 6. [22] Hernández, M. A., Miller, R. J., and Haas, L. M. Clio: A semi-automatic tool for schema mapping. ACM SIGMOD Record 30, 2 (2001), 607. [23] Klug, A. Calculating constraints on relational expression. ACM TODS 5, 3 (1980), [24] Klug, A., and Price, R. Determining view dependencies using tableaux. ACM TODS 7, 3 (1982), [25] Kolaitis, P. G. Schema mappings, data exchange, and metadata management. In PODS (2005), pp [26] Lenzerini, M. Data integration: A theoretical perspective. In PODS (2002). [27] Madhavan, J., and Halevy, A. Y. Composing mappings among data sources. In PVLDB (2003), pp [28] Missaoui, R., and Godin, R. The implication problem for inclusion dependencies: A graph approach. ACM SIGMOD Record 19, 1 (1990), [29] Nash, A., Bernstein, P. A., and Melnik, S. Composition of mappings given by embedded dependencies. ACM TODS 32, 1 (2007), 4. [30] ten Cate, B., Dalmau, V., and Kolaitis, P. G. Learning schema mappings. In ICDT (2012), pp [31] ten Cate, B., and Kolaitis, P. G. Structural characterizations of schema-mapping languages. JACM 53, 1 (2010), [32] Thalheim, B., and Wang, Q. Towards a theory of refinement for data migration. In ER. 2011, pp [33] Thalheim, B., and Wang, Q. Data migration: A theoretical perspective. DKE 87 (2013),

Composing Schema Mapping

Composing Schema Mapping An Overview Phokion G. Kolaitis UC Santa Cruz & IBM Research Almaden Joint work with R. Fagin, L. Popa, and W.C. Tan 1 Data Interoperability Data may reside at several different