Conjunctive queries. Many computational problems are much easier for conjunctive queries than for general first-order queries.

Conjunctive queries Relational calculus queries without negation and disjunction. Conjunctive queries have a normal form: ( y 1 ) ( y n )(p 1 (x 1,..., x m, y 1,..., y n ) p k (x 1,..., x m, y 1,..., y n )). They can also be expressed in Datalog: q(x 1,..., x m ) : p 1 (x 1,..., x m, y 1,..., y n ),..., p k (x 1,..., x m, y 1,..., y n ). Many computational problems are much easier for conjunctive queries than for general first-order queries. 1

Query containment For every query language, a query can be viewed as a mapping from the set of all possible database instances to all possible result instances. A query Q 1 is contained in a query Q 2 (Q 1 Q 2 ) if for every database instance D: Q 1 (D) Q 2 (D). Query containment is: undecidable for arbitrary first-order queries decidable (NP-complete) for conjunctive queries without built-in predicates decidable (Π p 2-complete) for conjunctive queries with built-in predicates Query equivalence can be determined using containment: Q 1 Q 2 iff Q 1 Q 2 Q 2 Q 1. 2

Checking conjunctive query containment Without built-in predicates. Assume Q 1 = A 0 : A 1,..., A n. Algorithm for checking Q 1 Q 2 : 1. to each goal A i, i = 1,..., n, in the body of Q 1 apply some ground substitution h that maps different variables to different (arbitrary) constants 2. apply T Q2 to the canonical database: the set of facts {A 1 h,..., A n h} 3. Q 1 Q 2 iff A 0 h is derived. This algorithm works also if Q 2 is an arbitrary Datalog program (the step 2 has to be repeated like in bottom-up evaluation) but not if Q 1 is an arbitrary Datalog program. 3

Queries with built-in predicates Arithmetic comparison predicates. Multiple canonical databases: all possible orderings for variables instantiate variables to integers For every canonical database D that makes the entire body of Q 1 true, T Q2 needs to derive the corresponding head of Q 1. 4

Integrity constraints A query Q 1 is contained in a query Q 2 under a set of integrity constraints F (Q 1 F Q 2 ) if for every database instance D satisfying F : Q 1 (D) Q 2 (D). Theorem: Q 1 F Q 2 iff chase F (Q 1 ) chase F (Q 2 ). 5

Chase A versatile tool, useful also for checking query containment under integrity constraints. Chase: apply chase steps to the body of a conjunctive query Q until no changes occur. Chase steps depend on the kind of constraints used. For functional dependencies: FD-step using X Y over P : if there are two P -goals in the body of Q that agree on X-attributes, apply to Q a substitution that will make them agree on Y -attributes (preferring the variables in the head). Chase terminates for FDs. 6

Views in data integration Basic model: many independent data sources containing all the data wrappers of data sources provide a single data model a single integrated database (virtual) relationships between the content of the sources and that of the integrated database: local-as-view global-as-view queries asked against the integrated database 7

Local-as-view Each data source is viewed as a goal G s and is defined using a query Q g over the integrated database. Notation: S is an instance of the source, D is a (virtual) instance of the integrated database. Source annotations: sound: S Q g (D) (the most common), complete: Q g (D) S, exact: Q g (D) = S. There may be more than one instance D satisfying the annotations, for a given S. 8

Global-as-view Each relation over the integrated database is defined using a goal G g as a view Q s over the data sources. Notation: S consists of instances of all the sources, D is a (virtual) instance of the integrated database. Source annotations: sound: Q s (S) G g (D), complete: G g (D) Q s (S), exact: G g (D) = Q s (S) (the most common). Under the exact annotations, there is only one satisfying instance D, for a given S. 9

Comparison Local-as-view: query evaluation rewriting in terms of views cannot be composed scalable Global-as-view: query evaluation view materialization (for exact annotations) can be composed (integrated database may be viewed as a data source) not scalable (adding sources requires the redefinition of the integrated database) 10

Query evaluation Semantics: a tuple t is a certain answer to a query Q given some source instances if it is in the answer to this query over every instance of the integrated database that satisfies all the source annotations. Computing certain answers is in most cases computationally hard. 11

Inverse rules An approach to query rewriting, used in Infomaster. Data source conjunctive query. Query a set of Datalog rules. Rewriting produces a set of nonrecursive Datalog rules with function symbols: EDB predicates: source relations IDB predicates: database relations Function symbols can be eliminated. 12

Query rewriting: 1. for every source rule A : B 1,..., B n, produce n inverse rules B 1 : A,..., B n : A 2. B i is like B i, except that each variable that occur only in the body of the source rule is replaced by the (Skolem) term f(x 1,..., X n ) where: f is a unique function symbol X 1,..., X n are all the variables in the head of the source rule 3. all the occurrences of the same variable are replaced by the same term Query evaluation: the query rule and the inverse rules are evaluated bottom-up the evaluation terminates only the substitutions that do not contain Skolem terms are returned to the user 13

Bucket algorithm An approach to query rewriting, used in Information Manifold. Data source a view defined as a conjunctive query. Query a conjunctive query Q D 0 : D 1,..., D n Rewriting of the query Q produces a set S, initially empty, of conjunctive queries. 14

Buckets: 1. create a bucket for every goal D 1,..., D n in Q 2. if A : B 1,..., B n is a data source definition such that for some j, B j unifies with D i, if D i has a head variable in some argument, then B j also has a head variable in the same argument, then Aσ (with new variables substituted for those variables that do not occur in B j ) is added to the bucket D i where σ is a most general unifier of B j and D i preferring the variables in the query Query rewriting: for each query rewriting Q which is a conjunction of one subgoal from each bucket: if the expansion of Q is contained in Q, then add Q to S the final rewriting is the union of the queries in S. 15

Constraints in the query and the views To prevent irrelevant goals from being placed in the buckets: if the constraints in the data source definition together with the constraints in the query are unsatisfiable after applying σ limited to the head variables of the view, then the source is useless for the query. To help pass the containment test: constraints may be added to a query rewriting. 16

Properties Not always a rewriting equivalent to the original query exists: data sources are insufficient data sources are incomplete. If the query does not contain any constraints, and there are only sound annotations, then the inverse rules and bucket algorithms are guaranteed to produce the maximally contained rewriting (in terms of the given sources) and the given query language. The rewriting computes exactly the certain answers. 17

Source limitations Sometimes data sources impose limitations on access paths to the data: some arguments have to be provided as inputs. This is represented by adorning each data source with a string of b and f: each argument is labelled with a b or an f b stands for mandatory input f stands for input or output To obtain a maximally-contained rewriting sometimes a recursive Datalog program is necessary. 18