Conjunctive queries. Many computational problems are much easier for conjunctive queries than for general first-order queries.

Similar documents
Foundations of AI. 9. Predicate Logic. Syntax and Semantics, Normal Forms, Herbrand Expansion, Resolution

OPTIMIZING RECURSIVE INFORMATION GATHERING PLANS. Eric M. Lambrecht

Range Restriction for General Formulas

Integrity Constraints (Chapter 7.3) Overview. Bottom-Up. Top-Down. Integrity Constraint. Disjunctive & Negative Knowledge. Proof by Refutation

Database Theory: Beyond FO

Resolution (14A) Young W. Lim 6/14/14

Foundations of Databases

Datalog. Rules Programs Negation

Data Integration: A Theoretical Perspective

Rewriting Ontology-Mediated Queries. Carsten Lutz University of Bremen

Outline. q Database integration & querying. q Peer-to-Peer data management q Stream data management q MapReduce-based distributed data management

Datalog Evaluation. Linh Anh Nguyen. Institute of Informatics University of Warsaw

Local Stratiæcation. Instantiate rules; i.e., substitute all. possible constants for variables, but reject. instantiations that cause some EDB subgoal

Mathematical Logic Prof. Arindama Singh Department of Mathematics Indian Institute of Technology, Madras. Lecture - 37 Resolution Rules

Data integration lecture 2

Module 6. Knowledge Representation and Logic (First Order Logic) Version 2 CSE IIT, Kharagpur

Complexity of Answering Queries Using Materialized Views

Computing Query Answers with Consistent Support

(i.e., produced only a subset of the possible answers). We describe the novel class

PCP and Hardness of Approximation

DATABASE THEORY. Lecture 11: Introduction to Datalog. TU Dresden, 12th June Markus Krötzsch Knowledge-Based Systems

Answering Queries with Useful Bindings

LOGIC AND DISCRETE MATHEMATICS

Knowledge Representation. CS 486/686: Introduction to Artificial Intelligence

A Retrospective on Datalog 1.0

Chapter 9: Constraint Logic Programming

Chapter 10 Part 1: Reduction

Plan of the lecture. G53RDB: Theory of Relational Databases Lecture 14. Example. Datalog syntax: rules. Datalog query. Meaning of Datalog rules

Implementing mapping composition

DATABASE THEORY. Lecture 15: Datalog Evaluation (2) TU Dresden, 26th June Markus Krötzsch Knowledge-Based Systems

Query Containment for Data Integration Systems

Lecture 1: Conjunctive Queries

CSL105: Discrete Mathematical Structures. Ragesh Jaiswal, CSE, IIT Delhi

CSE-6490B Assignment #1

Module 6. Knowledge Representation and Logic (First Order Logic) Version 2 CSE IIT, Kharagpur

COMP718: Ontologies and Knowledge Bases

An Evolution of Mathematical Tools

The Formal Semantics of Programming Languages An Introduction. Glynn Winskel. The MIT Press Cambridge, Massachusetts London, England

On the Hardness of Counting the Solutions of SPARQL Queries

The Inverse of a Schema Mapping

DATABASE THEORY. Lecture 18: Dependencies. TU Dresden, 3rd July Markus Krötzsch Knowledge-Based Systems

8.1 Polynomial-Time Reductions

Chapter 5: Other Relational Languages

Data Integration: Logic Query Languages

DATABASE THEORY. Lecture 12: Evaluation of Datalog (2) TU Dresden, 30 June Markus Krötzsch

Chapter 5: Other Relational Languages.! Query-by-Example (QBE)! Datalog

On the Computational Complexity of Minimal-Change Integrity Maintenance in Relational Databases

Datalog. Susan B. Davidson. CIS 700: Advanced Topics in Databases MW 1:30-3 Towne 309

Multi-event IDS Categories. Introduction to Misuse Intrusion Detection Systems (IDS) Formal Specification of Intrusion Signatures and Detection Rules

FOUNDATIONS OF DATABASES AND QUERY LANGUAGES

Overview. CS389L: Automated Logical Reasoning. Lecture 6: First Order Logic Syntax and Semantics. Constants in First-Order Logic.

Foundations of Databases

CS 3512, Spring Instructor: Doug Dunham. Textbook: James L. Hein, Discrete Structures, Logic, and Computability, 3rd Ed. Jones and Barlett, 2010

Processing Regular Path Queries Using Views or What Do We Need for Integrating Semistructured Data?

! Greed. O(n log n) interval scheduling. ! Divide-and-conquer. O(n log n) FFT. ! Dynamic programming. O(n 2 ) edit distance.

Optimization of logical query plans Eliminating redundant joins

CSE 20 DISCRETE MATH. Fall

Relational Databases

Reconcilable Differences

Finite Model Generation for Isabelle/HOL Using a SAT Solver

Access Patterns (Extended Version) Chen Li. Department of Computer Science, Stanford University, CA Abstract

Positive higher-order queries

Finding Equivalent Rewritings in the Presence of Arithmetic Comparisons

A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries

CSE 344 JANUARY 26 TH DATALOG

Chapter 8. NP and Computational Intractability. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Harvard School of Engineering and Applied Sciences CS 152: Programming Languages. Lambda calculus

Midterm. Introduction to Data Management CSE 344. Datalog. What is Datalog? Why Do We Learn Datalog? Why Do We Learn Datalog? Lecture 13: Datalog

Logical Query Languages. Motivation: 1. Logical rules extend more naturally to. recursive queries than does relational algebra. Used in SQL recursion.

Midterm. Database Systems CSE 414. What is Datalog? Datalog. Why Do We Learn Datalog? Why Do We Learn Datalog?

Functional Logic Programming. Kristjan Vedel

Figure 5.1: Query processing in a data integration system. This chapter focuses on the reformulation step, highlighted in the dashed box.

! Greed. O(n log n) interval scheduling. ! Divide-and-conquer. O(n log n) FFT. ! Dynamic programming. O(n 2 ) edit distance.

Relational Algebra and Relational Calculus. Pearson Education Limited 1995,

Query Rewriting Using Views in the Presence of Inclusion Dependencies

2.2.2.Relational Database concept

Data Integration: Datalog

Formal Predicate Calculus. Michael Meyling

An Algorithm for Answering Queries Efficiently Using Views Prasenjit Mitra Infolab, Stanford University Stanford, CA, 94305, U.S.A.

Database Theory: Datalog, Views

Foundations of Schema Mapping Management

CSE 20 DISCRETE MATH. Winter

Introduction to Data Management CSE 344. Lecture 14: Datalog (guest lecturer Dan Suciu)

Expressive capabilities description languages and query rewriting algorithms q

Lecture 17 of 41. Clausal (Conjunctive Normal) Form and Resolution Techniques

Database Theory VU , SS Codd s Theorem. Reinhard Pichler

Outline. Forward chaining Backward chaining Resolution. West Knowledge Base. Forward chaining algorithm. Resolution-Based Inference.

Inverting Schema Mappings: Bridging the Gap between Theory and Practice

Lecture 9: Datalog with Negation

SQL, DLs, Datalog, and ASP: comparison

Efficiently Computing Provenance Graphs for Queries with Negation

Logic Languages. Hwansoo Han

NP and computational intractability. Kleinberg and Tardos, chapter 8

. Relation and attribute names: The relation and attribute names in the mediated. Describing Data Sources. 3.1 Overview and Desiderata

Deductive Databases. Motivation. Datalog. Chapter 25

Essential Gringo (Draft)

Context-free Grammars

Notes. Notes. Introduction. Notes. Propositional Functions. Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry.

Datalog. Susan B. Davidson. CIS 700: Advanced Topics in Databases MW 1:30-3 Towne 309

CS 512, Spring 2017: Take-Home End-of-Term Examination

Transcription:

Conjunctive queries Relational calculus queries without negation and disjunction. Conjunctive queries have a normal form: ( y 1 ) ( y n )(p 1 (x 1,..., x m, y 1,..., y n ) p k (x 1,..., x m, y 1,..., y n )). They can also be expressed in Datalog: q(x 1,..., x m ) : p 1 (x 1,..., x m, y 1,..., y n ),..., p k (x 1,..., x m, y 1,..., y n ). Many computational problems are much easier for conjunctive queries than for general first-order queries. 1

Query containment For every query language, a query can be viewed as a mapping from the set of all possible database instances to all possible result instances. A query Q 1 is contained in a query Q 2 (Q 1 Q 2 ) if for every database instance D: Q 1 (D) Q 2 (D). Query containment is: undecidable for arbitrary first-order queries decidable (NP-complete) for conjunctive queries without built-in predicates decidable (Π p 2-complete) for conjunctive queries with built-in predicates Query equivalence can be determined using containment: Q 1 Q 2 iff Q 1 Q 2 Q 2 Q 1. 2

Checking conjunctive query containment Without built-in predicates. Assume Q 1 = A 0 : A 1,..., A n. Algorithm for checking Q 1 Q 2 : 1. to each goal A i, i = 1,..., n, in the body of Q 1 apply some ground substitution h that maps different variables to different (arbitrary) constants 2. apply T Q2 to the canonical database: the set of facts {A 1 h,..., A n h} 3. Q 1 Q 2 iff A 0 h is derived. This algorithm works also if Q 2 is an arbitrary Datalog program (the step 2 has to be repeated like in bottom-up evaluation) but not if Q 1 is an arbitrary Datalog program. 3

Queries with built-in predicates Arithmetic comparison predicates. Multiple canonical databases: all possible orderings for variables instantiate variables to integers For every canonical database D that makes the entire body of Q 1 true, T Q2 needs to derive the corresponding head of Q 1. 4

Integrity constraints A query Q 1 is contained in a query Q 2 under a set of integrity constraints F (Q 1 F Q 2 ) if for every database instance D satisfying F : Q 1 (D) Q 2 (D). Theorem: Q 1 F Q 2 iff chase F (Q 1 ) chase F (Q 2 ). 5

Chase A versatile tool, useful also for checking query containment under integrity constraints. Chase: apply chase steps to the body of a conjunctive query Q until no changes occur. Chase steps depend on the kind of constraints used. For functional dependencies: FD-step using X Y over P : if there are two P -goals in the body of Q that agree on X-attributes, apply to Q a substitution that will make them agree on Y -attributes (preferring the variables in the head). Chase terminates for FDs. 6

Views in data integration Basic model: many independent data sources containing all the data wrappers of data sources provide a single data model a single integrated database (virtual) relationships between the content of the sources and that of the integrated database: local-as-view global-as-view queries asked against the integrated database 7

Local-as-view Each data source is viewed as a goal G s and is defined using a query Q g over the integrated database. Notation: S is an instance of the source, D is a (virtual) instance of the integrated database. Source annotations: sound: S Q g (D) (the most common), complete: Q g (D) S, exact: Q g (D) = S. There may be more than one instance D satisfying the annotations, for a given S. 8

Global-as-view Each relation over the integrated database is defined using a goal G g as a view Q s over the data sources. Notation: S consists of instances of all the sources, D is a (virtual) instance of the integrated database. Source annotations: sound: Q s (S) G g (D), complete: G g (D) Q s (S), exact: G g (D) = Q s (S) (the most common). Under the exact annotations, there is only one satisfying instance D, for a given S. 9

Comparison Local-as-view: query evaluation rewriting in terms of views cannot be composed scalable Global-as-view: query evaluation view materialization (for exact annotations) can be composed (integrated database may be viewed as a data source) not scalable (adding sources requires the redefinition of the integrated database) 10

Query evaluation Semantics: a tuple t is a certain answer to a query Q given some source instances if it is in the answer to this query over every instance of the integrated database that satisfies all the source annotations. Computing certain answers is in most cases computationally hard. 11

Inverse rules An approach to query rewriting, used in Infomaster. Data source conjunctive query. Query a set of Datalog rules. Rewriting produces a set of nonrecursive Datalog rules with function symbols: EDB predicates: source relations IDB predicates: database relations Function symbols can be eliminated. 12

Query rewriting: 1. for every source rule A : B 1,..., B n, produce n inverse rules B 1 : A,..., B n : A 2. B i is like B i, except that each variable that occur only in the body of the source rule is replaced by the (Skolem) term f(x 1,..., X n ) where: f is a unique function symbol X 1,..., X n are all the variables in the head of the source rule 3. all the occurrences of the same variable are replaced by the same term Query evaluation: the query rule and the inverse rules are evaluated bottom-up the evaluation terminates only the substitutions that do not contain Skolem terms are returned to the user 13

Bucket algorithm An approach to query rewriting, used in Information Manifold. Data source a view defined as a conjunctive query. Query a conjunctive query Q D 0 : D 1,..., D n Rewriting of the query Q produces a set S, initially empty, of conjunctive queries. 14

Buckets: 1. create a bucket for every goal D 1,..., D n in Q 2. if A : B 1,..., B n is a data source definition such that for some j, B j unifies with D i, if D i has a head variable in some argument, then B j also has a head variable in the same argument, then Aσ (with new variables substituted for those variables that do not occur in B j ) is added to the bucket D i where σ is a most general unifier of B j and D i preferring the variables in the query Query rewriting: for each query rewriting Q which is a conjunction of one subgoal from each bucket: if the expansion of Q is contained in Q, then add Q to S the final rewriting is the union of the queries in S. 15

Constraints in the query and the views To prevent irrelevant goals from being placed in the buckets: if the constraints in the data source definition together with the constraints in the query are unsatisfiable after applying σ limited to the head variables of the view, then the source is useless for the query. To help pass the containment test: constraints may be added to a query rewriting. 16

Properties Not always a rewriting equivalent to the original query exists: data sources are insufficient data sources are incomplete. If the query does not contain any constraints, and there are only sound annotations, then the inverse rules and bucket algorithms are guaranteed to produce the maximally contained rewriting (in terms of the given sources) and the given query language. The rewriting computes exactly the certain answers. 17

Source limitations Sometimes data sources impose limitations on access paths to the data: some arguments have to be provided as inputs. This is represented by adorning each data source with a string of b and f: each argument is labelled with a b or an f b stands for mandatory input f stands for input or output To obtain a maximally-contained rewriting sometimes a recursive Datalog program is necessary. 18