Database Theory: Datalog, Views

Similar documents
Database Theory: Beyond FO

Query Minimization. CSE 544: Lecture 11 Theory. Query Minimization In Practice. Query Minimization. Query Minimization for Views

Logical Query Languages. Motivation: 1. Logical rules extend more naturally to. recursive queries than does relational algebra. Used in SQL recursion.

Recap and Schedule. Till Now: Today: Ø Query Languages: Relational Algebra; SQL. Ø Datalog A logical query language. Ø SQL Recursion.

Logic As a Query Language. Datalog. A Logical Rule. Anatomy of a Rule. sub-goals Are Atoms. Anatomy of a Rule

Chapter 10 Advanced topics in relational databases

I Relational Database Modeling how to define

I Relational Database Modeling how to define

Logical Query Languages. Motivation: 1. Logical rules extend more naturally to. recursive queries than does relational algebra.

Datalog. Susan B. Davidson. CIS 700: Advanced Topics in Databases MW 1:30-3 Towne 309

Plan of the lecture. G53RDB: Theory of Relational Databases Lecture 14. Example. Datalog syntax: rules. Datalog query. Meaning of Datalog rules

Conjunctive queries. Many computational problems are much easier for conjunctive queries than for general first-order queries.

Lecture 1: Conjunctive Queries

DATABASE THEORY. Lecture 11: Introduction to Datalog. TU Dresden, 12th June Markus Krötzsch Knowledge-Based Systems

Data Warehousing and Decision Support

}Optimization. Module 11: Optimization of Recursive Queries. Module Outline

}Optimization Formalisms for recursive queries. Module 11: Optimization of Recursive Queries. Module Outline Datalog

DATABASE THEORY. Lecture 12: Evaluation of Datalog (2) TU Dresden, 30 June Markus Krötzsch

Three Query Language Formalisms

Databases-1 Lecture-01. Introduction, Relational Algebra

DATABASE THEORY. Lecture 15: Datalog Evaluation (2) TU Dresden, 26th June Markus Krötzsch Knowledge-Based Systems

CMPS 277 Principles of Database Systems. Lecture #11

Announcements. Database Systems CSE 414. Datalog. What is Datalog? Why Do We Learn Datalog? HW2 is due today 11pm. WQ2 is due tomorrow 11pm

Datalog. Susan B. Davidson. CIS 700: Advanced Topics in Databases MW 1:30-3 Towne 309

Introduction to SQL. Select-From-Where Statements Multirelation Queries Subqueries. Slides are reused by the approval of Jeffrey Ullman s

Lecture 16. The Relational Model

FOUNDATIONS OF DATABASES AND QUERY LANGUAGES

CSE 544: Principles of Database Systems

Annoucements. Where are we now? Today. A motivating example. Recursion! Lecture 21 Recursive Query Evaluation and Datalog Instructor: Sudeepa Roy

Decision Support. Chapter 25. CS 286, UC Berkeley, Spring 2007, R. Ramakrishnan 1

Announcements. What is Datalog? Why Do We Learn Datalog? Database Systems CSE 414. Midterm. Datalog. Lecture 13: Datalog (Ch

Lecture 5 Design Theory and Normalization

Data Warehousing 3. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

CSE 344 JANUARY 26 TH DATALOG

Introduction to Data Management CSE 344. Lecture 14: Datalog (guest lecturer Dan Suciu)

SQL: Data Manipulation Language

Relational Model and Relational Algebra

Chapter 2 The relational Model of data. Relational algebra

CSE 344 JANUARY 29 TH DATALOG

Lecture 04: SQL. Monday, April 2, 2007

Database Systems CSE 414. Lecture 9-10: Datalog (Ch )

Negation wrapped inside a recursion makes no. separated, there can be ambiguity about what. the rules mean, and some one meaning must

Introduction to Data Management CSE 344. Lecture 14: Datalog

4/10/2018. Relational Algebra (RA) 1. Selection (σ) 2. Projection (Π) Note that RA Operators are Compositional! 3.

Introduction to SQL SELECT-FROM-WHERE STATEMENTS SUBQUERIES DATABASE SYSTEMS AND CONCEPTS, CSCI 3030U, UOIT, COURSE INSTRUCTOR: JAREK SZLICHTA

Chapter 6: Bottom-Up Evaluation

CSC 343 Winter SQL: Aggregation, Joins, and Triggers MICHAEL LIUT

Datalog. Rules Programs Negation

Lecture 04: SQL. Wednesday, October 4, 2006

CSC 261/461 Database Systems Lecture 5. Fall 2017

Introduction to SQL. Multirelation Queries Subqueries. Slides are reused by the approval of Jeffrey Ullman s

Managing Inconsistencies in Collaborative Data Management

Relational Algebra and SQL. Basic Operations Algebra of Bags

CS411 Database Systems. 04: Relational Algebra Ch 2.4, 5.1

Database Theory VU , SS Codd s Theorem. Reinhard Pichler

Introduction to Data Management CSE 344. Lecture 14: Relational Calculus

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

A Retrospective on Datalog 1.0

Views, Indexes, Authorization. Views. Views 8/6/18. Virtual and Materialized Views Speeding Accesses to Data Grant/Revoke Priviledges

CS 317/387. A Relation is a Table. Schemas. Towards SQL - Relational Algebra. name manf Winterbrew Pete s Bud Lite Anheuser-Busch Beers

FUNCTIONAL DEPENDENCIES

Encyclopedia of Database Systems, Editors-in-chief: Özsu, M. Tamer; Liu, Ling, Springer, MAINTENANCE OF RECURSIVE VIEWS. Suzanne W.

Midterm. Database Systems CSE 414. What is Datalog? Datalog. Why Do We Learn Datalog? Why Do We Learn Datalog?

Lecture 4: Advanced SQL Part II

Datalog Evaluation. Linh Anh Nguyen. Institute of Informatics University of Warsaw

SQL DATA DEFINITION LANGUAGE

Relational Algebra. Algebra of Bags

CSE 344 APRIL 11 TH DATALOG

Principles of Database Systems CSE 544. Lecture #2 SQL, Relational Algebra, Relational Calculus

SQL DATA DEFINITION LANGUAGE

Midterm. Introduction to Data Management CSE 344. Datalog. What is Datalog? Why Do We Learn Datalog? Why Do We Learn Datalog? Lecture 13: Datalog

Incomplete Information: Null Values

Schema Design for Uncertain Databases

Logical Aspects of Massively Parallel and Distributed Systems

Database Management System

L22: The Relational Model (continued) CS3200 Database design (sp18 s2) 4/5/2018

CS145 Introduction. About CS145 Relational Model, Schemas, SQL Semistructured Model, XML

RELATIONAL ALGEBRA II. CS121: Relational Databases Fall 2017 Lecture 3

Today s topics. Null Values. Nulls and Views in SQL. Standard Boolean 2-valued logic 9/5/17. 2-valued logic does not work for nulls

Introduction to Database Systems

Relative Information Completeness

Assignment 4 CSE 517: Natural Language Processing

Chapter 10 Advanced topics in relational databases

Three Query Language Formalisms

Datalog Recursive SQL LogicBlox

Database Design Theory and Normalization. CS 377: Database Systems

Range Restriction for General Formulas

ENTITY-RELATIONSHIP MODEL. CS 564- Spring 2018

Chapter 5: Other Relational Languages.! Query-by-Example (QBE)! Datalog

Relational Algebra BASIC OPERATIONS DATABASE SYSTEMS AND CONCEPTS, CSCI 3030U, UOIT, COURSE INSTRUCTOR: JAREK SZLICHTA

Notes. These slides are based on slide sets provided by M. T. Öszu and by R. Ramakrishnan and J. Gehrke. CS 640 Views Winter / 15.

Outline. q Database integration & querying. q Peer-to-Peer data management q Stream data management q MapReduce-based distributed data management

Program Analysis in Datalog

Chapter 5: Other Relational Languages

Introduction to SQL. Select-From-Where Statements Multirelation Queries Subqueries

Lecture 9: Datalog with Negation

SQL DATA DEFINITION LANGUAGE

CSC 261/461 Database Systems Lecture 13. Fall 2017

SQL: Data Definition Language

Principles of Data Management. Lecture #9 (Query Processing Overview)

Transcription:

Database Theory: Datalog, Views CS 645 Mar 8, 2006 Some slide content courtesy of Ramakrishnan & Gehrke, Dan Suciu, Ullman & Widom 1

TODAY: Coming lectures Adding recursion: datalog Summary of Containment Complexity Views 2

Expressive Power of FO Let I = {R(x,y)} represent a graph Query path(x,y) = all x,y such that there is path from x to y Theorem: path(x,y) cannot be expressed in FO.

Datalog Programs A Datalog program is a collection of rules. In a program, subgoals can be either 1. EDB = Extensional Database = stored table. 2. IDB = Intensional Database = computed table. No EDB in heads. (Recall: in a CQ, subgoals are always EDBs) 4

Non-recursive rules Graph: R(x,y) P(x,y) :- R(x,u), R(u,v), R(v,y) A(x,y) :- P(x,u), P(u,y) Can unfold it into: A(x,y) :- R(x,u), R(u,v), R(v,w), R(w,m), R(m,n), R(n,y)

Example: Datalog Program Using EDB Sells(bar, beer, price) and Beers(name, manf), find the manufacturers of beers Joe doesn t sell. JoeSells(b) <- Sells( Joes Bar, b, p) Answer(m) <- Beers(b,m) & JoeSells(b) 6

Evaluating Datalog Programs As long as there is no recursion, we can pick an order to evaluate the IDB predicates, so that all the predicates in the body of its rules have already been evaluated. If an IDB predicate has more than one rule, each rule contributes tuples to its relation. 7

Recursive example Two forms of transitive closure: Graph: R(x,y) Path(x,y) :- R(x,y) Path(x,y) :- Path(x,u), R(u,y) Path(x,y) :- R(x,y) Path(x,y) :- Path(x,u), Path(u,y)

Recursive Example EDB: Par(c,p) = p is a parent of c. Generalized cousins: people with common ancestors one or more generations back: Sib(x,y) :- Par(x,p), Par(y,p), x y Cousin(x,y) :- Sib(x,y) Cousin(x,y) :- Par(x,xp), Par(y,yp), Cousin(xp,yp) 9

Definition of Recursion Form a dependency graph whose nodes = IDB predicates. Arc X ->Y if and only if there is a rule with X in the head and Y in the body. Cycle = recursion; no cycle = no recursion. 10

Example: Dependency Graphs Cousin Answer Sib Recursive JoeSells Nonrecursive Sib(x,y) :- Par(x,p), Par(y,p), x y JoeSells(b) <- Sells( Joes Bar, b, p) Cousin(x,y) :- Sib(x,y) Answer(m) <- Beers(b,m) & JoeSells(b) Cousin(x,y) :- Par(x,xp), Par(y,yp), Cousin(xp,yp) 11

Evaluating Recursive Rules The following works when there is no negation: 1. Start by assuming all IDB relations are empty. 2. Repeatedly evaluate the rules using the EDB and the previous IDB, to get a new IDB. 3. End when no change to IDB. Fixed Point 12

The Naïve Evaluation Algorithm Start: IDB = 0 Apply rules to IDB, EDB yes Change to IDB? no done 13

Example: Evaluation of Cousin We ll proceed in rounds to infer Sib facts (red) and Cousin facts (green). Remember the rules: Sib(x,y) <- Par(x,p) & Par(y,p) & x y Cousin(x,y) <- Sib(x,y) Cousin(x,y) <- Par(x,xp) & Par(y,yp) & Cousin(xp,yp) 14

Seminaive Evaluation Since the EDB never changes, on each round we only get new IDB tuples if we use at least one IDB tuple that was obtained on the previous round. Saves work; lets us avoid rediscovering most known facts. - A fact could still be derived in a second way. 15

Par Data: Parent Above Child Sibling Cousin a d Round 1 Round 2 Round 3 Round 4 b c e f g h j k i 16

Datalog There are three equivalent meanings for a datalog rule - least fixed point (Recursion, no negation) - (unique) minimal model - set of facts derivable from EDBs 17

Recursion in SQL Graph: R(x,y) Path(x,y) :- R(x,y) Path(x,y) :- Path(x,u), R(u,y) WITH RECURSIVE Path(x,y) AS ( SELECT R1.x, R1.y FROM R as R1) UNION (SELECT P1.x, R2.y FROM R as R2, Path as P1 WHERE P1.y = R2.x ) SELECT * FROM Path 18

Variants of Datalog without recursion with recursion without with Non-recursive Datalog = UCQ Non-recursive Datalog = FO Datalog Datalog

Query language classes Algebra Logic SQL Recursive Queries Datalog Datalog Expressiveness FO queries RA nr-datalog (safe) RC UCQ UCQ CQ< CQ CQ SFW + UNION EXCEPT Conjunctive Queries RA: σ,π, single datalog rule S d FW

Containment for Datalog Theorem Containment for datalog programs is undecidable. 21

Summary of complexity of containment CQ NP-complete same expressive power UCQ nr-datalog (no ) Π p 2 NP-complete CQ Π p 2 CQ< Π p 2 FO (nr-datalog) Datalog undecidable undecidable 22

Non-recursive Datalog Union of Conjunctive Queries = UCQ Containment is decidable, and NPcomplete Non-recursive Datalog (no ) Is equivalent to UCQ Hence containment is decidable here too Is it still NP-complete?

Non-recursive Datalog A non-recursive datalog: Its unfolding as a CQ: How big is this query? T 1 (x,y) :- R(x,u), R(u,y) T 2 (x,y) :- T 1 (x,u), T 1 (u,y)... T n (x,y) :- T n-1 (x,u), T n-1 (u,y) Answer(x,y) :- T n (x,y) Answer(x,y) :- R(x,u 1 ), R(u 1, u 2 ), R(u 2, u 3 ),... R(u m, y)

Query language classes Algebra Logic SQL Recursive Queries Datalog Datalog Expressiveness FO queries RA nr-datalog (safe) RC UCQ UCQ CQ< CQ CQ SFW + UNION EXCEPT Conjunctive Queries RA: σ,π, single datalog rule S d FW

Views 26

Views A view is a relation defined by a query. The query defining the view is called the view definition For example: SQL CREATE VIEW Developers AS SELECT name, project FROM Employee WHERE department = Development CQ v1(y,z) :- P(x,y) & P(y,z) 27

Virtual and Materialized Views A view may be: virtual: the view relation is defined, but not computed or stored. Computed only on-demand slow at runtime Always up to date materialized: the view relation is computed and stored in system. Pre-computed offline fast at runtime May have stale data 28

Virtual view example Person(name, city) Purchase(buyer, seller, product, store) Product(name, maker, category) CREATE VIEW Seattle-view AS SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = Seattle AND Person.name = Purchase.buyer We have a new virtual table: Seattle-view(buyer, seller, product, store) 29

View Example We can use the view in a query as we would any other relation: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = shoes 30

Querying a virtual view SELECT name, Seattle-view.store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = shoes SELECT name, Purchase.store FROM Person, Purchase, Product WHERE Person.city = Seattle AND Person.name = Purchase.buyer AND Purchase.poduct = Product.name AND Product.category = shoes View expansion 31

Views and query minimization Users usually don t write non-minimal queries However, non-minimal queries arise when using views intensively Good motivation for query minimization

Query Minimization for Views CREATE VIEW HappyBoaters SELECT DISTINCT E1.name, E1.manager FROM Employee E1, Employee E2 WHERE E1.manager = E2.name and E1.boater= YES and E2.boater= YES This query is minimal

Query Minimization for Views Now compute the Very-Happy-Boaters SELECT DISTINCT H1.name FROM HappyBoaters H1, HappyBoaters H2 WHERE H1.manager = H2.name This query is also minimal What happens in SQL when we run a query on a view?

Query Minimization for Views View Expansion SELECT DISTINCT E1.name FROM Employee E1, Employee E2, Employee E3, Empolyee E4 WHERE E1.manager = E2.name and E1.boater = YES and E2.boater = YES and E3.manager = E4.name and E3.boater = YES and E4.boater = YES and E1.manager = E3.name This query is no longer minimal! E4 E3 E1 E2 E2 is redundant

The great utility of views Data independence Efficient query processing materializing certain results can improve query execution esp. data warehousing Controlling access Grant access to views only to filter data Data integration Combine data sources using views 36

1. View selection View-related issues which views to materialize, given workload 2. View maintenance when base relations change, views need to be refreshed. 3. Updating virtual views can users update relations that don t exist? 4. Answering queries using views when only views are available, what queries over base relations are answerable?

Issues in View Materialization What views should we materialize? What indexes should we build? Given a query and a set of materialized views, can we use the materialized views to answer the query? When do we refresh materialized views to make them consistent with the underlying tables? (And how can we do this incrementally?)

View Maintenance Two steps: Propagate: Compute changes to view when data changes. Refresh: Apply changes to the materialized view table. Maintenance policy: Controls when we do refresh. Immediate: As part of the transaction that modifies the underlying data tables. (+ Materialized view is always consistent; - updates are slowed) Deferred: Some time later, in a separate transaction. (- View becomes inconsistent; + can scale to maintain many views without slowing updates)

Deferred Maintenance Three flavors: Lazy: Delay refresh until next query on view; then refresh before answering the query. Periodic (Snapshot): Refresh periodically. Queries possibly answered using outdated version of view tuples. Widely used, especially for asynchronous replication in distributed databases, and for warehouse applications. Event-based: E.g., Refresh after a fixed number of updates to underlying data tables.

Updating Views How can I insert a tuple into a table that doesn t exist? Employee(ssn, name, department, project, salary) CREATE VIEW Developers AS SELECT name, project FROM Employee WHERE department = Development If we make the following insertion: INSERT INTO Developers VALUES( Joe, Optimizer ) It becomes: INSERT INTO Employee(ssn, name, department, project, salary) VALUES(NULL, Joe, NULL, Optimizer, NULL) 41

Non-Updatable Views Person(name, city) Purchase(buyer, seller, product, store) CREATE VIEW City-Store AS SELECT Person.city, Purchase.store FROM Person, Purchase WHERE Person.name = Purchase.buyer How can we add the following tuple to the view? ( Seattle, Nine West ) We don t know the name of the person who made the purchase; cannot set to NULL (why?)

Troublesome examples CREATE VIEW OldEmployees AS SELECT name, age FROM Employee WHERE age > 30 INSERT INTO OldEmployees VALUES( Joe, 28) If this tuple is inserted into view, it won t appear! Allowed by default in SQL. 43

Ambiguous updates Name group Alice Bob Bob group fac fac fac cvs file foo.txt Join view Alice Alice Bob Bob foo.txt bar.txt foo.txt bar.txt fac bar.txt cvs foo.txt Delete ( Alice, foo.txt )

Updating views in practice Updates on views highly constrained: SQL-92: updates only allowed on singletable views with projection, selection, no aggregates. SQL-99: takes into account primary keys; updates on multiple table views may be allowed. SQL-99: distinguishes between updatable and insertable views 45

Data integration A data integration system provides a uniform interface to multiple autonomous, heterogeneous data sources. enterprise integration integrating sources on the WWW integrating scientific databases Uniform interface is offered by constructing a mediated schema over sources (views) Query answering means expressing query over sources using only views. 46

Data integration 47

Answering queries using views Available views: v1(y,z) :- R(x,y), R(y,z) v2(x,z) :- R(x,y), R(y,z) Query we want to answer: q(w) :- R(0,u), R(u,v), R(v,w) Great-grand parents of person 0 A(s) :- v2(0,t), v1(t,s) 48

Basic strategy Given views V 1,, V n and query Q, Guess a rewriting i.e. query A which refers to views Expand A using view definitions Check equivalence of A and Q 49

Finding a Rewriting Theorem Given views V 1,, V n and query Q, the problem whether Q has a equivalent rewriting in terms of V 1,, V n is NP complete

Rewritings Q is a maximally-contained rewriting of Q if: Q Q For language L no other rewriting contains Q 51

Certain Answers Definition. Given V 1,, V n, their answers A 1,, A n and a query Q, a tuple t is a certain tuple for Q iff for every database instance D: if A 1 =V 1 (D) and and A n =V n (D) then t Q(D) CWD (Closed World Assumption) if A 1 V 1 (D) and and A n V n (D) then t Q(D) OWD (Open World Assumption)