Logic and Databases. Lecture 4 - Part 2. Phokion G. Kolaitis. UC Santa Cruz & IBM Research - Almaden

Similar documents
Logic and Databases. Phokion G. Kolaitis. UC Santa Cruz & IBM Research - Almaden

Structural characterizations of schema mapping languages

A Retrospective on Datalog 1.0

Computability Theory

Consensus Answers for Queries over Probabilistic Databases. Jian Li and Amol Deshpande University of Maryland, College Park, USA

Efficient query evaluation on probabilistic databases

Lecture 1: Conjunctive Queries

Structural Characterizations of Schema-Mapping Languages

2SAT Andreas Klappenecker

Uncertain Data Models

1 Introduction. 1. Prove the problem lies in the class NP. 2. Find an NP-complete problem that reduces to it.

Learning Schema Mappings

Binary Decision Diagrams

Query Evaluation on Probabilistic Databases

Database Constraints and Homomorphism Dualities

Database Theory VU , SS Introduction: Relational Query Languages. Reinhard Pichler

CS 151 Complexity Theory Spring Final Solutions. L i NL i NC 2i P.

Composing Schema Mapping

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Query Evaluation on Probabilistic Databases

Database Theory VU , SS Introduction: Relational Query Languages. Reinhard Pichler

[Ch 6] Set Theory. 1. Basic Concepts and Definitions. 400 lecture note #4. 1) Basics

The Complexity of Data Exchange

Foundations of Data Exchange and Metadata Management. Marcelo Arenas Ron Fagin Special Event - SIGMOD/PODS 2016

Relative Information Completeness

SAT-CNF Is N P-complete

Provable data privacy

PROPOSITIONAL LOGIC (2)

Integration of Multiple Uncertain Data Sources

Exact Algorithms Lecture 7: FPT Hardness and the ETH

On the Hardness of Counting the Solutions of SPARQL Queries

Data Integration with Uncertainty

Definition: A context-free grammar (CFG) is a 4- tuple. variables = nonterminals, terminals, rules = productions,,

Schema Design for Uncertain Databases

Data Exchange: Semantics and Query Answering

The Satisfiability Problem [HMU06,Chp.10b] Satisfiability (SAT) Problem Cook s Theorem: An NP-Complete Problem Restricted SAT: CSAT, k-sat, 3SAT

Data integration lecture 2

9.5 Equivalence Relations

Department of Computer Science CS-RR-15-01

A study on lower interval probability function based decision theoretic rough set models

Scalar Aggregation in Inconsistent Databases

Schema Mappings and Data Exchange

In this lecture we discuss the complexity of approximation problems, and show how to prove they are NP-hard.

Fixed Parameter Algorithms

8.1 Polynomial-Time Reductions

Complexity of unique list colorability

W4231: Analysis of Algorithms

Canonical Completeness in Lattice-Based Languages for Attribute-Based Access Control

Limits of Schema Mappings

XML Data Exchange. Marcelo Arenas P. Universidad Católica de Chile. Joint work with Leonid Libkin (U. of Toronto)

NP and computational intractability. Kleinberg and Tardos, chapter 8

PROBABILISTIC GRAPHICAL MODELS SPECIFIED BY PROBABILISTIC LOGIC PROGRAMS: SEMANTICS AND COMPLEXITY

The Relational Model

CSE 20 DISCRETE MATH. Fall

The Most Probable Database Problem

CMPS 277 Principles of Database Systems. Lecture #3

NP-Completeness. Algorithms

Decomposition of log-linear models

The temporal explorer who returns to the base 1

Efficient Query Evaluation on Probabilistic Databases

Parallel Databases C H A P T E R18. Practice Exercises

NP-Complete Reductions 2

DATABASE THEORY. Lecture 18: Dependencies. TU Dresden, 3rd July Markus Krötzsch Knowledge-Based Systems

Querying Probabilistic Data

Uncertainty in Databases. Lecture 2: Essential Database Foundations

ALGORITHMS EXAMINATION Department of Computer Science New York University December 17, 2007

Chapter 8. NP and Computational Intractability. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

arxiv: v1 [cs.db] 23 May 2016

Formal Verification. Lecture 7: Introduction to Binary Decision Diagrams (BDDs)

Feedback Arc Set in Bipartite Tournaments is NP-Complete

Symbolic Model Checking

Overview. CS389L: Automated Logical Reasoning. Lecture 6: First Order Logic Syntax and Semantics. Constants in First-Order Logic.

Designing and Refining Schema Mappings via Data Examples

CSE 20 DISCRETE MATH. Winter

Management of Probabilistic Data Foundations and Challenges

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

The Inverse of a Schema Mapping

4.1 Review - the DPLL procedure

Computer Science Technical Report

Exercises Computational Complexity

Randomness and Computation March 25, Lecture 5

Complexity of Answering Queries Using Materialized Views

Efficient Enumeration Algorithms for Constraint Satisfaction Problems

New Worst-Case Upper Bound for #2-SAT and #3-SAT with the Number of Clauses as the Parameter

Approximation Algorithms for Computing Certain Answers over Incomplete Databases

COMPUTER SCIENCE TRIPOS

Advanced Query Processing

Example of a Demonstration that a Problem is NP-Complete by reduction from CNF-SAT

Chapter 2 & 3: Representations & Reasoning Systems (2.2)

Computational Geometry

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Nondeterministic Query Algorithms

Disjunctive and Conjunctive Normal Forms in Fuzzy Logic

8 Matroid Intersection

Lecture 9: Datalog with Negation

Isomorphisms between low n and computable Boolean algebras

Conceptual modeling of entities and relationships using Alloy

CMPSCI611: The SUBSET-SUM Problem Lecture 18

NP-Completeness of 3SAT, 1-IN-3SAT and MAX 2SAT

P -vs- NP. NP Problems. P = polynomial time. NP = non-deterministic polynomial time

Transcription:

Logic and Databases Phokion G. Kolaitis UC Santa Cruz & IBM Research - Almaden Lecture 4 - Part 2

2 / 17 Alternative Semantics of Queries Bag Semantics We focused on the containment problem for conjunctive queries under bag semantics. Next, we will discuss: Probabilistic Databases Inconsistent Databases The focus will be on the data complexity of conjunctive queries in these two frameworks.

3 / 17 Probabilistic Databases So far, data stored in a database have been assumed to exist with certainty However, in modern applications, data may be uncertain: noisy, fuzzy, corrupted, or even missing. Such applications include social media, information integration, scientific data management,... Probabilistic Databases provide a framework for modeling and managing uncertain data. Probabilistic Databases extend relational databases with probabilities. Both the data and their probabilities are stored as "standard" relations, but the semantics of query answering takes probabilities into account.

4 / 17 Probabilistic Databases Definition A probabilistic database is a pair W = (D, P) such that D = {D 1,..., D k } is a finite set of databases D i over the same schema. P : D [0, 1] is a function such that k i=1 P(D k) = 1. Intuition A probabilistic database can be in one of finitely many possible states, each with some probability. D is a set of possible worlds representing the possible states of the probabilistic database.

Marginal Probabilities Definition Let W = (D, P) be a probabilistic database. Let q be a k-ary query, k 1, and let a be a k-tuple. The marginal probability Pr(q, a, W) of a is Pr(q, a, W) = a q(d i ) P(D i ). Let q be a Boolean query. The marginal probability Pr(q, W) of q is Pr(q, W) = D i =q P(D i ). 5 / 17

6 / 17 Query Evaluation over Probabilistic Databases Query Evaluation over probabilistic databases: Given a k-ary query q, a k-tuple a, and a probabilistic database W, compute the marginal probability Pr(q, a, W). Note that this is a combined complexity problem. Here, we will focus on the data complexity of Boolean conjunctive queries over probabilistic databases. Fix a Boolean conjunctive query q. Then Pr[q] is the following algorithmic problem: Given a probabilistic database W, compute the marginal probability Pr(q, W).

7 / 17 Representations of Probabilistic Databases A probabilistic database may have an arbitrarily large number of possible worlds, which implies that listing all these possible worlds may be infeasible. For this reason, several different compact representations of probabilistic databases have been introduced and investigated. Here, we will focus on tuple-independent databases, which is arguably the simplest model for probabilistic database design. Intuitively, in a tuple-independent database all tuples are independent probabilistic events.

8 / 17 Tuple-Independent Databases A tuple-independent relation is a relation R(A 1,..., A m, P) in which tuples (a 1,..., a m ) are independent events and the values of P are numbers in the interval [0, 1] denoting the marginal tuple probabilities of the tuples. Company Product P Apple iphone 6 0.95 Samsung Galaxy 7 0.96 Apple iphone 7 0.75 Microsoft Lumia 640 0.85 This table is a compact representation of 16 possible tables. For example, the table consisting of the first, the second, and the fourth tuple has probability 0.95 0.96 0.25 0.85. A tuple-independent database is a database consisting of tuple-independent relations.

9 / 17 Data Complexity in Tuple-Independent Databases Fix a Boolean query q. Pr[q] is the following problem: Given a tuple-independent database W, compute the marginal probability Pr(q, W). This is a data complexity problem because the query is fixed and the input is a tuple-independent database W. Recall that the data complexity of unions of conjunctive queries on (deterministic) databases is in LOGSPACE.

10 / 17 Data Complexity in Tuple-Independent Databases Dichotomy Theorem (Dalvi and Suciu - 2012) If q is a union of Boolean conjunctive queries, then Pr[q] is in P or Pr[q] is #P-complete.

10 / 17 Data Complexity in Tuple-Independent Databases Dichotomy Theorem (Dalvi and Suciu - 2012) If q is a union of Boolean conjunctive queries, then Pr[q] is in P or Pr[q] is #P-complete. Note #P is the class of counting problems associated with decision problems in NP. The prototypical #P-complete problem is #SAT: Given a CNF-formula ϕ, compute the number of its satisfying assignments. Valiant (1979) also showed that #POSITIVE 2SAT is #P-complete. ("easy" decision - "hard" counting phenomenon)

11 / 17 Hierarchical Queries Definition A self-join free conjunctive query is a conjunctive query in which no relation symbol appears more than once. Let q be a self-join free conjunctive query. If x is a variable of q, then at(x) is the set of all atoms of x in which x appears. We say that q is hierarchical if for every two variables x and y of q, one of the following holds: at(x) at(y), at(y) at(x), at(x) at(y) =. Example The query x y(r(x) S(x, y)) is hierarchical. The query x y(r(x) S(x, y) T (y)) is not hierarchical.

12 / 17 Data Complexity in Tuple-Independent Databases The Little Dichotomy Theorem (Dalvi and Suciu - 2004) Let q be a Boolean self-join free conjunctive query. If q is hierarchical, then Pr[q] is in P. If q is not hierarchical, then Pr[q] is #P-complete.

12 / 17 Data Complexity in Tuple-Independent Databases The Little Dichotomy Theorem (Dalvi and Suciu - 2004) Let q be a Boolean self-join free conjunctive query. If q is hierarchical, then Pr[q] is in P. If q is not hierarchical, then Pr[q] is #P-complete. Proof Idea Hierarchical queries admit safe evaluation plans. Non-hierarchical queries: Show that Pr[ x y(r(x) S(x, y) T (y))] is #P-complete. Show that if q is not hierarchical, then Pr[ x y(r(x) S(x, y) T (y))] is reducible to Pr[q].

13 / 17 Hierarchical Queries Let q be the hierarchical query x y(r(x) S(x, y) and let W be a tuple-independent database. First, write q as x(r(x) ys(x, y)). Then, using tuple-independence repeatedly, we have that: Pr[q] = 1 (1 P((R(a) ys(a, y)))) a adom(w) = 1 a adom(w) = 1 a adom(w) (1 P((R(a)) P( ys(a, y)))) (1 P((R(a)) (1 b adom(w) (1 P(S(a, b)))))) The last expression has size O(n 2 ), where n = adom(w).

14 / 17 Non-Hierarchical Queries A positive partitioned 2DNF formula (PP2DNF) is a DNF-formula of the form x i1 y j1 x ik y jk, where the x i s and the y j s form disjoint sets of variables. Theorem (Provan and Ball - 1982) #PP2DNF is #P-complete. Theorem (Dalvi and Suciu - 2004) There is a counting reduction from #PP2DNF to Pr[ x y(r(x) S(x, y) T (y))].

15 / 17 Non-Hierarchical Queries Counting reduction from #PP2DNF to Pr[ x y(r(x) S(x, y) T (y))]. Suppose ϕ is the formula x 1 y 1 x 1 y 2 x 2 y 1 Let W ϕ be the tuple-independence database R X P S X Y P T Y P x 1 0.5 x 1 y 1 1 y 1 0.5 x 2 0.5 x 1 y 2 1 y 2 0.5 x 2 y 1 1 There is a 1-1 correspondence between truth assignments for ϕ and possible worlds for W ϕ. It is easy to see that #ϕ = 2 n Pr( x y(r(x) S(x, y) T (y)), W ϕ ), where n is the number of variables of ϕ.

16 / 17 Non-Hierarchical Queries Let q be a Boolean conjunctive query that is not hierarchical. By definition, there are variables x and y of q such that at(x) at(y), at(y) at(x), at(x) at(y). Since at(x) at(y), there is an atom R (x,...) in which y does not appear. Since at(y) at(x), there is an atom T (y,...) in which x does not appear. Since at(x) at(y), there is an atom T (x, y,...) in which both x and y appear. These atoms can be used to obtain a counting reduction from Pr[ x y(r(x) S(x, y) T (y))] to Pr[q].

17 / 17 Data Complexity in Tuple-Independent Databases The Little Dichotomy Theorem (Dalvi and Suciu - 2004) Let q be a Boolean self-join free conjunctive query. If q is hierarchical, then Pr[q] is in P. If q is not hierarchical, then Pr[q] is #P-complete.

17 / 17 Data Complexity in Tuple-Independent Databases The Little Dichotomy Theorem (Dalvi and Suciu - 2004) Let q be a Boolean self-join free conjunctive query. If q is hierarchical, then Pr[q] is in P. If q is not hierarchical, then Pr[q] is #P-complete. Open Problems: Dichotomy Theorem for arbitrary conjunctive queries on the block-independent-disjoint model. Dichotomy known for self-join free conjunctive queries. Dichotomy Theorem for arbitrary conjunctive queries on the tuple-independent model in the presence of functional dependencies. Dichotomy known for self-join free conjunctive queries.