Incomplete Information: Null Values

Similar documents
SQL: Advanced concepts

SQL and Incomp?ete Data

SQL: Queries, Programming, Triggers

Relational Model, Relational Algebra, and SQL

RELATIONAL ALGEBRA II. CS121: Relational Databases Fall 2017 Lecture 3

Foundations of Databases

Incomplete Data: What Went Wrong, and How to Fix It. PODS 2014 Incomplete Information 1/1

Relational Algebra and SQL

Can We Trust SQL as a Data Analytics Tool?

Relational Databases

Lecture 3 SQL. Shuigeng Zhou. September 23, 2008 School of Computer Science Fudan University

Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries Derived Relations Views Modification of the Database Data Definition

Ontology and Database Systems: Foundations of Database Systems

Ian Kenny. November 28, 2017

Midterm Review. Winter Lecture 13

SQL. Chapter 5 FROM WHERE

SQL. Lecture 4 SQL. Basic Structure. The select Clause. The select Clause (Cont.) The select Clause (Cont.) Basic Structure.

2.3 Algorithms Using Map-Reduce

SQL: Queries, Constraints, Triggers

CS121 MIDTERM REVIEW. CS121: Relational Databases Fall 2017 Lecture 13

Enterprise Database Systems

Multisets and Duplicates. SQL: Duplicate Semantics and NULL Values. How does this impact Queries?

Chapter 3: Relational Model

Why SQL? SQL is a very-high-level language. Database management system figures out best way to execute query

Relational Algebra. Procedural language Six basic operators

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Chapter 3: SQL. Database System Concepts, 5th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Chapter 3: SQL. Chapter 3: SQL

SQL Part 3: Where Subqueries and other Syntactic Sugar Part 4: Unknown Values and NULLs

CIS 330: Applied Database Systems

The Extended Algebra. Duplicate Elimination. Sorting. Example: Duplicate Elimination

Uncertain Data Models

SQL: Queries, Programming, Triggers. Basic SQL Query. Conceptual Evaluation Strategy. Example of Conceptual Evaluation. A Note on Range Variables

SQL s Three-Valued Logic and Certain Answers

Semantics in RDF and SPARQL Some Considerations

SQL: Data Manipulation Language. csc343, Introduction to Databases Diane Horton Winter 2017

Linear Time Unit Propagation, Horn-SAT and 2-SAT

II. Structured Query Language (SQL)

Chapter 4: SQL. Basic Structure

CMP-3440 Database Systems

Chapter 2: Relational Model

Semantic Errors in Database Queries

Today s topics. Null Values. Nulls and Views in SQL. Standard Boolean 2-valued logic 9/5/17. 2-valued logic does not work for nulls

Missing Information. We ve assumed every tuple has a value for every attribute. But sometimes information is missing. Two common scenarios:

Information Systems Engineering. SQL Structured Query Language DML Data Manipulation (sub)language

Database Systems SQL SL03

Do we really understand SQL?

Database Management Systems Paper Solution

Transactions and Concurrency Control

CS34800 Information Systems. The Relational Model Prof. Walid Aref 29 August, 2016

RELATIONAL DATA MODEL: Relational Algebra

Relational Algebra. Relational Algebra Overview. Relational Algebra Overview. Unary Relational Operations 8/19/2014. Relational Algebra Overview

Data Integration: Logic Query Languages

XML Data Exchange. Marcelo Arenas P. Universidad Católica de Chile. Joint work with Leonid Libkin (U. of Toronto)

Database Systems SQL SL03

SQL: csc343, Introduction to Databases Renée J. Miller and Fatemeh Nargesian and Sina Sina Meraji. Winter 2018

More SQL. Extended Relational Algebra Outerjoins, Grouping/Aggregation Insert/Delete/Update

SQL: Data Querying. B0B36DBS, BD6B36DBS: Database Systems. h p:// Lecture 4

Querying Data with Transact SQL

Relational model continued. Understanding how to use the relational model. Summary of board example: with Copies as weak entity

Introduction to Data Management. Lecture 14 (SQL: the Saga Continues...)

Silberschatz, Korth and Sudarshan See for conditions on re-use

Relational Database: The Relational Data Model; Operations on Database Relations

SQL Data Querying and Views

Chapter 6 The Relational Algebra and Calculus

Dr Paolo Guagliardo. Fall 2018

II. Structured Query Language (SQL)

Database Technology Introduction. Heiko Paulheim

Introduction to SQL. Select-From-Where Statements Multirelation Queries Subqueries

Chapter 3: Introduction to SQL

SQL:Union, Intersection, Difference. SQL: Natural Join

CS 582 Database Management Systems II

Lecture 3 More SQL. Instructor: Sudeepa Roy. CompSci 516: Database Systems

Virtual views. Incremental View Maintenance. View maintenance. Materialized views. Review of bag algebra. Bag algebra operators (slide 1)

Lecture Notes for 3 rd August Lecture topic : Introduction to Relational Model. Rishi Barua Shubham Tripathi

NP-Completeness of 3SAT, 1-IN-3SAT and MAX 2SAT

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

The SQL data-definition language (DDL) allows defining :

Database Design and Programming

CSIT5300: Advanced Database Systems

Lecture 1: Conjunctive Queries

Introduction to SQL. Select-From-Where Statements Multirelation Queries Subqueries. Slides are reused by the approval of Jeffrey Ullman s

Subqueries. Must use a tuple-variable to name tuples of the result

SQL Continued! Outerjoins, Aggregations, Grouping, Data Modification

Lecture 3 SQL - 2. Today s topic. Recap: Lecture 2. Basic SQL Query. Conceptual Evaluation Strategy 9/3/17. Instructor: Sudeepa Roy

Core Membership Computation for Succinct Representations of Coalitional Games

WTF with SQL. It s just not right! Kennie Nybo Pontoppidan Dublin, June 18th 2016

6. Relational Algebra (Part II)

Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Chapter 6 Outline. Unary Relational Operations: SELECT and

Chapter 3: Introduction to SQL. Chapter 3: Introduction to SQL

Relational Model: History

CSE 344 Final Review. August 16 th

Chapter 3: Introduction to SQL

3.1 Constructions with sets

NULLs & Outer Joins. Objectives of the Lecture :

Informationslogistik Unit 4: The Relational Algebra

CSC 261/461 Database Systems Lecture 5. Fall 2017

CMSC424: Database Design. Instructor: Amol Deshpande

RELATIONAL QUERY LANGUAGES

Transcription:

Incomplete Information: Null Values Often ruled out: not null in SQL. Essential when one integrates/exchanges data. Perhaps the most poorly designed and the most often criticized part of SQL:... [this] topic cannot be described in a manner that is simultaneously both comprehensive and comprehensible.... those SQL features are not fully consistent; indeed, in some ways they are fundamentally at odds with the way the world behaves. A recommendation: avoid nulls. Use [nulls] properly and they work for you, but abuse them, and they can ruin everything L. Libkin 1 Data Integration and Exchange

Part I: theory of incomplete information What is incomplete information? Which relational operations can be evaluated correctly in the presence of incomplete information? What does evaluated correctly mean? Part II: incomplete information in SQL Simplifies things too much. Leads to inconsistent answers. One needs to understand this for asking queries over integrated/exchanged data. L. Libkin 2 Data Integration and Exchange

Sometimes we don t have all the information Null is used if we don t have a value for a given attribute. What could null possibly mean? Value exists, but is unknown at the moment. Value does not exist. There is no information. L. Libkin 3 Data Integration and Exchange

Representing relations with nulls: Codd tables In Codd tables, we put distinct variables for null values: A B C T a 1 b 1 c 1 a 2 b 2 c 2 x b 3 c 3 y b 4 c 4 a 5 z c 5 Semantics of a Codd table T is the set POSS(T ) of all tables without nulls it can represent. That is, we substitute values for all variables. L. Libkin 4 Data Integration and Exchange

Tables in POSS(T ) Closed World Assumption: simply replace each variable by a value Open World Assumption: replace each variable by a value and possibly add tuples L. Libkin 5 Data Integration and Exchange

Querying Codd tables Suppose Q is a relational algebra, or SQL query, and T is a Codd table. What is Q(T )? We only know how to apply Q to usual relations, so we can find: ˆQ(T ) = {Q(R) R POSS(T )} If there were a Codd table T such that POSS(T )= ˆQ(T ), thenwe would say that T is Q(T ). Thatis, POSS(Q(T )) = {Q(R) R POSS(T )} Question: Can we always find such a table T? L. Libkin 6 Data Integration and Exchange

Strong representation systems Let L be a language (e.g. a fragment of relational algebra). Suppose Q is a query, over n tables, in language L. If for all T 1,...,T n which are inputs to Q, there exists a table T such that POSS(T ) = {Q(R 1,...,R n ) R 1 POSS(T 1 ),...,R n POSS(T n )} then we call T the answer to Q, thatis,q(t 1,...,T n ). If we can always find Q(T 1,...,T n ), we say that Codd tables form a strong representation system for L. Bad news: We may not have a strong representation system even for a small subset of relational algebra. L. Libkin 7 Data Integration and Exchange

No strong representation system for Codd tables Table: T = A B 0 1 x 2 Query: Q = σ A=3 (T ) Suppose there is T such that POSS(T )={Q(R) R POSS(T )}. Consider: R 1 = A B 0 1 2 2 and R 2 = A B 0 1 3 2 and Q(R 1 )=, Q(R 2 )={(3, 2)}, and hence T cannot exist, because POSS(T ) if and only if T = L. Libkin 8 Data Integration and Exchange

Weak representation systems Idea: consider certain answers: certain(q, T 1,...,T n ) = { Q(R 1,...,R n ) R 1 POSS(T 1 ),..., R n POSS(T n ) } certain(t ) the set of tuples in T without null values. For a query language L, Coddtablesformaweak representation system if for any query Q in L, certain ( Q(T 1,...,T n ) ) = certain(q, T 1,...,T n ) L. Libkin 9 Data Integration and Exchange

Weak representation systems cont d Good news: Codd tables form a weak representation system for the selection-projection queries in relational algebra. That is, Codd tables form a weak representation system for SQL queries of the form SELECT-FROM-WHERE such that the FROM clause only has one relation. Bad News: If we add either union or join (that is, allow UNION or multiple relations in the FROM clause), then Codd tables no longer form a weak representation system. Reason: we cannot use conditions of the form x = y, wherex and y are variables, and this causes problems in computing joins. L. Libkin 10 Data Integration and Exchange

Naive tables Codd tables in which some of the variables can coincide: A B C a 1 b 1 c 1 a 2 b 2 c 2 x b 3 c 3 y b 4 c 4 a 5 x c 5 Naive tables form a weak representation system for SPJU queries (that is, π, σ,, ). In SQL terms: no INTERSECT, EXCEPT, NOT IN, NOT EXISTS Evaluation: A B 1 x 2 y B C x 3 y 4 Heavily used in data exchange. = A B C 1 x 3 2 y 4 L. Libkin 11 Data Integration and Exchange

Conditional tables Naive tables do not form a weak representation system for full relational algebra Conditional tables do. Example: A B C condition a 1 b 1 c 1 x>1 a 2 b 2 c 2 x b 3 c 3 t =0 y b 4 c 4 t =1 a 5 x c 5 x 5 y =1 Query evaluation is quite complicated. L. Libkin 12 Data Integration and Exchange

Theory of incomplete information: summary Simple representation: Codd tables. But we cannot even evaluate simple selections over them. If we settle for less just certain answers must be represented correctly thenσ and π can be evaluated over Codd tables, but not,,. If we use naive tables (variables can coincide), then SPJU queries can be evaluated. If we use conditional tables, all relational algebra queries can be evaluated, but conditional tables are very hard to deal with. Tradeoff: Semantic correctness vs Complexity of queries L. Libkin 13 Data Integration and Exchange

Incomplete information in SQL SQL approach: there is a single general purpose NULL for all cases of missing/inapplicable information Nulls occur as entries in tables; sometimes they are displayed as null, sometimes as They immediately lead to comparison problems The union of SELECT * FROM R WHERE R.A=1 SELECT * FROM R WHERE R.A<>1 SELECT * FROM R. But it is not. and shouldbethesameas Because, if R.A is null, then neither R.A=1 nor R.A<>1 evaluates to true. L. Libkin 14 Data Integration and Exchange

Nulls cont d R.A has three values: 1, null, and 2. SELECT * FROM R WHERE R.A=1 returns A 1 SELECT * FROM R WHERE R.A<>1 returns A 2 How to check = null? New comparison: IS NULL. SELECT * FROM R WHERE R.A IS NULL returns SELECT * FROM R is the union of SELECT * FROM R WHERE R.A=1, SELECT * FROM R WHERE R.A<>1, and SELECT * FROM R WHERE R.A IS NULL. A null L. Libkin 15 Data Integration and Exchange

Nulls and other operations What is 1+null? What is the truth value of 3 = null? Nulls cannot be used explicitly in operations and selections: WHERE R.A=NULL or SELECT 5-NULL are illegal. For any arithmetic, string, etc. operation, if one argument is null, then the result is null. For R.A={1,null}, S.B={2}, SELECT R.A + S.B FROM R, S returns {3, null}. What are the values of R.A=S.B? When R.A=1, S.B=2, it is false. When R.A=null, S.B=2, it is unknown. L. Libkin 16 Data Integration and Exchange

The logic of nulls How does unknown interact with Boolean connectives? What is NOT unknown? Whatisunknown OR true? x NOT x AND true false unknown true false true true false unknown false true false false false false unknown unknown unknown unknown false unknown OR true false unknown true true true true false true false unknown unknown true unknown unknown Problem with null values: people rarely think in three-valued logic! L. Libkin 17 Data Integration and Exchange

Nulls and aggregation Be ready for big surprises! SELECT * FROM R A ----------- 1 - SELECT COUNT(*) FROM R returns 2 SELECT COUNT(R.A) FROM R returns 1 L. Libkin 18 Data Integration and Exchange

Nulls and aggregation One would expect nulls to propagate through arithmetic expressions SELECT SUM(R.A) FROM R is the sum a 1 + a 2 +...+ a n of all values in column A; if one is null, the result is null. But SELECT SUM(R.A) FROM R returns 1 if R.A={1,null}. Most common rule for aggregate functions: first, ignore all nulls, and then compute the value. The only exception: COUNT(*). L. Libkin 19 Data Integration and Exchange

Nulls in subqueries: more surprises R1.A = {1,2} R2.A = {1,2,3,4} SELECT R2.A FROM R2 WHERE R2.A NOT IN (SELECT R1.A FROM R1) Result: {3,4} Now insert a null into R1: R1.A = {1,2, null} and run the same query. The result is! L. Libkin 20 Data Integration and Exchange

Nulls in subqueries cont d Although this result is counterintuitive, it is correct. What is the value of 3 NOT IN (SELECT R1.A FROM R1)? 3 NOT IN {1,2,null} = NOT (3 IN {1,2,null}) = NOT((3 = 1) OR (3=2) OR (3=null)) = NOT(false OR false OR unknown) = NOT (unknown) = unknown Similarly, 4 NOT IN {1,2,null} evaluates to unknown, and 1 NOT IN {1,2,null}, 2 NOT IN {1,2,null} evaluate to false. Thus, the query returns. L. Libkin 21 Data Integration and Exchange

Nulls in subqueries cont d The result of SELECT R2.A FROM R2 WHERE R2.A NOT IN (SELECT R1.A FROM R1) can be represented as a conditional table: A condition 3 x =0 4 x 0 3 y =0 4 y =0 L. Libkin 22 Data Integration and Exchange

Nulls could be dangerous! Imagine US national missile defense system, with the database of missile targeting major cities, and missiles launched to intercept those. Query: Is there a missile targeting US that is not being intercepted? SELECT M.#, M.target FROM Missiles M WHERE M.target IN (SELECT Name FROM USCities) AND M.# NOT IN (SELECT I.Missile FROM Intercept I WHERE I.Status = active ) Assume that a missile was launched to intercept, but its target wasn t properly entered in the database. L. Libkin 23 Data Integration and Exchange

Nulls could be dangerous! Missile Intercept # Target I# Missile Status M1 A I1 M1 active M2 B I2 null active M3 C {A, B, C} are in USCities The query returns the empty set: M2 NOT IN {M1, null} and M3 NOT IN {M1, null} evaluate to unknown. although either M2 or M3 is not being intercepted! Highly unlikely? Probably (and hopefully). But never forget what caused the Mars Climate Orbiter to crash! L. Libkin 24 Data Integration and Exchange

Complexity of nulls Several problems related to nulls. We shall look at two: recognizing relations in POSS(T ) query answering (i.e., computing certain answers) L. Libkin 25 Data Integration and Exchange

Recognising tables in POSS(T ) Input: Output: a { table T, relation R 1 if R POSS(T ) 0 otherwise Complexity depends on what type of table T is: If T is a Codd table, there is a polynomial O(n 2 n) algorithm bipartite graph mathcing If T is a naive table, the problem is NP-complete 3-colorability reduction (blackboard) L. Libkin 26 Data Integration and Exchange

Computing certain answers Input: Output: a { table T, a tuple t 1 if t certain(q, T ) 0 otherwise Complexity: conp-complete, under CWA. it is in conp: just guess R POSS(T ) so that t Q(R) it is complete for conp: 3-colourability (blackboard) Complexity: undecidable for relational algebra queries under OWA the same as validity problem in logic undecidable but in conp for simpler classes of queries (e.g. conjunctive or σ, π,, -queries) L. Libkin 27 Data Integration and Exchange