Consensus Answers for Queries over Probabilistic Databases. Jian Li and Amol Deshpande University of Maryland, College Park, USA

Similar documents
Information Retrieval Rank aggregation. Luca Bondi

Logic and Databases. Lecture 4 - Part 2. Phokion G. Kolaitis. UC Santa Cruz & IBM Research - Almaden

Histograms and Wavelets on Probabilistic Data

LOCAL STRUCTURE AND DETERMINISM IN PROBABILISTIC DATABASES. Theodoros Rekatsinas, Amol Deshpande, Lise Getoor

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Understanding policy intent and misconfigurations from implementations: consistency and convergence

Computer-based Tracking Protocols: Improving Communication between Databases

Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases

Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data

Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can t-do

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

31.6 Powers of an element

STA 4273H: Statistical Machine Learning

Database Applications (15-415)

Uncertain Data Models

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Understanding Clustering Supervising the unsupervised

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks

Estimating the Quality of Databases

Approximation Algorithms for Clustering Uncertain Data

Query Evaluation on Probabilistic Databases

Rank Aggregation for Similar Items

High Dimensional Indexing by Clustering

EECS730: Introduction to Bioinformatics

II TupleRank: Ranking Discovered Content in Virtual Databases 2

Mining Social Network Graphs

Efficient Query Evaluation over Temporally Correlated Probabilistic Streams

Asking the Right Questions in Crowd Data Sourcing

ECLT 5810 Clustering

Data Integration with Uncertainty

Scalar Aggregation in Inconsistent Databases

3 No-Wait Job Shops with Variable Processing Times

Combinatorial Search; Monte Carlo Methods

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Association Pattern Mining. Lijun Zhang

Probabilistic/Uncertain Data Management

INDEPENDENT SCHOOL DISTRICT 196 Rosemount, Minnesota Educating our students to reach their full potential

ECLT 5810 Clustering

AUTHOR COPY. Probabilistic top-k range query processing for uncertain databases. Guoqing Xiao a,fanwu a,, Xu Zhou a and Keqin Li a,b

Biclustering with δ-pcluster John Tantalo. 1. Introduction

Ranked Retrieval in Uncertain and Probabilistic Databases

Uncertainties: Representation and Propagation & Line Extraction from Range data

Value Added Association Rules

Joint Entity Resolution

9.1 Cook-Levin Theorem

Query Evaluation on Probabilistic Databases

XML Data Exchange. Marcelo Arenas P. Universidad Católica de Chile. Joint work with Leonid Libkin (U. of Toronto)

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Classification with Diffuse or Incomplete Information

Database Applications (15-415)

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Data Preprocessing. Slides by: Shree Jaswal

Monte Carlo Methods; Combinatorial Search

H1 Spring B. Programmers need to learn the SOAP schema so as to offer and use Web services.

Ch9: Exact Inference: Variable Elimination. Shimi Salant, Barak Sternberg

Probabilistic Graph Summarization

Discovery of Genuine Functional Dependencies from Relational Data with Missing Values

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Collaborative Filtering using Euclidean Distance in Recommendation Engine

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Combining Fuzzy Information - Top-k Query Algorithms. Sanjay Kulhari

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Deterministic Algorithms for Rank Aggregation and Other Ranking and Clustering Problems

Sketching Probabilistic Data Streams

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

2. Data Preprocessing

Structural characterizations of schema mapping languages

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Submodularity Reading Group. Matroid Polytopes, Polymatroid. M. Pawan Kumar

Schema Design for Uncertain Databases

Probabilistic Graphical Models

Prove, where is known to be NP-complete. The following problems are NP-Complete:

Ranking Functions. Linear-Constraint Loops

CS 664 Image Matching and Robust Fitting. Daniel Huttenlocher

Introduction to Randomized Algorithms

Managing Uncertainty in Data Streams. Aleka Seliniotaki Project Presentation HY561 Heraklion, 22/05/2013

Private Database Queries Using Somewhat Homomorphic Encryption. Dan Boneh, Craig Gentry, Shai Halevi, Frank Wang, David J. Wu

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Lecture 3: Motion Planning 2

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

Case-Based Label Ranking

1 Introduction. 1. Prove the problem lies in the class NP. 2. Find an NP-complete problem that reduces to it.

Probabilistic Databases: Diamonds in the Dirt

Searching non-text information objects

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

On Reduct Construction Algorithms

Microdata Publishing with Algorithmic Privacy Guarantees

Mixture Models and the EM Algorithm

Discrete Optimization. Lecture Notes 2

Rank Aggregation Methods for the Web

arxiv:cs/ v1 [cs.db] 5 Apr 2002

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Lecture 2: NP-Completeness

Clustering. Chapter 10 in Introduction to statistical learning

Phil 320 Chapter 7: Recursive sets and relations Note: 0. Introduction Significance of and main objectives for chapter 7:

NP-Complete Reductions 2

Creating Probabilistic Databases from Information Extraction Models

Transcription:

Consensus Answers for Queries over Probabilistic Databases Jian Li and Amol Deshpande University of Maryland, College Park, USA

Probabilistic Databases Motivation: Increasing amounts of uncertain data Sensor Networks; Information Networks Noisy input data; measurement errors; incomplete data Prevalent use of probabilistic modeling techniques Data Integration and Information Extraction Need to model reputation, trust, and data quality Increasing use of automated tools for schema mapping etc. Probabilistic databases Annotate tuples with existence probabilities, and attribute values with probability distributions Propagate probabilities through query execution Interpretation according to the "possible worlds semantics"

Semantics of Query Processing Prob DB: D should be efficient Answer(D) combine All Possible Worlds: pw1, pw2,... Exponential Size!! Query All Possible Answers: Answer(pw1), Answer(pw2),

Semantics of Query Processing How to Combine? Allow probabilistic answers. Return all possible tuples along with prob. [Dalvi, Suciu 04] Return tuples with annotations [Green et al. 06] What if we want a single deterministic answer? Probabilistic thresholding [Dalvi, Suciu 04] Return all tuples s.t. t appears in the answer w.p. >=Threshold Sampling Top-k queries?

Semantics of Top-k Queries pw 1: t 1 t 2 t 3 pw 2: t 1 t 3 t 4 pw 3: t 2 t 3 t 5 pw 4: t 2 t 4 t 5 Many prior proposals for combining them U-top-k, U-rank-k [Soliman et al. 07] Probabilistic Threshold (PT-k) [Hua et al. 08] Global-top-k [Zhang et al. 08] Expected Rank [Cormode et al. 09] Parameterized Ranking Function (PRF) [Li et al. 09] But, formal semantics are lacking.

Consensus Answers Think of each possible answer as a point in the space. Suppose d() is a distance metric between answers. Consensus Answers: A single deterministic answer where pw is the answer for the possible world pw Mean Answers: Median Answers: is the set of feasible answers is the set of possible answers

Consensus Answers the median answer pw 6: t 3 t 6 t 7 pw 1: t 1 t 2 t 7 answer 6 w.p. 0.2 answer 1 w.p. 0.1 ma: t 1 t 4 t 5 answer 5 w.p. 0.05 pw 3: t 1 t 3 t 4 answer 3 w.p. 0.15 the mean answer pw 5: t 3 t 4 t 5 answer 2 w.p. 0.3 Centroid / Center of Mass answer 4 w.p. 0.2 pw 2: t 1 t 2 t 4 pw 4: t 2 t 3 t 4

Related Work Rank Aggregation [Dwork et al. 01], [Ailon 07] Original work in voting systems [Condorcet 1785] Goal: Combine rankings provided by different experts Consensus Clustering [Ailon et al. 08] Goal: Aggregate a set of clusterings to minimize the disagreements Probabilistic Query Processing Dichotomy result: Conjunctive query evaluation is either PTIME or #P-Complete [Dalvi, Suciu 04] Finding consensus answers a much harder problem (NP-hard even if there is a safe plan)

Outline Problem Definition: Consensus Answers Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries Conclusion

Probabilistic Database Models Tuple-independence Model The existence of each tuple is independent of other tuples Block-independent Disjoint (BID) Scheme Key Attr 1 Prob 1 500 0.5 1 950 0.3 2 20 0.3 2 30 0.2 3 150 0.2 3 200 0.8 Tuples with the same key are mutually exclusive.

Probabilistic Database Models Probabilistic And/Xor Trees Capture two types of correlations: mutual exclusitivity and coexistence. And node: Xor nodes: V V V 0.5 0.3 0.3 0.2 0.2 0.8 V (1,500) (1,950) (2,20) (2,30) (3,150) (3,200) Possible Worlds Pr (3,150) 0.02 (3,200) 0.08. (1,500);(2,20);(3,150) 0.03 (1,950);(2,20);(3,150) 0.018.

Probabilistic Database Models Probabilistic And/Xor Trees Capture two types of correlations: mutual exclusitivity and coexistence. And node: Xor nodes: V V V 0.5 0.3 0.3 0.2 0.2 0.8 V (1,500) (1,950) (2,20) (2,30) (3,150) (3,200) Possible Worlds Pr (3,150) 0.02 (3,200) 0.08. (1,500);(2,20);(3,150) 0.03 (1,950);(2,20);(3,150) 0.018. (1-0.5-0.3)*(1-0.3-0.2)*0.2=0.02

Probabilistic Database Models Probabilistic And/Xor Trees Xor node: V And nodes: 0.5 0.3 0.2 Possible Worlds Pr V V V (1,20);(2,50) 0.5 (2,20);(3,35) 0.3 (1,30);(3,25) 0.2 (1,20) (2,50) (2,20) (3,35) (1,30) (3,25) And/Xor trees can represent any finite set of possible worlds (not necessarily compact).

Computing Probabilities on And/Xor Trees Generating Function Method: Leaves: x y x z And Node: V F 1 (x,y, )F 2 (x,y, )F 3 (x,y, ) F 1 (x,y, ) F 2 (x,y, ) F 3 (x,y, ) Xor Node: V q+p 1 F 1 (x,y, )+p 2 F 2 (x,y, )+p 3 F 3 (x,y, ) p 1 p 2 p 3 q=1-p 1 -p 2 -p 3 F 1 (x,y, ) F 2 (x,y, ) F 3 (x,y, )

Computing Probabilities on And/Xor Trees Generating Function Method: Root: F(x,y, )= ij c ij x i y j. THM: The coefficient c ij of the term x i y j. = total prob of the possible worlds which contain i tuples annotated with x, j tuples annotated with y,

Computing Probabilities on And/Xor Trees Example: Computing the prob. dist. of the size of the pw (0.2+0.8x)(0.5+0.5x)x = 0.4 x 3 +0.5 x 2 +0.1 x V Pr( pw =3)=0.4 Pr( pw =2)=0.5 Pr( pw =1)=0.1 0.2+0.8x V 0.5+0.5x V x V 0.5 0.3 0.3 0.2 0.2 0.8 (1,500) (1,950) (2,20) (2,30) (3,150) (3,200) x x x x x x

Computing Probabilities on And/Xor Trees Example: Computing the rank distribution r(i) : the rank of tuple i. r(i)=j if and only if (1) j-1 tuples with higher scores appear (2) tuple i appears Pr(r(i)=j) = coeff of x j-1 y V (0.5+0.5x)(0.2+0.8x)(0.4+0.6y)(0.1+0.9x) 0.5+0.5x V 0.2+0.8x V 1 V 0.4+0.6y V 0.1+0.9x V 0.5 0.8 0.4 0.6 0.9 (1,500) (2,950) (3,200) (4,350) (5,550) x x 1 y x

Outline Problem Definition: Consensus Answers Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries Conclusion

Set Distance Metrics Think of the relations (either existing or results of conjunctive queries) as sets. Symmetric Difference: THM: The mean answer under the symmetric difference distance is the set of all tuples with probability >0.5. THM: For conjunctive queries over tuple independent databases, finding the median answer under the symmetric difference distance is NP-Hard (even if the query has a safe plan). Reduction from MAX-2-SAT

Set Distance Metrics Jaccard Distance LM: For tuple independent databases, if the mean world contains tuple t 1 but not tuple t 2, then Pr (t 1 ) > Pr (t 2 ). Hence, suffices to sort by probabilities, and consider prefixes LM: For any fixed world W, E[d J (W,pw)] can be computed in polynomial time (using generating functions) Gives us a polynomial time algorithm

Outline Problem Definition: Consensus Answers Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries Conclusion

Top-k Queries Symmetric Difference and Probabilistic Threshold Top-k (PT-k) Mean answer under - Find a k-tuple set minimizing PT-k: Find k tuples with largest THM: The two definitions are equivalent.

Top-k Queries Intersection Metric: [Fagin et al 03] i : top-i tuples of e.g. 1 : 5 4 6 3 1 d I ( 1, 2 )= 2 : 5 6 2 7 3 1/5(0 + 1/4*2 +1/6*2 + 1/8* 4 +1/10*4)

Top-k Queries Intersection Metric: [Fagin et al 03] For any fixed top-k answer, we have Thus we need to find which maximizes

Top-k Queries Intersection Metric: [Fagin et al 03] Where Reduce to the Max-weight Matching Problem: ranks: f(t 1,1) 1 2 3 i k tuples: t 1 t 2 t 3 t 4 t j t n

Top-k Queries Spearman s Footrule [Fagin et al. 03] Extension of traditional footrule distance to partial rankings Polynomial time algorithm (by reduction to min-cost matching) Kendall s tau Distance [Fagin et al. 03] Measures the number of inversions NP-hard [Dwork et al 01] Even for only four possible worlds 3/2-approximation By adapting the algorithm by [Ailon 07] Open question: The complexity for a tuple independent DBs

Outline Problem Definition: Consensus Answers Models: BID, Probabilistic and/xor tree Set Distance Metrics Top-k Queries Other Types of Queries Conclusion

Other Types of Queries Aggregate Queries SELECT groupname, count(*) FROM R GROUP BY groupname Distance: squared vector distance Mean answer is trivial: take average count for each group Median answer: 4-approximation Clustering A somewhat simplified model Distance: consensus clustering distance 4/3-approximation for finding the mean clustering

Conclusion Proposed the notion of Consensus Answers for probabilistic databases Lends precise and formal semantics to query answers Algorithms for finding consensus answers for many queries For the rich probabilistic and/xor tree model Future work: Examining utility of consensus answers in practice Handling other types of queries: range queries, frequent items, clustering Finding connections to existing query processing semantics

Thanks.