The interaction of theory and practice in database research

Similar documents
modern database systems lecture 5 : top-k retrieval

Foundations of Data Exchange and Metadata Management. Marcelo Arenas Ron Fagin Special Event - SIGMOD/PODS 2016

Integrating rankings: Problem statement

Bio/Ecosystem Informatics

Data Exchange: Semantics and Query Answering

Combining Fuzzy Information: an Overview

Optimal algorithms for middleware

Combining Fuzzy Information - Top-k Query Algorithms. Sanjay Kulhari

Structural characterizations of schema mapping languages

Foundations and Applications of Schema Mappings

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

Composing Schema Mapping

Scalable Data Exchange with Functional Dependencies

Combining Fuzzy Information from Multiple Systems*

Efficient Top-k Algorithms for Fuzzy Search in String Collections

CPSC 532L Project Development and Axiomatization of a Ranking System

Optimal Aggregation Algorithms for Middleware

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

Schema Mappings, Data Exchange, and Metadata Management

INFO 1103 Homework Project 2

The Inverse of a Schema Mapping

Dataspaces: A New Abstraction for Data Management. Mike Franklin, Alon Halevy, David Maier, Jennifer Widom

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

Introduction Data Integration Summary. Data Integration. COCS 6421 Advanced Database Systems. Przemyslaw Pawluk. CSE, York University.

Core Schema Mappings: Computing Core Solution with Target Dependencies in Data Exchange

Lecture 5: The Halting Problem. Michael Beeson

KEY BASED APPROACH FOR INTEGRATION OF HETEROGENEOUS DATA SOURCES

A Collective, Probabilistic Approach to Schema Mapping

Chapter 2 Introduction to Relational Models

Learning mappings and queries

CS 4349 Lecture August 21st, 2017

Lecture 19. Lecturer: Aleksander Mądry Scribes: Chidambaram Annamalai and Carsten Moldenhauer

Function Symbols in Tuple-Generating Dependencies: Expressive Power and Computability

Math 182. Assignment #4: Least Squares

SOFTWARE ENGINEERING DESIGN I

An Interesting Way to Combine Numbers

A Non intrusive Data driven Approach to Debugging Schema Mappings for Data Exchange

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

p x i 1 i n x, y, z = 2 x 3 y 5 z

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION

Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles

DBAI-TR UMAP: A Universal Layer for Schema Mapping Languages

Recitation 4: Elimination algorithm, reconstituted graph, triangulation

Provable data privacy

Algorithms, Games, and Networks February 21, Lecture 12

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Categorical databases

Joint Entity Resolution

Code No: R Set No. 1

The Basics of Graphical Models

Computer-based Tracking Protocols: Improving Communication between Databases

Johns Hopkins Math Tournament Proof Round: Point Set Topology

Information Retrieval Rank aggregation. Luca Bondi

Overview of Data Management

Edexcel GCSE (9 1) Sciences FAQs

Recursively Enumerable Languages, Turing Machines, and Decidability

USING THE MUSICBRAINZ DATABASE IN THE CLASSROOM. Cédric Mesnage Southampton Solent University United Kingdom

Simplifying Information Integration: Object-Based Flow-of-Mappings Framework for Integration

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Schema Exchange: a Template-based Approach to Data and Metadata Translation

Logic and Databases. Lecture 4 - Part 2. Phokion G. Kolaitis. UC Santa Cruz & IBM Research - Almaden

CS261: A Second Course in Algorithms Lecture #14: Online Bipartite Matching

Schema Exchange: a Template-based Approach to Data and Metadata Translation

Divisibility Rules and Their Explanations

Evaluating Top-k Queries Over Web-Accessible Databases

LEGAL RESEARCH FOR THE REST OF US. Morgan Morrissette Wright Fastcase

Material from Recitation 1

Steps To Create Approval Workflow In Sharepoint Designer 2007

Utility Maximization

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Chapter 15 Introduction to Linear Programming

ELEMENTARY NUMBER THEORY AND METHODS OF PROOF

Model Management and Schema Mappings: Theory and Practice (Part II)

Database Technology Introduction. Heiko Paulheim

The ICT4me Curriculum

The ICT4me Curriculum

Who won the Universal Relation wars? Alberto Mendelzon University of Toronto

Welfare Navigation Using Genetic Algorithm

Intermediate Algebra. Gregg Waterman Oregon Institute of Technology

CS 5540 Spring 2013 Assignment 3, v1.0 Due: Apr. 24th 11:59PM

Tara McPherson School of Cinematic Arts USC Los Angeles, CA, USA

Introduction to Programming in C Department of Computer Science and Engineering\ Lecture No. #02 Introduction: GCD

Web Service Usage Mining: Mining For Executable Sequences

Query Evaluation Strategies

HW1. Due: September 13, 2018

DATA IS DEAD WITHOUT WHAT-IF MODELS

Logical Foundations of Relational Data Exchange

Computational Complexity and Implications for Security DRAFT Notes on Infeasible Computation for MA/CS 109 Leo Reyzin with the help of Nick Benes

Relevance of a Document to a Query

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

Information Integration

IMPROVING YOUR JOURNAL WORKFLOW

We will give examples for each of the following commonly used algorithm design techniques:

Speech 2 Part 2 Transcript: The role of DB2 in Web 2.0 and in the IOD World

CS425 Fall 2016 Boris Glavic Chapter 1: Introduction

Chapter 3. Set Theory. 3.1 What is a Set?

Data Integration: Schema Mapping

Lecture 2 Finite Automata

Database Theory VU , SS Introduction: Relational Query Languages. Reinhard Pichler

Transcription:

The interaction of theory and practice in database research Ron Fagin IBM Research Almaden 1

Purpose of This Talk Encourage collaboration between theoreticians and system builders via two case studies One initiated by the system builders, and one by the theoreticians For theoreticians: How to apply theory to practice Why applying theory to practice can lead to better theory For system builders: The value of theory The value of involving theoreticians 2

System Builders / Practitioners / Engineers Are strong software architects & programmers Understand system thinking Have a large vocabulary for software issues Work in teams to produce complex systems Have good insights into their problem domains 3

Theoreticians Are strong at precise, mathematical thinking Can generalize well and find insightful abstractions Can find good formal systems Know many algorithms and can come up with more Can prove properties of systems and algorithms 4

First Case Study: Garlic (1996) 5

Garlic Laura Haas What was the problem? Mr. Database Theoretician, we ve got a problem with Garlic, our multimedia database system! Garlic Example databases: The answers to queries in DB/2 are sets The answers to queries in QBIC are sorted lists How do you combine the results? 6

Example Searching a CD database for Artist = Beatles yields a set, via, say DB/2 Musicbrainz has 12 million recordings in its DB 7

Example AlbumColor = Red yields a sorted list, via, say QBIC.697.683.670.659.629 Redness 8

Example How do we make sense of (Artist = Beatles ) (AlbumColor = Red )? Here it is probably a list of albums by the Beatles, sorted by how red they are What about (Artist = Beatles ) (AlbumColor = Red )? And what about (Color = Red ) (Shape = Round )? 9

What Was My Solution? These weren t just sorted lists: they were scored lists Can view sets as scored lists (scores 0 or 1) This reminded me of fuzzy logic In fuzzy logic, conjunction ( ) is min, and disjunction ( ) is max 10

Use fuzzy logic I like your solution. But we also need an efficient algorithm that can find the top k results while minimizing database accesses I have an algorithm that finds the top k with only n database accesses Laura Haas Good, that beats linear! But we database people are spoiled, and are used to only log n accesses. Be smarter and get me a log n algorithm Ron Fagin I proved that you can t do better than n 11

Time for the Accesses Say n = 12,000,000 CDs Assume 1000 accesses per second n accesses (naïve algorithm) would take 3 hours n accesses would take 3 seconds 12

Generalizing the Algorithm The algorithm works for arbitrary monotone scoring functions increasing the scores of arguments cannot decrease the overall score 13

Influence Algorithm implemented in Garlic Influenced other IBM products, including Watson Bundled Search system InfoSphere Federation Server WebSphere Commerce Paper introducing my algorithm (now called Fagin s Algorithm ) has over 800 citations (Google Scholar) 14

The Threshold Algorithm Amnon Lotem Moni Naor Ron Fagin In 2001, we found the Threshold Algorithm 15

The Problem There are m attributes Each object in a database has a score x i for attribute i The objects are given in m sorted lists, one list per attribute Goal: Find the top k objects according to a monotone scoring function, while minimizing access to the lists Can think of the attributes as voters, and the objects as candidates, where each voter assigns a score to each candidate 16

Multimedia Example REDNESS 177: 0.993 139: 0.991 702: 0.982 235: 0.325 ROUNDNESS 235: 0.999 666: 0.996 820: 0.992 177: 0.406 17

Scoring Functions Let f be the scoring function Popular choices for f: min (used in fuzzy logic) average Let x 1,, x m be the scores of object R under the m attributes Then f(x 1,, x m ) is the overall score of object R Sometimes write f(r) to mean f(x 1,, x m ) A scoring function f is monotone if whenever x i y i for every i, then f(x 1,, x m ) f(y 1,, y m ) 18

Modes of Access Sorted (or sequential) access Can obtain the next object and its score for attribute i Random access Can obtain the score of object R for attribute i Wish to minimize total number of accesses 19

Algorithms Want an algorithm for finding the top k objects Naïve algorithm retrieves every score of every object Too expensive 20

Threshold Algorithm Do sorted access in parallel to each of the m scored lists. As each object R is seen under sorted access: Do random access to retrieve all of its scores x 1,, x m Compute its overall score f(x 1,, x m ) If this is one of the top k answers so far, remember it For each list i, let t i be the score of the last object seen under sorted access Define the threshold value T to be f(t 1,, t m ). When k objects have been seen whose overall score is at least T, stop Return the top k answers 21

Threshold Algorithm: Example (using min) REDNESS 177: 0.993 ROUNDNESS 235: 0.999 Scoring function is min 22

Threshold Algorithm: Example (using min) REDNESS 177: 0.993 ROUNDNESS 235: 0.999 177: 0.406 23

Threshold Algorithm: Example (using min) REDNESS 177: 0.993 ROUNDNESS 235: 0.999 235: 0.325 177: 0.406 24

Threshold Algorithm: Example (using min) REDNESS 177: 0.993 ROUNDNESS 235: 0.999 235: 0.325 177: 0.406 Overall score for 177: min(0.993, 0.406) =.406 Overall score for 235: min(0.325, 0.999) =.325 25

Threshold Algorithm: Example (using min).993 REDNESS 177: 0.993 ROUNDNESS 235: 0.999 235: 0.325 177: 0.406 Overall score for 177: min(0.993, 0.406) =.406 Overall score for 235: min(0.325, 0.999) =.325 Threshold value: min( 0.993, 0.999 ) =.993 26

Threshold Algorithm: Example (using min) REDNESS 177: 0.993 139: 0.991 ROUNDNESS 235: 0.999 666: 0.996 235: 0.325 177: 0.406 27

Threshold Algorithm: Example (using min).991 REDNESS 177: 0.993 139: 0.991 ROUNDNESS 235: 0.999 666: 0.996 235: 0.325 177: 0.406 Threshold value: min( 0.991, 0.996 ) =.991 28

Threshold Algorithm: Example (using min) REDNESS 177: 0.993 139: 0.991 702: 0.982 235: 0.325 ROUNDNESS 235: 0.999 666: 0.996 820: 0.992 177: 0.406 29

Threshold Algorithm: Example (using min).982 REDNESS 177: 0.993 139: 0.991 702: 0.982 235: 0.325 ROUNDNESS 235: 0.999 666: 0.996 820: 0.992 177: 0.406 Threshold value: min( 0.982, 0.992 ) =.982 30

Correctness of the Halting Rule Suppose the current top k objects have scores at least T (the current threshold). Assume (by way of contradiction): R unseen; S in current top k ; f(r)>f(s) R has scores x 1,, x m x i t i for every i (as R has not been seen) f(r) = f(x 1,, x m ) f(t 1,, t m ) = T f(s) contradiction! 31

Instance Optimality A = class of algorithms, D = class of legal inputs. For A A and D D have cost(a,d) 0. An algorithm A A is instance optimal over A and D if there are constants c 1 and c 2 s.t. for every A A and D D cost(a,d) c 1 cost(a,d) c 2. c 1 is called the optimality ratio 32

Instance Optimality of TA Intuition about why TA is instance optimal: Cannot stop any sooner, since the next object to be explored might have the threshold value. But, life is a bit more delicate... 33

Wild Guesses Wild guesses: random access for a field i of object R that has not been sequentially accessed before Neither FA nor TA use wild guesses Subsystem might not allow wild guesses 34

Instance Optimality of TA Theorem: For each monotone f let A be the class of algorithms that correctly find top k answers, with scoring function f, for every database. Do not make wild guesses. D be the class of all databases. Then TA is instance optimal over A and D. Optimality ratio is m+m(m-1) c R /c S - best possible! 35

Our threshold algorithm is an even better algorithm (optimal in a stronger sense) Amnon Lotem Moni Naor Ron Fagin Laura Haas But Ron, you told me that your algorithm is optimal!? Well, Laura, there is optimal, and then there is optimal 36

PODS PODS is the premier database theory conference Considered by the database community and a much broader community as a top-tier conference Strangely, the China Computer Federation rates PODS as a B-level conference A search of conference rankings on the web shows that every ranking except CCF s gives PODS the highest ranking 37

Influence We submitted the paper to PODS 01 I was worried that the Threshold Algorithm was so simple that the paper would be rejected So I called it a remarkably simple algorithm The paper won the PODS Best Paper Award! The paper was very influential Over 1800 citations (Google Scholar) PODS Test of Time Award in 2011 IEEE Technical Achievement Award in 2011 Gödel Prize in 2014 Gems of PODS 2016 38

Applications of TA relational databases multimedia databases music databases semistructured databases text databases uncertain databases probabilistic databases graph databases spatial databases spatio-temporal databases web-accessible databases XML data web text data semantic web high-dimensional datasets information retrieval fuzzy data sets data streams search auctions wireless sensor networks distributed sensor networks distributed networks social-tagging networks document tagging systems peer-to-peer systems recommender systems personal information management systems group recommendation systems document annotation 39

Morals How did theory help? Resolving Laura Haas s dilemma Knowledge of the literature (fuzzy logic) Abstraction (using scoring functions) Devising optimal algorithms and proving optimality Figure out the real problem For example, there are scores, not just sorted lists Don t stop at original problem Example: doing a weighted version (with Ed Wimmers) Led to a successful and influential body of work 40

Measures of Success Making our products better An ultimate measure of success for practitioners Creating a new subfield An ultimate measure of success for theoreticians A paper that arose by resolving a practical problem won the Gödel Prize! 41

Second Case Study: Clio (2003) 42

Clio Clio deals with data exchange, where we convert data from one format to another When Laura Haas started Clio, I followed her I attended Clio meetings for a year Phokion Kolaitis Renee Miller Lucian Popa Ron Fagin Let s start from scratch and lay the foundations for data exchange! 43

Data Exchange Translate data from source format to target format Σ S Source Schema T Target Schema I J 44

Data Exchange Data exchange is an old, but recurrent, database problem Phil Bernstein 2003 Data exchange is the oldest database problem EXPRESS: IBM San Jose Research Lab 1977 Transforms data between hierarchical databases Data exchange underlies: Data warehousing, ETL (Extract-Transform-Load), 45

Example Source Target EMP MGR EMP DEPT DEPT MGR Relationship between source and target (the schema mapping ) specified by tuplegenerating dependencies (tgds) Originally used to help specify normal forms for relational databases EM(e,m) d (ED(e,d) DM(d,m)) 46

Example Source Target EMP Fagin Clarkson Haas MGR Haas Haas Welser EMP DEPT DEPT MGR 47

Example 3 Possible Solutions Source Target EMP Fagin MGR Haas EMP Fagin DEPT Haas DEPT Haas MGR Haas Clarkson Haas Clarkson Haas Welser Welser Haas Welser Haas Welser EMP DEPT Fagin d 1 Clarkson d 1 Haas d 2 DEPT d 1 d 2 MGR Haas Welser EMP DEPT Fagin d 1 Clarkson d 2 Haas d 3 DEPT d 1 d 2 d 3 MGR Haas Haas Welser 48

Which Solution Should We Produce? We define a universal solution to be one as general as possible Third solution is universal 49

Target Constraints Might have target constraints specified by equality-generating dependencies (egds), like DM(d,m) DM(d',m)) (d = d') If this egd is a target constraint, then second solution is universal 50

How Do We Obtain a Universal Solution? There is a well-known mechanical procedure called the chase, originally used as a tool in database design We use the chase to generate the target from the source efficiently Example: EM(e,m) d (ED(e,d) DM(d,m)) From EM(Fagin, Haas), create ED(Fagin, d) and DM(d, Haas), where d is a newly introduced labeled null The egds tell when to equate labeled nulls 51

Composing Schema Mappings S 12 S 23 Schema S 1 Schema S 2 Schema S 3 S 13 With Phokion Kolaitis, Lucian Popa, and Wang-Chiew Tan, we studied composition of schema mappings Composition can take us out of first-order logic! We found the right language for composition ( second order tgds ) We gave an algorithm for composition 52

Measures of Success Used in DB2 Control Center, Rational Data Architect, and Content Manager Using universal solutions Using our algorithm to produce a universal solution, and our algorithm to compose schema mappings Our initial paper won the ICDT Test of Time Award in 2013. With over 1100 citations, our paper was the 2 nd most highly cited paper of the decade in the journal TCS Our follow-up paper on composition won the PODS Test of Time Award in 2014. This work created a new subfield Special sessions on data exchange in every major database conference 53

Morals How did theory help? Established principles rather than ad hoc approaches Yielded algorithms for converting data, and for composing schema mappings Theorists need a partner to keep us honest Never too late to lay the foundations for an area, even for existing systems Can cause essential changes and improvement Again, don t stop at original problem 54

Conclusions 55

Conclusions (for System Builders) Consult with theoreticians Explaining the problem is useful by itself Principled approaches can improve your product Better or new algorithms can differentiate your product Algorithm analysis can provide performance expectations and provide product guarantees Abstractions can expand the function of your product 56

Conclusions (for Theoreticians) Involvement with system builders can help your theory! Novel questions will be asked New models and new, interesting areas of study will arise Implementation can reveal weaknesses in the theory Theory will be relevant Practical impact! 57