modern database systems lecture 5 : top-k retrieval

Similar documents
Integrating rankings: Problem statement

Combining Fuzzy Information - Top-k Query Algorithms. Sanjay Kulhari

Combining Fuzzy Information: an Overview

Optimal algorithms for middleware

Optimal Aggregation Algorithms for Middleware

The interaction of theory and practice in database research

. A quick enumeration leads to five possible upper bounds and we are interested in the smallest of them: h(x 1, x 2, x 3) min{x 1

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

INFO 1103 Homework Project 2

CSC 261/461 Database Systems Lecture 19

Lecture 8 13 March, 2012

Efficient Top-k Algorithms for Fuzzy Search in String Collections

Modern Database Systems CS-E4610

Predictive Indexing for Fast Search

Finding k-dominant Skylines in High Dimensional Space

Combination of TA- and MD-algorithm for Efficient Solving of Top-K Problem according to User s Preferences

IO-Top-k at TREC 2006: Terabyte Track

Evaluating Top-k Queries Over Web-Accessible Databases

Advanced Data Management Technologies

Efficient Aggregation of Ranked Inputs

CSE373: Data Structures & Algorithms Lecture 11: Implementing Union-Find. Lauren Milne Spring 2015

Modern Database Systems Lecture 1

Predictive Indexing for Fast Search

In this paper we consider probabilistic algorithms for that task. Each processor is equipped with a perfect source of randomness, and the processor's

Comparison of of parallel and random approach to

Balanced Trees Part Two

Class Note #02. [Overall Information] [During the Lecture]

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

CS 4349 Lecture October 18th, 2017

Lecture 7: Asymmetric K-Center

Combining Fuzzy Information from Multiple Systems*

The hierarchical model for load balancing on two machines

Speeding up Queries in a Leaf Image Database

Announcements. Homework 4. Project 3. Due tonight at 11:59pm. Due 3/8 at 4:00pm

Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-based Approach

Specifying and Proving Broadcast Properties with TLA

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Lecture 7: Efficient Collections via Hashing

Table of Contents. Course Minutiae. Course Overview Algorithm Design Strategies Algorithm Correctness Asymptotic Analysis 2 / 32

Efficient Top-K Problem Solvings for More Users in Tree-Oriented Data Structures

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Course : Data mining

CSE 21 Spring 2016 Homework 5. Instructions

CSCI 136 Data Structures & Advanced Programming. Lecture 7 Spring 2018 Bill and Jon

Data Structure and Algorithm Homework #3 Due: 2:20pm, Tuesday, April 9, 2013 TA === Homework submission instructions ===

LECTURE 18 LECTURE OUTLINE

A NOVEL APPROACH ON SPATIAL OBJECTS FOR OPTIMAL ROUTE SEARCH USING BEST KEYWORD COVER QUERY

Clustering. (Part 2)

Exact and Approximate Generic Multi-criteria Top-k Query Processing

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Selective-NRA Algorithms for Top-k Queries

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 16 Treaps; Augmented BSTs

Optimal algorithms for selecting top-k combinations of attributes: theory and applications

Lecture 7 February 26, 2010

6.856 Randomized Algorithms

Efficient Top-k Aggregation of Ranked Inputs

Multi-objective Query Processing for Database Systems

AtCoder World Tour Finals 2019

Some material taken from: Yuri Boykov, Western Ontario

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Feedback Week 4 - Problem Set

Leveraging Transitive Relations for Crowdsourced Joins*

Competitive analysis of aggregate max in windowed streaming. July 9, 2009

Information Retrieval Rank aggregation. Luca Bondi

Notes slides from before lecture. CSE 21, Winter 2017, Section A00. Lecture 3 Notes. Class URL:

Midterm 2. Read all of the following information before starting the exam:

Distributed Computing over Communication Networks: Leader Election

Online Algorithms. - Lecture 4 -

modern database systems lecture 4 : information retrieval

Approximation Algorithms

A Review to the Approach for Transformation of Data from MySQL to NoSQL

Database Applications (15-415)

Approximate Linear Programming for Average-Cost Dynamic Programming

1. Introduction. performance of numerical methods. complexity bounds. structural convex optimization. course goals and topics

Cost-aware top-k join algorithms

Answering Top K Queries Efficiently with Overlap in Sources and Source Paths

And Now to Something Completely Different: Finding Roots of Real Valued Functions

Distributed Algorithms 6.046J, Spring, Nancy Lynch

Introduction to Data Mining

The Rainbow Connection of a Graph Is (at Most) Reciprocal to Its Minimum Degree

Flexible Coloring. Xiaozhou Li a, Atri Rudra b, Ram Swaminathan a. Abstract

2.993: Principles of Internet Computing Quiz 1. Network

ISSUES IN SPATIAL DATABASES AND GEOGRAPHICAL INFORMATION SYSTEMS (GIS) HANAN SAMET

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

Evaluating Top-N Queries in n-dimensional Normed Spaces

Peter Gurský. Institute of Computer Science, Faculty of Science.

What is an algorithm?

Information Retrieval CSCI

arxiv:cs/ v1 [cs.cc] 28 Apr 2003

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Lectures 6+7: Zero-Leakage Solutions

De-identifying Facial Images using k-anonymity

Nondeterministic Query Algorithms

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015

A Mathematical Proof. Zero Knowledge Protocols. Interactive Proof System. Other Kinds of Proofs. When referring to a proof in logic we usually mean:

Transcription:

modern database systems lecture 5 : top-k retrieval Aristides Gionis Michael Mathioudakis spring 2016

announcements problem session on Monday, March 7, 2-4pm, at T2 solutions of the problems in homework 1 homework 2 will be out on Monday, Feb 29

Journal of Computer and System Sciences 66 (2003) 614 656 http://www.elsevier.com/locate/jcss today s Optimal aggregation algorithms for middleware $ Ronald Fagin, a, Amnon Lotem, b and Moni Naor c,1 a IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA b Department of Computer Science, University of Maryland-College Park, College Park, MD 20742, USA c Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100, Israel lecture Received 6 September 2001; revised 1 April 2002 Abstract Assume that each object in a database has m grades, or scores, one for each of m attributes. For example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. For each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). Each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, such as min or average. To determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. Fagin has given an algorithm ( Fagin s Algorithm, or FA) that is much more efficient. For some monotone aggregation functions, FA is optimal with high probability in the worst case. We analyze an elegant and remarkably simple algorithm ( the threshold algorithm, or TA) that is optimal in a much stronger sense than FA. We show that TA is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. Unlike FA, which requires large buffers (whose size may grow unboundedly as the database size grows), TA requires only a small, constant-size buffer. TA allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. We distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). We consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 Elsevier Science (USA). All rights reserved. Ronald Fagin, Amnon Lotem, Moni Naor Optimal aggregation algorithms for middleware JCSS 2003 $ Extended abstract appeared in Proceedings of the 20th ACM Symposium on Principles of Database Systems, 2001 (PODS 2001), pp. 102 113. Corresponding author. E-mail addresses: fagin@almaden.ibm.com (R. Fagin), lotem@cs.umd.edu (A. Lotem), naor@wisdom.weizmann. ac.il (M. Naor). 1 The work of this author was performed while he was a Visiting Scientist at the IBM Almaden Research Center. 0022-0000/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved. doi:10.1016/s0022-0000(03)00026-6

top-k retrieval users specify information need via a query SQL, mongodb, keyword search, too many data objects satisfy the query present top-k objects assumes ranking according to a relevance score examples find a flat to rent according to price, location, size, find a flight according to price, departure and arrival time, number of stops,

top-k retrieval consider the following scenario data objects have different attributes given a query, we can obtain a ranking of the objects according to the different attributes a black-box subsystem for each attribute want to combine (aggregate) the individual rankings into a single ranking top-k is obtained from the aggregate ranking aggregator is built on top of the subsystems cannot modify the black-box subsystems subsystems are viewed as middleware

middleware aggregation examples example 1: building a meta-search engine

middleware aggregation examples example 1: building a meta-search engine

middleware aggregation examples example 1: building a meta-search engine

middleware aggregation examples example 1: building a meta-search engine

middleware aggregation examples example 1: building a meta-search engine

middleware aggregation examples example 1: building a meta-search engine

middleware aggregation examples example 2 : image retrieval with multiple attributes

middleware aggregation examples example 2 : image retrieval with multiple attributes query query is a photo in flickr; assume that it is geolocated in Helsinki and contains the tag cinnamon roll

middleware aggregation examples example 2 : image retrieval with multiple attributes query query is a photo in flickr; assume that it is geolocated in Helsinki and contains the tag cinnamon roll text search :

middleware aggregation examples example 2 : image retrieval with multiple attributes query query is a photo in flickr; assume that it is geolocated in Helsinki and contains the tag cinnamon roll text search : color search :

middleware aggregation examples example 2 : image retrieval with multiple attributes query query is a photo in flickr; assume that it is geolocated in Helsinki and contains the tag cinnamon roll text search : color search : location search :

middleware aggregation examples example 2 : image retrieval with multiple attributes query query is a photo in flickr; assume that it is geolocated in Helsinki and contains the tag cinnamon roll text search : color search : location search :

top-k aggregation abstraction we are given a set of n objects each has a set of m attributes X 1,...,X n A 1,...,A m object i on attribute j has score r ij we typically assume 0 apple r ij apple 1 r ij the higher the value of the better the object X i according to attribute object i has overall score A j f i = f(r i1,...,r im ) retrieve the top-k items according to score f i

top-k aggregation example A 1 A 2 A 3 X 1 X 2 X 3 X 4 0.1 0.4 0.7 0.7 0.8 0.2 0.3 0.5 0.6 0.9 0.1 0.4

top-k aggregation example A 1 A 2 A 3 f = max{r 1,r 2,r 3 } X 1 X 2 X 3 X 4 0.1 0.4 0.7 0.7 0.8 0.2 0.3 0.5 0.6 0.9 0.1 0.4

top-k aggregation example A 1 A 2 A 3 f f = max{r 1,r 2,r 3 } X 1 X 2 X 3 X 4 0.1 0.4 0.7 0.7 0.8 0.2 0.3 0.5 0.6 0.9 0.1 0.4 0.7 0.8 0.6 0.9

top-k aggregation example f = max{r 1,r 2,r 3 } A 1 A 2 A 3 f rank X 1 0.1 0.4 0.7 0.7 3 X 2 0.7 0.8 0.2 0.8 2 X 3 0.3 0.5 0.6 0.6 4 X 4 0.9 0.1 0.4 0.9 1

sorted lists we assume that objects are available in m sorted lists this models our assumption about middleware subsystems the j-th list corresponds to attribute A j the j-th list ranks all objects according to values r ij

aggregation functions score of object i is given by aggregation function f i = f(r i1,...,r im ) common choices for min average or sum f we typically assume monotonicity an aggregate function is monotone if f f(r 1,...,r m ) f(r 0 1,...,r 0 m) whenever r j r 0 j

modes of access and cost model sorted access can get objects in each list in decreasing order cost to get the next object in a list C S random access can get the value of a specific object in a list cost for a random access C R middleware cost cost for s sorted accesses and random accesses sc S + rc R r

modes of access and cost model what is and for the web meta-search engine setting? C S C R

modes of access and cost model what is and for the web meta-search engine C S setting? C R = 1 C R

modes of access and cost model what is and for the web meta-search engine C S setting? C R = 1 C R no random access (NRA) special case of the model

example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function

naive algorithm for each object i, use the aggregation function to compute the score f i get the top-k according to all computed scores

naive algorithm questions : do we need to compute the score for every object in the database? can we safely ignore some objects whose scores are lower than what we already have?

Fagin s algorithm (FA) 1. perform sorted accesses in all lists in parallel until there are k objects that have been seen in all lists 2. perform random accesses to obtain the scores of all objects seen so far 3. compute score for all objects and find the top-k

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function sorted access random access

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function sorted access random access

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function sorted access random access

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function sorted access random access

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function sorted access random access

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 X 5 cannot be in the top-2. why? sorted access random access

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 X 5 cannot be in the top-2. why? monotonicity sorted access random access

Fagin s algorithm is correct assume object y was not seen at all object y has values y 1,...,y m assume object x is one of the objects seen by FA during sorted access object x has values for all attributes j it is : x 1,...,x m y j apple x j therefore f y = f(y 1,...,y m ) apple f(x 1,...,x m )=f x for all objects seen the values of all attributes are known thus, top-k returns the correct results

Fagin s algorithm note correctness proof assumes only monotonicity Fagin s algorithm is correct for any monotone aggregation function

can we do better?

can we do better? yes! threshold algorithm also proposed by Fagin

the threshold algorithm (TA) 1. do a sorted access in parallel to each of the m sorted lists 2. for each object seen under sorted access : 1. retrieve all of its values by random access 2. compute 3. if this is one of the top-k answers so far, remember it 3. for the j-th list, let be the value of the last object seen under sorted access 4. define the threshold value to be x 5. when k objects have been seen whose score is at least, then stop 6. return the top-k answers x 1,...,x m f x = f(x 1,...,x m ) ˆx j = f(ˆx 1,...,ˆx m )

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 top-k compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 top-k compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.6 top-k compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.6 top-k compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.6 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.6 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.1 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.1 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.1 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.1 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.0 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.0 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm is correct assume object y was not seen at all object y has values y 1,...,y m assume object x is one of the objects seen by TA during sorted access object x has values x 1,...,x m for all attributes j it is : y j apple ˆx j apple x j therefore f y = f(y 1,...,y m ) apple f(ˆx 1,...,ˆx m ) apple f(x 1,...,x m )=f x for all objects seen the values of all attributes are known thus, top-k returns the correct results

threshold algorithm properties TA is correct for any monotone aggregation function TA uses a bounded-size buffer independent of the size of the database TA is optimal in a very strong sense it is as good as any other algorithm on every instance (instance optimal) any other algorithm means : except pathological algorithms as good means : within a constant factor pathological means : making wild guesses

instance optimality let A be a class of algorithms let D be a class of legal inputs (datasets) for A 2 A and D 2 D we consider performance cost cost(a, D) definition an algorithm B 2 A is called instance optimal if for any algorithm A 2 A and any dataset D 2 D it is cost(b, D) =O(cost(A, D)) that is, there are constants c and d such that cost(b, D) =c cost(a, D)+d

instance optimality instance optimality is a very strong notion we are comparing a deterministic algorithm against all possible nondeterministic algorithms consider search on a sorted list binary search is worst-case optimal however it is not instance optimal there is a nondeterministic algorithm that finds an object with one probe, or finds that the object does not exist with two probes but such a nondeterministic algorithm makes wild guesses

instance optimality of TA assume that the aggregation function is monotone f let let D A be the class of all databases be the class of all algorithms that correctly find the top k answers for f for every database and that do not make wild guesses then TA is instance optimal over A and D

instance optimality of TA proof sketch let A be any algorithm that runs over a database s.t. it returns the correct top-k and it does not make wild guesses let d = max the maximum depth of 1applejapplem d j A assume that A sees a distinct objects then since A makes no wild guesses a d cost of A is at least a C S

instance optimality of TA proof sketch A A 1 A 2 A m execution of d 1 d 2... d m maximum depth : d = max 1applejapplem d j cost : a C S, a d

instance optimality of TA proof sketch A A 1 A 2 A m execution of d 1 d 2... d m maximum depth : d = max 1applejapplem d j cost : a C S, a d claim : TA reaches maximum depth a + k

instance optimality of TA proof sketch assume claim true (TA reaches maximum depth a + k ) cost of TA is at most (a + k)mc S +(a + k)m(m 1)C R or amc S + am(m 1)C R +(kmc S + km(m 1)C R ) last term is a constant optimality ratio between A and TA is amc S + am(m 1)C R = m + m(m 1) C R ac S C S that is, a constant ratio QED modulo claim

instance optimality of TA proof sketch (main case : we show that TA reaches max depth a) (bound a + k is shown in corner cases) let Y be the output of A (consisting of top-k objects) let ˆx 1,...,ˆx m be the values of objects at the end of each list when A terminates define A = f(ˆx 1,...,ˆx m ) an object y is called big if all objects y 2 Y are big f y A

instance optimality of TA proof sketch A 1 A 2 A m execution of A on database D d 1 d 2 d m ˆx m ˆx 1... ˆx 2

instance optimality of TA proof sketch execution of A on database D d 1 x : A 1 A 2 A m d m 1 ˆx m ˆx 2 d... 2 ˆx x : 2 ˆx ˆx m ˆx 1 x : consider database with planted object x :ˆx 1...ˆx m D 0

instance optimality of TA proof sketch execution of A on database D d 1 x : A 1 A 2 A m d m 1 ˆx m ˆx 2 d... 2 ˆx x : 2 ˆx ˆx m ˆx 1 x : consider database with planted object x :ˆx 1...ˆx m D 0 execution of A on D and is identical by correctness of A we get D 0 f y f x = f(ˆx 1,...,ˆx m ) for all y 2 Y

instance optimality of TA proof sketch when TA reaches depth d apple a it has seen all objects in Y since all objects in Y are big (they have value larger than threshold) TA will halt QED

restricting sorted access assume that a subset of the lists is not accessible under sorted access mode TA can be easily modified to handle such scenario define = f(ˆx 1,...,ˆx m ) where ˆx j =1 for all inaccessible lists all lists that are inaccessible under sorted access are access only under random access mode

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 top-k compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 top-k compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.8 top-k compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.8 top-k compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.8 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.8 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.5 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.5 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.5 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.5 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.8 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.8 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.5 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.5 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.5 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

restricting random access perform sorted access on all lists in parallel; at depth d: maintain worst scores x any object seen in lists best(x) =f(x 1,...,x j, ˆx j+1,...,ˆx m ) worst(x) =f(x 1,...,x j, 0,...,0) top-k contains k objects with max worst scores at depth d (break ties using best) = k-th worst score in top-k object y is viable if stop when top-k contains more than k distinct objects and no object outside top-k is viable ˆx 1,...,ˆx m {1,...,j} best(y) >

approximate top-k finding top-k objects approximately for > 0, an -approximation of top k answers is a collection of k objects x 1,...,x k so that for any y not among them, it is (1 + )f f xi y TA can be easily modified to an approximation algorithm simply change the stopping rule into : when k objects have been seen whose score is at least 1+ then stop

summary rank aggregation and top-k algorithms Fagin s algorithm and threshold algorithm instance optimality algorithm variants depending on cost model next lecture (Michael) big data platforms