modern database systems lecture 5 : top-k retrieval

modern database systems lecture 5 : top-k retrieval Aristides Gionis Michael Mathioudakis spring 2016

announcements problem session on Monday, March 7, 2-4pm, at T2 solutions of the problems in homework 1 homework 2 will be out on Monday, Feb 29

Journal of Computer and System Sciences 66 (2003) 614 656 http://www.elsevier.com/locate/jcss today s Optimal aggregation algorithms for middleware $ Ronald Fagin, a, Amnon Lotem, b and Moni Naor c,1 a IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA b Department of Computer Science, University of Maryland-College Park, College Park, MD 20742, USA c Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 76100, Israel lecture Received 6 September 2001; revised 1 April 2002 Abstract Assume that each object in a database has m grades, or scores, one for each of m attributes. For example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. For each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). Each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, such as min or average. To determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. Fagin has given an algorithm ( Fagin s Algorithm, or FA) that is much more efficient. For some monotone aggregation functions, FA is optimal with high probability in the worst case. We analyze an elegant and remarkably simple algorithm ( the threshold algorithm, or TA) that is optimal in a much stronger sense than FA. We show that TA is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. Unlike FA, which requires large buffers (whose size may grow unboundedly as the database size grows), TA requires only a small, constant-size buffer. TA allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. We distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). We consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 Elsevier Science (USA). All rights reserved. Ronald Fagin, Amnon Lotem, Moni Naor Optimal aggregation algorithms for middleware JCSS 2003 $ Extended abstract appeared in Proceedings of the 20th ACM Symposium on Principles of Database Systems, 2001 (PODS 2001), pp. 102 113. Corresponding author. E-mail addresses: fagin@almaden.ibm.com (R. Fagin), lotem@cs.umd.edu (A. Lotem), naor@wisdom.weizmann. ac.il (M. Naor). 1 The work of this author was performed while he was a Visiting Scientist at the IBM Almaden Research Center. 0022-0000/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved. doi:10.1016/s0022-0000(03)00026-6

top-k retrieval users specify information need via a query SQL, mongodb, keyword search, too many data objects satisfy the query present top-k objects assumes ranking according to a relevance score examples find a flat to rent according to price, location, size, find a flight according to price, departure and arrival time, number of stops,

top-k retrieval consider the following scenario data objects have different attributes given a query, we can obtain a ranking of the objects according to the different attributes a black-box subsystem for each attribute want to combine (aggregate) the individual rankings into a single ranking top-k is obtained from the aggregate ranking aggregator is built on top of the subsystems cannot modify the black-box subsystems subsystems are viewed as middleware

middleware aggregation examples example 1: building a meta-search engine

middleware aggregation examples example 2 : image retrieval with multiple attributes

middleware aggregation examples example 2 : image retrieval with multiple attributes query query is a photo in flickr; assume that it is geolocated in Helsinki and contains the tag cinnamon roll

middleware aggregation examples example 2 : image retrieval with multiple attributes query query is a photo in flickr; assume that it is geolocated in Helsinki and contains the tag cinnamon roll text search :

top-k aggregation abstraction we are given a set of n objects each has a set of m attributes X 1,...,X n A 1,...,A m object i on attribute j has score r ij we typically assume 0 apple r ij apple 1 r ij the higher the value of the better the object X i according to attribute object i has overall score A j f i = f(r i1,...,r im ) retrieve the top-k items according to score f i

top-k aggregation example A 1 A 2 A 3 X 1 X 2 X 3 X 4 0.1 0.4 0.7 0.7 0.8 0.2 0.3 0.5 0.6 0.9 0.1 0.4

top-k aggregation example A 1 A 2 A 3 f = max{r 1,r 2,r 3 } X 1 X 2 X 3 X 4 0.1 0.4 0.7 0.7 0.8 0.2 0.3 0.5 0.6 0.9 0.1 0.4

top-k aggregation example A 1 A 2 A 3 f f = max{r 1,r 2,r 3 } X 1 X 2 X 3 X 4 0.1 0.4 0.7 0.7 0.8 0.2 0.3 0.5 0.6 0.9 0.1 0.4 0.7 0.8 0.6 0.9

top-k aggregation example f = max{r 1,r 2,r 3 } A 1 A 2 A 3 f rank X 1 0.1 0.4 0.7 0.7 3 X 2 0.7 0.8 0.2 0.8 2 X 3 0.3 0.5 0.6 0.6 4 X 4 0.9 0.1 0.4 0.9 1

sorted lists we assume that objects are available in m sorted lists this models our assumption about middleware subsystems the j-th list corresponds to attribute A j the j-th list ranks all objects according to values r ij

aggregation functions score of object i is given by aggregation function f i = f(r i1,...,r im ) common choices for min average or sum f we typically assume monotonicity an aggregate function is monotone if f f(r 1,...,r m ) f(r 0 1,...,r 0 m) whenever r j r 0 j

modes of access and cost model sorted access can get objects in each list in decreasing order cost to get the next object in a list C S random access can get the value of a specific object in a list cost for a random access C R middleware cost cost for s sorted accesses and random accesses sc S + rc R r

modes of access and cost model what is and for the web meta-search engine setting? C S C R

modes of access and cost model what is and for the web meta-search engine C S setting? C R = 1 C R

modes of access and cost model what is and for the web meta-search engine C S setting? C R = 1 C R no random access (NRA) special case of the model

example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function

naive algorithm for each object i, use the aggregation function to compute the score f i get the top-k according to all computed scores

naive algorithm questions : do we need to compute the score for every object in the database? can we safely ignore some objects whose scores are lower than what we already have?

Fagin s algorithm (FA) 1. perform sorted accesses in all lists in parallel until there are k objects that have been seen in all lists 2. perform random accesses to obtain the scores of all objects seen so far 3. compute score for all objects and find the top-k

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 compute top-2 for sum aggregation function sorted access random access

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 X 5 cannot be in the top-2. why? sorted access random access

Fagin s algorithm example R 1 R 2 R 3 X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 X 5 cannot be in the top-2. why? monotonicity sorted access random access

Fagin s algorithm is correct assume object y was not seen at all object y has values y 1,...,y m assume object x is one of the objects seen by FA during sorted access object x has values for all attributes j it is : x 1,...,x m y j apple x j therefore f y = f(y 1,...,y m ) apple f(x 1,...,x m )=f x for all objects seen the values of all attributes are known thus, top-k returns the correct results

Fagin s algorithm note correctness proof assumes only monotonicity Fagin s algorithm is correct for any monotone aggregation function

can we do better?

can we do better? yes! threshold algorithm also proposed by Fagin

the threshold algorithm (TA) 1. do a sorted access in parallel to each of the m sorted lists 2. for each object seen under sorted access : 1. retrieve all of its values by random access 2. compute 3. if this is one of the top-k answers so far, remember it 3. for the j-th list, let be the value of the last object seen under sorted access 4. define the threshold value to be x 5. when k objects have been seen whose score is at least, then stop 6. return the top-k answers x 1,...,x m f x = f(x 1,...,x m ) ˆx j = f(ˆx 1,...,ˆx m )

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 top-k compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.6 top-k compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.6 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.1 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.1 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm example R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.0 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm is correct assume object y was not seen at all object y has values y 1,...,y m assume object x is one of the objects seen by TA during sorted access object x has values x 1,...,x m for all attributes j it is : y j apple ˆx j apple x j therefore f y = f(y 1,...,y m ) apple f(ˆx 1,...,ˆx m ) apple f(x 1,...,x m )=f x for all objects seen the values of all attributes are known thus, top-k returns the correct results

threshold algorithm properties TA is correct for any monotone aggregation function TA uses a bounded-size buffer independent of the size of the database TA is optimal in a very strong sense it is as good as any other algorithm on every instance (instance optimal) any other algorithm means : except pathological algorithms as good means : within a constant factor pathological means : making wild guesses

instance optimality let A be a class of algorithms let D be a class of legal inputs (datasets) for A 2 A and D 2 D we consider performance cost cost(a, D) definition an algorithm B 2 A is called instance optimal if for any algorithm A 2 A and any dataset D 2 D it is cost(b, D) =O(cost(A, D)) that is, there are constants c and d such that cost(b, D) =c cost(a, D)+d

instance optimality instance optimality is a very strong notion we are comparing a deterministic algorithm against all possible nondeterministic algorithms consider search on a sorted list binary search is worst-case optimal however it is not instance optimal there is a nondeterministic algorithm that finds an object with one probe, or finds that the object does not exist with two probes but such a nondeterministic algorithm makes wild guesses

instance optimality of TA assume that the aggregation function is monotone f let let D A be the class of all databases be the class of all algorithms that correctly find the top k answers for f for every database and that do not make wild guesses then TA is instance optimal over A and D

instance optimality of TA proof sketch let A be any algorithm that runs over a database s.t. it returns the correct top-k and it does not make wild guesses let d = max the maximum depth of 1applejapplem d j A assume that A sees a distinct objects then since A makes no wild guesses a d cost of A is at least a C S

instance optimality of TA proof sketch A A 1 A 2 A m execution of d 1 d 2... d m maximum depth : d = max 1applejapplem d j cost : a C S, a d

instance optimality of TA proof sketch A A 1 A 2 A m execution of d 1 d 2... d m maximum depth : d = max 1applejapplem d j cost : a C S, a d claim : TA reaches maximum depth a + k

instance optimality of TA proof sketch assume claim true (TA reaches maximum depth a + k ) cost of TA is at most (a + k)mc S +(a + k)m(m 1)C R or amc S + am(m 1)C R +(kmc S + km(m 1)C R ) last term is a constant optimality ratio between A and TA is amc S + am(m 1)C R = m + m(m 1) C R ac S C S that is, a constant ratio QED modulo claim

instance optimality of TA proof sketch (main case : we show that TA reaches max depth a) (bound a + k is shown in corner cases) let Y be the output of A (consisting of top-k objects) let ˆx 1,...,ˆx m be the values of objects at the end of each list when A terminates define A = f(ˆx 1,...,ˆx m ) an object y is called big if all objects y 2 Y are big f y A

instance optimality of TA proof sketch A 1 A 2 A m execution of A on database D d 1 d 2 d m ˆx m ˆx 1... ˆx 2

instance optimality of TA proof sketch execution of A on database D d 1 x : A 1 A 2 A m d m 1 ˆx m ˆx 2 d... 2 ˆx x : 2 ˆx ˆx m ˆx 1 x : consider database with planted object x :ˆx 1...ˆx m D 0

instance optimality of TA proof sketch execution of A on database D d 1 x : A 1 A 2 A m d m 1 ˆx m ˆx 2 d... 2 ˆx x : 2 ˆx ˆx m ˆx 1 x : consider database with planted object x :ˆx 1...ˆx m D 0 execution of A on D and is identical by correctness of A we get D 0 f y f x = f(ˆx 1,...,ˆx m ) for all y 2 Y

instance optimality of TA proof sketch when TA reaches depth d apple a it has seen all objects in Y since all objects in Y are big (they have value larger than threshold) TA will halt QED

restricting sorted access assume that a subset of the lists is not accessible under sorted access mode TA can be easily modified to handle such scenario define = f(ˆx 1,...,ˆx m ) where ˆx j =1 for all inaccessible lists all lists that are inaccessible under sorted access are access only under random access mode

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 top-k compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.8 top-k compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.8 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.5 top-k X 2 1.6 X 1 1.5 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 2.5 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.8 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

threshold algorithm no sorted access in R3 R 1 R 2 R 3 threshold X 1 X 4 1.0 X 2 0.8 X 2 0.8 X 3 0.7 X 3 X 3 0.5 X 1 0.3 X 1 X 4 0.3 X 4 0.2 X 5 X 5 0.1 X 5 0.1 X 2 0.8 0.6 0.2 0.1 0.0 1.5 top-k X 3 1.8 X 2 1.6 compute top-2 for sum aggregation function

restricting random access perform sorted access on all lists in parallel; at depth d: maintain worst scores x any object seen in lists best(x) =f(x 1,...,x j, ˆx j+1,...,ˆx m ) worst(x) =f(x 1,...,x j, 0,...,0) top-k contains k objects with max worst scores at depth d (break ties using best) = k-th worst score in top-k object y is viable if stop when top-k contains more than k distinct objects and no object outside top-k is viable ˆx 1,...,ˆx m {1,...,j} best(y) >

approximate top-k finding top-k objects approximately for > 0, an -approximation of top k answers is a collection of k objects x 1,...,x k so that for any y not among them, it is (1 + )f f xi y TA can be easily modified to an approximation algorithm simply change the stopping rule into : when k objects have been seen whose score is at least 1+ then stop

summary rank aggregation and top-k algorithms Fagin s algorithm and threshold algorithm instance optimality algorithm variants depending on cost model next lecture (Michael) big data platforms