Evaluating Top-k Selection Queries

Evaluatig Top-k Selectio Queries Surajit Chaudhuri Microsoft Research surajitc@microsoft.com Luis Gravao Columbia Uiversity gravao@cs.columbia.edu Abstract I may applicatios, users specify target values for certai attributes, without requirig exact matches to these values i retur. Istead, the result to such queries is typically a rak of the top k tuples that best match the give attribute values. I this paper, we study the advatages ad limitatios of processig a top-k query by traslatig it ito a sigle rage query that traditioal relatioal DBMSs ca process efficietly. I particular, we study how to determie a rage query to evaluate a top-k query by exploitig the statistics available to a relatioal DBMS, ad the impact of the quality of these statistics o the retrieval efficiecy of the resultig scheme. Itroductio Iteret Search egies rak the objects i the results of selectio queries accordig to how well these objects match the origial selectio coditio. For such egies, query results are ot flat sets of objects that match a give coditio. Istead, query results are raked startig from the top object for the query at had. Give a query cosistig of a set of words, a search egie returs the matchig documets sorted accordig to how well they match the query. For decades, the iformatio retrieval field has studied how to rak text documets for a query both efficietly ad effectively []. I cotrast, much less attetio has bee devoted to supportig such top-k queries over relatioal databases. As the followig example illustrates, top-k queries arise aturally i may applicatios where the data is exact, as i a traditioal relatioal database, but where users are flexible ad willig to accept o-exact Permissio to copy without fee all or part of this material is grated provided that the copies are ot made or distributed for direct commercial advatage, the VLDB copyright otice ad the title of the publicatio ad its date appear, ad otice is give that copyig is by permissio of the Very Large Data Base Edowmet. To copy otherwise, or to republish, requires a fee ad/or special permissio from the Edowmet. Proceedigs of the 5th VLDB Coferece, Ediburgh, Scotlad, 999. matches that are close to their specificatio. The aswer to such a query is a raked set of the k tuples i the database that best match the selectio coditio. Example : Cosider a real-estate database that maitais iformatio like the Price ad Number of Bedrooms of each house that is available for sale. Suppose that a potetial customer is iterested i houses with four bedrooms, ad with a price tag of aroud $,. The database system should the rak the available houses accordig to how well they match the give user preferece, ad retur the top houses for the user to ispect. If o houses match the query specificatio exactly, the system might retur a house with, say, five bedrooms ad a price tag close to $, as the top house for the query. Ufortuately, despite the coceptual simplicity of top-k queries ad the expected performace payoff, they are ot yet supported by today s relatioal database systems. This support would free applicatios ad ed-users from havig to add this fuctioality i their cliet code. To provide such support efficietly, we eed processig techiques that do ot ivolve full sequetial scas of the uderlyig relatios. The challege i providig this fuctioality is that the database system eeds to hadle efficietly top-k queries for a wide variety of scorig fuctios. Ief- fect, these scorig fuctios might chage by user, ad they might also vary by applicatio, or by database. It is also importat that we are able to process such top-k queries with as few extesios to existig query egies as possible, sice today s relatioal systems are sigificatly complex ad performace sesitive. As i the case of processig traditioal selectio queries, oe must cosider the problem of executio as well as optimizatio of top-k queries. We assume that the executio egie is a traditioal relatioal egie that supports sigle as well as possibly multidimesioal idexes. Therefore, the key challege is to augmet the optimizatio phase such that top-k selectio queries may be compiled ito a executio pla that ca leverage the existig data structures (i.e., idexes) ad statistics (e.g., histograms) that a database system maitais. Simply put, we eed to develop ew techiques that make it possible to map a top-k query ito a traditioal selectio query. It is also importat

that ay such techique preserves the followig two properties: () it hadles a variety of scorig fuctios for computig the top-k tuples for a query, ad () it guaratees that there are o false dismissals (i.e., we ever miss ay of the top-k tuples for the give query). I this paper, we udertake a comprehesive study of the problem of mappig top-k queries ito executio plas that use traditioal selectio queries. I particular, we use the database histograms to map a top-k query to a suitable rage that ecapsulates k best matches for the query. I particular, we study the sesitivity of the mappig algorithms to the followig parameters: types of histograms available ad their memory budgets, scorig fuctios, data distributio, ad umber of query attributes. The rest of the paper is orgaized as follows. Sectio formally defies the problem of queryig for topk matches. Sectio discusses related work. Sectio 4 is the core of the paper, ad outlies the techiques that form the basis of our approach. Fially, Sectio 6 presets a experimetal evaluatio of our approach, usig the experimetal settig of Sectio 5. Query Model I a traditioal relatioal system, the aswer to a selectio query is a set of tuples. I cotrast, the aswer to a top-k query is a ordered set of tuples, where the orderig reflects how closely each tuple matches the give query. This sectio defies our query model precisely. Cosider a relatio R with attributes A,...,A. Atop-k query over R simply specifies target values for the attributes i R. Thus, a query is a assigmet of values v,...,v to the attributes A,...,A of R. I this paper, we will focus o top-k queries o cotiuous attributes (e.g., age, salary). Without loss of geerality, we will also assume that the values of these attributes are ormalized to be real umbers betwee ad. Example : Cosider a relatio S with two attributes, A ad A. These attributes have real values that rage betwee ad. A example of top- query over this relatio is q =(.4,.). Such a query asks for the tuples i S that are the closest to the (.4,.) poit, for some defiitio of proximity, as we discuss below. Give a top-k query q, the database system with relatio R uses some scorig fuctio Score to determie how closely each tuple i R matches the target values v,...,v specified i query q. Give a tuple t ad a query q, we assume that Score(q, t) is a real umber that rages betwee ad. I this paper, we focus o three importat scorig fuctios, amely Mi, Euclidea, adsum. Defiitio : Cosider a relatio R =(A,...,A ). A,...,A are real-valued attributes ragig betwee ad. The, give a query q =(q,...,q ) adatuple t =(t,...,t ) from R, we defie the score of t for q usig ay of the followig three scorig fuctios: Mi(q, t) = mi { q i t i } i= Euclidea(q, t) = (q i t i ) Sum(q, t) = i= i= q i t i Example : Cosider a tuple t = (.,.8) i our sample database S from Example, ad query q = (.4,.). The, t will the have a score of Mi(q, t) =mi{..4,.8. } =.5 for the Mi scorig fuctio, a score of Euclidea(q, t) =..4.8. =.64 for the Euclidea scorig fuctio, ad a score of Sum(q, t) = (..4.8. )=.7 forthesum scorig fuctio. Figure (c) shows the distributio of scores for the Mi scorig fuctio ad query q =(.4,.). The horizotal plae i the figure cosists of the tuples with z =.8, so what emerges above this plae are those tuples with score.8 or higher. Note that the tuples with score.8 or higher for q are eclosed i a box aroud q. I cotrast, the tuples with score.8orhigherfortheeuclidea scorig fuctio (Figure (b)) are eclosed i a circle aroud q. Fially, the top tuples accordig to the Sum scorig fuctio lie withi a rotated box aroud q (Figure (a)). This differece i the shape of the regio eclosig the top tuples for the query will have crucial implicatios o query processig, as we will discuss i Sectio 4. A simple variatio of the defiitio of the scorig fuctios above results from lettig the differet attributes have differet weights. I geeral, the Mi, Euclidea,adSum fuctios that we use i this paper are just a few of may possible scorig fuctios. Our strategy for processig top-k queries ca be adapted to hadle a wide variety of such fuctios, as we will discuss. The key property that we ask from scorig fuctios is as follows: Property : Mootoicity of Scorig Fuctios: Cosider a relatio R ad a scorig fuctio Score defied over it. Let q =(v,...,v ) be a top-k query over R, ad let t =(t,...,t ) ad t =(t,...,t ) be two tuples i R such that t i q i t i q i for i =,...,. (I other words, t is at least as close to q as t for all attributes.) The, Score(q, t ) Score(q, t). Ituitively, this property of scorig fuctios implies that if a tuple t is closer, alog each attribute,

Sum.8.6.5 X.5.75.5.5.75 Y.8 Euclidea.6.4.5.5 X (a) (b) (c) Figure : The scores (z axis) for query q =(.4,.) for the differet (x, y) pairs ad scorig fuctios Sum (a), Euclidea (b), admi (c)..75.5.75.5 Y Mi.8.6.4.5 X.5.75.5.5.75 Y to the query values tha some other tuple t is, the, the score that t gets for the query caot be worse tha that of t. Fortuately, all iterestig scorig fuctios that we could thik of satisfy our mootoicity assumptios. I particular, the Euclidea, Mi, ad Sum scorig fuctios that we defied above satisfy this property. A possible SQL-like otatio for expressig top-k queries is as follows []: SELECT * FROM R WHERE A=v AND... ORDER k BY Score AND A=v The distiguishig feature of the query model is i the ORDER BY clause. This clause idicates that we are iterested i oly the k aswers that best match the give WHERE clause, accordig to the Score fuctio. Sectio 4 discusses how we will evaluate top-k queries for differet defiitios of the Score fuctio. Related Work Motro [9] emphasized the eed to support approximate ad raked matches i a database query laguage. He exteded the laguage Quel to distiguish betwee exact ad vague predicates. He also suggested a composite scorig fuctio to rak each aswer. Motro s work led to further developmet of the idea of query relaxatio that weakes a give user query to provide approximate matches usig additioal metadata (e.g., cocept hierarchies). The queryig model for top-k queries that we use i this paper is cosistet with Motro s defiitios. Our key focus is o explorig opportuities ad limitatios of efficietly mappig top-k queries ito traditioal relatioal queries. Recetly, Carey ad Kossma [, ] preseted techiques to optimize queries that require oly top-k matches. Their techique leverages the fact that whe k is relatively small compared to the size of the relatio, specialized sortig (or idexig) techiques that ca produce the first few values efficietly should be used. However, i order to apply their techiques whe the scorig fuctio is ot based o colum values themselves (e.g., as is the case for Mi, Euclidea, ad Sum as defied i Sectio ), we eed to first evaluate the scorig fuctio for each database object. Thus, whe a query requests the top-k values accordig to a scorig fuctio like Mi, theirtechique would eed to first evaluate the Mi score for every data object. Oly after evaluatig the score for each object are we able to use the techiques i [, ]. Hece, these strategies require a preprocessig step to compute the scorig fuctio itself ivolvig oe sequetial sca of all the data. I cotrast, i this paper we explore techiques that avoid accessig the etire data set. I [4, 5], Fagi addresses the problem of fidig topk matches for a user query q ivolvig several multimedia attributes. Each of these attributes (e.g., a image attribute) is assumed to have a ative sub-system that aswers top-k queries ivolvig oly the correspodig attribute. I the first phase of Fagi s A algorithm, the query processig system obtais a stream L i of top matches for coditio c i o attribute A i from the correspodig sub-system. Whe there are at least k objects i the itersectio of all the sigle-attribute streams L i, the system is guarateed to have already accessed k top objects for query q. (These top objects are ot ecessarily i the itersectio of the streams.) The secod phase of algorithm A computes the score of each of the retrieved objects, ad returs the best k objects. I Sectio 4., we preset a adaptatio of Fagi s strategy to the case whe the top-k query is issued agaist a relatioal database system. I [], we preseted a algorithm for processig queries over a multimedia database. Our query model built o Fagi s to also iclude Boolea coditios to the top-k compoet of the multimedia queries. There is a large body of work o fidig the earesteighbors of a multidimesioal data poit. Give a -dimesioal poit p, these techiques retrieve the k objects that are earest to p accordig to a give

distace metric. The state-of-the-art algorithms (e.g., [7]) follow a multi-step approach. Their key step is idetifyig a set of poits A such that p s k earest eighbors are o further from p tha a is, where a is the poit i A that is furthest from p. (A more recet paper [4] further refies this idea.) This approach is coceptually similar to the approach that we follow i this paper (ad also i []), where we first fid a suitable score S, ad the we use it to build a relatioal query that will retur the top-k matches for the origial query. Our focus i this paper is to study the practicality ad limitatios of usig the iformatio i the histograms kept by a relatioal system for query processig. I cotrast, the earest-eighbor algorithms metioed above use the data values themselves to idetify a cut-off score. Fially, refereces [6, 8] study how to merge ad recocile top-k query results obtaied from distributed databases whe the databases use arbitrary, udisclosed scorig algorithms. 4 Mappig a Top-k Query ito a Traditioal Selectio Query This sectio shows how to map a top-k query q ito a relatioal selectio query C q that ay traditioal RDBMS ca execute. Our goal is to obtai k tuples from relatio R that are the best tuples for q accordig to a scorig fuctio Score. Our query processig strategy cosists of the followig steps:. Use statistics o relatio R to fid a search score S q (Sectio 4.).. Build a selectio query C q to retrieve all tuples i R with score S q or higher for q (Sectio 4.).. Evaluate C q over R. 4. Compute Score(q, t) for every tuple t i the aswer for C q. 5. If there are at least k tuples t i the result for C q with Score(q, t) S q, the output k tuples with the highest scores. Otherwise, choose a lower value for S q ad restart the process. Sectio 4. itroduces a related mappig strategy that does ot follow the five steps above, ad is a adaptatio of Fagi s A algorithm (Sectio ). 4. Choice of Search Score S q The key step for evaluatig a top-k query q is determiig score S q : our algorithm retrieves all tuples t such that Score(q, t) S q. If there are at least k such tuples, the our algorithm above succeeds i fidig the top k matches for q. Otherwise, our choice of S q is too high, ad hece the query eeds to be restarted with a lower value for S q. Cosequetly, we should choose a value of S q that is ot too low, so that we do ot retrieve too may cadidate tuples from the database, but that is ot too high either, so that we ca obtai the top-k tuples without restartig the query. Our choice of S q will be guided by the statistics that the query processor keeps about relatio R. I particular, we will assume that we have a -dimesioal histogram H that describes the distributio of values of R. We discuss this issue further i Sectio 5.. Util the, we assume that H cosists of a series of ooverlappig buckets. Each bucket has associated with it a -rectagle [a,b ]... [a,b ], ad stores the umber of tuples i R that lie withi the -rectagle, together with other iformatio. For efficiecy, our choice of S q will be based o histogram H, ad ot o the uderlyig relatio R itself. More specifically, we choose S q as follows: a. Create (coceptually) a small, sythetic relatio R, cosistet with histogram H. R has oe distict tuple for each bucket i H, with as may istaces as the frequecy of the correspodig bucket. b. Compute Score(q, t) for every tuple t i R. c. Let T be the set of the top-k tuples i R for q. Output S q =mi t T Score(q, t). We ca coceptually build sythetic relatio R i may differet ways. We will study two extreme query processig strategies that result from two possible defiitios of R. The first query processig strategy, NoRestarts, results i a search score S q that is low eough to guaratee that o restarts are ever eeded as log as histograms are kept up to date. I other words, Step (5) above always fiishes successfully, without ever havigtoreduces q ad restart the process. For this, the NoRestarts strategy defies R i a pessimistic way: give a histogram bucket b, the correspodig tuple t b that represets b i R will be as bad for query q as possible. More formally, t b is a tuple i b s -rectagle with the followig property: Score(q, t b )=mi t T b Score(q, t) where T b is the set of all potetial tuples i the - rectagle associated with bucket b. Example 4: Cosider our example relatio S, with two attributes A ad A,queryq =(.4,.), ad the -dimesioal histogram H show i Figure (a). Histogram H has three buckets, b, b,adb. Relatio S has 4 tuples i bucket b, 5 tuples i bucket b, ad 55 tuples i bucket b. As explaied above, the NoRestarts strategy will build relatio S based o H by assumig that the tuple distributio i S is

(, ) (, ) b t b b q (, ) (, ) b t b b q t t Iter Iter NoRestarts Restarts Figure : The four strategies for computig the search score S q. (.4,.) t t (, ) (, ) (, ) (, ) (a) (b) Figure : A -bucket histogram H ad the choice of tuples represetig each bucket that strategies NoRestarts (a) ad Restarts (b) make for query q. (.,.) q=(.4,.) (.68,.) as bad as possible for query q. So, relatio S will cosist of three tuples (oe for each bucket i H) t, t,adt, which are as far from q as their correspodig bucket boudaries permit. Tuple t will have a frequecy of 4, t will have a frequecy of 5, ad t will have a frequecy of 55. Assume that the user who issued query q wats to use the Mi scorig fuctio to fid the top tuples for q. SiceMi(q, t )=., Mi(q, t )=.6, ad Mi(q, t )=.4, to get tuple istaces we eed the top tuple, t (frequecy 5), ad t (frequecy 55). Cosequetly, the search score S q will be Mi(q, t )=.4. From the way we built S, it follows that the origial relatio S is guarateed to cotai at least tuples with score S q =.4 orhigher for query q. The, if we retrieve all of the tuples with that score or higher, we will obtai a superset of the set of top-k tuples for q. Lemma : Let q be a top-k query over a relatio R. Let S q be the search score computed by strategy NoRestarts for q. The, there are at least k tuples t i R such that Score(q, t) S q. The secod query processig strategy, Restarts, results i a search score S q that is highest amog those search scores that might result i o restarts. This strategy defies R i a optimistic way: give a histogram bucket b, the correspodig tuple t b that represets t b i R will be as good for query q as possible. More formally, t b is a tuple i b s -rectagle with the followig property: Score(t b,q)=max t T b Score(q, t) where T b is the set of all potetial tuples i the - rectagle associated with bucket b. Example 4: (cot.) The Restarts strategy will ow build relatio S based o H by assumig that the tuple distributio i S is as good as possible for query q (Figure (b)). So, relatio S will cosist of three tuples (oe per bucket i H) t, t,adt,which (.4,.58) Figure 4: The circle aroud query q =(.4,.) cotais all of the tuples with Euclidea score of.8 or higher for q. are as close to q as their correspodig bucket boudaries permit. I particular, tuple t will be defied as q proper, with frequecy 5, sice its correspodig bucket (i.e., b ) has 5 tuples i it. After defiig the bucket represetatives t, t, ad t, we proceed as i the NoRestarts strategy to sort the tuples o their score for q. For Mi, we pick tuples t ad t,ad defie S q as Mi(q, t ). This time it is ideed possible for fewer tha k tuples i the origial table S to have a score of S q or higher for q, so restarts are possible. The S q score that Restarts computes is the highest score that might result i o restarts i Step (5) of the algorithm above. I other words, usig a value for S q that is higher tha that of the Restarts strategy will always result i restarts. I practice, as we will see i Sectio 6, the Restarts strategy results i restarts i virtually all cases, hece its ame. Lemma : Let q be a top-k query over a relatio R. Let S q be the search score computed by strategy Restarts for q. The, there are fewer tha k tuples t i R such that Score(q, t) >S q. I additio to the two extreme score-selectio strategies NoRestarts ad Restarts, we will study two other itermediate strategies, Iter ad Iter (Figure ). Give a query q, lets q be the search score selected by NoRestarts for q, adlets q be the correspodig score selected by Restarts. The, the Iter strategy will choose score SqS q, while the Iter strategy will choose a higher score of SqS q.asour experimets will show, Iter ad Iter are ofte the best strategies that we ca follow i terms of the efficiecy of the resultig techiques.

4. Choice of Selectio Query C q Oce we have determied the search score S q (Sectio 4.), the algorithm i Sectio 4 uses a query C q to retrieve all tuples t such that Score(q, t) S q,where q is the origial top-k query, ad Score is the scorig fuctio beig used. I this sectio we describe how to defie query C q. Ideally, we would like to ask our database system to retur exactly those tuples t such that Score(q, t) S q. Ufortuately, idexig structures i relatioal DBMSs do ot atively support this kid of predicates, as discussed i Sectio. Our approach is to build C q as a simple selectio coditio defiig a -rectagle. I other words, we defie C q as a query of the form: SELECT * FROM R WHERE (a<=a<=b) AND... AND (a<=a<=b) The -rectagle [a,b ]... [a,b ]ic q should tightly eclose all tuples t i R with Score(q, t) S q. Example 5 : Cosider our example query q = (.4,.) over relatio S, with Euclidea as the scorig fuctio. Suppose that our search score S q is.8, as computed by ay of the strategies i Sectio 4.. Each tuple t with Euclidea(q, t).8 lies i the circle aroud q that is show i Figure 4. The, the tightest -rectagle that ecloses that circle is [.,.68] [.,.58]. Hece, the fial SQL query C q is: SELECT * FROM S WHERE (.<=A<=.68) AND (.<=A<=.58) Give a search score S q,the-rectagle [a,b ]... [a,b ] that determies C q follows directly from the scorig fuctio used, the search score S q, ad the query q. Example 5: (cot.) Let us assume that the search score for our query q = (.4,.) is S q =.8, as above. We calculate the -rectagle that ecloses all tuples with.8 score or higher by focusig o oe attribute at a time. First, cosider a tuple r =(t,.) that has the same attribute values as query q i all attributes except for maybe attribute A. We will compute the rage of values that t ca have while Euclidea(q, r).8. I effect, Euclidea(q, r) = (t Euclidea((.4,.), (t,.)) =.4). Cosequetly, Euclidea(q, r).8 if ad oly if. t.68. Hece, the rage of values that attribute A ca take is [a,b ]=[.,.68]. Aalogously for attribute A,[a,b ]=[.,.58]. Puttig both pieces together, the fial -rectagle that ecloses all tuples with score.8 or higher for q is [.,.68][.,.58] (Figure 4). Score a i b i Mi q i (. S q) q i (. S q) Sum Euclidea q i (. S q) q i (. S q) q i (. S q) q i (. S q) Table : The -rectagle [a,b ]...[a,b ]forc q s selectio coditio ad search score S q, for differet scorig fuctios, where a i = max{,a i } ad b i = mi{,b i }. Table summarizes how to compute the -rectagle [a,b ]... [a,b ] for the three scorig fuctios from Sectio. The Mi scorig fuctio presets a iterestig property: the regio to be eclosed by the -rectagle is already a -rectagle. (See Figure (c).) Cosequetly, the query C q that is geerated for Mi for query q ad its associated search score S q will retrieve oly tuples with a score of S q or higher. This property will result i efficiet executios of top-k queries for Mi, as we will see. Ufortuately, this property does ot hold for the Sum ad Euclidea scorig fuctios (Figures (a) ad (b)). 4. A Alterative Mappig Strategy This sectio adapts Fagi s A algorithm (Sectio ) to produce a ew techique for mappig a top-k query ito a traditioal relatioal query. Ulike the Sectio 4. strategies, the selectio query resultig from this ew mappig is a disjuctio, ot a cojuctio. Our goal is, agai, to build a oe-shot relatioal query that avoids restarts wheever possible. We proceed as i strategy NoRestarts (Sectio 4.) to build a database with oe tuple represetig each bucket i the available -dimesioal histogram. We fid the top tuples as i the NoRestarts strategy. We the compute a -rectagle F =[a,b ]...[a,b ]that ecloses these top tuples tightly, ad that has bee exteded so that it is symmetric with respect to the give query q. (I other words, a i q i b i ad b i q i = q i a i,fori =,...,.) The tuples matchig rage [a i,b i ] are the top tuples for q alog attribute A i. The selectio query cosists of the disjuctio of the a i A i b i coditios. By retrievig all tuples that match at least oe of these coditios, we retrieve the top tuples for each of the idividual attributes. Furthermore, from the way we costructed F, there will be at least k tuples matchig all coditios. As with the origial A algorithm, we compute the score for all the oe-dimesioal matches. The k retrieved tuples havig the highest score for q are the fial aswer to the origial top-k query. The correctess of this algorithm follows from that of algorithm A [4]. Due to space costraits, we do ot discuss this algorithm ay further i this paper.

5 Experimetal Settig We ow describe the data sets, histograms, ad metrics for the experimets of Sectio 6. 5. Data Sets Our experimets use a real-world data set as well as sythetic data. The real-world data set is a fragmet of US Cesus Bureau data, ad was obtaied from the Uiversity of Califoria, Irvie archive of machielearig databases (ftp://ftp.ics.uci.edu/pub/- machie-learig-databases). The data set has 45, rows. Each row is a record for a idividual, with 4 attributes. We picked four cotiuous attributes that were especially well suited for our topk query model: age, wage, educatio level, ad hours of work per week. We also scaled dow the attribute values so that the resultig values raged betwee ad, to simplify our experimetal settig. We refer to this database as the Cesus database. I additio to the Cesus database, we geerated a umber of sythetic databases with differet data distributios. For this, we wrote a seed program that is capable of geeratig oe-dimesioal Zipfia distributio [5] with varyig Z factors. Whe this factor is zero, it geerates a uiform distributio. Higher values result i higher skew. For a -dimesioal data set, our geeratio program is parameterized by () a vector of Zvalues (oe for each attribute), Z =<z,...,z >; () the umber of tuples to be geerated, N. We created the data correspodig to a Z specificatio as follows. First, we geerated a oe-dimesioal Zipfia distributio of N tuples for attribute A usig Z factor z. Let us say that for attribute A the value v occurred i N out of the N tuples. We ow fill i the value for attribute A for each of these N tuples by geeratig N values w,...,w N usig a Zipfia distributio with Z factor z. At the ed of this step, the first two attributes of the origial N tuples are filled i with values (v,w ),...,(v,w N ). Let us say that this results i N tuples that have v ad w as the values for attributes A ad A, respectively. We the fill i the remaiig attribute values A,...,A for these N tuples i a aalogous way as above, usig the Z values z through z. For our experimets, we geerated databases of, records with =,, ad 4 attributes. The domai of each attribute is the real umbers betwee ad, with a spacig of. betwee attribute values. We varied the Zipfia vectors i the geeratio of the databases so we obtaied databases with a spectrum of skews. More specifically, Sectio 6 reports experimets for three families of databases, Z, Z, ad Z. Z, Z, ad Z represet the skew of databases built usig Zipfia vectors <,,..., >, <,,..., >, ad<,,..., >, respectively. Table summarizes the sythetic databases for which we report experimets i the ext sectio. Data Skew 4 Z,,, Z 7, 5,554 66,46 Z 79 878 74 Table : The umber of distict tuple values for differet data skews ad umber of attributes. 5. Histograms As outlied above, we map a top-k query over a table R ito a relatioal selectio query. To do this mappig, we exploit the statistics (e.g., histograms) kept by the relatioal DBMS where relatio R resides. Oe of our goals i this paper is to study the effect o our mappig of the differet -dimesioal histogram structures proposed i the literature. These structures rely o a uderlyig strategy for buildig oe-dimesioal histograms. I this paper we focus o the AVI, PHASED, admhist -p -dimesioal techiques, with MAXDIFF as the uderlyig oedimesioal strategy [, ]. Below we briefly describe these structures. We refer the reader to [, ] for a detailed discussio. Costructig a MAXDIFF histogram o a attribute of a relatio is logically a two-step process. First, the data values are sorted ad, for each distict value, its frequecy of occurrece is calculated. Let the sorted values be v,...,v with correspodig frequecies f,...,f. We ca the defie frequecygap(i) = f i f i. This fuctio records the differece i frequecy of attribute values v i ad v i.the bucket boudaries are placed at those attribute values that correspod to the highest values of the frequecygap fuctio. The MAXDIFF histogram structure has bee show to have a good trade-off betwee accuracy ad buildig cost []. For the experimets that we report i the ext sectio, we have implemeted - dimesioal variats of MAXDIFF histograms usig the AVI, PHASED, admhist -p techiques, as described i []. The AVI techique for costructig a - dimesioal histogram is to simply assume statistical idepedece of the oe-dimesioal attributes. Thus, to determie the fractio of data i a -dimesioal bucket, we multiply the fractio of the data i each oe-dimesioal projectio of the bucket. The PHASED techique for costructig a - dimesioal histogram cosists of steps. I the first step, oe of the dimesios is used to partitio the dataset ito k buckets. I the j th step, each of the buckets obtaied at the ed of the previous step is divided ito k j buckets alog oe of the uused dimesios. The order i which dimesios are chose is determied prior to doig ay of the partitioig.

For each dimesio (attribute), we compute the variace i the frequecy of values o that dimesio. We the choose the attributes for partitioig the buckets i descedig order of their variace. This order reflects the criticality for separatig the values i buckets. This techique for costructig -dimesioalhistogram was first used i [] i the cotext of equidepth histogram structures. The MHIST -p techique for costructig a - dimesioal histogram is a adaptatio of the PHASED approach. More specifically, durig the j th step (see the descriptio of PHASED above), we determie the bucket i most eed of partitioig, ad we partitio it alog the attribute that exhibits the highest variace i frequecy withi the bucket. The factor p desigates the umber of buckets ito which each bucket is split at every step. The performace of our mappig techiques (Sectio 4) depeds o the accuracy of the available histograms. The accuracy of a histogram depeds i tur o the techique with which it was geerated, ad o the amout of memory that has bee allocated for it. I our experimets, i additio to tryig several histogram structures, we also study the effect of varyig memory o the accuracy of histograms. We assume throughout that histograms are kept up to date with the data. If histograms are ot up to date, the the performace of our techiques might decrease. However, the correctess of the aswers produced will remai uaffected, at the expese of a potetially higher umber of restarts (Sectio 4). 5. Measurig the Efficiecy of the Query Executio Strategies A top-k query q will typically ivolve several attributes. We might have idexes available for a umber of combiatios of the query attributes, ad the efficiecy of processig the query will be greatly affected by the particular idex cofiguratio available. We focus o two cofiguratios: (a) a sigle-colum idex exists for every attribute metioed i the query; or (b) a sigle -colum idex exists, coverig all attributes metioed i the query. Wheever a -dimesioal idex is preset, we retrieve exactly as may idex etries as there are tuples i the -rectagle defiig query C q, as described i Sectio 4., followed by the actual retrieval of the k top tuples for q. (The idex etries provide all the iformatio that we eed to decide which k tuples are the oes with the highest score for q.) Alteratively, whe oly oe-dimesioal idexes are available, we ca itersect oe or more idexes to determie the data tuples to be retrieved. Whe all ecessary siglecolum idexes are preset, this strategy results i o redudat retrieval of data tuples, as i the case whe a -dimesioal idex is available. However, ulike the case with -dimesioal idexes, we must ow pay the overhead of the idex itersectio. The cost of the idex itersectio ca be traded off agaist the cost of retrievig redudat data tuples (i.e., data tuples that do ot belog to the -rectagle of Sectio 4.). For each top-k query q, wemeasuretheumberof objects that match the associated -dimesioal selectio query C q (Sectio 4.). I Sectio 6, we report the average over all queries of the umber of tuples retrieved as the fractio of the umber of (ot ecessarily distict) tuples i the database (% of tuples retrieved). This metric reveals the tightess of our mappig of a top-k query ito a traditioal selectio query. A complemetary metric is %ofrestarts,the percetage of queries i our workload for which the associated selectio query failed to cotai the k best tuples, hece leadig to restarts. (See Step (5) of the algorithm of Sectio 4.) It is importat to distiguish betwee the tightess of the mappig of a top-k query to a traditioal selectio query, ad the efficiecy of executio of the latter. The tightess of the mappig depeds o the mappig algorithms (Sectio 4) ad o their iteractio with the quality of the available histograms. The efficiecy of executio of the selectio query produced by our mappig algorithm depeds i tur o the idexes available o the database ad o the optimizer s choice of a executio pla. The cost estimator i a optimizer determies the best access path amog the available choices. (These choices iclude performig a sequetial sca of the data.) I this paper, we will ot discuss further details of efficiet executio of selectio queries o databases but rather focus o the problem of mappig top-k queries to selectio queries efficietly usig histogram structures. 6 Experimetal Results This sectio presets experimetal results for our techiques of Sectio 4 for evaluatig top-k queries. I particular, we study the role of several factors o the efficiecy of our strategies, icludig the size ad type of -dimesioal histograms available, the scorig fuctio used i the queries, ad the dimesioality ad skew of the data sets. Our experimets the ivolve a large umber of parameters, ad we tried may differet value assigmets. For cociseess, we report results o a default settig where appropriate. This default settig uses databases built with the Z (moderate) skew (Sectio 5.), the PHASED techique for buildig -dimesioal histograms (Sectio 5.), ad allocates 5KB per histogram. For each experimet, we geerated differet queries. Each query was created by pickig each attribute value radomly from the [, ] rage. I the default settig, these queries ask for top tuples (i.e., k = ). We report results for other settigs of the parameters as well.

Validity of our Geeral Approach Our geeral approach for processig a top-k query q (Sectio 4.) is to fid a -rectagle that cotais all the top k tuples for q, ad use this rectagle to build a traditioal selectio query. Our first experimet studies the itrisic limitatios of our approach, i.e., whether it is possible to build a good -rectagle aroud query q that cotais all top k tuples ad little else. To aswer this first questio, idepedet of ay available histograms or search-score selectio strategies (Sectio 4), we first scaed the database to fid the actual top k tuples for a give query q, ad determied a tight -rectagle T that ecloses all of these tuples. We the computed what fractio of the database tuples lies withi rectagle T. Table reports these figures. As we ca see from the table, the fractio of tuples that lie i this ideal rectagle is extremely low, which validates our approach: if the database statistics (i.e., histograms) are accurate eough, the we should be able to fid a tight -rectagle that ecloses all the best tuples for a give query, with few extra tuples. Data Distributio Scorig 4 Mi... Z Sum... Euclidea... Mi... Z Sum.4.. Euclidea.4.. Mi.8.76.6 Z Sum.5.5.9 Euclidea..4.6 Table : The percetage of tuples i the database icluded i a -rectagle eclosig the actual top-k tuples for a query (k = ; N =, tuples). Effect of Multidimesioal Histograms For this experimet, we cosidered the AVI, PHASED, admhist - histogram structures (Sectio 5.). AVI proved to be sigificatly worse tha MHIST ad PHASED sice it teded to require restarts i most cases, while retrievig oly a extremely low fractio of the database tuples. I effect, the NoRestarts strategy of Sectio 4. guaratees o restarts oly i the presece of a accurate -dimesioal histogram. AVI ca oly estimate the holdigs of each -dimesioal bucket by assumig that attributes follow idepedet distributios. The results for AVI were so poor that we omit this histogram structure from the rest of the discussio. For PHASED ad MHIST, we varied the amout of storage that we allocated for the histograms. Figure 5 shows the effect of this variatio for the Euclidea scorig fuctio. (The results for Mi ad Sum are 9 8 MHIST, NoRestarts 7 MHIST, Iter 6 PHASED, NoRestarts PHASED, Iter 5 4 5 Histogram Size (bytes) Figure 5: The percetage of tuples retrieved, as a fuctio of the umber of bytes dedicated to the - dimesioal histogram (Euclidea scorig fuctio; = ;Z data distributio). % Tuples Retrieved aalogous.) I this figure, we report the results for the NoRestarts ad the Iter policies of Sectio 4.. Whe we icrease the histogram size from KB to 5KB, there is a sharp improvemet i the efficiecy of our techique, as evideced by the drop i the percetage of tuples retrieved. PHASED performs (margially) better tha MHIST ad therefore for the rest of this sectio we report results maily usig PHASED. Although higher memory allocatio clearly icreases accuracy, as show by the figures, we decided to settle o a 5KB budget for each histogram i the rest of this paper. Effect of Differet Scorig Fuctios The goal of this experimet is to measure the differeces amog scorig fuctios as the data skew ad the umber of dimesios are varied (Sectio 5.). Figure 6 shows that, as the data skew icreases, the percetage of tuples retrieved decreases sharply ad cosistetly across all scorig fuctios. O the other had, as the umber of attributes is icreased (Figure 7), the performace of our techiques drops. Iterestigly, the Mi scorig fuctio copes sigificatly better with the icrease i tha the other scorig fuctios. As metioed i Sectio 4., the shape of the regio cotaiig the top tuples for a query matches a -rectagle perfectly, ulike the case for Sum ad Euclidea. The performace of Euclidea, though, is better tha that of Sum. As ca be observed from Table ad Figures (a) ad (b), the size of the -rectagle eclosig the top tuples for Sum is much larger tha that for Euclidea (Sectios 4. ad 4.). Effect of the Number of Tuples Requested k Figure 8 studies the effect of icreasig k, theumber of tuples requested i a top-k query. As k is icreased from to, the performace drops. As i the pre-

% Tuples Retrieved 7 4 6 Sum, NoRestarts 5 Sum, NoRestarts Euclidea, NoRestarts 5 Euclidea, NoRestarts Mi, NoRestarts Mi, NoRestarts 4 Sum, Iter 5 Sum, Iter Euclidea, Iter Euclidea, Iter Mi, Iter Mi, Iter 5 5 Z Z Z Z Z Z Data Skew Data Skew (a) (b) Figure 6: The percetage of tuples retrieved (a), ad the percetage of queries that eeded restarts (b), for icreasig data skew (PHASED histogram of 5KB; =). % Restarts % Tuples Retrieved 5 Sum, NoRestarts 8 Sum, NoRestarts Euclidea, NoRestarts 6 Euclidea, NoRestarts Mi, NoRestarts 4 Mi, NoRestarts Sum, Iter Sum, Iter 5 Euclidea, Iter Euclidea, Iter Mi, Iter 8 Mi, Iter 6 5 4 4 4 (a) (b) Figure 7: The percetage of tuples retrieved (a), ad the percetage of queries that eeded restarts (b), asa fuctio of the umber of attributes (PHASED histogram of 5KB; Z data distributio). % Restarts % Tuples Retrieved Sum, NoRestarts 5 Sum, NoRestarts Euclidea, NoRestarts Euclidea, NoRestarts 5 Mi, NoRestarts Mi, NoRestarts Sum, Iter Sum, Iter Euclidea, Iter 5 Euclidea, Iter Mi, Iter Mi, Iter 5 5 5 5 k k (a) (b) Figure 8: The percetage of tuples retrieved (a), ad the percetage of queries that eeded restarts (b), for differet values of k (PHASED histogram of 5KB; Z data distributio; =). % Restarts

4 5 NoRestarts NoRestarts Iter 8 Iter Iter Iter 5 Restarts 6 Restarts 5 4 5 Z Z Z Z Z Z Data Skew Data Skew (a) (b) Figure 9: The percetage of tuples retrieved (a), ad the percetage of queries that eeded restarts (b), for icreasig data skew (Euclidea scorig fuctio; PHASED histogram of 5KB; = ). % Tuples Retrieved % Restarts vious experimet, the percetage of tuples retrieved for Mi grows the slowest, followed by Euclidea. The combiatio of scorig fuctio Sum ad the NoRestarts strategy performs the worst. Comparig Query Processig Strategies Figure 9 compares the relative merits of the query processig strategies of Sectio 4.. At low data skews, the NoRestarts strategy results i a relatively larger umber of matchig tuples. However, as skew icreases, the performace of NoRestarts improves sigificatly ad domiates that of the other strategies, sice, by defiitio, it icurs o query restarts with up-to-date histograms. Strategy Iter proves to be a robust techique, sice it maitais good performace for all data skews. Effect of Usig -Rectagle Queries As explaied i Sectio 4., we process a top-k query q by first fidig a score S q ad the fidig a - rectagle that ecloses all tuples with a Score of S q or higher. Our goal is for the -rectagle to have as few bad tuples as possible, i.e., as few tuples with Score lower tha S q as possible. Figure examies this issue by computig the actual umber of tuples t with Score(q, t) S q. I other words, we take the score S q computed by usig a histogram ad a query processig strategy (Sectio 4.), ad we cout the tuples i the database with that score or higher. We ca the compare these umbers agaist those i Figure 9(a) to coclude that usig -rectagles for retrievig the database tuples does ot result i a major source of iefficiecy, sice the percetage of tuples i both cases is quite comparable. Results for the Cesus Database Figure shows how our query processig strategies perform o the Cesus data set (Sectio 5.). While oe of the strategies resulted i a sigificat umber % Tuples 5 5 5 NoRestarts Iter Iter Restarts Z Z Z Data Skew Figure : The average umber of tuples (as a percetage of N) with score S q or higher (Step () of the Sectio 4 algorithm) for icreasig data skew (Euclidea scorig fuctio; PHASED histogram of 5KB; = ). of restarts (hece we do ot show the correspodig plot here), the robustess of strategy Iter for icreasig histogram size ca be see clearly. The performace for the differet scorig fuctios is cosistet with the results obtaied for the sythetic databases described above. 7 Coclusios ad Future Work I this paper, we studied the problem of mappig a top-k query o a relatioal database to a traditioal selectio query such that the mappig is tight, i.e., we retrieve as few tuples as possible. Our mappig algorithms exploit the histogram structures ad are able to cope with a wide variety of scorig fuctios. Our experimets highlighted the effect of differet scorig fuctios, data distributios, as well as histogrambuildig strategies o the performace of this mappig. Our focus i this paper has bee primarily o queries over cotiuous attributes. I the future, we will exted our techiques to hadle top-k queries over

9 8 7 Sum, NoRestarts 6 Euclidea, NoRestarts 5 Mi, NoRestarts 4 Sum, Iter Euclidea, Iter Mi, Iter 5 Histogram Size (bytes) Figure : The percetage of tuples retrieved, as a fuctio of the umber of bytes dedicated to the histogram (Cesus database; PHASED histogram). % Tuples Retrieved Max.9.8.7.5 X.5 Figure : The scores for query q =(.4,.) for scorig fuctio Max..75 discrete attributes. Aother directio for future work is to explore approaches to support top-k queries with scorig fuctios (e.g., Max ) that caot be mapped tightly to the family of traditioal selectio queries that we used i this paper (Figure ). Ackowledgmets We thak Eugee Agichtei ad David Lomet for their useful commets. Refereces [] M. J. Carey ad D. Kossma. O sayig Eough Already! i SQL. I Proceedigs of the 997 ACM Iteratioal Coferece o Maagemet of Data (SIGMOD 97), May 997. [] M. J. Carey ad D. Kossma. Reducig the brakig distace of a SQL query egie. I Proceedigs of the Twety-fourth Iteratioal Coferece o Very Large Databases (VLDB 98), Aug. 998. [] S. Chaudhuri ad L. Gravao. Optimizig queries over multimedia repositories. I Proceedigs of the 996 ACM Iteratioal Coferece o Maagemet of Data (SIGMOD 96), Jue 996..5.5.75 Y [4] R. Fagi. Combiig fuzzy iformatio from multiple systems. I Proceedigs of the Fifteeth ACM Symposium o Priciples of Database Systems (PODS 96), Jue 996. [5] R. Fagi. Fuzzy queries i multimedia database systems. I Proceedigs of the Seveteeth ACM Symposium o Priciples of Database Systems (PODS 98), Jue 998. [6] L. Gravao ad H. García-Molia. Mergig raks from heterogeeous Iteret sources. I Proceedigs of the Twety-third Iteratioal Coferece o Very Large Databases (VLDB 97), Aug. 997. [7] F. Kor, N. Sidiropoulos, C. Faloutsos, E. Siegel, ad Z. Protopapas. Fast earest eighbor search i medical image databases. I Proceedigs of the Twety-secod Iteratioal Coferece o Very Large Databases (VLDB 96), Sept. 996. [8] W.Meg,K.-L.Liu,C.Yu,X.Wag,Y.Chag, ad N. Rishe. Determiig text databases to search i the Iteret. I Proceedigs of the Twety-fourth Iteratioal Coferece o Very Large Databases (VLDB 98), Aug. 998. [9] A. Motro. VAGUE: A user iterface to relatioal databases that permits vague queries. ACM Trasactios o Office Iformatio Systems, 6():87 4, July 988. [] M. Muralikrisha ad D. J. DeWitt. Equi-depth histograms for estimatig selectivity factors for multidimesioal queries. I Proceedigs of the 988 ACM Iteratioal Coferece o Maagemet of Data (SIGMOD 88), Jue 988. [] V. Poosala ad Y. E. Ioaidis. Selectivity estimatio without the attribute value idepedece assumptio. I Proceedigs of the Twetythird Iteratioal Coferece o Very Large Databases (VLDB 97), Aug. 997. [] V. Poosala, Y. E. Ioaidis, P. J. Haas, ad E. J. Shekita. Improved histograms for selectivity estimatio of rage predicates. I Proceedigs of the 996 ACM Iteratioal Coferece o Maagemet of Data (SIGMOD 96), Jue 996. [] G. Salto ad M. J. McGill. Itroductio to moder iformatio retrieval. McGraw-Hill, 98. [4] T. Seidl ad H.-P. Kriegel. Optimal multi-step k-earest eighbor search. I Proceedigs of the 998 ACM Iteratioal Coferece o Maagemet of Data (SIGMOD 98), Jue 998. [5] G. K. Zipf. Huma behaviour ad the priciple of least effort. Addiso-Wesley, 949.