Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Size: px

Start display at page:

Download "Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data"

Rafe Poole
5 years ago
Views:

1 Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark Nikos Mamoulis Deartment of Comuter Science University of Hong Kong Pokfulam Road, Hong Kong ABSTRACT The to-k dominating query returns k data objects which dominate the highest number of objects in a dataset. This query is an imortant tool for decision suort since it rovides data analysts an intuitive way for finding significant objects. In addition, it combines the advantages of to-k and skyline queries without sharing their disadvantages: (i) the outut size can be controlled, (ii) no ranking functions need to be secified by users, and (iii) the result is indeendent of the scales at different dimensions. Desite their imortance, to-k dominating queries have not received adequate attention from the research community. In this aer, we design secialized algorithms that aly on indexed multi-dimensional data and fully exloit the characteristics of the roblem. Exeriments on synthetic datasets demonstrate that our algorithms significantly outerform a revious skyline-based aroach, while our results on real datasets show the meaningfulness of to-k dominating queries. Introduction Consider a dataset D of oints in a d-dimensional sace R d. Given a (monotone) ranking function F : R d R, a to-k query [4, 9] returns k oints with the smallest F value. For examle, Figure shows a set of hotels modeled by oints in the 2D sace, where the dimensions corresond to (reference) attribute values; traveling time to a conference venue and room rice. For the ranking function F = x + y, the to-2 hotels are 4 and 6. An obvious advantage of the to-k query is that the user is able to control the number of results (through the arameter k). On the other hand, it might not always be easy for the user to secify an aroriate ranking function. In addition, there is no straightforward way for a data analyst to identify the most imortant objects using to-k queries, since different functions may infer different rankings. Besides, a skyline query [2] retrieves all oints which are not dominated by any other oint. Assuming that smaller values are referable to larger at all dimensions, a oint dominates another oint (i.e., ) when ( i [, d], [i] < [i]) ( i [, d], [i] [i]) () Research suorted by grant HKU 76/5E from Hong Kong RGC. Permission to coy without fee all or art of this material is granted rovided that the coies are not made or distributed for direct commercial advantage, the VLDB coyright notice and the title of the ublication and its date aear, and notice is given that coying is by ermission of the Very Large Data Base Endowment. To coy otherwise, or to reublish, to ost on servers or to redistribute to lists, requires a fee and/or secial ermission from the ublisher, ACM. VLDB 7, Setember 23-28, 27, Vienna, Austria. Coyright 27 VLDB Endowment, ACM /7/9. where [i] denotes the coordinate of in the i-th dimension. Continuing with the examle in Figure, the skyline query returns oints, 4, 6, and 7. [2] showed that the skyline contains the to- result for any monotone ranking function; therefore, it can be used by decision makers to identify otentially imortant objects to some database users. A key advantage of the skyline query is that it does not require the use of a secific ranking function; its results only deend on the intrinsic characteristics of the data. Furthermore, the skyline is not affected by otentially different scales at different dimensions (monetary unit or time unit in the examle of Figure ); only the order of the dimensional rojections of the objects is imortant. On the other hand, the size of the skyline cannot be controlled by the user and it can be as large as the data size in the worst case. As a result, the user may be overwhelmed as she may have to examine numerous skyline oints manually in order to identify the ones that will eventually be regarded as imortant. y (rice) F=x+y x (time to conf. venue) Figure : Features of hotels From an analyst s oint of view, an intuitive score function for modeling the imortance of a oint D could be: µ() = { D } (2) In words, the score µ() is the number of oints dominated by oint. The following roerty holds for µ:, D, µ() > µ( ) (3) Therefore, we can define a natural ordering of the oints in the database, based on the µ function. Accordingly, the to-k dominating query returns k oints in D with the highest score. For examle, the to-2 dominating query on the data of Figure retrieves 4 (with µ( 4) = 3) and 5 (with µ( 5) = 2). This result may indicate to an analyst the most oular hotels to the conference articiants (considering rice and traveling time as selection factors). Normally, a articiant will try to book at 4 and, if this hotel is fully-booked, try the next one ( 5). From this examle, we can already see that a to-k dominating query is a owerful decision suort tool, since it identifies the most significant objects in an intuitive way. From a ractical ersective, to-k dominating queries combine the advantages of to-k queries and skyline 483

2 queries without sharing their disadvantages. The number of results can be controlled without secifying any ranking function. In addition, data normalization is not required; the results are not affected by different scales or data distributions at different dimensions. We are the first to recognize the imortance of to-k dominating query as a data analysis tool and its advantages over to-k and skyline queries Paadias et al. [23] did not exlore such advantages although they introduced to-k dominating query as an extension of skyline query. In this aer, we identify the imortance and racticability of the query and define some of its otential extensions. A simle evaluation method for to-k dominating queries, based on skyline comutation, was roosed in [23]. The basic idea is to comute the skyline, find the to- object o in it (note that the to- oint must belong to the skyline), remove o from D and iteratively aly the same rocedure, until k results have been outut. This skyline-based aroach may erform many unnecessary score countings, since the skyline could be much larger than k. In addition, we note that the R-tree (used in the solution of [23]) may not be the most aroriate index for this query; since comuting µ() is in fact an aggregate query, we can relace the R-tree by an aggregate R-tree (ar-tree) [7, 22]. Motivated by these observations, we roose secialized algorithms that oerate on ar-trees. Our technical contributions include (i) a batch counting technique for comuting scores of multile oints simultaneously, (ii) a counting-guided search algorithm for rocessing to-k dominating queries, and (iii) a riority-based tree traversal algorithm that retrieves query results by examining each tree node at most once. We enhance the erformance of (ii) with lightweight counting, which derives relatively tight uer bound scores for non-leaf tree entries at low cost. Furthermore, to our surrise, the intuitive best-first traversal order [3, 23] turns out not to be the most efficient for (iii) because of otential artial dominance relationshis between visited entries. Thus, we erform a careful analysis on (iii) and roose a novel, efficient tree traversal order for it. Extensive exeriments show that our methods significantly outerform the skyline-based aroach. Finally, we define two interesting query variants; aggregate to-k dominating queries and bichromatic to-k dominating queries and show how our methods can be extended to rocess them. The rest of the aer is organized as follows. Section 2 reviews the related work. Section 3 discusses the roerties of to-k dominating search and rooses otimizations for the existing solution in [23]. We then roose eager/lazy aroaches for evaluating tok dominating queries. Section 4 resents an eager aroach that guides the search by deriving tight score bounds for encountered non-leaf tree entries immediately. Section 5 develos an alternative, lazy aroach that defers score comutation of visited entries and gradually refines their score bounds when more tree nodes are accessed. Section 6 introduces extensions of to-k dominating queries and discusses their evaluation. In Section 7, exeriments are conducted on both real and synthetic datasets to demonstrate that the roosed algorithms are efficient and also to-k dominating queries return meaningful results to users. Section 8 discusses alternative aroaches for to-k dominating queries and query rocessing on non-indexed data. Finally, Section 9 concludes the aer. 2 Related Work To-k dominating queries include a counting comonent which is a case of multi-dimensional aggregation; in this section, we review related work on satial aggregation rocessing. In addition, as the dominance relationshi is relevant to skyline queries, we survey existing methods for comuting skylines. 2. Satial Aggregation Processing R-trees [2] have been extensively used as access methods for multidimensional data and for rocessing satial queries, e.g., range queries, nearest neighbors [3], and skyline queries [23]. The aggregate R-tree (ar-tree) [7, 22] augments to each non-leaf entry of the R-tree an aggregate measure of all data oints in the subtree ointed by it. It has been used to seed u the evaluation of satial aggregate queries, where measures (e.g., number of buildings) in a satial region (e.g., a district) are aggregated. y e W 7.5 e e 2 e 5 e 3 e 4 e 7 e 6 e 8 e e e 9 3 e 4 e e e e 2 5 e 2 6 e 8 e 9 x.5 (a) a set of oints e 7 e 8 e 9 e 2 3 e e e 5 e 2 6 e 9 e e 3 e 4 e 3 e 4 e 7 e 8 e e 2 e 5 e root node 2 3 Figure 2: ar-tree examle 2 3 contents of leaf nodes omitted (b) a COUNT ar-tree Figure 2a shows a set of oints in the 2D sace, indexed by the COUNT ar-tree in Figure 2b. Each non-leaf entry stores the COUNT of data oints in its subtree. For instance, in Figure 2b, entry e 7 has a count, meaning that the subtree of e 7 contains oints. Suose that a user asks for the number of oints intersecting the region W, shown in Figure 2a. To rocess the query, we first examine entries in the root node of the tree. Entries that do not intersect W are runed because their subtree cannot contain any oints in W. If an entry is satially covered by W (e.g., entry e 9), its count (i.e., ) is added to the answer without accessing the corresonding subtree. Finally, if a non-leaf entry intersects W but it is not contained in W (e.g., e 7), search is recursively alied to the child node ointed by the entry, since the corresonding subtree may contain oints inside or outside W. Note that the counts augmented in the entries effectively reduce the number of accessed nodes. To evaluate the above examle query, only nodes in the COUNT ar-tree are accessed but 7 nodes in an R-tree with the same node caacity would be visited. 2.2 Skyline Comutation Börzsönyi et al. [2] were the first to roose efficient external memory algorithms for rocessing skyline queries. The BNL (blocknested-loo) algorithm scans the dataset while emloying a bounded buffer for tracking the oints that cannot be dominated by other oints in the buffer. A oint is reorted as a result if it cannot be dominated by any other oint in the dataset. On the other hand, the DC (divide-and-conquer) algorithm recursively artitions the dataset until each artition is small enough to fit in memory. After the local skyline in each artition is comuted, they are merged to form the global skyline. The BNL algorithm was later imroved to SFS (sort-filter-skyline) [8] and LESS (linear elimination sort for skyline) [] in order to otimize the average-case running time. The above algorithms are generic and alicable for non-indexed data. On the other hand, [25, 6, 23] exloit data indexes to accelerate skyline comutation. The state-of-the-art algorithm is the BBS (branch-and-bound skyline) algorithm [23], which is shown to be otimal for comuting skylines on datasets indexed by R-trees. Recently, the research focus has been shifted to the study of queries based on variants of the dominance relationshi. [2] roose a data cube structure for seeding u the evaluation of queries that analyze the dominance relationshi of oints in the dataset. 484

3 However, incremental maintenance of the data cube over udates has not been addressed in [2]. Clearly, it is rohibitively exensive to recomute the data cube from scratch for dynamic datasets with frequent udates. [6] identify the roblem of comuting to-k frequent skyline oints, where the frequency of a oint is defined by the number of dimensional subsaces. [5] study the k-dominant resectively, for any oint indexed under e. As we will show later, µ(e + ) and µ(e ) can be comuted by a search rocedure that accesses only ar-tree nodes that intersect e along at least one dimension. These bounds hel runing the search sace and defining a good order for visiting ar-tree nodes. Later in Sections 4 and 5, we relace the tight bounds µ(e + ) and µ(e ) with loose lower and u- skyline query, which is based on the k-dominance relationshi. A er bounds for them (µ l (e) and µ u (e), resectively). Bounds µ l (e) oint is said to k-dominate another oint if dominates in at least one k-dimensional subsace. The k-dominant skyline contains the oints that are not k-dominated by any other oint. When k decreases, the size of the k-dominant skyline also decreases. Observe and µ u (e) are cheaer to comute and can be rogressively refined during search, therefore trading-off between comutation cost and bound tightness. The comutation and use of score bounds in ractice will be further elaborated there. that [2, 6, 5] cannot be directly alied to evaluate to-k 3.2 Otimizing e dominating queries studied in this aer. the e 2 Skyline-Based e 3 Aroach Finally, [28, 24] study the efficient comutation of skylines for Paadias et al. [23] roosed a Skyline-Based contents omitted To-k Dominating every subsace; [26] roose a technique for retrieving the skyline for a given subsace; [, 5] investigate skyline comutation an R-tree. They 7 Algorithm (STD) for to-k dominating queries, on data indexed by e e noted 8 ethat 9 the skyline is guaranteed to contain the over distributed data; [, 7] develo techniques for estimating the to- dominating oint, since a non-skyline oint has lower score skyline cardinality; [2] study continuous maintenance of the skyline over a data stream; and [4] address skyline comutation over Thus, STD retrieves the skyline oints, comutes their µ scores and than at leastcontents one skyline omitted oint that dominates it (see Equation 3). datasets with artially-ordered attributes. e 3 e 4 e 5 oututs the oint with the highest score. It then removes from e e 2 e 3 e 6 e 3 e 4 e 5 e 6 e 3 e 4 e 5 e 6 e 3 e 4 e 5 e 6 3 Preliminary the dataset, incrementally finds the skyline of the remaining oints, and reeats the same rocess. In this section, we discuss some fundamental roerties of to-k dominating search, assuming that the data have been indexed by Consider for examle a to-2 dominating query on the dataset shown in Figure 4. STD first retrieves the skyline oints, 2, an ar-tree. In addition, we roose an otimized version for the and 3 (using the BBS skyline algorithm of [23]). For each skyline existing to-k dominating algorithm [23] that oerates on ar-trees. oint, a range query is issued to count the number of oints it dominates. After that, we have µ( ) =, µ( 2) = 4, and µ( 3) =. y (rice) 2 3. Score Bounding Functions Hence, 2 is reorted as the to- 7 3 result. We now restrict the region of searching for the next result. First, Equation 3 suggests that Before resenting our to-k dominating algorithms, we first introduce some notation that will be used in this aer. For an ar-tree the region dominated.5 by the remaining skyline 6 4 oints (i.e., and entry e (i.e., a minimum bounding box) whose rojection on the 3) needs not be examined. i-th dimension is the interval [e[i], e[i] + Second, 5 the region dominated by 2 ], we denote its lower corner e and uer corner e + F=x+y (i.e., 6 the 2 7 revious result) may contain by 3 some oints which are not dominated by the remaining skyline oints and 3. It suffices to e = (e[], e[2],, e[d].5 ) retrieve the skyline oints (i.e., 4 and 5) in the x constrained (gray) x (time to conf. venue).5 region M shown in Figure 4. After counting their scores using the e + = (e[] +, e[2] +,, e[d] + ) tree, we have µ( 4) = 2 and µ( 5) =. Finally, we comare them with the scores of retrieved oints (i.e., and 3) and reort 4 as Observe that both e and e + do not corresond to actual data oints the next result. but they allow us to exress dominance relationshisyamong e oints W and minimum bounding boxes conveniently. As Figure 3 illustrates, there are three cases for a oint to dominate a non-leaf e eentry. e e 5 6 y 7 Since e 2 (i.e., full dominance), must also dominate all e 7 e 3 e e e 9 data oints indexed under e. On the other hand, oint 2 dominates e + but not e (i.e., artial dominance), thus 2 dominates 4 some, but not all data oints in e. Finally, as 3 e + e e (i.e., 9 no 3 e 4 5 e 2 dominance), 3 cannot dominate any oint in e. Similarly, e e thee 2 5 e 2 6 M 3 cases for an entry to dominate another entry are: (i) full dominance (e.g., e + e 3 ), (ii) artial dominance (e.g., e e+ 4 e+ e 4 ), x e x (iii) no dominance (e.g., e e+ 2 ). Figure 4: Constrained skyline e 3 e 2 3 _ e e 2 + e Figure 3: Dominance relationshi among ar-tree entries Given a tree entry e, whose sub-tree has not been visited, µ(e + ) and µ(e ) corresond to the tightmost lower and uer score bounds e 5 e 4 In this section, we resent two otimizations that greatly reduce the cost of the above solution by exloiting ar-trees. Our first otimization is called batch counting. Instead of iteratively alying searate range queries to comute the scores of the skyline oints, we erform them in batch. Algorithm shows the seudocode of this recursive batch counting rocedure. It takes two arameters: the current ar-tree node Z and the set of oints V, whose µ scores are to be counted. Initially, Z is set to the root node of the tree and µ() is set to for each V. Let e be the current entry in Z to be examined. As illustrated in Section 3., if e is a non-leaf entry and there exists some oint V such that e + e, then may dominate some (but not guaranteed to dominate all) oints indexed under e. Thus, we cannot imme- 485

4 diately decide the number of oints in e dominated by. In this case, we have to invoke the algorithm recursively on the child node ointed by e. Otherwise, for each oint V, its score is incremented by COUNT(e) when it dominates e. BatchCount correctly comutes the µ score for all V, at a single tree traversal. Algorithm Batch Counting algorithm BatchCount(Node Z, Point set V ) : for all entries e Z do 2: if Z is non-leaf and V, e + e then 3: read the child node Z ointed by e; 4: BatchCount(Z, V ); 5: else 6: for all oints V do 7: if e then 8: µ():=µ()+count(e); Algorithm 2 is a seudo-code of the Iterative To-k Dominating Algorithm (), which otimizes the STD algorithm of [23]. Like STD, comutes the to-k dominating oints iteratively. In the first iteration, comutes in V the skyline of the whole dataset, while in subsequent iterations, the comutation is constrained to a region M. M is the region dominated by the reorted oint q in the revious iteration, but not any oint in the set V of retrieved oints in ast iterations. At each loo, Lines 6 8 comute the scores for the oints in V in batches of B oints each (B V ). By default, the value of B is set to the number of oints that can fit into a memory age. Our second otimization is that we sort the oints in V by a sace-filling curve (Hilbert ordering) [3] before alying batch counting, in order to increase the comactness of the MBR of a batch. After merging the constrained skyline with the global one, the object q with the highest µ score is reorted as the next dominating object, removed from V and used to comute the constrained skyline at the next iteration. The algorithm terminates after k objects have been reorted. For instance, in Figure 4, q corresonds to oint (, ) and V = in the first loo, thus M corresonds to the whole sace and the whole skyline {, 2, 3} is stored in V, the oints there are sorted and slit in batches and their µ scores are counted using the BatchCount algorithm. In the beginning of the second loo, q = 2, V = {, 3}, and M is the gray region in the figure. V now becomes { 4, 5} and the corresonding scores are batchcounted. The next oint is then reorted (e.g., 4) and the algorithm continues as long as more results are required. Algorithm 2 Iterative To-k Dominating Algorithm () algorithm (Tree R, Integer k) : V := ; q:=origin oint; 2: for i := to k do 3: M:=region dominated by q but by no oint in V ; 4: V :=skyline oints in M; 5: sort the oints in V by Hilbert ordering; 6: for all batches V c of (B) oints in V do 7: initialize all scores of oints in V c to ; 8: BatchCount(R.root,V c); 9: V :=V V ; : q:=the oint with maximum score in V ; : remove q from V ; 2: reort q as the i-th result; 4 Counting-Guided Search The skyline-based solution becomes inefficient for datasets with large skylines as µ scores of many oints are comuted. In addition, not all skyline oints have large µ scores. Motivated by these observations, we study algorithms that solve the roblem directly, without deending on skyline comutations. This section resents an eager aroach for the evaluation of to-k dominating queries, which traverses the ar-tree and comutes tight uer score bounds for encountered non-leaf tree entries immediately; these bounds determine the visiting order for the tree nodes. We discuss the basic algorithm, develo otimizations for it, and investigate by an analytical study the imrovements of these otimizations. 4. The Basic Algorithm Recall from Section 3. that the score of any oint indexed under an entry e is uer-bounded by µ(e ). Based on this observation, we can design a method that traverses ar-tree nodes in descending order of their (uer bound) scores. The rationale is that oints with high scores can be retrieved early and accesses to ar-tree nodes that do not contribute to the result can be avoided. Algorithm 3 shows the seudo code of the Simle Counting- Guided Algorithm (SCG), which directs search by counting uer bound scores of examined non-leaf entries. A max-hea H is emloyed for organizing the entries to be visited in descending order of their scores. W is a min-hea for managing the to-k dominating oints as the algorithm rogresses, while γ is the k-th score in W (used for runing). First, the uer bound scores µ(e ) of the ar-tree root entries are comuted in batch (using the BatchCount algorithm) and these are inserted into the max-hea H. While the score µ(e ) of H s to entry e is higher than γ (imlying that oints with scores higher than γ may be indexed under e), the to entry is deheaed, and the node Z ointed by e is visited. If Z is a non-leaf node, its entries are enheaed, after BatchCount is called to comute their uer score bounds. If Z is a leaf node, the scores of the oints in it are comuted in batch and the to-k set W (also γ) is udated, if alicable. Algorithm 3 Simle Counting Guided Algorithm (SCG) algorithm SCG(Tree R, Integer k) : H:=new max-hea; W :=new min-hea; 2: γ:=; the k-th highest score found so far 3: BatchCount(R.root,{e e R.root}); 4: for all entries e R.root do 5: enhea(h, e, µ(e ) ); 6: while H > and H s to entry s score > γ do 7: e:=dehea(h); 8: read the child node Z ointed by e; 9: if Z is non-leaf then : : BatchCount(R.root,{e c e c Z}); for all entries e c Z do 2: enhea(h, e c, µ(e c ) ); 3: else Z is a leaf 4: BatchCount(R.root,{ Z}); 5: udate W and γ, using, µ(), Z 6: reort W as the result; As an examle, consider the to- dominating query on the set of oints in Figure 5. There are 3 leaf nodes and their corresonding entries in the root node are e, e 2, and e 3. First, uer bound scores for the root entries (i.e., µ(e ) = 3, µ(e 2 ) = 7, µ(e 3 ) = 3) are comuted by the batch counting algorithm, which incurs 3 node accesses (i.e., the root node and leaf nodes ointed by e and e 3). Since e 2 has the highest uer bound score, the leaf node ointed by e 2 will be accessed next. Scores of entries in e 2 are comuted in batch and we obtain µ( ) = 5, µ( 2) =, µ( 3) = 2. Since is a oint and µ( ) is higher than the scores of remaining entries ( 2, 3, e, e 3), is guaranteed to be the to- result. 4.2 Otimizations Now, we discuss three otimizations that can greatly reduce the cost of the basic SCG. First, we utilize encountered data oints to strengthen the runing ower of the algorithm. Next, we aly a 486

5 y.5 6 e e e x.5 Figure 5: Comuting uer bound scores lazy counting method that delays the counting for oints, in order to form better grous for batch counting. Finally, we develo a lightweight technique for deriving uer score bounds of non-leaf entries at low cost. The runer set. SCG visits nodes and counts the scores of oints and entries, based only on the condition that the uer bound score of their arent entry is greater than γ. However, we observe that oints which have been counted, but have scores at most γ can also be used to rune early other entries or oints, which are dominated by them. Thus, we maintain a runer set F, which contains oints that (i) have been counted exactly (i.e., at Line 5), (ii) have scores at most γ, and (iii) are not dominated by any other oint in F. The third condition ensures that only minimal information is ket in F. 2 We erform the following changes to SCG in order to use F. First, after deheaing an entry e (Line 7), we check whether there exists a oint F, such that e. If yes, then e is runed and the algorithm goes back to Line 6. Second, before alying BatchCount at Lines and 4, we eliminate any entries or oints that are dominated by a oint in F. Lazy counting. The erformance of SCG is negatively affected by executions of BatchCount for a small number of oints. A batch may have few oints if many oints in a leaf node are runed with the hel of F. In order to avoid this roblem, we emloy a lazy counting technique, which works as follows. When a leaf node is visited (Line 3), instead of directly erforming batch counting for the oints, those that are not runed by F are inserted into a set L, with their uer bound score µ(e ) from the arent entry. If, after an insertion, the size of L exceeds B (the size of a batch), then BatchCount is executed for the contents of L, and all W, γ, F are udated. Just before reorting the final result set (Line 6), batch counting is erformed for otential results L not dominated by any oint in F and with uer bound score greater than γ. We found that the combined effect of the runer set and lazy counting lead to 3% cost reduction of SCG, in ractice. Lightweight uer bound comutation. As mentioned in Section 3., the tight uer score bound µ(e ) can be relaced by a looser, cheaer to comute, bound µ u (e). We roose an otimized version of SCG, called Lightweight Counting Guided Algorithm (). Line of SCG (Algorithm 3) is relaced by a call to LightBatchCount, which is a variation of BatchCount. In secific, when bounds for a set V of non-leaf entries are counted, the algorithm avoids exensive accesses at ar-tree leaf nodes, but uses entries at non-leaf nodes to derive looser bounds. LightBatchCount is identical to Algorithm, excet that the recursion of Line 2 is alied when Z is at least two levels above leaf Suose that a oint satisfies µ() γ. Alying Equation 3, if a oint is dominated by, then we have µ( ) < γ. 2 Note that F is the skyline of a secific data subset. nodes and there is a oint in V that artially dominates e; thus, the else statement at Line 5 now refers to nodes one level above the leaves. In addition, the condition at Line 7 is relaced by e + ; i.e., COUNT(e) is added to µ u (), even if artially dominates entry e. As an examle, consider the three root entries of Figure 5. We can comute loose uer score bounds for V = {e, e 2, e 3 }, without accessing the leaf nodes. Since, e 2 fully dominates e2 and artially dominates e, e 3, we get µ u (e 2) = 9. Similarly, we get µ u (e ) = 3 and µ u (e 3) = 3. Although these bounds are looser than the resective tight ones, they still rovide a good order of visiting the entries and they can be used for runing and checking for termination. In Section 7, we demonstrate the significant comutation savings by this lightweight counting (of µ u (e)) over exact counting (of µ(e )) and show that it affects very little the runing ower of the algorithm. Next, we investigate its effectiveness by a theoretical analysis. 4.3 Analytical Study Consider a dataset D with N oints, indexed by an ar-tree whose nodes have an average fanout f. Our analysis is based on the assumtion that the data oints are uniformly and indeendently distributed in the domain sace [, ] d, where d is the dimensionality. Then, the tree height h and the number of nodes n i at level i (let the leaf level be ) can be estimated by h = + log f (N/f) and n i = N/f i+. Besides, the extent (i.e., length of any D rojection) λ i of a node at the i-th level can be aroximated by λ i = (/n i) /d [27]. We now discuss the trade-off of lightweight counting over exact counting for a non-leaf entry e. Recall that the exact uer bound score µ(e ) is counted as the number of oints dominated by its lower corner e. On the other hand, lightweight counting obtains µ u (e); an uer bound of µ(e ). For a given e, Figure 6 shows that the sace can be divided into three regions, with resect to nodes at level i. The gray region 3 M 2 corresonds to the maximal e e region, covering nodes 2 (at level i) 3 that are artially dominated by e. While comuting µ(e e + ), only the entries which are comletely e inside M 2 need to be further examined (e.g., e A). λ e 4 i λother i M 3 entries are _ runed after either disregardinge 2 their aggregate values (e.g., e B, M which intersects M ), or adding these values to µ(e e 5 ) (e.g., e e _ M2 λ i λ C, which intersects M i 3). (,) M (,) e _ e A e B λ i λ i λ i λ i M 3 e C M 2 (,) (,) Figure 6: cost of comuting uer bound Thus, the robability of accessing a (i-th level) node can be aroximated by the area of M 2, assuming that tree nodes at the same level have no overlaing. To further simlify our analysis, suose that all coordinates of e are of the same value v. Hence, the ar-tree node accesses required for comuting the exact µ(e ) can 487

6 be exressed as 3 : NA E(e ) = h X i= n i [( v + λ i) d ( v λ i) d ] (4) In the above equation, the quantity in the square brackets corresonds to the volume of M 2 (at level i) over the volume of the universe (this equals to ), caturing thus the robability of a node at level i to be comletely inside M 2. The node accesses of lightweight comutation can also be catured by the above equation, excet that no leaf nodes (i.e., at level ) are accessed. As there are many more leaf nodes than non-leaf nodes, lightweight comutation incurs significantly lower cost than exact comutation. Now, we comare the scores obtained by exact comutation and lightweight comutation. The exact score µ(e ) is determined by the area dominated by e : µ(e ) = N ( v) d (5) In addition to the above oints, lightweight comutation counts also all oints in M 2 for the leaf level into the uer bound score: µ u (e) = N ( v + λ ) d (6) Summarizing, three factors N, v, and d affect the relative tightness of the lightweight score bound over the exact bound. When N is large, the leaf node extent λ is small and thus the lightweight score is tight. If v is small, i.e., e is close to the origin and has high dominating ower, then λ becomes less significant in Equation 6 and the ratio of µ u (e) to µ(e ) is close to (i.e., lightweight score becomes relatively tight). As d increases (decreases), λ also increases (decreases) and the lightweight score gets looser (tighter). In ractice, during counting-guided search, entries close to the origin have higher robability to be accessed than other entries, since their arent entries have higher uer bounds and they are rioritized by search. As a result, we exect that the second case above will hold for most of the uer bound comutations and lightweight comutation will be effective. 5 Priority-Based Traversal In this section, we resent a lazy alternative to the counting-guided method. Instead of comuting uer bounds of visited entries by exlicit counting, we defer score comutations for entries, but maintain lower and uer bounds for them as the tree is traversed. Score bounds for visited entries are gradually refined when more nodes are accessed, until the result is finalized with the hel of them. For this method to be effective, the tree is traversed with a carefullydesigned riority order aiming at minimizing cost. We resent the basic algorithm, analyze the issue of setting an aroriate order for visiting nodes, and discuss its imlementation. 5. The Basic Algorithm Recall that counting-guided search, resented in the revious section, may access some ar-tree nodes more than once due to the alication of counting oerations for the visited entries. For instance in Figure 5, the node ointed by e may be accessed twice; once for counting the scores of oints under e 2 and once for counting 3 For simlicity, the equation does not consider the boundary effect (i.e., v is near the domain boundary). To cature the boundary effect, we need to bound the terms ( v + λ i ) and ( v λ i ) within the range [, ]. the scores of oints under e. We now roose a to-k dominating algorithm which traverses each node at most once and has reduced cost. Algorithm 4 shows the seudo-code of this Priority-Based Tree Traversal Algorithm (PBT). PBT browses the tree, while maintaining (loose) uer µ u (e) and lower µ l (e) score bounds for the entries e that have been seen so far. The nodes of the tree are visited based on a riority order. The issue of defining an aroriate ordering of node visits will be elaborated later. During traversal, PBT maintains a set S of visited ar-tree entries. An entry in S can either: (i) lead to a otential result, or (ii) be artially dominated by other entries in S that may end u in the result. W is a min-hea, emloyed for tracking the to-k oints (in terms of their µ l scores) found so far, whereas γ is the lowest score in W (used for runing). First, the root node is loaded, and its entries are inserted into S after uer score bounds have been derived from information in the root node. Then (Lines 8-8), while S contains non-leaf entries, the non-leaf entry e z with the highest riority is removed from S, the corresonding tree node Z is visited and (i) the µ u (µ l ) scores of existing entries in S (artially dominating e z) are refined using the contents of Z, (ii) µ u (µ l ) values for the contents of Z are comuted and, in turn, inserted to S. Note that for oerations (i) and (ii), only information from the current node and S is used; no additional accesses to the tree are required. Udates and comutations of µ u scores are erformed incrementally with the information of e z and entries in S that artially dominate e z. W is udated with oints/entries of higher µ l than γ. Finally (Line 2), entries are runed from S if (i) they cannot lead to oints that may be included in W, and (ii) are not artially dominated by entries leading to oints that can reach W. Algorithm 4 Priority-Based Tree Traversal Algorithm (PBT) algorithm PBT(Tree R, Integer k) : S:=new set; entry format in S: e, µ l (e), µ u (e) 2: W :=new min-hea; k oints with the highest µ l 3: γ:=; the k-th highest µ l score found so far 4: for all e x R.root do 5: µ l (e x):= P e R.root e + x e COUNT(e); 6: µ u (e x):= P e R.root e x e + COUNT(e); 7: insert e x into S and udate W ; 8: while S contains non-leaf entries do 9: remove e z: non-leaf entry of S with the highest riority; : read the child node Z ointed by e z; : for all e y S such that e + y e z e y e + z do 2: µ l (e y):=µ l (e y) + P e Z e + y e COUNT(e); 3: µ u (e y):=µ l (e y) + P e Z e + y e e y e + COUNT(e); 4: S z:=z {e S e + z e e z e + }; 5: for all e x Z do 6: µ l (e x):=µ l (e z) + P e S z e + x e COUNT(e); 7: µ u (e x):=µ l (e x) + P e S z e + x e e x e + COUNT(e); 8: insert all entries of Z into S; 9: udate W (and γ) by e S whose score bounds changed; 2: remove entries e m from S where µ u (e m) < γ and e S, (µ u (e) γ) (e + e m e e + m); 2: reort W as the result; It is imortant to note that, at Line 2 of PBT, all non-leaf entries have been removed from the set S, and thus (result) oints in W have their exact scores found. To comrehend the functionality of PBT consider again the to- dominating query on the examle of Figure 5. For the ease of discussion, we denote the score bounds of an entry e by the interval µ (e)=[µ l (e), µ u (e)]. Initially, PBT accesses the root node 488

7 and its entries are inserted into S after their lower/uer bound scores are derived (see Lines 5 6); µ (e )=[, 3], µ (e 2)=[, 9], µ (e 3)=[, 3]. Assume for now, that visited nodes are rioritized (Lines 9-) based on the uer bound scores µ u (e) of entries e S. Entry e 2, of the highest score µ u in S is removed and its child node Z is accessed. Since e e + 2 and e 3 e + 2, the uer/lower score bounds of remaining entries {e, e 3} in S will not be udated (the condition of Line is not satisfied). The score bounds for the oints, 2, and 3 in Z are then comuted; µ ( )=[, 7], µ ( 2)=[, 3], and µ ( 3)=[, 3]. These oints are inserted into S, and W ={ } with γ=µ l ( )=. No entry or oint in S can be runed, since their uer bounds are all greater than γ. The next non-leaf entry to be removed from S is e (the tie with e 3 is broken arbitrarily). The score bounds of the existing entries S={e 3,, 2, 3} are in turn refined; µ (e 3) remains [, 3] (unaffected by e ), whereas µ ( )=[3, 6], µ ( 2)=[, ], and µ ( 3) =[, 3]. The scores of the oints indexed by e are comuted; µ ( 4)=[, ], µ ( 5)=[, ], and µ ( 6)=[, ] and W is udated to with γ=µ l ( )=3. At this stage, all oints, excet from, are runed from S, since their µ u scores are at most γ and they are not artially dominated by non-leaf entries that may contain otential results. Although no oint from e 3 can have higher score than, we still have to kee e 3, in order to comute the exact score of in the next round. 5.2 Traversal Orders in PBT An intuitive method for rioritizing entries at Line 9 of PBT, hinted by the uer bound rincile of [9] or the best-first ordering of [3, 23], is to ick the entry e z with the highest uer bound score µ u (e z); such an order would visit the oints that have high robability to be in the to-k dominating result early. We denote this instantiation of PBT by UBT (for Uer-bound Based Traversal). Nevertheless a closer look into PBT (Algorithm 4) reveals that the uer score bounds alone may not offer the best riority order for traversing the tree. Recall that the runing oeration (at Line 2) eliminates entries from S, saving significant cost and leading to the early termination of the algorithm. The effectiveness of this runing deends on the lower bounds of the best oints (stored in W ). Unless these bounds are tight enough, PBT will not terminate early and S will grow very large. For examle, consider the alication of UBT to the tree of Figure 2. The first few nodes accessed are in the order: root node, e 8, e, e 9, e 2. Although e has the highest uer bound score, it artially dominates high-level entries (e.g., e 7 and e 2), whose child nodes have not been accessed yet. As a result, the best-k score γ (i.e., the current lower bound score of e ) is small, few entries can be runed, and the algorithm does not terminate early. Thus, the objective of search is not only to (i) examine the entries of large uer bounds early, which leads to early identification of candidate query results, but also (ii) eliminate artial dominance relationshis between entries that aear in S, which facilitates the comutation of tight lower bounds for these candidates. We now investigate the factors affecting the robability that one node artially dominates another and link them to the traversal order of PBT. Let a and b be two random nodes of the tree such that a is at level i and b is at level j. Using the same uniformity assumtions and notation as in Section 4.3, we can infer that the two nodes a and b not intersect along dimension t with robability 4 : P r(a[t] b[t] = ) = (λ i + λ j) a and b have a artial dominance relationshi when they intersect 4 The current equation is simlified for readability. The robability equals when λ i + λ j >. along at least _ one dimension. The robability of being such is: P r( a[t] b[t] ) = ( (λ i + λ j)) d t [,d] The above robability is small when the sum λ i + λ j is minimized (e.g., a and b are both at low levels). The above analysis leads to the conclusion that in order to minimize the artially dominating entry airs in S, we should rioritize the visited nodes based on their level at the tree. In addition, between entries at the highest level in S, we should choose the one with the highest uer bound, in order to find the oints with high scores early. Accordingly, we roose an instantiation of PBT, called Cost-Based Traversal (). corresonds to Algorithm 4, such that, at Line 9, the non-leaf entry e z with the highest level is removed from S and rocessed; if there are ties, the entry with the highest uer bound score is icked. In Section 7, we demonstrate the advantage of over UBT in ractice. 5.3 Imlementation Details A straightforward imlementation of PBT may lead to very high comutational cost. At each loo, the burden of the algorithm is the runing ste (Line 2 of Algorithm 4), which has worst-case cost quadratic to the size of S; entries are runed from S if (i) their uer bound scores are below γ and (ii) they are not artially dominated by any other entry with uer bound score above γ. If an entry e m satisfies (i), then a scan of S is required to check (ii). In order to check for condition (ii) efficiently, we use a mainmemory R-tree I(S) to index the entries in S having uer bound score above γ. When the uer bound score of an entry dros below γ, it is removed from I(S). When checking for runing of e m at Line 2 of PBT, we only need to examine the entries indexed by I(S), as only these have uer bound scores above γ. In articular, we may not even have to traverse the whole index I(S). For instance, if a non-leaf entry e in I(S) does not artially dominate e m, then we need not check for the subtree of e. As we verified exerimentally, maintaining I(S) enables the runing ste to be imlemented efficiently. In addition to I(S), we tried additional data structures for accelerating the oerations of PBT (e.g., a riority queue for oing the next entry from S at Line 9), however, the maintenance cost of these data structures (as the uer bounds of entries in S change frequently at Lines -3) did not justify the erformance gains by them. 6 Extensions This section discusses interesting extensions to the basic form of to-k dominating queries we have studied so far. We note that the query tyes that are discussed here are original; to our knowledge they have not been mentioned or studied in the literature before. 6. Generic Aggregate Functions and Point Significance We can generalize the to-k dominating query to include any aggregate function agg (i.e., instead of COUNT) and weights w() of significance on oints (i.e., instead of all oints having the same significance w() = ). The generalized scoring function is defined as: µ agg() = agg { w( ) D } (7) It is not hard to see that our roosed techniques can be directly used for a generalized to-k dominating query, for distributive and monotone aggregate functions (like SUM, MAX, MIN) and weights of imortance on the oints. For this urose, we can use an aggregate R-tree, where entries are augmented with the aggregate score of w(), for all oints under them. 489

8 Only slight modifications have to be made in our algorithms because the fundamental roerty of score dominance (in Equation 3) holds not only for COUNT (i.e., the default to-k dominating query), but also for SUM and MAX. The case for SUM can be directly solved by our algorithms. Regarding MAX, the counting oerations (in, ) and incremental refinement of score bounds (in PBT) need to be modified for MAX corresondingly. Interestingly, MAX rovides us an oortunity to further otimize such counting oerations and score refinements. As an examle, Figure 7a shows the locations of the oints with their weights in brackets. The oints are indexed by a MAX ar-tree and the non-leaf entries e 2 and e 3 are augmented with the weights.9 and.7 resectively. Suose that we need to comute µ max( ), the score of. We first access the child node of e 2 and udate µ max( ) to.9. Now, even though artially dominates e 3, we need not access the node of e 3 as it cannot further imrove µ max( ). Note that query results for MIN can be obtained by evaluating a query for MAX. Secifically, assuming that the interval [, ] is the domain of ossible weights w(), our algorithms can be adated as follows: (i) for each visited oint (and entry), convert its weight w() to w(), (ii) evaluate the query for MAX to retrieve results, and (iii) at the end, transform each result value v to v for obtaining the final results. y (.6) 6 (.9) 4 5 e 2 (.) e 3 (.3) 7 8 (.7) 9 (.5) x (a) Dominating MAX query y (rice).5 y a 4 2 a a 2 3 a3 x.5 x (time to conf. venue) (b) Bichromatic query Figure 7: Variants of to-k dominating queries 6.2 Bichromatic To-k Dominating Queries Given a rovider dataset D P and a consumer dataset D A, the score of an object D P is defined as: µ A() = { a D A a } (8) A bichromatic to-k dominating query retrieves k data objects in D P with the highest µ A score. As an examle of the alicability of this query, consider the oints in Figure 7b, where D P = {, 2, 3} stores the feature values of different hotels (shown as white oints) and D A = {a, a 2, a 3, a 4} records the requirements for a hotel secified by different customers (shown as black oints). For examle, customer a = (.55,.73) will only stay in a hotel whose x (time to the conference venue) and y (room rice) values are at most.55 and.73 resectively. The bichromatic to-k dominating query could be used to find the most oular hotel; i.e., the one that fulfills the requirements of the largest number of customers. In this examle, we have µ A( ) = 2, µ A( 2) = 3, and µ A( 3) =. Thus, the bichromatic to- oint is 2. Algorithms and can be adated for bichromatic queries with slight modifications. In articular, candidate oints are accessed from the ar-tree on D P while their scores are counted using the ar-tree on D A. The extensions of PBT for bichromatic queries are more comlex. Two sets S P and S A are emloyed for managing visited entries in D P and D A resectively, and initially they contain root entries of the corresonding tree. First, a non-leaf entry e A (e.g., according to order) is removed from S A. After accessing the child node of e A, its entries are inserted to S A in order to refine score bounds of entries in S P. Second, a non-leaf entry e P (e.g., according to order) is removed from S P. After accessing the child node of e P, its entries are inserted to S P and their score bounds are refined by entries in S A. Whenever score bounds of entries in S P change, the result set W and the best-k score γ are udated. In addition, an entry e x S P is runed when its uer bound score µ u A(e x) is below γ. On the other hand, an entry in S A is runed if it is not artially dominated by any entry in e x S P with µ u A(e x) γ. The above rocedure reeats until S A becomes emty and S P contains the same objects as in W (i.e, all other entries in S P have been eliminated). 7 Exerimental Evaluation In this section, we exerimentally evaluate the erformance of the roosed algorithms. All algorithms in Table were imlemented in C++ and exeriments were run on a Pentium D 2.8GHz PC with GB of RAM. For fairness to the STD algorithm [23], it is imlemented with the satial aggregation technique (discussed in Section 2.) for otimizing counting oerations on ar-trees. In Section 7. we resent an extensive exerimental study for the efficiency of the algorithms with synthetically generated data. Section 7.2 studies the erformance of the algorithms on real data and demonstrates the meaningfulness of to-k dominating oints. Name Descrition STD Skyline-Based To-k Dominating Algorithm [23] Otimized version of STD (Sec. 3.2) SCG Simle Counting Guided Algorithm (Sec. 4) Lightweight Counting Guided Algorithm (Sec. 4) UBT Uer-bound Based Traversal Algorithm (Sec. 5) Cost-Based Traversal Algorithm (Sec. 5) Table : Descrition of the algorithms 7. Exeriments With Synthetic Data Data generation and query arameter values. We roduced three categories of synthetic datasets to model different scenarios, according to the methodology in [2]. UI contains datasets where oint coordinates are random values uniformly and indeendently generated for different dimensions. CO contains datasets where oint coordinates are correlated. In other words, for a oint, its i-th coordinate [i] is close to [j] in all other dimensions j i. Finally, AC contains datasets where oint coordinates are anti-correlated. In this case, oints that are good in one dimension are bad in one or all other dimensions. Table 2 lists the range of arameter values and their default values (in bold tye). Each dataset is indexed by an ar-tree with 4K bytes age size. We used an LRU memory buffer whose default size is set to 5% of the tree size. Parameter Values Buffer size (%), 2, 5,, 2 Data size, N (million).25,.5,, 2, 4 Data dimensionality, d 2, 3, 4, 5 Number of results, k, 4, 6, 64, 256 Table 2: Range of arameter values Lightweight counting otimization in Counting-Guided search. In the first exeriment, we investigate the erformance savings when using the lightweight counting heuristic in the counting-guided algorithm resented in Section 4. Using a default uniform dataset, for different locations of a non-leaf entry e, (after fixing all coordinates of e to the same value v), we comare (i) node accesses 49

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree To aear in IEEE TKDE Title: Efficient Skyline and To-k Retrieval in Subsaces Keywords: Skyline, To-k, Subsace, B-tree Contact Author: Yufei Tao (taoyf@cse.cuhk.edu.hk) Deartment of Comuter Science and