To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree

Size: px

Start display at page:

Download "To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree"

Marilyn McLaughlin
5 years ago
Views:

1 To aear in IEEE TKDE Title: Efficient Skyline and To-k Retrieval in Subsaces Keywords: Skyline, To-k, Subsace, B-tree Contact Author: Yufei Tao Deartment of Comuter Science and Engineering Chinese University of Hong Kong Sha Tin, Hong Kong Tel: Fax:

2 Efficient Skyline and To-k Retrieval in Subsaces Yufei Tao Xiaokui Xiao Jian Pei Deartment of Comuter Science and Engineering School of Comuting Chinese University of Hong Kong Simon Fraser University Sha Tin, Hong Kong Burnaby, BC Canada V5A S6 {taoyf, Abstract Skyline and to-k queries are two oular oerations for reference retrieval. In ractice, alications that require these oerations usually rovide numerous candidate attributes, whereas, deending on their interests, users may issue queries regarding different subsets of the dimensions. The existing algorithms are inadequate for subsace skyline/to-k search because they have at least one of the following defects: they (i) require scanning the entire database at least once; (ii) are otimized for one subsace but incur significant overhead for other subsaces; (iii) demand exensive maintenance cost or sace consumtion. In this aer, we roose a technique, SUBSKY, which settles both tyes of queries using urely relational technologies. The core of SUBSKY is a transformation that converts multidimensional data to D values. These values are indexed by a simle B-tree, which allows us to answer subsace queries by accessing a fraction of the database. SUBSKY entails low maintenance overhead, which equals the cost of udating a traditional B-tree. Extensive exeriments with real data confirm that our technique outerforms alternative solutions significantly in both efficiency and scalability. Introduction A multidimensional oint dominates another, if the coordinate of on each axis does not exceed that of, and is strictly smaller on at least one dimension. Given a set of oints, the skyline consists of all the oints that are not dominated by others. Figure shows a dataset with dimensionality d = 2. The x-dimension reresents the rice of a hotel, and the y-axis catures its distance to the beach. Hotel dominates 2, because the former is cheaer, and closer to the beach. The skyline includes, 4, and 5, which offer various tradeoffs between rice and distance: 4 is the nearest to the beach, 5 is the cheaest, and may be a good comromise of the two factors. The notion of skyline is generalized to skyband in [22]. Secifically, the k-skyband of a dataset includes all the oints that are dominated by less than k oints. For instance, the 2-skyband of the dataset in Figure contains all the objects excet 7 and 8. Clearly, the skyline is the -skyband. In general, for any k and k satisfying k < k, the k -skyband is a subset of the k-skyband.

3 distance to the beach (y) O.2.4 rice (x).6.8 Figure : A dataset with hotel records Skylines have been extensively studied in the literature, due to their close relationshi to reference search. A reference is usually formulated through a monotone reference function g, which returns a score g() for every oint. Given such a function, a to-k query retrieves the k oints in a dataset with the lowest scores. For examle, for g() = 3[x] + [y], the to- hotel in Figure is (score.8). Regardless of the choice of g, the to- object always lies in the skyline. Furthermore, every skyline oint is the to- object for a certain function (i.e., a skyline does not contain any redundant oint for to- search [5]). Similarly, the k-skyband contains all and only the objects retrieved by to-k queries. The motivation of this work is that, in ractice, a skyline/reference search alication tyically rovides numerous candidate attributes, whereas a user chooses only a small number of them in her/his query. Assume that, in addition to the dimensions in Figure, the database also stores the distances of each hotel to several other locations (e.g., the town center, the nearest suermarket, subway station, etc.), the ratings of security, air-quality, traffic-status in the neighborhood, and so on. It is unlikely that a customer would consider all the attributes in selecting her/his hotel. Instead, s/he would take into account only some of them, i.e., a subsace of the universe. Alternative customers may have different concerns. Therefore, the system must be reared to erform skyline/to-k retrieval in a variety of subsaces. Unfortunately, this observation has been ignored by the revious research. As discussed later, the existing skyline/to-k algorithms are otimized for the whole universe, but entail exensive cost for subsace queries. This aer resents the first study on indexes for efficient skyline and to-k comutation in arbitrary subsaces. We develo SUBSKY, a novel technique that settles both roblems using urely Monotonicity means g() > g( ) for two arbitrary oints and, which share the same coordinates on d dimensions, and has a larger coordinate on the remaining dimension. 2

4 relational technologies, and hence, can be incororated into a conventional database system immediately 2. The core of SUBSKY is a transformation that converts each multidimensional oint to a D value. The converted values are indexed by a B-tree, which can be used to handle all tyes (i.e., skyline, skyband, to-k) of queries effectively. In the resence of tule insertions/deletions, the tree can be maintained at the same cost of udating a traditional B-tree. Extensive exeriments confirm that the roosed solutions significantly outerform the state-of-the-art skyline/to-k algorithms. The rest of the aer is organized as follows. Section 2 reviews the revious work related to ours. Section 3 adats the existing algorithms for subsace skyline/to-k rocessing, and elaborates their deficiencies. Section 4 resents the basic SUBSKY otimized for skyline search on uniform data, and Section 5 generalizes the technique to arbitrary data distributions. Section 6 discusses skyband and to-k rocessing. Section 7 contains an exerimental evaluation that demonstrates the efficiency of SUBSKY. Section 8 concludes the aer with directions for future work. 2 Related Work Section 2. surveys the algorithms for comuting skylines in the whole universe. Then, Section 2.2 discusses the sky-cube that is highly relevant to subsace skylines. Finally, Section 2.3 reviews the revious work on to-k search. 2. Skyline Retrieval in the Universe The existing algorithms can be classified in two categories. The first one involves solutions that do not assume any rerocessing on the underlying dataset, but they retrieve the skyline by scanning the entire database at least once. The second category reduces query cost by utilizing an index structure. In the sequel, we survey both categories, focusing on the second one, since it also involves our solutions. Algorithms Requiring No Prerocessing. The first skyline algorithm in the database context is BNL (block-nested-loo) [5], which simly insects all airs of oints, and returns an object if it is not dominated by any other object. SFS [] (sort-filter-skyline) is based on the same rationale, but imroves the erformance by sorting the data according to a monotone function. The erformance 2 Including a non-relational method into a commercial DBMS is difficult, because it requires fixing comlex issues related to concurrency control, recovery, etc. 3

5 List (x) 5 :. 6 :.3 2 :.4 7 :.6 List 2 (y) 4 :. :.2 3 :.3 8 :.5 (a) The sorted lists used by Index List (x) 5 :.2 :.2 6 :.3 2 :.4 3 :.5 7 :.6 8 :.9 9 :.9 List 2 (y) 4 :. :.2 3 :.3 2 :.4 8 :.5 6 :.7 7 :.8 5 :.9 (b) The sorted lists used by TA Figure 2: Illustration of algorithms leveraging sorted lists of BNL and SFS is analyzed in [25]. D&C [5] (divide-and-conquer) divides the universe into several regions, calculates the skyline in each region, and roduces the final skyline from the regional skylines. When the entire dataset fits in memory, this algorithm roduces the skyline in O(n log d 2 n + n log n) time, where n is the dataset cardinality and d its dimensionality. Bitma [26] converts each oint to a bit string, which encodes the number of oints having a smaller coordinate than on every dimension. The skyline is then obtained using only bit oerations. LESS (linear-elimination-sort for skyline) [4] is an algorithm that has good worst-case asymtotical erformance. Secifically, when the data distribution is uniform and no two oints have the same coordinate on any dimension, LESS comutes the skyline in O(d n) time in exectation. Algorithms Based on Sorted Lists. Index [26] organizes the dataset into d lists. The i-th list ( i d) contains oints with the following roerty: [i] = min d j= [j], where [i] is the i-th coordinate of. Figure 2a shows the d = 2 lists for the dataset of Figure. For examle, 5 is assigned to List because its x-coordinate. is smaller than its y-coordinate.9. In case a oint has identical coordinates on both dimensions, the list containing it is decided arbitrarily (in Figure 2a, 2 and are randomly assigned to Lists and 2, resectively). The entries in List (2) are sorted in ascending order of their x- (y-) coordinates (e.g., entry 5 :. indicates the sorting key. of 5 ). To comute the skyline, Index scans the two lists in a synchronous manner. At the beginning, the algorithm initializes ointers tr and tr 2 referencing the first entries 5, 4, resectively. Then, at each ste, Index rocesses the referenced entry with a smaller sorting key. Since both 5 and 4 have the same key., Index randomly icks one for rocessing. Assume that 5 is selected; it is added to the skyline S sky, after which tr is moved to 6. As 4 has a smaller key (than 6 ), it is the second oint rocessed. 4 is not dominated by any oint in S sky, and hence, is inserted in S sky. Pointer tr 2 is then shifted to. Similarly, is rocessed next, and included in the skyline, 4

6 y O the dominant region of N N N N N N x N N N N Figure 3: Illustration of BBS N N after which tr 2 is set to 3. At this stage, S sky = {, 4, 5 }. Both coordinates of are smaller than the x-coordinate.3 of 6 (referenced by tr ), in which case all the not-yet insected oints in List can be runed. To understand this, observe that both coordinates of are at least.3, indicating that is dominated by. Due to the same reasoning, List 2 is also eliminated because both coordinates of are lower than the y-coordinate of 3 (referenced by tr 2 ). The algorithm finishes with {, 4, 5 } as the result. Borzsonyi et al. [5] develo an algorithm TA 3 that deloys a different set of sorted lists. For a d-dimensional dataset, the i-th list ( i d) enumerates all the objects in ascending order of their i-th coordinates. Figure 2b demonstrates the two lists for the dataset in Figure. TA scans the d lists synchronously, and stos as soon as the same object has been encountered in all lists. For instance, assume that TA accesses the two lists in Figure 2b in a round-robin manner. It terminates the scanning after seeing in both lists. At this moment, it has retrieved 5, 4 and. Clearly, if a oint has not been fetched so far, must be dominated by, and thus, can be safely removed from further consideration. On the other hand, 5, 4 and may or may not be in the skyline. To verify this, TA obtains the y-coordinate of 5 (notice that the scanning discovered only its x- coordinate), and x-coordinate of 4. Then, it comutes the skyline from { 5, 4, }, which is returned as the final skyline. Algorithms Based on R-trees. NN (nearest-neighbor) [8] and BBS (branch-and-bound skyline) [22] find the skyline using an R-tree [2]. The difference is that NN issues multile NN queries [5] while BBS erforms only a single traversal of the tree. It has been roved [22] that BBS is I/O otimal, i.e., it accesses the least number of disk ages among all algorithms based on R-trees 3 This algorithm is called B-tree in [5]. We refer to it as TA to emhasize its connection to Fagin s threshold algorithm [2]. 5

7 (including NN). Hence, the following discussion concentrates on this technique. Figure 3 shows the R-tree for the dataset of Figure, together with the minimum bounding rectangles (MBR) of the nodes. BBS rocesses the (leaf/intermediate) entries in ascending order of their mindist (minimum distance) to the origin of the universe. At the beginning, the root entries are inserted into a min-hea H (= {N 5, N 6 }) using their mindist as the sorting key. Then, the algorithm removes the to element N 5 of H, accesses its child node, and en-heas all the entries there. H now becomes {N, N 2, N 6 }. Similarly, the next node visited is leaf N, where the data oints are added to H (= {, 2, N 2, N 6 }). Since tos H, it is taken as the first skyline oint, and used for runing in the subsequent execution. Point 2 is de-heaed next, but is discarded because it falls in the dominant region of (the shaded area). BBS then visits N 2, and inserts only 4 into H = {N 6, 4 } ( 3 is not inserted as it is dominated by ). Likewise, accessing N 6 adds only one entry N 3 to H (= {N 3, 4 }) because N 4 lies comletely in the shaded area. Following the same rationale, the remaining entries rocessed are N 3 (en-heaing 5 ), 4, 5, at which oint H becomes emty and BBS terminates. Retrieval of Skyline Variants. Balke et al. [] consider skyline comutation in distributed environments. Also in the distributed framework, the work of Huang et al. [7] studies skyline search on mobile objects. Lin et al. [9] investigate continuous skyline monitoring on data streams. In [6], skylines are extended to artially-ordered domains. Chan et al. [8, 7] roose various aroaches to imrove the usefulness of skylines in high-dimensional saces. The above methods, however, are restricted to their secific scenarios, and cannot be adated to the roblem of this aer. 2.2 The Sky-Cube Pei et al. [23] and Yuan et al. [3] indeendently roose the sky-cube, which consists of the skylines in all ossible subsaces. In the sequel, we exlain this concet, assuming that no two oints have the same coordinate on any axis (see [23, 3] for a general discussion overcoming the assumtion). Suose that the universe has d = 3 dimensions x, y, and z. Figure 4 shows the 7 ossible nonemty subsaces. All the oints in the skyline of a subsace belong to the skyline of a subsace containing additional dimensions. For instance, the skyline of subsace xy is a subset of the skyline 6

8 xyz xy xz yz x y z Figure 4: The lattice of skycube in the universe, reresented by an edge between xy and xyz in Figure 4. Secially, if a subsace involves only a single dimension, its skyline consists of the oint having the smallest coordinate on this axis. The skycube can be comuted in a to-down manner. First, we retrieve the skyline of the universe. Then, a child skyline can be found by alying a conventional algorithm on a arent skyline (instead of the original database). For examle, the skyline of xyz can roduce those of xy, xz, and yz, while the skyline of x can be obtained from that of either xy or xz. To reduce cost, several heuristics are roosed in [23, 3] to avoid the common comutation in different subsaces. Xia and Zhang [29] exlain how to dynamically maintain a skycube, after the underlying database has been udated. 2.3 To-k Search in the Universe There is a bulk of research on distributed to-k rocessing (see [2] and the references therein). In that scenario, the data on each dimension is stored at a different server; the goal is to find the to-k objects with the least network communication. Our work falls in the category of centralized to-k search, where all the dimensions are retained at the same server, and the objective is to minimize queries CPU and I/O cost. Next, we concentrate on this category. Chang et al. [9] develo ONION, which answers only to-k queries with linear reference functions. Hristids and Paakonstantinou [6] roose the PREFER system, which suorts a broader class of reference functions, but requires dulicating the database several times. Yi et al. [3] suggest a similar aroach with lower maintenance cost. Tsaaras et al. [28] resent a technique that can handle arbitrary reference functions. This technique, however, is limited to two dimensions, and suorts only to-k queries whose k does not exceed a certain constant. The state-of-the-art solution [27, 28] is based on best-first traversal [5] on an R-tree. It enables to-k queries with any k and monotone reference function, on data of arbitrary dimensionality. 7

9 3 Extending the Previous Algorithms to Subsaces Without loss of generality, we assume a d-dimensional universe where each axis has domain [, ]. Given a oint, [i] denotes its coordinate on the i-th dimension ( i d). BNL, SFS, D&C, Bitma, and LESS (in general, any algorithm that does not demand rerocessing) can be trivially extended to comute the skyline in a subsace, by ignoring the irrelevant coordinates of each oint. However, they entail exensive cost by scanning the entire dataset multile times. Index can be adated for subsace skyline retrieval, by re-constructing the underlying data structure for every query. Assume, for examle, d =. As mentioned in Section 2., the rerocessing of Index creates ten sorted lists L,..., and L, where L i ( i ) includes all oints satisfying [i] = min{[],...,[]}. Consider a query that aims at finding the skyline in the first two dimensions. To aly Index, we must organize the database into two sorted lists L and L 2, where L i ( i 2) contains all oints such that [i] = min{[],[2]}. Notice that, the recomuted L,..., L rovide little hel for deriving L and L 2. The most efficient way to calculate L and L 2 would ignore the re-comuted lists, scan the database once to assign each object to L or L 2, and then erform two external sorts to obtain the roer order in the two lists. TA is directly alicable to subsace comutation, by oerating on only the sorted lists corresonding to the axes of the target subsace. Unfortunately, this technique is inaroriate for dynamic datasets, because its sorted lists are costly to maintain. Recall that, every object has an entry in each of the d sorted lists. Since a list is organized with a B-tree [5], a single tule insertion/deletion requires modifying d B-trees, which is rohibitively exensive for large d. NN and BBS can find a subsace skyline by ignoring the extents of an MBR along the irrelevant dimensions. It suffices to discuss only BBS since NN is always slower. Imagine the rectangles in Figure 3 as the rojections, in the 2D subsace demonstrated, of the MBRs in a d-dimensional R- tree for any d > 2. BBS retrieves the skyline of the subsace in exactly the same way as described in Section 2.. The erformance of BBS, however, severely degrades when the dimensionality d increases, due to two reasons. First, as d grows, MBRs become considerably larger, which significantly decreases the robability that an MBR falls in the dominant regions of the skyline oints (recall that this is the runing condition of BBS). The other (less obvious) reason is the emergence of the random grouing henomenon, after 8

10 d exceeds a certain threshold. Consider, for examle, a 5D uniform dataset with k oints. Given a node caacity of entries, the R-tree would contain aroximately leaf nodes. Since 2, each leaf MBR has been slit once along roughly dimensions, while the remaining 5 axes are not considered at all in the R-tree construction [3, 4]. Assume that we use the R-tree to retrieve the skyline in a 2D subsace including any two of those 5 dimensions. The exected erformance is as oor as deloying a athological 2D R-tree, which grous the oints (in that 2D subsace) into leaf nodes in a comletely random manner, regardless of their satial roximity. Finally, the skycube algorithm discussed in Section 2.2 is not suitable for retrieving the skyline in one subsace, because it erforms unnecessary work by fetching skylines in many non-requested subsaces. An alternative aroach is to re-comute the entire skycube. In that case, although the skyline of any subsace can be obtained immediately, the skycube occuies significant sace, and incurs exensive udate cost (whenever the original database is udated). In summary, the existing aroaches are inadequate for subsace skyline search because they have at least one of the following defects: they (i) require scanning the entire database at least once; (ii) are otimized for one subsace but incur significant overhead for other subsaces; (iii) demand exensive maintenance cost or sace consumtion. In the following sections, we remedy all defects by roosing a new technique. 4 The Basic SUBSKY We use the term maximal corner for the corner A C of the universe having coordinate on all dimensions. Each data oint is converted to a D value f(), equal to the L distance between and A C : f() = L (,A C ) = max d i= Point dominates all oints satisfying the following inequality: ( [i]). () f( ) < min d ( [i]) (2) i= Figure 5 illustrates a 2D examle. Points obeying the inequality constitute the shaded square, whose side length equals min 2 i=( [i]). Obviously, no such can aear in the skyline because the square is entirely contained in the dominant region of. 9

11 y region runed by A O Figure 5: Illustration of Inequality 2 x Inequality 2 alies to the whole universe, while a similar result exists in any subsace. Reresenting a subsace as a set SU B caturing the relevant dimensions (e.g., if the subsace involves the st and 3rd axes of the universe, then SUB = {, 3}), we have: Proerty. Given an arbitrary oint, no oint qualifying the following condition can belong to the skyline of SUB: f( ) < min ( [i]) (3) i SUB For examle, assume d = 3, and that the goal is to retrieve the skyline in SUB = {, 2}. If has coordinates (.5,., ) (the 3rd coordinate is irrelevant), no oint with f( ) < min(.5,.) =.9 can be in the target skyline. This is correct because the coordinates of must be at least. on all dimensions; hence, is dominated by in SUB. Proerty leads to an algorithm, referred to as the basic SUBSKY, for comuting the skyline in a subsace SUB. Secifically, we access the data oints in descending order of their f(). Meanwhile, we maintain (i) the current set S sky of skyline oints (among the data already examined), and (ii) a value U equal to the largest min i SUB ( [i]) of the objects S sky. The algorithm terminates when U exceeds the f() of the next to be rocessed. The seudocode of the basic SUBSKY is given in Figure 6. We illustrate the basic SUBSKY using the 8 three-dimensional oints in Figure 7 (the x- and y- coordinates are the same as in Figure ), where the last row indicates the f-value of each oint. The query subsace SUB is {, 2}. Objects are rocessed in this order: { 3, 4, 5,, 6, 2, 8, 7 }. After examining 3, SUBSKY initializes S sky as { 3 }, and U as min i SUB ( 3 [i]) =.5. After the second oint 4 is insected, S sky becomes { 3, 4 }, since 4 is not dominated by 3 ; U remains.5, because.5 > min i SUB ( 4 [i]) =.. Similarly, next 5 is added to S sky without

12 Algorithm BASIC-SUBSKY (SUB) /* SUB includes the dimensions relevant to the query subsace */. DB = the given set of data oints sorted in descending order of their f-values 2. U = ; S sky = 3. = the first oint in DB 4. while and U f() do 5. if is not dominated by any oint in DB 6. add to S sky 7. remove from S sky the oints dominated by 8. U = the maximum min i SUB ( [i]) of all S sky 9. set to the next oint in DB. return S sky Figure 6: The basic SUBSKY algorithm dimension (x) (y) (z) f( i ) Figure 7: An examle dataset affecting U. To handle the 4th oint, we insert it in S sky, but remove 3 from S sky, as 3 is dominated by. Moreover, U is increased to min i SUB ( [i]) =.8. The algorithm roceeds to insect 6, which does not change S sky and U. Since the f-values of the remaining oints are smaller than the current U =.8, the algorithm finishes and reorts {, 4, 5 } as the final skyline. As shown in the exeriments, desite its simlicity, the basic SUBSKY is highly efficient in comuting subsace skylines, when the data distribution is uniform. Next we rovide the intuition, under the same settings as used in Section 3 to illustrate the defects of BBS. Consider a 5D uniform dataset with cardinality k. Assume that we want to retrieve the skyline in a subsace SUB containing any two dimensions, as shown in Figure 8. There is a high chance that a skyline dimension 2 in SUB (l, l) O dimension in SUB Figure 8: Illustration of the analysis

13 y A A O x Figure 9: Pruning effects of different anchors oint lies very close to the origin in SUB. Secifically, let us examine the square in Figure 8 whose lower-left corner is the origin, and its side length equals some small λ [, ]. The robability of not having any object in the square equals ( λ 2 ), which is less than % for λ =.. In other words, with at least 9% chance, we can find a oint in the square such that min i SUB ( [i]).999. In this case, (by Proerty ) all oints with f( ) <.999 are eliminated by SUBSKY. The exected ercentage of such oints in the whole dataset equals the volume of a 5-dimensional square with side length.999. The volume evaluates to = 98.5%, that is, we only need to access.5% of the dataset! 5 The General SUBSKY In the basic SUBSKY, f() is always comuted using one anchor, i.e., the maximal corner A C. This works fine for uniform data. In ractice where data is clustered, however, the f() of various should be calculated with resect to different anchors to achieve greater runing ower. To illustrate this, Figure 9 shows a 2D dataset, where all objects gather around the uer-left corner. In the basic SUBSKY, (by Proerty ) oint runes the right shaded square, which is useless since the square does not cover any object. Alternatively, let us comute f() as the L distance from to another anchor A. As a direct corollary of Proerty, we can eliminate all oints whose f( ) is smaller than min 2 i=(a [i] [i]). These oints form the left shaded square, which encloses a significant ortion of the dataset. Namely, A offers stronger runing ower than A C. Based on this idea, in Sections , we develo the general version of SUBSKY. Finally, Section 5.5 discusses issues related to udates and other reference directions. 2

14 5. Pruning with Multile Anchors Given a dataset, we will comute a set S anc of anchors A, A 2,..., A m. Then, every data oint is converted to a D value as follows. Definition. Each data oint is converted to a value f() = L (,A), where A belongs to S anc, and is called the assigned anchor of. Next, we formalize our runing heuristic based on multile anchors. Proerty 2. Let be an arbitrary object, and S anc the set of anchors whose rojections in susace SUB are dominated by. Then, for each anchor A S anc, an object assigned to A cannot be in the skyline if f( ) < min (A[i] [i]) (4) i SUB The above result degenerates to Proerty when A = A C. To exlain the case where A A C, consider d = 3, and A = (.8,.7,.). In subsace SUB = {, 2}, an object = (.2,.2, ) eliminates all oints assigned to A with f( ) < min(a[] [],A[2] [2]) =.5. Note that the first and second coordinates of must be larger than.3 and.2 resectively, indicating that is dominated by in SUB. The runing effectiveness of Proerty 2 deends on (i) how data is assigned to anchors, and (ii) how the anchors are selected. We analyze these issues in the next two subsections, resectively. 5.2 Assigning Points to Anchors We introduce the concet of effective region. Definition 2. Given a data oint and an anchor A dominated by, the effective region (ER) of with resect to A is a d-dimensional rectangle whose oosite corners are the origin and the oint having coordinate A[i] L (,A) on the i-th dimension ( i d). The ER does not exist, if A[i] < L (,A) for any i [,d]. ERs are closely related to the benefit of assigning a oint to an anchor. To understand this, consider Figure a where d = 2, and oint is assigned to A C. As a result, in finding the skyline in the 3

15 y f() A y f() A the ER of the ER of O x O x (a) Assigning to A C (b) Assigning to A Figure : The concet of effective region universe, can be eliminated with Proerty 2, if and only if another oint is discovered in the shaded square, which is the ER of with resect to A C. For examle, since ( 2 ) is inside (outside) the ER, we can (cannot) eliminate after encountering ( 2 ). Let us assign to an alternative anchor A in Figure b. The shaded area demonstrates the new ER of, which covers both and 2. This means that can be eliminated as long as either or 2 has been discovered. Comared with assigning to A C, the new assignment increases the chance of runing. Motivated by this, we assign to the anchor that roduces the largest ER of. Secifically, this is the anchor that is dominated by A and maximizes: d max(,a[i] L (,A)) (5) 5.3 Finding the Anchors i= An anchor that leads to a large ER for one data oint may roduce a small ER for another. When we are allowed to kee only m anchors (where m is a small integer, set as a system arameter), how should they be selected in order to maximize the ER volumes of as many oints as ossible? Notice that the largest ER of a oint corresonds to its anti-dominant region, consisting of all the oints in the universe dominating. These oints form a rectangle that has the origin and as its oosite corners. In other words, the maximum value of Formula 5 equals Π d i=[i], which is achieved when A[i] L (,A) = [i] (6) holds on all dimensions i [,d]. We refer to an anchor A satisfying the above equation as a erfect anchor for. 4

16 y a good anchor A ' 2 ' A C r r 2 2 major erendicular lane l A O anti-dominant region of x (a) Perfect rays Figure : Finding anchors B (b) Deciding the anchor It turns out that each oint has infinite erfect anchors. Let us shoot a ray from that is in its dominant region, and arallel to the major diagonal of the data sace (i.e., the diagonal connecting the origin and the maximal corner). Every oint A on this erfect ray is a erfect anchor of. This is because the coordinate difference between A and is equivalent on all axes, i.e., A[i] [i] = L (,A) for any i [,d], thus establishing Equation 6. In Figure a, for examle, the erfect ray of oint is r, and any anchor on r will result in the ER of that is the shaded rectangle (i.e., the anti-dominant region of ). Similarly, r 2 is the ray for 2. Since r and r 2 are very close to each other, if we can kee only a single anchor A, it would lie between the two rays as in Figure a. Although A is not the erfect anchor of and 2, it is a good anchor as it leads to large ERs for both oints. The imortant imlication of the above discussion is that oints with close erfect rays may share the same anchor. This observation naturally leads to an algorithm for finding anchors based on clustering. Secifically, we first roject all objects onto the major erendicular lane, i.e., the d-dimensional lane that asses the maximal corner, and is erendicular to the major diagonal of the universe. In Figure a (d = 2), for instance, the lane is line l, and the rojections of and 2 are and 2, resectively. Then, we artition the rojected oints into m clusters using the k-means algorithm [, 24], and formulate an anchor for each cluster. It remains to clarify how to decide an anchor A for a cluster S. We aim at guaranteeing that A should roduce a non-emty ER for every oint S (i.e., A[i] > L (,A) on every dimension i, as suggested in Definition 2); otherwise, cannot be assigned to A. We illustrate the algorithm using a 2D examle, but the idea generalizes to arbitrary dimensionality in a straightforward manner. 5

17 Assume that S consists of 5 objects, 2,..., 5. The algorithm examines them in the original universe (i.e., not in the major erendicular lane), as shown in Figure b. We first obtain oint B, whose coordinate on each dimension equals the lowest coordinate of the oints in S on this axis. Note that B necessarily falls inside the data sace, and dominates all the oints. Then, we comute the smallest square that covers all the oints in S (see Figure b). The anchor A for S is the corner of the square oosite to B. 5.4 The Data Structure and Query Algorithm We are ready to clarify the details of our SUBSKY technique. Given a small number m (less than in our exeriments), SUBSKY first obtains m anchors, by alying the method in Section 5.3 on a random subset of the database. Then, the f() of each oint is set to the L distance between and its assigned anchor (which maximizes the volume of ER among the anchors dominated by ). We guarantee the existence of such an anchor by always including the maximal corner in the anchor set. SUBSKY manages the resulting f() with a single B-tree that searates the oints assigned to various anchors. We achieve this by indexing a comosite key (j, f()), where j [,m] is the id of the anchor to which is assigned. Thus, an intermediate entry e of the B-tree has the form (e.id, e.f), which means that (i) each oint in the subtree of e has been assigned to the j-th anchor with j e.id, and (ii) in case j = e.id, the value of f() is at least e.f. We illustrate the above rocess using the 3D dataset of Figure 7 and m = 2 anchors: the maximal corner A (= A C ), and A 2 = (,,.8). The second row of Figure 3a illustrates the ER volume of each data oint with resect to A, calculated by Equation 5. For instance, the volume 25 ( 3 ) of 8 is derived from Π 3 i=(a [i] L (A, 8 )) = (.5) 3. Similarly, the third row contains the ER volumes with resect to A 2. A means that the corresonding ER does not exist. For examle, the ER of 2 is undefined because 2 does not dominate A 2, while there is no ER for 4 since A 2 [3] =.8 is smaller than L (A 2, 4 ) =.9 (review Definition 2). The white cells of the table indicate each oint s ER-volume with resect to its assigned anchor. For examle, 3 is assigned to A 2 since this anchor roduces a larger ER than A. Figure 3b shows the B-tree indexing the transformed f-values, e.g., the leaf entry 3 :(2,.7) in node N 4 catures the fact that f( 3 ) equals the L distance.7 between A 2 and 3. 6

18 Algorithm SUBSKY (SUB, {A, A 2,... A m }) /* SUB includes the dimensions relevant to the query subsace; A,..., A m are the anchors */. for j = to m 2. use a B-tree to find the oint tr j with the maximum f(tr j ) among all the oints assigned to A j 3. tr j.er = d i= (A j[i] L (tr j, A j )) //Equation 5 4. S sky = //the set of skyline oints 5. while (tr j for any j [, m]) 6. t = the value of j giving the smallest tr j.er among all j [, m] such that tr j 7. if tr t is not dominated by any oint in S sky 8. remove from S sky the oints dominated by tr t 9. S sky = S sky {tr t }. for j = to m, and j t. if tr j and f(tr j ) < min i SUB (A j [i] tr t [i]) 2. tr j = /* no oint assigned to A j can belong to the skyline (Proerty 2) */ 3. tr t = the oint with the next largest f() among the data assigned to A t (this oint lies in either the same leaf as the revious tr t, or a neighboring node) 4. if tr t 5. for every oint sky S sky 6. if f(tr t ) < min i SUB (A t [i] sky [i]) 7. tr t = /* no oint assigned to A t can belong to the skyline (Proerty 2) */ 8. tr t.er = d i= (A t[i] L (tr t, A t )) 9. return S sky Figure 2: The algorithm of finding a subsace skyline Figure 2 formally describes the query algorithm of SUBSKY. At a high level, SUBSKY divides the dataset into m lists, such that the i-th ( i m) list contains all the oints assigned to anchor A i, sorted in descending order of their f-values. Given a query subsace SUB, the algorithm scans the m lists in a synchronous manner. Initially, ointers tr, tr 2,..., tr m are ositioned at the first elements of the m lists, resectively. The subsequent execution runs in iterations, until all the ointers have become. In each iteration, SUBSKY rocesses the oint that has the smallest ER, among all the oints currently referenced by the m ointers. Secifically, it first udates the skyline set S sky (i.e., whenever necessary, add to S sky and remove the oints from S sky dominated by ). Then, the algorithm checks whether a list can be eliminated with, according to Proerty 2. Once a list is runed, its corresonding ointer is set to. Afterwards, the ointer referencing is advanced to the next oint, and another iteration starts. As an examle, assume that we want to comute the skyline in the subsace SUB = {, 2}. As the first ste, the algorithm identifies, for each anchor, the assigned data oint with the maximum f(). In Figure 3b, the oint for A (A 2 ) is 6 ( 5 ), which is the right-most oint assigned to this 7

19 ! " # $ ER-vol w.r.t. A % ER-vol w.r.t. A& (a) ER volumes with resect to A, A 2 (unit 3 ) (,.4) (,.9) N 5 N 6 (,.4) (,.6) (,.9) (2,.7) (,.4) (,.5) (,.6) (,.8) (,.9) (,.9) (2,.7) (2,.7) N N 2 N 3 N 4 Figure 3: Illustration of the skyline algorithm (b) The B-tree on the transformed f-values anchor at the leaf level, and can be easily found by accessing a single ath of the B-tree. Then, the algorithm scans the oints assigned to each anchor in descending order of their f-values, i.e., the ordering is { 5, 4,, 2, 8, 7 } for A, and { 6, 3 } for A 2. Initially, tr and tr 2 reference the heads 5 and 6 of the two lists, resectively. At each iteration, we rocess the referenced oint with a smaller ER (in case of a tie, the next rocessed oint is randomly decided). Continuing the examle, since the ER-volume of 5 is smaller than that 9 of 6 (imlying that 6 has a larger robability of being runed by a future skyline oint), the algorithm adds 5 to the skyline set S sky, and advances tr to the next oint 4 in the list of A. Since 4 has a lower ER-volume (than 6 ointed to by tr 2 ) and is not dominated by 5, it is also added to S sky (={ 5, 4 }). Pointer tr now reaches, which is rocessed next, and is included in S sky, too. According to Proerty 2, runes all the oints assigned to A whose f() are smaller than min i SUB (A [i] [i]) =.8. Since the next oint 2 in the list of A qualifies the condition, none of the remaining data in the list can be a skyline oint. Similarly, also runes the data assigned to A 2 satisfying f() < min i SUB (A 2 [i] [i]) =.8. Thus, the head 6 in the list of A 2 is eliminated (f( 6 ) =.7 <.8), and no oint in the list belongs to the skyline either. Hence, the algorithm terminates with S sky ={ 5, 4, }. 5.5 Discussion We kee the anchor set in memory since it is small (occuying only several k-bytes) and is needed for erforming queries and udates. Secifically, to insert/delete a oint, we decide its assigned anchor A as described in Section 5.2, and set f() = L (,A), after which the insertion/deletion roceeds as in a normal B-tree. The anchor set is never modified after its initial comutation. Query efficiency remains unaffected as long as the data distribution does not incur significant changes. For a dynamic dataset, all the data must be retained because a non-skyline oint may aear in the 8

20 skyline after a skyline oint is deleted. On the other hand, if the dataset is static, oints that are not in the skyline of the whole universe can be discarded since, as mentioned in Section 2., they will not aear in the skyline of any subsace 4. When d is large, the size of the full-sace skyline may still be comarable to the dataset cardinality [3]. Hence, the oints (of the skyline) should be managed by a disk-oriented technique (such as SUBSKY) to enable efficient retrieval in subsaces. So far our definition of dominance refers small coordinates on all dimensions, whereas in general a oint may be considered dominating another only if its coordinates are larger on some axes. For examle, given attributes rice and size of houses, a reasonable skyline would seek to minimize the rice but maximize the size (i.e., a customer is tyically interested in large houses with low rices). Deending on its semantics, a dimension usually has only one reference direction, e.g., skylines involving rice (size) would most likely refer the negative (ositive) direction of this axis. SUBSKY easily suorts a ositive reference direction by converting it to a negative direction, which can be achieved by subtracting (from ) all coordinates on the corresonding dimension (e.g., rice). It is worth mentioning that, sometimes a dimension may have two reference directions. For examle, consider the attribute nearest-subway-station-distance of roerties. Peole, who travel with the subway frequently, may refer to minimize this attribute. Others, who are seeking quiet neighborhoods, may refer to maximize it. In this case, two instances of SUBSKY may be maintained, each suorting one reference direction. 6 Extensions of SUBSKY In the sequel, we show that SUBSKY can be adated to erform other tyes of search in subsaces efficiently. Section 6. first elaborates this for skyband queries, and then Section 6.2 discusses to-k rocessing. 4 Strictly seaking, this is correct only if all the data oints have distinct coordinates on each dimension. If this is not true, the oints that need to be retained include those sharing common coordinates with a oint in the full-sace skyline. Retrieval of such oints is discussed in [23]. 9

21 6. Subsace Skyband Retrieval As mentioned in Section, the k-skyband of a dataset consists of all the data oints that are dominated by less than k other oints. SUBSKY can be easily modified to find the k-skyband in any subsace SU B. In terms of theoretical reasoning, the modification lies in Proerty 2: a oint cannot aear in the k-skyband, if there are k oints,..., k, such that each j ( j k) satisfies Inequality 4, relacing with j. This observation imlies that a subsace k-skyband can be extracted using the B-tree deloyed by SUBSKY, in a way similar to finding a subsace skyline. Intuitively, the only difference is that, here, a sorted list can be eliminated, only after its remaining oints are guaranteed to be dominated by k oints already seen. Based on this idea, Figure 4 demonstrates the SUB-SKYBAND algorithm. Next, we illustrate the algorithm by using the index in Figure 3b to extract the 2-skyband in SUB = {, 2} of the dataset in Figure 7. SUB-SKYBAND scans two sorted lists { 5, 4,, 2, 8, 7 } and { 6, 3 } in a synchronous manner. Recall that the oints in the first (second) list are assigned to anchor A (A 2 ) and sorted in descending order of their f-values. Following the rocess discussed in Section 5.4, we access, in this order, 5, 4, and of the first list, after which tr and tr 2 are referencing 2 and 6 resectively, and S sky = { 5, 4, } (which should be interreted as the 2-skyband set now). As exlained in Section 5.4, definitely dominates all the un-insected oints in both lists the reason for terminating the skyline search in the examle in Section 5.4. To obtain the comlete 2-skyband, we continue to rocess 2, because its ER has a smaller volume than that of 6. Since 2 is dominated by only a single oint in S sky, it is added to S sky ; then, tr is moved to 8. 2 must dominate all the un-examined oints in the first list, which can be understood by combining Proerty 2 with the fact: f() f( 8 ) =.5 < min i SUB (A [i] 2 [i]) =.6. Now that we have found two oints ( and 2 ) that dominate the un-insected art of the first list, the list is eliminated from further consideration. The discovery of 2 does not rune the second list. Hence, 6 is examined, and added to S sky because it is dominated by only one oint in S sky. Pointer tr now references 3, which is also checked, and included in the 2-skyband. Since the second list has been exhausted, the algorithm finishes, reorting S sky = { 5, 4,, 2, 6, 3 } as the final result. 2

22 Algorithm SUB-SKYBAND (SUB, k, {A, A 2,... A m }) /* The meanings of SUB, A,..., and A m follow those in Figure 2; k is the arameter in k-skyband */. for j = to m 2. use a B-tree to find the oint tr j with the maximum f(tr j ) among all the oints assigned to A j 3. tr j.er = d i= (A j[i] L (tr j, A j )) //Equation 5 4. S sky = //the set of oints in the k-skyband in SUB 5. for j = to m 6. S rune [j] = //S rune [j] is the set of oints that dominate the un-insected oints assigned to A j 7. while (tr j for any j [, m]) 8. t = the value of j giving the smallest tr j.er among all j [, m] such that tr j 9. if tr t is dominated by less than k oints in S sky. for every oint S sky dominated by tr t..cnt + + //.cnt is the number of data oints found dominating 2. if.cnt = k then remove from S sky 3. S sky = S sky {tr t }; tr t.cnt = 4. for j = to m, and j t 5. if tr j and f(tr j ) < min i SUB (A j [i] tr t [i]) 6. S rune [j] = S rune [j] {tr t } 7. if S rune [j] = k then tr j = /* no oint assigned to A j can belong to the k-skyband */ 8. tr t = the oint with the next largest f() among the data assigned to A t (this oint lies in either the same leaf as the revious tr t, or a neighboring node) 9. if tr t 2. for every oint sky (S sky S rune [j]) 2. if f(tr t ) < min i SUB (A t [i] sky [i]) 22. S rune [t] = S rune [t] { sky } 23. if S rune [t] = k then tr t = /* no oint assigned to A t can belong to the k-skyband */ 24. tr t.er = d i= (A t[i] L (tr t, A t )) 25. return S sky Figure 4: The algorithm of finding a subsace skyband 6.2 Subsace To-k Retrieval Given a monotone reference function g (concerning a subsace SU B), a to-k query returns the k data oints with the lowest scores. Since, in any SUB, any to-k result is always included in the corresonding k-skyband, an obvious solution to answering the query is to extract the k- skyband, comute the scores of the retrieved oints, and reort the k ones with the lowest scores. This aroach, however, may erform considerable unnecessary work, if the k-skyband is sizable. Here, we roose a faster algorithm, which emloys exactly the same B-tree used by our SUBSKY methodology, and is alicable to any monotone reference function g. The algorithm requires the notion of ER max-corner, defined as follows: Definition 3. Given a oint and an anchor A dominated by, let R be the ER of with resect 2

23 Algorithm SUB-TOPK (SUB, k, g {A, A 2,... A m }) /* The meanings of SUB, A,..., and A m follow those in Figure 2; k is the arameter in to-k ; g is the reference function of the to-k query */. for j = to m 2. use a B-tree to find the oint tr j with the maximum f(tr j ) among all the oints assigned to A j 3. S to = //the to-k set 4. while (tr j for any j [, m]) 5. t = the value of j giving the smallest g(tr j ) among all j [, m] such that tr j 6. if g(tr t ) is smaller than the score of some oint in S to 7. remove from S to the oint with the largest score 8. S to = S to {tr t } 9. for j = to m, and j t. if tr j. tr jc is the ER max-corner of tr j 2. if g(tr jc ) is larger than the scores of all oints in S to 3. tr j = /* no oint assigned to A j can belong to the to-k set */ 4. tr t = the oint with the next largest f() among the data assigned to A t 5. if tr t 6. tr tc is the ER max-corner of tr t 7. if g(tr tc ) is larger than the scores of all oints in S to 8. tr t = /* no oint assigned to A t can belong to the to-k set */ 9. return S to Figure 5: The algorithm of answering a subsace to-k query to A (formulated in Definition 2). Then, the ER max-corner of with resect to A is the corner of R oosite to the origin (which is also a corner of R). For examle, in Figure a, the ER max-corner of is the uer-right corner of the shaded region. Such corners have two imortant roerties: Proerty 3. A data oint cannot be in the result of a to-k query (which secifies a reference function g concerning subsace SUB), if there exist k oints,..., k such that, for all j [,k], g( j ) < g( C) (7) where C is the ER max-corner of with resect to its assigned anchor. Proerty 4. Let and 2 be two oints assigned to the same anchor A. If f( ) f( 2 ), then g( C ) g( 2C ), where g is any monotone reference function, and C and 2C are the ER max-corners of and 2 with resect to A, resectively. Figure 7 formally describes the roosed algorithm SUB-TOPK for subsace to-k retrieval. As with subsace skyline/skyband search, SUB-TOPK leverages m lists, where the i-th ( i m) 22

24 list juxtaoses the oints assigned to the i-th anchor in descending order of their f-values. Given a query subsace SUB, the algorithm again uses m ointers tr, tr 2,..., tr m to scan the m lists synchronously, and maintains the set S to of k objects that have the smallest scores among all the objects scanned so far. In each iteration, SUB-TOPK rocesses the referenced oint with the smallest score, and attemts to rune a list according to Proerties 3 and 4. In the sequel, we exlain the algorithm using the dataset in Figure 3, assuming a to-2 query in SUB = {, 2} with g() = 3[] + [2]. SUB-TOPK examines two sorted lists { 5, 4,, 2, 8, 7 } and { 6, 3 }. At the beginning, ointers tr and tr 2 reference the to elements of the two lists, resectively. At each ste, we rocess the referenced oint with a smaller score. Since g( 5 ) =.2 < g( 6 ) =.6, 5 is added to S to, and tr is moved to the next element 4. As g( 6 ) < g( 4 ) = 2.8, the algorithm adds 6 to S to (which becomes { 5, 6 }, sorted in ascending order of their scores), and shifts ointer tr 2 nto 3. Similarly, 3 is the third oint insected, but is discarded, because g( 3 ) =.8 is larger than the score of the current to-2 object 6. The second list has been exhausted; hence, the subsequent execution focuses on the first list. We continue to rocess 5 and 4, which are also ignored, due to the same reason for discarding 3. Next, is examined, and included in S to, whereas 6 is removed from S to, because it has a higher score than and 5. The algorithm terminates here with S to = {, 5 } as the final result. To exlain this, notice that the element 2 referenced by tr now has an ER max-corner 2C = (.4,.4,.4). The score g( 2C ) =.6 of 2C is greater than those (.8 and.2) of and 5. Hence, by Proerty 3, 2 cannot belong to the to-2 result. Furthermore, let be any oint in the first list that has not been examined, and C the ER max-corner of. As f( 2 ) f(), according to Proerty 4, g( C ) must be at least g( 2C ), and hence, larger than the scores of both and 5. Therefore, cannot be in the to-2 result, either. The comlete algorithm is formally described in Figure 5. 7 Exeriments In this section, we exerimentally evaluate the efficiency of the roosed techniques. We deloy three real datasets NBA, Household, and Color 5. Secifically, NBA contains 7k 8-dimensional 5 These datasets can be downloaded at htt:// htt:// and htt://kdd.ics.uci.edu, resectively. 23

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark mly@cs.aau.dk Nikos Mamoulis Deartment of