Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data

Size: px

Start display at page:

Download "Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data"

Collin Doyle
5 years ago
Views:

1 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X Efficient Evaluation of Probabilistic Advanced Spatial Queries on Eistentiall Uncertain Data Man Lung iu, Nikos Mamoulis, Xianguan Dai, ufei Tao, Michail Vaitis Abstract We stud the problem of answering spatial queries in databases where objects eist with some uncertaint and the are associated with an eistential probabilit. The goal of a thresholding probabilistic spatial quer is to retrieve the objects that qualif the spatial predicates with probabilit that eceeds a threshold. Accordingl, a ranking probabilistic spatial quer selects the objects with the highest probabilities to qualif the spatial predicates. We propose adaptations of spatial access methods and search algorithms for probabilistic versions of range queries, nearest neighbors, spatial sklines, and reverse nearest neighbors and conduct an etensive eperimental stud, which evaluates the effectiveness of proposed solutions. Inde Terms H.2.4.h Quer processing, H.2.4.k Spatial databases INTRODUCTION Conventional spatial databases manage objects located on a thematic map with % certaint. In real-life cases, however, there ma be uncertaint about the eistence of spatial objects or events. As an eample, consider a satellite image, where interesting objects (e.g., vessels) have been etracted (e.g., b a human epert or an image segmentation tool). Due to low image resolution and/or color definitions, the data etractor ma not be % certain about whether a piel formation corresponds to an actual object o; a probabilit E o could be assigned to o, reflecting the confidence of o s eistence. We call such objects eistentiall uncertain, since their eact locations are known and the uncertaint refers onl to their eistence. As another eample of eistentiall uncertain data, consider emergenc calls to a police calling center, which are dialed from various map locations. Depending on the caller s voice, for each call we can generate a spatial event associated with a potential emergenc and a probabilit that the emergenc is actual. Events generated from sensors (e.g., smoke detection) can also be regarded as eistentiall uncertain data because each sensor is associated with a certain location and the eistence of each detected event depends on the sensor sensitivit and the background noise. Eistential probabilities are also a natural wa to model fuzz classification []. In this case, the class label of a particular object is uncertain; each label has an eistential probabilit and the sum of all probabilities is. M. L. iu is with the Department of Computer Science, Aalborg Universit, DK-922 Aalborg, Denmark. ml@cs.aau.dk N. Mamoulis and X. Dai are with the Department of Computer Science, Universit of Hong Kong, Pokfulam Road, Hong Kong. {nikos,dai}@cs.hku.hk. Tao is with the Department of Computer Science and Engineering, Chinese Universit of Hong Kong, Sha Tin, New Territories, Hong Kong. taof@cse.cuhk.edu.hk M. Vaitis is with the Department of Geograph, Universit of the Aegean, Mtilene, Greece. vaitis@aegean.gr Manuscript received XXXX XX, 2X; revised XXXX XX, 2X. We can naturall define probabilistic versions of spatial queries that appl on collections of eistentiall uncertain objects. We identif two tpes of such probabilistic spatial queries. Given a confidence threshold t, a thresholding quer returns the objects (or object pairs, in case of a join), which qualif some spatial predicates with probabilit at least t. E.g., given a segmented satellite image with uncertain objects, consider a port officer who wishes to find a set of vessels S such that ever o S is the nearest ship to the port with confidence at least 3%. Another eample is a police station asking for the emergencies in its vicinit, which have high confidence. A ranking spatial quer returns the objects, which qualif the spatial predicates of the quer, in order of their confidence. Ranking queries can also be thresholded (in analog to nearest neighbor queries) b a parameter m. For instance, the port officer ma want to retrieve the m = ships with the highest probabilit to be the nearest neighbor of the port. Previous work on managing spatial data with uncertaint [2], [3], [4], [5], [6], [7] focus on locationall uncertain objects; i.e., objects which are known to eist, but their (uncertain) location is described b a probabilit densit function. The rationale is that the managed objects are actual moving objects with unknown eact locations due to GPS errors or transmission delas. In Section 2, we elaborate the fundamental differences (e.g., location, eistence, storage, and probabilit computation) between eistentiall uncertain objects and locationall uncertain objects, and eplain wh eisting solutions for managing and searching locationall uncertain data are inappropriate for eistentiall uncertain data. To our knowledge, there is no prior work on eistentiall uncertain spatial data. Our contributions are summarized as follows: We identif the class of eistentiall uncertain spatial data and define two intuitive probabilistic quer tpes on them; thresholding and ranking queries. Assuming that the spatial attributes of the objects are indeed b 2-dimensional R trees, we propose search

2 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 2 algorithms for probabilistic variants of spatial range queries, nearest neighbor (NN) search, spatial skline (SS) queries, and reverse nearest neighbor (RNN) queries. Regarding different variants of R trees, we derive appropriate lower/upper probabilistic bounds for effectivel reducing the search cost. Our search algorithms for NN, SS, RNN are carefull designed to handle disqualified entries in such a wa that their removal is guaranteed not to influence the probabilistic bounds of an potential result object. The rest of the paper is organized as follows. Section 2 provides background on quering spatial objects with uncertain locations and etents. Section 3 defines eistentiall uncertain data and quer tpes on them. In Section 4 we stud the evaluation of probabilistic spatial queries, when the are primaril indeed on their spatial attributes, or when considering eistential probabilit as an additional dimension. Section 5 addresses probabilistic variants for interesting advanced spatial queries. Section 6 is a comprehensive eperimental stud for the performance of the proposed methods. Section 7 discusses the case where the eistential probabilities of objects are correlated. Finall, Section 8 concludes the paper with a discussion about future work. 2 BACKGROUND AND RELATED WORK In this section, we review popular spatial quer tpes and show how the can be processed when the spatial objects are indeed b R trees. In addition, we provide related work on modeling and quering spatial objects of uncertain location and/or etent. 2. Spatial Quer Processing The most popular spatial access method is the R tree [8], which indees minimum bounding rectangles (MBRs) of objects. R trees can efficientl process main spatial quer tpes, including spatial range queries, nearest neighbor queries, and spatial joins. Figure shows a collection R = {p,..., p 8 } of spatial objects (e.g., points) and an R tree structure that indees them. Given a spatial region W, a spatial range quer retrieves from R the objects that intersect W. For instance, consider a range quer that asks for all objects within distance 3 from q, corresponding to the shaded area in Figure. Starting from the root of the tree, the quer is processed b recursivel following entries, having MBRs that intersect the quer region. 5 5 p p 2 e.mbr 2 p 8 e.mbr p3 p 7 q e.mbr 3 p 4 p Fig.. Spatial queries on R trees p 5 e e 2 e 3 p 2 p 3 p p 8 p 7 p 4 p 5 p 6 A nearest neighbor (NN) quer takes as input a quer object q and returns the closest object in R to q. For instance, the NN of q in Figure is p 7. If R is indeed b an R tree, then the best-first (BF) algorithm of [9] is the most efficient solution for processing NN queries. A best-first priorit queue P Q, which organizes R tree entries based on the (minimum) distance of their MBRs to q, is initialized with the root entries. The top entr of the queue e is then retrieved; if e is a leaf node entr, the corresponding object is returned as the NN (assuming point objects). Otherwise, the node pointed b e is accessed and all entries are inserted to P Q. In order to find the NN of q in Figure, BF first inserts to P Q entries e, e 2, e 3, and their distances to q. Then the nearest entr e 2 is retrieved from P Q and objects p, p 7, p 8 are inserted to P Q. The net nearest entr in P Q is p 7, which is the NN of q. In Section 4, we will etend BF for processing probabilistic versions of NN search on eistentiall uncertain data. 2.2 Locationall Uncertain Spatial Data Recentl, there is an increasing interest on the modeling, indeing, and quering of objects with uncertain location and/or etent. For instance, consider a collection of moving objects, whose positions are tracked b GPS devices. Eact locations are unknown due to GPS errors and transmission delas; e.g., if the object is in motion its location might be outdated when reaching the listening server. As a result, the set of possible locations of an object is captured b a probabilit densit function (PDF), which combines GPS measurement error, the last reported object location, and object velocit [2]. Figure 2a eemplifies a locationall uncertain object o, modeled b a Gaussian PDF, with the regions of higher probabilit marked in darker color. According to [], [7], an arbitrar PDF can be approimated b a spatial histogram (e.g., 3 3 bins in Figure 2a), where each bin stores the probabilit to include the object, and their sum equals to. o (a) loc. uncertain PDF (b) PCR of o, at g =.2 o λ q o 3 r o 2 (c) NN search o 4 l _ o (.) W o W q l + o 2 (.2) o 3 (.3) l + l _ o 4 (.9) (d) eistentiall uncertain object Fig. 2. Locationall and Eistentiall Uncertaint Objects

3 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 3 TABLE Fundamental Differences between Two Notions of Uncertaint Propert Locationall Eistentiall Uncertain Object Uncertain Object Location uncertain certain Eistence certain uncertain Storage a spatial histogram a point o with eistence probabilit E o Probabilit epensive constant time computation numerical integration constant time Given a locationall uncertain object o and a quer range W (see Figure 2b), the probabilit that o intersects a quer range W is formall defined b: P rng (o, W ) = o.pdf(v) dv o W where o.pdf(v) denotes the probabilit that o coincides with point v. Probabilistic threshold range queries [], [7] retrieve result pairs o, P rng (o, W ) such that P rng (o, W ) t, where t is a user-specified threshold. The filter-refinement framework is adopted to accelerate their evaluation. An inepensive filter step is applied to determine fast whether an object o can belong to the result. Onl when o ma potentiall become a result, the refinement step is eecuted to compute the P rng (o, W ) value. In the state of the art method of [7] probabilistic constrained rectangle (PCR) is used for the filter step of the queries. Given a sstem parameter g, modeling a minimum value for t, the PCR of a object o is pre-computed b sliding each aisparallel line inwards until the swept area over the PDF of o equals to g. Figure 2b illustrates the PCR of an object o, for g =.2; o appears in the region on the left of line l with probabilit.2. Similarl, o appears in regions on the right/bottom/top of lines l + /l /l + respectivel, with probabilit.2. To answer the threshold range quer W (with t =.5), we first compare W with the lines l /l + /l /l +. Since W does not intersect the PCR of o (i.e., it is above line l + ), we can immediatel infer that P rng (o, W ).2 < t. Thus, o is discarded during the filter step of quer W, saving the epensive computation of the eact probabilit P rng (o, W ). Table summarizes the fundamental differences between locationall uncertain objects and eistentiall uncertain objects. As depicted in Figure 2d, an eistentiall uncertain object o 3 has a certain location (i.e., a point) but its eistence is associated with a probabilit E o3 =.3. The probabilit of o 3 satisfing a range quer W is P rng (o 3, W ) = E o3 if o intersects W ; or otherwise. Thus, P rng (o 3, W ) can be computed in constant time. One ma argue that an eistentiall uncertain point o with eistence probabilit E o could be modeled as a locationall uncertain object with the PDF consisting of eactl 2 locations: one point o with probabilit E o, and a point at infinit with probabilit ( E o ). This model encumbers the application of eisting locationall uncertain techniques [], [7], because the assume multiple locations with probabilities and the continuit of PDF in the space. Consider, for instance, the probabilistic NN search algorithm for locationall uncertain data, proposed in [6]. Given a quer point q and a set R of locationall uncertain objects, we can derive λ = min o R mad(q, o), i.e., the minimum furthest distance of an o from q. For instance, in Figure 2c, the object o leads to the minimum λ. Since the PDF of o sums to within the circle (centered at q with radius λ), it is clear that, an object o (e.g., o 4 ) with mind(q, o ) > λ has no chance of being the NN of q. For an remaining object o (e.g., o, o 2, o 3 ), its probabilit of being the NN of q is denoted b P nn (o, q). Assuming independent PDFs between different objects, [6] define P nn (o, q) as follows: P nn(o, q) = Z λ mind(q,o) o.pdf( q(r)) o R,o o,mind(q,o ) λ ( o.pdf( q(r))) dr where q (r) and q (r) represent the hollow ring and the concrete circle respectivel, centered at q with radius r. The evaluation of the above probabilit is epensive for arbitrar PDFs so [6] focuses on basic PDFs and develops efficient computation techniques for P nn (o, q). Note that the above probabilistic NN search technique is inapplicable to eistentiall uncertain data. Figure 2d depicts a set of eistentiall uncertain objects, with a similar spatial configuration as in Figure 2c. In this case, o is still the object causing the minimum λ value. However, since its eistence probabilit is not, it cannot be used to bound the search space. For instance, the object o 4 now has non-zero probabilit of being the NN of q; this happens with the probabilit (.)(.2)(.3).9 =.454, when o 4 eists but o, o 2, o 3 do not eist. Other work on locationall uncertain data includes indeing the trajector of an object as a clindrical volume around the tracked polline (e.g., b a GPS), capturing uncertaint up to a certain distance from the polline []. A similar approach is followed in [3], where recorded trajectories are converted to sequences of locations connected b elliptical volumes. [5] also models the uncertain locations of spatial objects b (circular) uncertaint regions and discuss how to process simple and aggregate spatial range queries using the fuzz representations. [4] studies the evaluation of spatial joins between two sets of objects, for the case where the object etents are floating according to uncertaint distance bounds. An etension of the R tree that captures uncertaint in director node entries is proposed, and R tree join techniques are adapted to process the join efficientl. Cheng et al. [2], [] stud a problem related to probabilistic spatial range queries. The uncertain data are not spatial, but ordinal D values (e.g., temperature values recorded from sensors). [] indees such uncertain data for efficient evaluation of probabilistic range queries. [2] classifies queries on such data to entit-based queries asking for the set of objects satisfing a quer predicate and valuebased queries asking for a PDF describing the distribution of a quer result when it is a single aggregate value (e.g., the sum of values, the maimum value, etc.). Finall, [3] studies the evaluation of queries over uncertain or summarized data, where the user specifies thresholds (precision, recall, lait) regarding the qualit (i.e., accurac) of the desired result. 3 EXISTENTIALL UNCERTAIN SPATIAL DATA An object is eistentiall uncertain if its eistence is described b a probabilit E, < E. We refer to E as eistential probabilit or confidence of. Note that since we

4 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X p 2 (.5) p (.2) W p 8 (.2) Fig. 3. NN search eample p3 (.3) p (.) 7 q p (.5) 4 p (.) p (.5) 5 can have E =, we (triviall) regard a % known object as eistentiall uncertain. This allows us to model object collections which are mitures of uncertain and certain data. On the other hand, E = corresponds to an object that definitel does not eist, so there is no need to store it in a database. We take the eistential independence assumption that the confidence values of two different objects are independent of each other. This assumption is reasonable for the applications mentioned in the Introduction (e.g., satellite image etraction, emergenc call). We will rela this assumption in Section 7 and handle eistentiall uncertain objects whose confidence values are correlated. Figure 3 shows a collection R = {p, p 2,..., p 8 } of eistentiall uncertain points. Net to each point label p i, is its eistential probabilit E pi enclosed in parentheses (e.g., E p =.2). We are interested in answering spatial queries that take uncertaint into account. Let R be a collection of eistentiall uncertain objects. We then define probabilistic versions of basic spatial quer tpes: Definition : A probabilistic spatial range quer takes as input a spatial region W and returns all (, P ) pairs, such that R and intersects W with probabilit P = E, where P >. Definition 2: A probabilistic nearest neighbor quer takes as input an object q and returns all (, P ) pairs, such that R and is the nearest neighbor of q, with probabilit P = E R,,d(q, )<d(q,) ( E ), where P > and d(q, ) denotes the distance between q and. The output of a probabilistic quer is a conventional quer result coupled with a positive probabilit that the item satisfies the quer. The case of probabilistic range queries is simple; P = E for each object that qualifies the spatial predicate. Consider, for instance, the shaded window W, shown in Figure 3. Two objects p and p 2 intersect W, with confidences E p =.2 and E p2 =.5, respectivel. Similar to locationall uncertain data, the probabilit of an object to qualif a spatial range quer is irrelevant of the locations and confidences of other objects. On the other hand, the probabilit of an object to be the nearest neighbor depends on the locations and probabilities of other objects. Consider again Figure 3 and assume that we want to find the potential nearest neighbor of q. The nearest point to q (i.e., p 7 ) is the actual NN iff p 7 eists. Thus, (p 7, E p7 ) is a quer result. In order for the second nearest point p 6 to be the NN of q (i) p 7 must not eist and (ii) p 6 must eist. Thus, (p 6, ( E p7 ) E p6 ) is another result. B continuing this wa, we can eplore the whole set of points in R and assign a probabilit to each of them to be the NN of q. This nearest neighbor quer eample not onl shows the search compleit in uncertain data, but also unveils that the result of probabilistic queries ma be arbitraril large. For instance, the result of an NN quer is as large as R, if E < for all R. We can define practical versions of probabilistic queries with controlled output b either thresholding the results of low probabilit to occur or ranking them and selecting the most probable ones: Definition 3: Let (, P ) be an output item of a probabilistic spatial quer Q. The thresholding version of Q takes as additional input a threshold t, < t and returns the results for which P t. The ranking version of Q takes as additional input a positive integer m and returns the m results with the highest P. For eample, a thresholding range (window) quer W with t =.6 on the objects of Figure 3 returns, whereas a ranking range quer W with m = returns (p 2,.5). 4 EVALUATION OF BASIC PROBABILISTIC QUERIES Like spatial queries on eact data, probabilistic spatial queries can be efficientl processed with the use of appropriate access methods. In this section, we eplore alternative indeing schemes and propose algorithms for probabilistic queries on them. We focus on the most important spatial quer tpes; namel, range queries and nearest neighbor queries. 4. Algorithms for R trees The most straightforward wa to inde a set R of eistentiall uncertain spatial data is to create a 2-dimensional R tree on their spatial attribute. The confidences of the spatial objects are stored together with their geometric representation or approimation (for comple objects) at the leaves of the tree. We now stud the evaluation of probabilistic queries on top of this indeing scheme. 4.. Range Queries Probabilistic range queries can be easil processed in two steps; a standard depth-first search algorithm is applied on the R tree to retrieve the objects that qualif the spatial predicate of the quer. For each retrieved object, P = E. If the quer Q is a thresholding quer, the threshold t is used to filter out objects with P < t. If Q is a ranking quer, a priorit queue maintains the m results with the highest P, during search, and outputs them at the end of quer processing.. Especiall for thresholding range queries of ver large thresholds t, a viable alternative could be to use a B + tree that indees objects based on their probabilit to efficientl access the objects with E t and then filter them using the spatial quer predicate.

5 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X Nearest neighbor search NN search is more comple compared to range queries, because the probabilit of an object to qualif the quer depends on the locations and confidences of other objects. Algorithm elegantl and efficientl computes the probabilit P of to be nearest neighbor of q, for all having P >. Algorithm Probabilistic NN on a R tree Algorithm PNN(Quer point q, R tree on R) : P first := ; Prob. of no object before 2: while P first > and more objects in R do 3: := net NN of q in R; 4: P := P first E ; 5: output (,P ); 6: P first := P first ( E ); Algorithm PNN applies best-first NN-search [9] on the R tree to incrementall retrieve the nearest neighbors of q, without considering confidences. It also incrementall maintains a variable P first which captures the probabilit that no object retrieved before the current object is the actual NN. P first is equal to ( E ), for all objects seen before. Thus the probabilit of to be the nearest neighbor of q is P first E. In the eample of Figure 3, PNN graduall computes P p7 =., P p6 = (.). =.9, P p8 = (.)(.).2 =.62, P p4 = (.)(.)(.2).5 =.324, etc. Note that all objects of R in this eample are retrieved and inserted to the response set. In other words, PNN does not terminate, until an object with E = is found; if no such object eists, all objects have a positive probabilit to be the nearest neighbor Thresholding and ranking: As discussed in Section 3, the user ma want to restrict the response set b thresholding or ranking. Algorithm 2 is the thresholding version of PNN, which returns onl the objects with P t. The onl differences with the non-thresholding version are the termination condition at Line 2 and the filtering of results having P < t (Line 5). As soon as P first < t, we know that the net objects, even with % confidence cannot be the NN of q, so we can safel terminate. For eample, assume that we wish to retrieve the points in Figure 3 which are the NN of q with probabilit at least t =.23. First p 7 with P p7 = E p7 =. is retrieved, which is filtered out at Line 5 and P first is set to.9 t. Then we retrieve p 6 with P p6 = P first E p6 =.9 (also disqualified) and set P first =.8 t. Net, p 8 is retrieved with P p8 =.62 (also disqualified) and P first =.648 t. The net object p 4 satisfies P p4 =.324 t, thus (p 4,.324) is output. Then P first =.324 t and we retrieve p 3 with P p3 =.972 (disqualified). Finall, P first =.2268 < t and the algorithm terminates having produced onl (p 4,.324). PRNN (Algorithm 3), the ranking version of PNN, maintains a heap H of m objects with the largest P found so far. Let P m be the m-th largest P in H; as soon as < P m, we know that the net objects, even with % confidence cannot be the in the set of m most probable NN of q, so we can safel terminate. For eample, assume that we wish to retrieve the point with the highest probabilit of being the NN of q in Figure 3. PRNN progressivel P first Algorithm 2 Probabilistic NN on a R tree with thresholding Algorithm PTNN(Quer point q, R tree on R, Threshold t) : P first := ; Prob. of no object before 2: while P first t and more objects in R do 3: := net NN of q in R; 4: P := P first E ; 5: if P t then 6: output (,P ); 7: P first := P first ( E ); maintains the object with the highest P. After each of the first 4 object accesses, P m becomes.,.,.62, and.324. The algorithm terminates after the 4-th loop, when P first =.324 and P m = P p4 =.324; this indicates that the net object can have P at most P p4, thus p 4 has the highest chances among all objects to be the NN of q. Algorithm 3 Probabilistic NN on a R tree with ranking Algorithm PRNN(Quer point q, R tree on R, Integer m) : P first := ; Prob. of no object before 2: H := ; heap of m objects with highest P 3: P m := ; P of m-th object in H 4: while P first P m and more objects in R do 5: := net NN of q in R; 6: P := P first E ; 7: if P P m then 8: update H to include ; 9: P m := m-th probabilit in H; : P first := P first ( E ); 4.2 Quer Evaluation using Augmented R trees We can enhance the efficienc of the probabilistic search algorithms, b augmenting some statistical information to the R tree director node MBRs. A simple and intuitive method is to store with each director node entr e a value e mae ; the maimum E for all objects indeed under e. This value can be used to prune R tree nodes, while processing thresholding or ranking queries. Similar augmentation techniques are proposed in [4], [] for locationall uncertain data. TABLE 2 Checking disqualified entries in augmented R trees quer tpe range search NN search thresholding e mae < t P first e mae < t ranking e mae P m P first e mae P m Table 2 summarizes the conditions for pruning R tree entries (and the corresponding sub-trees) which do not point to an results, during range or NN thresholding and ranking queries. For range queries, we can directl prune an entr e when: (i) e.mbr does not intersect the quer range, or (ii) its e mae satisfies the condition in the table. On the other hand, for NN search, a disqualified entr cannot be directl pruned, because the confidences of objects in the pointed subtree ma be needed for computing the probabilities of objects with greater distances to q, but high enough probabilities to be included in the result.

6 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 6 Let us assume for the moment that for each non-leaf entr e we know the eact number of objects e num in its subtree. Algorithm 4 is the thresholding NN procedure for the augmented R tree. BF is etended as follows: If a non-leaf entr e is de-heaped for which P first e mae < t, the node where e points is not immediatel loaded (as in PTNN) but e is inserted into a set L of deleted entries. For objects retrieved later from the Best-First heap, we use entries in L to compute P min and P ma ; lower and upper bounds for P. If P min t, we know that is definitel a result. If P ma < t, we know that is definitel not a result. On the other hand, if P min < t P ma (Lines 6 2), we must refine the probabilit range for. For this purpose, we pick the entr e with the minimum mind(q, e) in L. 2 Observe that an entries with mind(q, e) > d(q, ) cannot contribute to the probabilit of. As P min < P ma (at Line 6), the entr e selected at Line 7 must satisf mind(q, e) d(q, ). If e is an object, then q must be nearer to e than and we update P first with the confidence of e. Otherwise, its confidence does not affect P first, we access its child node n e and insert all entries of n e into L. In either case, the probabilit range of shrinks. The process is repeated while the range covers t. Algorithm 4 Probabilistic NN on an augmented R tree with thresholding Algorithm PTNNaug(Quer point q, Augmented R tree on R, Threshold t) : P first := ; Prob. of no object before 2: L := ; list of disqualified entries 3: while P first t and more objects in R do 4: := net NN of q in R; during BF-search, each non-leaf entr with P first e mae < t is removed from Best-First heap and inserted into L 5: compute P min and P ma b using P first, L and E ; 6: while P min < t P ma do 7: pick the entr e with the smallest mind(q, e) in L; remove e from L; 8: if e is an object then e is an object closer to q than is 9: P first := P first ( E e); : else : read node n e pointed b e and insert all entries of n e into L; 2: compute P min 3: if P min t then 4: output (,P min,p ma ); 5: P first := P first ( E ); and P ma b using P first, L and E ; It remains to clarif how P min and P ma for an object are computed. Note that L onl contains entries whose minimum distance to q are smaller than d(q, ). For an entr e in the list L, the confidence of each object in its subtree is in the range (, e mae ]. In addition, there eists at least one object in e whose confidence is eactl e mae. Thus, P min corresponds to the case where for all objects under all entries in L are closer to q than is and the all have the maimum possible confidences. P ma corresponds to the case, where for all e L, with maimum distance from q greater than d(q, ), there is onl one object with e mae confidence (for all other objects 2. Throughout the paper, we use d(q, ) to denote the distance between two points q and ; and use mind(q, e) (mad(q, e)) to denote the minimum (maimum) possible distance between q and an data point indeed b the sub-tree pointed b e. under e the confidence converges to ): P min = P first E P ma = P first E e L mind(q,e) d(q,) e L mad(q,e) d(q,) ( e mae ) enum () ( e mae ) (2) So far, we have assumed that for each non-leaf entr e the number of objects e num in its subtree is known (e.g., this information is augmented, or the tree is packed). We can still appl the algorithm for the case where this information is not known, b using an upper bound for e num : f level(e), where level(e) is the level of the entr e (leaves are at level ) and f is the maimum R tree node fanout. This upper bound replaces e num in Equation. 5 5 p 2 (.5) p (.2) e.mbr p 8 (.2) e.mbr p3 (.3) p (.) 7 q e.mbr 3 p (.5) 4 p (.) p (.5) 5 e e 2 e 3 (.5)(.2)(.5) p 2 p 3 p p 8 p 7 p 4 p 5 p 6 (.5)(.3) (.2) (.2)(.) (.5)(.5)(.) Fig. 4. Eample of augmented R tree Let us now show the functionalit of the PTNNaug algorithm b an eample. Consider the augmented R tree of Figure 4 that indees the pointset of Figure 3 and assume that we want to find the points that are the NN of q with probabilit at least t =.23. First, the entries in the root are enheaped in the Best-First heap. Net, the entr e 2 is dequeued. Since it disqualifies the quer (P first e mae 2 =.2 < t), it is inserted into the list L. Then, the entr e 3 is dequeued. Its objects p 4, p 5, p 6 are enheaped in the Best-First Queue. The nearest object p 6 is dequeued. From Equations and 2, we derive a probabilit range for P p6 b using P first and L. p 6 is disqualified as Pp ma 6 = E p6 =. < t. Then, P first =.9 t and we retrieve p 4. Since Pp min 4 =.9.5 (.2) 3 =.234 t, p 4 is a result. Net, P first =.45 t and the net entr retrieved from the priorit queue of the BF algorithm is e. We do not access the node pointed b e, since we know that for each object indeed under e, P e mae P first =.225 < t. Thus, e is inserted into L. Net, p 5 is dequeued and discarded as Pp ma 5 =.45.5 (.2) (.5) < t. Now, the Best- First heap becomes empt and the algorithm terminates. Note that PTNN accesses all nodes of the tree in this eample, whereas PTNNaug saves two leaf node accesses. Ranking NN retrieval on the augmented R tree is performed b Algorithm 5. PRNNaug has several differences from the thresholding NN algorithm. A heap H is emploed to organize objects o b their Po min. P m denotes the m-th highest Po min in the heap. Observe that more complicated techniques are used for updating H, as the accesses to L ma affect the order of objects in H. Each object o in H maintains Po first, which is the value of P first when o is enheaped (Line 8). At Lines 2

7 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 7 Algorithm 5 Probabilistic NN on an augmented R tree with ranking Algorithm PRNNaug(Quer point q, Augmented R tree on R, Integer m) : P first := ; Prob. of no object before 2: L := ; list of disqualified entries 3: H := ; heap of objects, organized b Po min 4: P m := ; P min of m-th object in H 5: while P first > P m and more objects in R do 6: := net NN of q in R; during BF-search, each non-leaf entr with P first e mae < t is removed from Best-First heap and inserted into L 7: compute P min and P ma b using P first, L and E ; 8: while P min < P m P ma do 9: pick the entr e with the smallest mind(q, e) in L; remove e from L; : if e is an object then e is an object closer to q than is : P first := P first ( E e); 2: for all entr o H such that d(q, e) d(q, o) do 3: Po first 4: else := P first o ( E e); 5: read node n e pointed b e and insert all entries of n e into L; 6: compute P min and P ma b using P first, L and E ; 7: if P min > P m then 8: enheap(h,(,p first :=P first,p min,p ma )); 9: if H is changed or L is changed then 2: recompute, for each o H, Po min and Po ma b using Po first, L and E o; 2: P m := m-th P min in H; 22: remove entries o from H with P ma o < P m ; 23: P first := P first ( E ); 24: while H > m and L > do 25: appl Lines 9 6; 26: appl Lines 2 22; 27: remove e from L with mind(q, e) > ma{d(q, o) : o H}; 3, Po first (for some entries in H) is updated for each object e found no further than o from q. The new Po first value is used to update Po min and potentiall the order of objects in H at Lines 2 2. Note that H ma store more than m entries, since there ma be objects o in it satisfing Po ma P m. However, entries o are removed from H once P ma P min o o < P m. The algorithm does not need to access an more objects from the Best-First heap as soon as P first < P m. In case H has more than m objects at that point, we need to refine the probabilit ranges of the objects in H (b processing entries in L) until we have the best m objects. In this case, entries e are removed from L once mind(q, e) > ma{d(q, o) : o H} because such entries cannot be used to refine the probabilit ranges of the objects in H. 4.3 Quer Evaluation using R trees An alternative method for indeing eistentiall uncertain data is to model the confidences E of objects as an additional dimension and use a R tree to inde the objects. Now, each non-leaf entr e in the tree, apart from the spatial dimensions, has a range [e mine, e mae ] within which the eistential probabilities of all objects in its subtree fall. Figure 5 illustrates the differences between the augmented R tree and the R tree. Figure 5a depicts the structure of the augmented R tree for the points p, p 2,, p 6. The R* tree insertion algorithm [4] aims at grouping the points into leaf nodes such that the their MBR areas are minimized. As such, the (non-leaf) entr e points to a leaf node containing the points p, p 2, p 3 ; whereas the entr e 2 points to a leaf node containing the points p 4, p 5, p 6. The spatial X, ranges, and the augmented probabilit, for these two entries in the augmented R tree are listed in Figure 5c. Note that each entr consists of 6 values (including its child node pointer). Figure 5b shows the structure of the R tree, for the same set of points. The R* tree insertion optimizes the bounding rectangles of nodes defined b three dimensions: spatial dimensions X and, as well as the probabilit dimension E. Hence, the entr e points to a leaf node containing the points p, p 2, p 5 ; whereas the entr e 2 points to a leaf node containing the points p 3, p 4, p 6. The values stored in these entries in the R tree are also listed in Figure 5c. Now, each entr consists of 7 values (including its child node pointer), impling that the fanout of the R tree is slightl smaller than the augmented R tree. The methods for processing the probabilistic range and NN queries over the augmented R tree (in Section 4.2) are applicable for the R tree, since each tree entr still stores an e mae value. In particular, for the NN quer, we utilize e mine to derive tighter probabilit ranges: P min = P first E (3) e L mind(q,e) d(q,) ( e mine )( e mae ) (enum ) P ma = P first E (4) e L mad(q,e) d(q,) ( e mine ) (enum ) ( e mae ) If the eact number e num of objects in the subtree pointed b e is not known, we can use the fanout f and the minimum node utilization (.4 for R* trees) and replace e num b f level(e) in Equation 3 and b (.4 f) level(e) in Equation 4. p 4 (.7) p (.3) p6 (.9) e (.8) p 3 e 2 p 2 (.6) p 5 (.5) (a) Augmented R tree p (.3) p 4 (.7) e 2 p6 (.9) (.8) p e 3 p 5 (.5) p 2 (.6) (b) R tree Name Entr e Entr e 2 Grouping Fields Aug. X=[.5,.4] X=[.55,.85] Spatial 6 R tree =[.2,.6] =[.3,.8] onl (+4+) mae=.8 mae=.9 R tree X=[.5,.7] X=[.4,.85] Both 7 =[.2,.6] =[.4,.8] spatial and (+4+2) E=[.3,.6] E=[.7,.9] probabilities (c) Comparison between the two trees Fig. 5. Structures of different R tree variants Interestingl, the quer performance of the R tree is not necessaril better than the augmented R tree. A careful eamination of Equations 3,4 reveals that these probabilit bounds are determined b both the spatial and probabilistic intervals of the entries. Even the e mine values in the R tree are helpful for tightening the bounds, this effect is

8 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 8 counteracted b the large spatial bounding rectangles in the tree. Thus, more (disqualified) entries e L satisf the mind(q, e) d(q, ) condition in Equation 3 and fewer entries e L satisf the mad(q, e) d(q, ) condition in Equation 4. Hence, the final probabilit bounds for the R tree ma indeed become looser. Besides, the R tree has a slightl smaller fanout, which ma lead to more page accesses. 5 ADVANCED SPATIAL QUERIES In this section, we discuss probabilistic variants of spatial skline queries [5] and reverse nearest neighbor queries [6], due to their applications in spatial decision support sstems. For each quer tpe, we first present its background, then define its probabilistic variant, and finall develop corresponding quer algorithms for the thresholding and ranking versions. 5. Spatial Skline Queries Given a set Q of quer points (e.g., user locations) and two points p and p (e.g., two facilities), p spatiall dominates [5] p when all quer points in Q are closer to p than to p : q Q, d(q, p) d(q, p ) (5) Given a point dataset R, its spatial skline [5] (with respect to Q) contains the objects p R that are not spatiall dominated b an other object in R. As an eample, consider the distances of the stations p i R from a group of 2 users Q = {q, q 2 } in Figure 6a. The spatial skline contains p, p 2, and p 3. The main application of spatial skline queries is to discover facilities that are not farther than other facilities, for all users. distance to q 2 p (.3) p 7 (.7) p6 (.9) p 4 (.8) p 5 (.5) p 2 (.6) p 3 (.4) (a) a point set distance to q distance to q 2 ψ (e ) ψ (e ) 2 ψ (e ) ψ (e ) 3 ψ + (e ) distance to q (b) dominance relationship Fig. 6. Feature space defined b the distances from quer points To ease our discussion, we first introduce some notation. The spatial skline quer is formulated in a feature space in which each dimension captures the distance to a quer point. Given a set Q = {q, q 2,, q z } of quer points, a spatial location (i.e., data point) p (or a MBR e) can be mapped to a point ψ(p) (or a MBR ψ(e)) in a z-dimensional feature space, where the i-th dimension captures the distances of the points to q i (for i [, z]). Table 3 illustrates the mapping of a data point or an MBR (corresponding to a non-leaf R tree entr, assuming that the data points are indeed b an R tree) to this feature space. As a shorthand notation, we use ψ(p) ψ(p ) to mean that p spatiall dominates p. Let ψ (e) and ψ + (e) be the lower TABLE 3 Mapping from the original space to the feature space Original space Feature space ψ( ), i-th dimension Point p d(q i, p) MBR e [mind(q i, e), mad(q i, e)] and upper bound corners of the MBR ψ(e) respectivel (see Figure 6b). Since ψ + (e 2 ) ψ (e ), each point in e 2 must spatiall dominate all points in e. On the other hand, onl some point in e 3 ma spatiall dominate some points in e as ψ (e 3 ) ψ + (e ) and ψ + (e 3 ) ψ (e ). With the above mapping technique, [7] propose an R tree based algorithm for computing the dnamic skline in the feature space. The idea is to appl the best-first search algorithm [9] on the R tree to visit the entries e, from the origin z in the feature space, in ascending order of the value: ω Q (e) = q Q mind(q, e) (6) [7] proved that a point must be discovered earlier than the points it dominates (if an). Hence, a point p is reported as a result if it cannot be dominated b an eamined points. We then adapt the above algorithm for the probabilistic spatial skline quer Probabilistic spatial skline quer and its properties: For eistentiall uncertain data, a point is a quer result with probabilit: P = E ( E ) (7) R ψ( ) ψ() which corresponds to the case that eists and the points dominating do not eist. A probabilistic spatial skline quer takes as input a set Q of quer points and returns all (, P ) pairs, such that R and belongs to the spatial skline of Q with probabilit P >. For instance, in Figure 6a, we have P p4 =.8 (.6) =.32 because p 2 dominates p 4. Since no points dominate p 2, we derive P p2 =.6 =.6. The probabilit of other points can be computed in a similar wa. In Section 4..2, we used a single variable P first to incrementall compute the upper bound probabilit for the remaining objects to be eamined. This technique is inapplicable to the spatial skline quer, since the points visited in decreasing order from the origin do not necessaril influence the points that will be visited net. For instance, the eistence of point p 2 in Figure 6a does not influence the probabilit that p (which is further than p 2 from the origin and will be visited net) is in the skline. However, it influences p 4, since p 4 is dominated b p 2. In general, given a set S R of alread eamined points, in order of their distance to the origin, an upper bound P first (S) of the probabilit that point is in the skline with respect to S can be computed b: P first (S) = ( E ) (8) S ψ( ) ψ() For a MBR e, the upper bound probabilit Pe first (S) of an

9 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 9 point in e to be in the skline can be computed as follows: Pe first (S) = ( E ) (9) S ψ( ) ψ (e), since ψ (e) dominates an point in e. Net, we discuss how thresholding and ranking versions of the quer are evaluated on a R tree Thresholding and ranking: Assume that we want to find the points with probabilit at least t to be in the skline. Algorithm 6 describes the procedure to retrieve these points from a R tree. At Line 3, objects are incrementall retrieved from the tree in increasing order of their ω Q () value, which is defined in Equation 6. Set S is used for storing objects eamined so far, in order to derive the probabilit P first (S) of remaining objects (using Equation 8). The probabilit derivation of P as P first (S) E is correct because [7] proved that all the points dominating must have been eamined before (and stored into S). When P t, is reported as a result. In case P first (S) < t, an (remaining) point dominated b must be at least dominated b the same subset of points in S such that P first (S) < t. Thus, is inserted into S onl when P first (S) t. Following the above logic, we can optimize the algorithm at Line 3 b removing non-leaf entries with Pe first (S) < t from the Best-First heap. Algorithm 6 Probabilistic spatial skline on a R tree with thresholding Algorithm PTSK(Quer set Q, R tree on R, Threshold t) : S := ; set of eamined objects 2: while more objects in R do 3: := net point in R with minimum ω Q (); during BF-search, non-leaf entries e with Pe first (S) < t are removed from Best-First heap 4: P := P first (S) E ; 5: if P t then 6: output (,P ); 7: if P first (S) t then 8: insert into S; Threshold-based retrieval (of Algorithm 6) can be etended to retrieve the m points with the highest probabilit to be in the skline (i.e., the ranking probabilistic variant of the quer). The general idea is to maintain a heap H of m objects with the highest P found so far. In addition, we replace the fied threshold t b a floating bound P m, which indicates the m- th highest P in H. If P is found to be greater than P m, then the result heap H and the bound P m are updated. As P m increases, (unnecessar) objects with Po first (S) < P m are removed from S in order to save space Etensions for augmented R trees: As discussed in Section 4.2, augmented R trees can be used to improve the quer efficienc. Algorithm 7 generalizes Algorithm 4 to utilize information from an augmented R tree. During the BF-search at Line 4, each non-leaf entr e with Pe first (S) e mae < t is removed from Best-First heap because the cannot contain an results. If the removed entr has Pe first (S) at least t, then it ma influence the remaining points and e is inserted into the list L for further processing. At Line 5, the lower bound P min and upper bound P ma probabilities of a point are computed from S, L and E using Equations and respectivel. P min = P first (S) E P ma = P first (S) E e L ψ (e) ψ() e L ψ + (e) ψ() ( e mae ) enum () ( e mae ) () When P min < t P ma, we need to refine the probabilit range for (Lines 6-3). After that, is reported as a result if P min t. In case P first (S) t, is inserted into S because it influences the probabilit of other points that ma end up in the result. Algorithm 7 Probabilistic spatial skline on an augmented R tree with thresholding Algorithm PTSKaug(Quer point q, Augmented R tree on R, Threshold t) : S := ; set of eamined objects 2: L := ; list of disqualified entries 3: while more objects in R do 4: := net point in R with minimum ω Q (); during BF-search, each non-leaf entr e with Pe first (S) e mae < t is removed from Best-First heap, and inserted into L if Pe first (S) t 5: compute P min and P ma b using P first (S), L and E ; 6: while P min < t P ma do 7: pick the entr e with the smallest ω Q (e) in L; remove e from L; 8: if Pe first (S) t then e ma influence the probabilit of potential results 9: if e is an object then : insert e into S; : else 2: read node n e pointed b e and insert all entries of n e into L; 3: compute P min 4: if P min t then 5: output (,P min 6: if P first (S) t then 7: insert into S; and P ma,p ma ); b using P first (S), L and E ; Similarl, we can generalize the algorithm for evaluating ranking spatial skline queries on an augmented R tree. Regarding R trees, Equations 2 and 3 are applied to compute the P min and P ma values of a point respectivel. In case the eact number e num of objects in the subtree pointed b e is not known, we can use the fanout f and the minimum node utilization (.4) and replace e num b f level(e) in Equations,2, and b (.4 f) level(e) in Equation 3. P min = P first (S) E (2) P ma e L ψ (e) ψ() ( e mine )( e mae ) (enum ) = P first (S) E (3) e L ψ + (e) ψ() ( e mine ) (enum ) ( e mae ) 5.2 Reverse Nearest Neighbor Queries Given a point dataset R and a quer point q, a reverse nearest neighbor (RNN) quer [6] retrieves the objects p R having q as their nearest neighbor. This quer has applications in decision support and resource allocation. [6], [8] develop R tree based algorithms for reverse nearest neighbor queries. In this section, we etend the geometric partitioning method of [6] to solve probabilistic versions of this problem. According

10 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X to [6], an RNN quer can be answered in two steps. In the filter step, the data space (shown in Figure 7a) is divided into si equal sectors A, A 2,, A 6 around the quer point q. The NN of q in each sector (if an) is included into the candidate set. In the eample, the candidates are the points p (in A 2 ), p 2 (in A 5 ), and p 4 (in A 3 ). [6] proved that the candidate set is a superset of the result set. During the refinement step, each candidate is verified b retrieving its NN. A candidate (e.g., p 2 ) is reported as a result if its NN is q. Otherwise, the candidate (e.g., p 3 ) is a false hit and it is discarded. A 3 p 4 (.5) A 4 A 2 p 3 (.7) p (.6) q p 2 (.8) A 5 A A 6 (a) 6-sector partitioning p 3 A 5 (.7) A 2 p p 4 A 6 (.5) (.6) A A 7 A8 A 4 A 3 q p2 (.8) A 9 A A A 2 (b) 2-sector partitioning Fig. 7. Reverse nearest neighbor quer eample Probabilistic reverse nearest neighbor quer and its properties: For eistentiall uncertain data, a point belongs to the RNN set of q with probabilit: P = E ( E ) (4) R d(, )<d(q,) which corresponds to the case that eists and the points that are closer to than to q do not eist. For instance, in Figure 7a, we have P p3 =.7 (.6) (.5) =.4 because p 3 is closer to p and p 4 than to q. Since p 2 is closer to q than to other points, we derive P p2 =.8 =.8. The probabilit of other points can be computed in a similar wa. Similarl to the skline quer and unlike the NN quer of Section 4..2, we cannot define an order of visiting the points around q, such that the upper bound probabilit of remaining points to be in the RNN result, can be maintained b incrementall updating a single P first value. To elaborate this, suppose that we first eamined the point p in Figure 7a. Note that p onl influences the probabilities of p 3, p 4 but not that of p 2. The eample also demonstrates that the upper bound probabilit of a point can be computed b using eamined points. Given a set S R of (eamined) points, the upper bound probabilit P first of a point with respect to q and S is defined as: P first = S d(, )<d(q,) ( E ) (5) Geometric properties of reverse nearest neighbors can be eploited to derive the upper bound probabilit of remaining points in a specific sector (see Figure 7a). [6] proved that, if two points p and p are in the same sector and q is closer to p than to p, then p must be closer to p than to q. Based on this propert, a natural solution for the quer is to retrieve the points in ascending order of their distances from q. For each sector A j, its P first A j value is used as the upper bound probabilit of an remaining point in A j. P first A j is set to initiall and it is multiplied b the factor ( E ) when a new point is discovered in A j. We observe that introducing additional sectors ma help deriving tighter probabilit bounds for uneamined points. Consider the 2-sector partitioning shown in Figure 7b. When a point (sa, p ) is discovered in the sector A 4, it is used to update P first A j for the sectors (i.e., A 3, A 4, A 5 ) that are within (maimum) 6 angular range from A 4. Conversel, the probabilit bound of a sector is contributed b the points within (3 3) = 9 angular range. Recall that, for the original 6-sector partitioning in [6], a sector is onl affected b the points within 6 angular range. In general, given a positive integer V, in the (6V )-sector partitioning scheme, (2V ) sectors need to be eamined per visited point. The more sectors we have, the tighter probabilit bounds are derived for (uneamined points in) the sectors, and the earlier unqualified sectors can be pruned. On the other hand, the computational overhead of updating probabilit bounds for the sectors is proportional to V. In Section 6, we will determine an appropriate number of sectors that achieves significant cost reduction and adds little computational overhead for updating probabilit bounds for the sectors. Net, we discuss how this partitioning scheme can be used to evaluate probabilistic reverse nearest neighbor queries Thresholding and ranking: Algorithm 8 shows how thresholding RNN queries are evaluated on a tree. The sstem parameter κ specifies the number of sectors to be used. First, the space is divided into κ sectors A i and their probabilit bounds P first A i are set to. The algorithm maintains candidate objects (i.e., potential results) in a set C and delas computing the actual probabilit of a candidate until all objects influencing it have been eamined. Eamined objects are stored in the set S and the are used to compute upper bound probabilities for candidate objects. Both C and S are initialized to empt sets. At Line 6, we appl Best-First search [9] to incrementall retrieve the net NN (i.e., the object ) of q from the tree R. Suppose that A() denotes the sector containing. If the upper bound probabilit E P first A() is greater than the threshold t, then a tighter upper bound probabilit E P first is computed, b eamining the objects in S (see Equation 5). When the above probabilit is at least t, the object is inserted into C. After that, is inserted into S and P first A i is updated for each sector A i within 6 angular range from A(). In turn, is used to update the Po first value for objects in C, and those satisfing E o Po first < t are removed from C. If the last deheaped distance d last (from the Best-First heap) is greater than 2 d(q, o) for a candidate object o C, then all entries in Best-First heap cannot affect the probabilit of o. At Line 7, we compute the actual probabilit P o as Po first E o and report o as a result when P o t. The loop (Lines 5-2) continues while some sector ma contain potential results to be discovered or C is not empt. The above algorithm can be etended to retrieve the topm ranked reverse nearest neighbors from the R tree. It

11 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X Algorithm 8 Probabilistic reverse nearest neighbor on a R tree with thresholding Algorithm PTRNN(Quer point q, R tree on R, Threshold t) κ: number of sectors (sstem parameter) : divide the space into κ equal sectors around q: A i, i [, κ]; 2: for all i [, κ] do 3: P first A :=; i Upper prob. bound of remaining objects in sector A i 4: C:= ; S:= ; 5: while i [, κ], P first A t or C > do i 6: := net NN of q in R; d last := d(q, ); 7: let A() be the sector of ; 8: if E P first first t and E P A() t then appl cheap filter first, and then epensive filter 9: C := C {}; : S := S {}; : for all i [, κ] such that A i is within (maimum) 6 angular range from A() do 2: P first A i := P first A i ( E ); 3: for all o C such that d(, o) < d(q, o) do update Po first 4: Po first :=Po first ( E ); 5: remove objects o from C with E o Po first < t; filter false hits 6: for all o C such that d last 2 d(q, o) do entries in Best-First heap cannot affect the probabilit of o 7: P o := Po first E o; 8: if P o t then 9: output (o,p o); 2: remove o from C; Algorithm 9 Probabilistic reverse nearest neighbor on an augmented R tree with thresholding Algorithm PTRNNaug(Quer point q, Augmented R tree on R, Threshold t) κ: number of sectors (sstem parameter) : divide the space into κ equal sectors around q: A i, i [, κ]; 2: for all i [, κ] do 3: P first A :=; i Upper prob. bound of remaining objects in sector A i 4: C:= ; S:= ; L:= ; 5: while more objects in R do 6: := net NN of q in R; d last := d(q, ); during BF-search, each non-leaf entr e intersecting onl sector(s) with P first A e mae < t is removed from Best-First heap and inserted i into L 7: appl Lines 7 5 of Algorithm 8; 8: for all o C such that d last 2 d(q, o) do entries in Best-First heap cannot affect the probabilit of o 9: while Po first E o t and e L, mind(e, o) < d(q, o) do refinement step : remove the entr e with the smallest mind(e, o) in L; : if e is an object then 2: appl Lines 5 of Algo. 8, but b replacing with e; 3: else 4: read the node n e pointed b e and insert all entries of n e into L; 5: P o := Po first E o; 6: if P o t then 7: output (o,p o); 8: remove o from C; 9: for all o C do verif remaining candidates in C 2: appl Lines 9 8 of this Algorithm; Besides, the above thresholding 3 algorithm can be adapted to Algorithm 9, for augmented R trees and R trees. At Line 6, each non-leaf entr e intersecting onl sector(s) with P first A i e mae < t is removed from Best-First heap because the cannot contain an results. However, such entries ma affect the probabilit of other points so the are inserted into L. Lines 9-4 compute the actual probabilit for such an object, b refining its Po first value with the entries in L. For this, we check whether the upper bound probabilit Po first E o is above t and q is closer to some entries in L than to q. If so, the entr closest to o is removed from L and its child node n e is accessed. In case n e points to a tree node, all its entries are inserted into L. Otherwise, entries in n e are used to update the set S, Po first values of candidate objects, and P first A i values of sectors. At Lines 9-2, the remaining candidates in C are verified b accessing entries in L that ma influence their probabilities. 6 EXPERIMENTAL EVALUATION In this section, we evaluate the efficienc of the proposed techniques. We compare the performances of five indees and their corresponding algorithms for thresholding and ranking versions of range queries, nearest neighbor search, skline queries and reverse nearest neighbor retrieval. The five indees are (i) a simple R tree (denoted b ), (ii) a R tree, where each non-leaf entr e is augmented with e mae (denoted b ), (iii) a R tree, where each non-leaf entr e is augmented with e mae and e num (i.e., the number of objects in the subtree indeed b it), denoted b COUNT, (iv) a R tree (denoted b ), and (v) a R tree, where each non-leaf entr e is augmented with e num (denoted b COUNT). For indees (iv) and (v), all (spatial/probabilit) dimensions are normalized to the same domain interval. Note that inde (i) captures minimum information in non-leaf entries and occupies the least space, whereas inde (v) is at the other end (entries capture maimum information and the inde occupies the most space). All algorithms were implemented in C++. Eperiments were run on a PC with a Pentium D CPU of 2.8GHz. The page size of indees was set to Kb; the relative performance results of the above methods were observed for other page sizes (up to 8Kb). No memor buffers are used for caching disk pages between different queries; the number of node accesses directl reflects the cost. In each eperiment, the measured cost is the average cost of queries with the same parameter values (but with different locations randoml chosen from the dataset). For range queries, nearest neighbor search, and reverse nearest neighbor retrieval, the time is over 9% of the total eecution cost so the CPU time is not reported. maintains a heap H of m objects with the highest P found so far. In addition, we replace the fied threshold t b a floating bound P m, which indicates the m-th highest P in H. At Lines 8-9, if P o is greater than P m, then the result heap H and the bound P m are updated. 6. Description of Data For our eperiments, we used various real datasets of different sizes and object distributions, described in Table 4. The datasets TG and SF are obtained from [9] while 3. Adaptations of ranking algorithms for RNN queries are omitted due to space constraints.

12 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 2 TABLE 4 cost of thresholding/ranking NN on different datasets, t =.5, m = Dataset Size COUNT COUNT San Joaquin / 6.49/ 9.75/ 28.6/ 3./ roads (TG) Greece / 7.23/ 2.79/ 24.58/ 25.2/ roads (GR) Long Beach / 9.7/ 2.49/ 24.7/ 25.8/ roads (LB) LA streets / 9.9/ 2.89/ 32.83/ 36.7/ (LA) San Francisco / 2.39/ 22.5/ 32.7/ 36.7/ roads (SF) Tiger streams / 8.36/ 9.69/ 27.37/ 32.94/ (TS) the other datasets are obtained from the R tree Portal ( Due to the lack of a real spatial dataset with objects having eistential probabilities, we generated probabilities for the objects, using the following methodolog. First we generated K = 2 anchor points randoml on the map, following the data distribution. These points model locations around which there is large certaint for the eistence of data (e.g., the could be antennas of receivers close to which information is accurate). For each point of the dataset, we (i) find the closest anchor a and (ii) assign an eistential probabilit proportional to. Thus, the distribution of probabilities around (c dist(,a)) θ the anchors is a Zipfian one. The probabilities are normalized (using c) with respect to the maimum probabilit () corresponding to the anchor point. The default skew value is θ = ; eperiments on different skew values can be found in our preliminar work [2]. 6.2 Eperimental Results Table 4 shows the performances of the five indees for thresholding and ranking NN queries on different datasets. We fi t =.5 for thresholding NN queries and m = for ranking NN queries. 4 Observe that the augmented and R trees perform better than the R tree, even though the are larger in size. Algorithms 4 and 5 manage to prune a large number of nodes that do not contain quer results, which are otherwise visited in the simple R tree inde. The cost of R tree variants (i.e., methods, COUNT) does not change much with the database size. The costs of R tree variants increase slowl as the database size increases. This is due to the fact that R trees group entries using both spatial and probabilit dimensions, but the quer algorithms mainl search for objects based on spatial dimensions. In subsequent eperiments, we compare the performance of the indees on the SF dataset and default parameter values are t =.5 and m = for thresholding and ranking queries respectivel. Figure 8 shows the performance of the indees for thresholding and ranking queries. Augmented and 4. A small value for t is necessar in order to observe difference between the indees. Larger values for t will be tested in a subsequent eperiment COUNT COUNT t (a) thresholding queries COUNT COUNT m (b) ranking queries Fig. 8. NN queries on the SF dataset, θ = R trees perform much better than the simple R tree for all tested values of t and m. For t.2, less than 5 accesses are required to find the quer result when using the four advanced indees and Algorithms 4 and 5. When comparing these indees, we observe that augmenting e num is not a good idea; using the fanout f gives accurate enough estimations of P min and P ma. Thus the etra space (translated to etra accesses) required for augmenting e num does not pa off. In addition, the augmented R tree performs better than the R tree. First, the R tree occupies more space (the capacit of each non-leaf node is smaller) and results in more accesses, since the etra space is not compensated b tighter P min and P ma (see Equations 3 and 4). Second, since the R tree groups entries to nodes using the eistential probabilities as well as spatial dimensions, it does not achieve as good partitioning as the one using the spatial dimensions onl; however, search is performed primaril using the spatial dimensions. Net, we eamine the performances of range queries on the indees. The parameter len denotes the etent of the quer window (in each dimension), whose default value is set to 5% of the domain length. Figure 9a and 9b show the cost of thresholding and ranking queries as a function of t and m respectivel. Ecept for the simple R tree, all indees follow similar trends as in probabilistic nearest neighbor queries. The cost of range queries on the R tree is independent of t and m as all points within the spatial range are retrieved. Observe that for ver small t, the augmented and indees ma perform worse than the R tree because (i) the prune no or ver few director entries that have lower e mae than t and (ii) the are larger in size than the simple R tree. Similarl, P m decreases with m, affecting the costs of the advanced methods. The R tree performs worse than the augmented R tree also for range queries. Figure 9c shows the cost of thresholding queries as a function of len. The costs of all methods increase with len. We proceed to compare the performances of spatial skline queries on the indees. For each quer, a set of Q quer points are randoml generated in a quer window with side length len, such that the window follows the data distribution. The default values of Q and len are 6 and 5% of the domain range respectivel. Figure shows the -CPU time breakdown of thresholding and ranking queries as a function of t an m respectivel. Each page fault is charged milliseconds of time. Observe that the method outperforms its competitors for a wide range of parameters. In terms of,

13 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X COUNT COUNT 2 5 COUNT COUNT 8 COUNT COUNT t m length(%) (a) thresholding queries vs t (b) ranking queries vs m (c) thresholding queries vs len, t =.5 Fig. 9. Range queries on the SF dataset the trends are similar to the ones in Figure 8. However, the CPU time of augmented and trees becomes high at low t value and high m value COUNT COUNT COUNT COUNT 6 6 quer processing time (sec) C C CPU time time C C C C C C C C C C C C t (a) thresholding queries vs t quer processing time (sec) C C CPU time time C C C C C C C C C C C C m (b) ranking queries vs m Fig.. Spatial skline queries on the SF dataset, Q = 6, len = 5% number of sectors number of sectors (a) thresholding queries, t =.5 (b) ranking queries, m = Fig. 2. Reverse nearest neighbor queries on the SF dataset, varing the number of sectors COUNT COUNT COUNT COUNT Figure plots the cost of the indees b varing the number Q of quer points. In general, when Q increases, a point is spatiall dominated b fewer points, and thus the probabilit of the point to be in the skline increases. Thus, more points need to be eamined b thresholding queries and its cost increases rapidl. On the other hand, P m increases with Q, strengthening the pruning power of advanced indees. Thus, the cost of ranking queries increases at a slower rate. quer processing time (sec) CPU time time C C C C C C C C C C size of Q quer processing time (sec) CPU time time C C C C C C C C C C size of Q (a) thresholding queries, t =.5 (b) ranking queries, m = Fig.. Spatial skline queries on the SF dataset, varing Q Finall, we stud the performance of the indees for reverse nearest neighbor queries. Figure 2 shows the effect of the number of sectors in performance. When more sectors are used, tighter probabilit bounds are derived for the sectors and hence the algorithm terminates faster. In particular, the 96-sector partitioning achieves substantial cost reduction (over the basic 6-sector partitioning) for thresholding and ranking queries respectivel. Observe that the cost starts converging to its final value with as few as 24 partitions. Figure 3 plots the t (a) thresholding queries vs t m (b) ranking queries vs m Fig. 3. Reverse nearest neighbor queries on the SF dataset, using the 24-sector partitioning cost of the methods as a function of t and m respectivel, when using the 24-sector partitioning. For thresholding queries, the performance gap between the R tree and other indees widens as t increases because of the increased pruning power of the advanced indees. On the other hand, the cost differences among the indees are not sensitive to the value of m. As with previous queries, prevails. 7 DISCUSSION: RELAXING THE INDEPEN- DENCE ASSUMPTION Our analsis so far assumes that the eistential probabilities of objects are independent. This assumption is valid in a large number of applications (e.g., those mentioned in Section ); hence, our solutions have significant value in practice. However, there are also other applications where the eistential probabilities of different objects are correlated. For eample, consider a collection of sensors distributed in a forest for detecting wildfire. When a sensor detects smoke, sensors in its neighborhood are likel to sense it as well. A thorough solution in this scenario falls out of the scope of this paper. Nevertheless, in the sequel, we point out the direction towards

IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 4 etending the proposed algorithms and indeing schemes to support correlated eistential probabilities.

14 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 4 etending the proposed algorithms and indeing schemes to support correlated eistential probabilities. We now elaborate how to evaluate the thresholding probabilistic NN quer using a simple R tree, in the correlated probabilit model. Instead of Definition 2, we define the probabilit of an object to be the NN of q as: JP = JPr( Γ() R,,d(q, )<d(q,) (NOT Γ( )) ) where JPr denotes the joint probabilit that eists (i.e., the event Γ()) and all objects closer to q do not eist. This probabilit can be computed from a Baesian network modeling dependent probabilities among the objects. The value JPr( R,,d(q, )<d(q,) (NOT Γ( )) ) serves as an upper bound of JP, regardless of how the probabilities are correlated. Based on this propert, we modif Algo. 2 as follows: (i) we maintain the set S of visited objects, (ii) at Line 7, we insert the object into S and compute P first = JPr( S (NOT Γ( )) ), (iii) at Line 4, we compute JP = JPr( Γ() S (NOT Γ( )) ). The above idea works also for spatial skline (Equation 7, Algo. 6) and reverse nearest neighbor (Equation 4, Algo. 8), after replacing each multiplication b, each E b Γ(), each ( E ) b NOT Γ( ), and the final probabilit b JPr(). Etensions of other R tree solutions (e.g., augmented trees, trees) generate non-trivial research issues, due to the fact that: (i) the number of possible joint probabilities is enormous (i.e., eponential to the data cardinalit), and (ii) it remains unclear how to augment a non-leaf entr to effectivel capture the joint probabilities of the objects in its subtree. In the future, we will develop efficient solutions for augmented trees that are applicable for the correlated probabilit model. 8 CONCLUSIONS In this paper, we presented the interesting problem of evaluating spatial queries for eistentiall uncertain data. Variants of common spatial queries, like range and nearest neighbor search, have probabilistic versions for this data model. We proposed algorithms for these probabilistic versions and several etensions of spatial access methods (i.e., R trees) where these algorithms are applied. In addition, we discuss how comple spatial queries such as spatial skline queries and reverse nearest neighbor queries can be processed in our framework. Finall, we conducted etensive eperiments to evaluate the search algorithms and the corresponding spatial indees. In most of the tested cases, the data structure that performs best is a R tree, where non-leaf entries are augmented with maimum eistential probabilities of the sub-tree the point at. In the future, we plan to stud in detail more advanced quer tpes and etend our methods to appl on data that are both eistentiall and locationall uncertain, as well as results of fuzz classifiers []. ACKNOWLEDGMENTS This work was supported b grant HKU 749/7E from Hong Kong RGC. The work of ufei Tao was supported b grants CUHK 22/6 and CUHK 46/7 from Hong Kong RGC. A preliminar version of this work appeared in Ref. [2]. REFERENCES [] P. M. Atkinson and N. J. Tate, Eds., Advances in Remote Sensing and GIS Analsis. John Wile & Sons, 999. [2] O. Wolfson, A. P. Sistla, S. Chamberlain, and. esha, Updating and Quering Databases that Track Mobile Units, Distributed and Parallel Databases, vol. 7, no. 3, pp , 999. [3] D. Pfoser and C. S. Jensen, Capturing the Uncertaint of Moving- Object Representations, in Proc. of SSD, 999. [4] J. Ni, C. V. Ravishankar, and B. Bhanu, Probabilistic Spatial Database Operations, in Proc. of SSTD, 23. [5] X. u and S. Mehrotra, Capturing Uncertaint in Spatial Queries over Imprecise Data, in Proc. of DEXA, 23. [6] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, Quering Imprecise Data in Moving Object Environments, IEEE TKDE, vol. 6, no. 9, pp. 2 27, 24. [7]. Tao, X. Xiao, and R. Cheng, Range Search on Multidimensional Uncertain Data, ACM TODS, vol. 32, no. 3, p. 5, 27. [8] A. Guttman, R-trees: A Dnamic Inde Structure for Spatial Searching, in Proc. of ACM SIGMOD, 984. [9] G. R. Hjaltason and H. Samet, Distance Browsing in Spatial Databases, ACM TODS, vol. 24, no. 2, pp , 999. [] R. Cheng,. Xia, S. Prabhakar, R. Shah, and J. S. Vitter, Efficient Indeing Methods for Probabilistic Threshold Queries over Uncertain Data, in Proc. of VLDB, 24. [] G. Trajcevski, O. Wolfson, F. Zhang, and S. Chamberlain, The Geometr of Uncertaint in Moving Objects Databases, in Proc. of EDBT Conf., 22. [2] R. Cheng, D. V. Kalashnikov, and S. Prabhakar, Evaluating Probabilistic Queries over Imprecise Data, in Proc. of ACM SIGMOD, 23. [3] I. Lazaridis and S. Mehrotra, Approimate Selection Queries over Imprecise Data, in Proc. of ICDE, 24. [4] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, The R*-tree: An Efficient and Robust Access Method for Points and Rectangles, in SIGMOD, 99. [5] M. Sharifzadeh and C. Shahabi, The Spatial Skline Queries, in Proc. of VLDB, 26. [6] I. Stanoi, D. Agrawal, and A. Abbadi, Reverse Nearest Neighbor Queries for Dnamic Databases, in SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discover, 2. [7] D. Papadias,. Tao, G. Fu, and B. Seeger, Progressive Skline Computation in Database Sstems, ACM TODS, vol. 3, no., pp. 4 82, 25. [8]. Tao, D. Papadias, and X. Lian, Reverse knn Search in Arbitrar Dimensionalit, in Proc. of VLDB, 24. [9] T. Brinkhoff, A Framework for Generating Network-Based Moving Objects, GeoInformatica, vol. 6, no. 2, pp. 53 8, 22. [2] X. Dai, M. L. iu, N. Mamoulis,. Tao, and M. Vaitis, Probabilistic Spatial Queries on Eistentiall Uncertain Data, in Proc. of SSTD, 25. Man Lung iu Man Lung iu received the Bachelor Degree in Computer Engineering and the PhD Degree in Computer Science from the Universit of Hong Kong in 22 and 26 respectivel. He is currentl an assistant professor at Department of Computer Science, Aalborg Universit. His research focuses on the management of comple data, in particular the quer processing topics on spatio-temporal data and multidimensional data.

Kong Universit of Science and Technolog. He is currentl an associate professor at the Department of Computer Science, Universit of Hong Kong, which he joined in 2.

(CWI), the Netherlands. His research focuses on management and mining of comple data tpes.

He is an editorial board member for Geoinformatica Journal and a field editor of the Encclopedia of Geographic Information Sstems.

15 IEEE TRANSACTIONS OF KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, XXXXX 2X 5 Nikos Mamoulis Nikos Mamoulis received a diploma in Computer Engineering and Informatics in 995 from the Universit of Patras, Greece, and a PhD in Computer Science in 2 from the Hong Kong Universit of Science and Technolog. He is currentl an associate professor at the Department of Computer Science, Universit of Hong Kong, which he joined in 2. In the past, he has worked as a research and development engineer at the Computer Technolog Institute, Patras, Greece and as a post-doctoral researcher at the Centrum voor Wiskunde en Informatica (CWI), the Netherlands. His research focuses on management and mining of comple data tpes. He has served on the program committees of over 4 international conferences and workshops on data management and data mining. He was the general chair of SSDBM 28 and a coorganizer of SSTDM 26. He is an editorial board member for Geoinformatica Journal and a field editor of the Encclopedia of Geographic Information Sstems. Xianguan Dai Xianguan Dai received the Bachelor of Engineering Degree from the Department of Computer Science and Technolog of the Universit of Science and Technolog of China in 24. He obtained the MPhil Degree in Computer Science from the Universit of Hong Kong in 26. His research interests include quer processing problems on spatial data. ufei Tao Dr. Tao is engaged in research of database sstems. He is particularl interested in inde structures and quer algorithms on multidimensional data, and has published primaril on temporal databases, spatial databases, and privac preservation. He received the Hong Kong oung scientist award in 22. He has served the program committees of most prestigious database conferences such as SIGMOD, VLDB, ICDE, and is currentl an associate editor of ACM Transactions on Database Sstems (TODS). He joined the Chinese Universit of Hong Kong in September 26. Before that, he held positions at the Carnegie Mellon Universit and the Cit Universit of Hong Kong. He is a member of the ACM. Michail Vaitis Michail Vaitis holds an Engineering Diploma (992) and a Ph.D. degree (2) in Computer Engineering and Informatics from the Universit of Patras, Greece. Since 23 he has been a facult member of the Department of Geograph at the Universit of the Aegean, Greece. Now he is an assistant professor. In the past, he was working for 5 ears at the Research Academic Computer Technolog Institute (RA- CTI), Greece, on hpertet and database sstems. His research interests include geographical databases, spatial data infrastructures, geographic hpermedia and geo-spatial semantic web. He is a member of the ACM and the Technical Chamber of Greece.

Probabilistic Spatial Queries on Existentially Uncertain Data

Probabilistic Spatial Queries on Existentially Uncertain Data Xiangyuan Dai 1, Man Lung Yiu 1, Nikos Mamoulis 1, Yufei Tao 2, and Michail Vaitis 3 1 Department of Computer Science, University of Hong Kong,