Finding both Aggregate Nearest Positive and Farthest Negative Neighbors

Finding both Aggregate Nearest Positive and Farthest Negative Neighbors I-Fang Su 1, Yuan-Ko Huang 2, Yu-Chi Chung 3,, and I-Ting Shen 4 1 Dept. of IM,Fotech, Kaohsiung, Taiwan 2 Dept. of IC, KYU, Kaohsiung, Taiwan 3 Dept. of CSIE, CJCU, Tainan, Taiwan 4 Dept. of CSIE, NCKU, Tainan, Taiwan Abstract Recently, researchers use the aggregate nearest neighbor (ANN) search for users at different locations (query points) that want to find one restaurant (data point), which leads to the minimum sum of distances where they have to travel in order to meet. Users can also use aggregate farthest neighbor (AFN) search to find one of the building locations of a new hotel (query points) so as to maximize the aggregate distances to all the other existing hotels(data points) for reducing competition. These works mainly focus on finding either the aggregate nearest neighbors or the aggregate farthest neighbors. In reality, users not only have queries of aggregate nearest neighbors but also have queries of the aggregate farthest neighbors. He needs to make a decision through both aggregate nearest neighbors such as finding the objects that user prefer from the a set of data points and aggregate farthest neighbors which retrieves the objects that user dislike from another set of data points. In order to verify these two sets of query points, we named the objects which user prefer the Positive query set, and the objects that user dislike the Negative query set. Motivated by these observations, we propose a novel query by combining the aggregate nearest positive neighbor search and the aggregate farthest negative neighbor search together meaningfully. We name this query the Aggregate Nearest Positive and Farthest Negative Neighbors (ANPFNN) query. In this paper, we propose a round-robin algorithm to retrieve the first aggregate nearest positive and farthest negative neighbors. Further, we use a pruning rule to efficiently filter out the answers. Our extensive evaluation results validate the effectiveness and efficiency of our algorithm on uniform distributed clustered data. Keywords: aggregate nearest positive, farthest negative neighbor search, nearest neighbor, dominate 1. Introduction Nearest neighbor queries (NN queries) [1], [2], [3], [4] have been widely discussed in recent years. Given an n- *This work is supported by National Science Council of Taiwan (R.O.C.) under Grants NSC 100-2221-E-268-007, NSC100-2221-E-309-011, NSC 100-2221-E-244-018. **Corresponding Author dimensional data set D and a query point q, an NN querie finds the data point d (d D) which is nearest to q. For example, in Figure 1, a point data set P (P ={p 1,p 2,p 3,p 4,p 5 }) represents the locations of parking lots, and a point q is the current position of the user. If a user issues a query for finding the location of the nearest parking lot to a server, the server returns p 2 as it is closest to q within the data point set P. Y p4 p1 Query Point Data Point p3 Fig. 1: Example of an Nearest Neighbor search. Recently, aggregate nearest neighbor (ANN) query[5], [6] had been proposed. ANN queries are a variant of NN queries. The conventional NN query is to find a data point p i that is close to a given query point q, where p i is within a point data set P ={p 1,p 2,p 3, }. The ANN query problem involves a point data set P ={p 1,p 2,p 3, }, a point query set, Q = {q 1,q 2,q 3, }, and a given function F such as MIN, MAX, and SUM. An ANN query is to find a point p i within P that has the shortest function distance to all the query points. For example in Figure 2, suppose a university holds an international conference, accommodations are arranged for the attendees at several hotels (Q = {H 1,H 2,,H 7 }) near the university. The conference is looking for a venue to hold a banquet from given restaurants P = {R 1,R 2,R 3,R 4 }. To minimize the total transportation costs of all the attendees (supposing that the transportation rates are the same), the host must find one restaurant from P that has the shortest total distance to Q. The Table 5 depicts q p2 p5 X

the shortest distances from the restaurants to each of the hotels. To minimize transportation costs, the host chooses to hold the banquet at R 2. Because of all the restaurants, the total distance between R 2 and all of the hotels is the shortest(14.2). Fig. 2: Example of an Aggregate Nearest Neighbor search. In addition to the SUM function, MAX and MIN are also employed as set functions in ANN queries. We use Figure 2 to explain the varying meanings of the other functions in ANN queries. If MAX is adopted as the set function, the ANN query returns R 2. The reason is that the maximum distance of R 2 to all hotels is the shortest among that of other restaurants. This ANN query shows that selecting R 2 as the venue for the banquet minimizes the time attendees spend getting there (supposing that all of the attendees move at the same speed). If MIN is adopted as the set function, the ANN query returns R 4 since the minimum distance of R 4 to all hotels is the shortest among that of other restaurants. The primary objective of this query shortens the vacancy time of the venue (because R 4 can be reached the most quickly). Another variation of NN query is the aggregate farthest neighbor (AFN) query [7]. Given a point data set P, a point query set Q, and a given function F, a data point p (p P ) is derived with the maximum distance function with query set Q. For example, suppose a financial group wishes to build a resort at popular tourist destination, they must find a location p among the potential areas P as far as possible from all the other hotels Q. Similar to ANN queries, the AFN queries also include MAX, MIN, and SUM as their aggregate function. From the above observation, we found that either the ANN queries or the AFN query only gives the one-sided information to users. In reality, users may need to facilitate the exploration of a data space and make an objective decision. For instance, people usually hope for a house with high functionality in the surroundings but away from dangerous or noisy locations while they are purchasing houses. If we present these characteristics in the form of options, home buyers may hope for the presence of (1) schools, (2) MRT stations, and (3) parks within a 500-meter radius. At the same time, hazardous locations such as (1) power plants, (2) gas stations, and (3) industrial zones are not desired within neighboring areas. For example in Figure 4, suppose that X and Y are distance coordinates, d i is the house for sale in the area where i =1to4,P ={p 1,p 2,p 3 } is the query set of locations that the user wishes to be closer to (such as parks, schools, and MRT stations), and N={n 1,n 2,n 3 } is the query set of locations that the user wishes to be farther away from (such as gas stations, power plants, and chemical plants). In order to verify these two types of query sets, we use the Positive query set and Negative query set to represent the locations that user wish to be closer to and farther away from, respectively. In this example, the user may select one house from the given data points that is closer to the positive query set and is simultaneously farther away from the negative data points. As seen in Figure 4, under the preference conditions of the user, d 2 and d 3 would be the two recommanded houses to the user s needs. The reason is that the maximum distance of d 2 to the positive query set is the shortest and the minimum distance of d 2 to the negative query set is the farthest. In addition to d 2 is a recommend house for users, d 3 is also an option for users. Although the maximum distance of d 3 to the positive query set is not the shortest, the minimum distance of it to the negative query set is the farthest. That is to say, d 3 is the house which farther away from the locations user dislike. And d 2 and d 3 are both answers of this given query. Apart from the above example, there are a large number of similar queries in real life. From the above observations, we believe that this kind of query is important for users to make a decision. However, previous studies are unable to apply existing methods to this type of query. Although ANN and AFN queries are related to this type of query, they only consider either aggregate nearest or farthest neighbor search. The answers provided by these two algorithms can only satisfy one single user demand, and cannot simultaneously find the result from this two different query sets. Besides, there may not be only one result which meets the demands of users and ANN as well as AFN queries may only satisfy a portion of the query conditions. Therefore, we formally define the problem of finding the Aggregate Nearest Positive and Farthest Negative Neighbors query (ANPFNN) in this paper. In this paper, we propose an algorithm to execute the ANPFNN query and use a pruning rule to efficiently filter out the answers so as to reduce the computation cost. We also apply a dominance test [8] in ANPFNN in order to retrieve all significant objects for users to make a good decision. Our extensive evaluation results validate the effectiveness

Distance (km) H 1 H 2 H 3 H 4 H 5 H 6 H 7 Total R 1 2 4 5.5 7 5.5 4 2 30 R 2 2 1.5 2 1.5 3 2 2.2 14.2 R 3 5.5 7 6 4.5 2.5 1.5 3.5 30.5 R 4 6 4.5 2.2 1 2.5 3 7 26.2 Fig. 3: The distances between restaurants and hotels of Figure 2. Y p3 p2 d2 p1 d1 Data Point Positive Query Point Negative Query Point d3 d4 n3 during processing. Moreover, the finding results may only fit for one-sided query and a further process is required for retrieving the final results. Therefore, we apply a different pruning technique that can quickly filter impossible answers to further reduce the required computation costs involved in processing queries. n1 Fig. 4: Example of an aggregate nearest positive and farthest negative neighbors search. and efficiency of our algorithm on both uniform distributed and clustered data. The rest of the article is organized as follows. Section 2 reviews related work on ANN and AFN searches. Section 3 provides preliminaries on ANPFNN query and some distance metrics used in this paper. Section 4 presents our main algorithm for processing ANPFNN queries efficiently. Results of our experimental study are reported in Section 5. Finally, Section 6 concludes the paper. 2. Related Work In this section, we briefly review previous work related to ANPFNN queries. Papadias et al. [5] proposed three algorithms, named Multiple Query Method(MQM), Single Point Method(SPM), and Minimum Bounding Method(MBM) for processing aggregate nearest neighbor searches. In [7], Yuan Gao et al. propose minimum bounding(mb) and best first(bf) algorithms for processing aggregate farthest neighbor queries. The main idea of the first algorithm exends the idea of MBM in ANN to MB by using a threshold to filter out the possible results, and then efficiently retrieve the answers. Instinctively, ANPFNN can be disassembled into an ANN problem and an AFN problem. However, adopting this existing method for ANPFNN problems presents the following issues. In order to accelerate query speed, ANN and AFN both employ R-tree to index the query set and data points. Hence, the root nodes of R-trees are visit frequently while processing query for finding points nearest to the positive query set and those farthest from the negative data set. Subsequently, this step incurs a considerable amount of computation costs. And many redundant query results appear n2 X 3. Preliminaries In this section, we first give the definitions of the aggregation distance, the dominance relation, the aggregate nearest positive as well as farthest negative neighbors, and formally define the aggregate nearest positive and farthest negative neighbors (ANPFNN) query. We further describe the underlying indexing structure of our algorithm in this section. 3.1 Definitions Given a set of data points D= {d 1,d 2,,d i }, a set of positive query points P ={p 1,p 2,,p x }, and a set of negative query points N={n 1,n 2,,n y },the aggregation distance (D agg ) is defined in Definition 1.Based on Definition 1, the ANPFNN query is formulated in Definition 2. Definition 3.1: Aggregation distance (AggD) Given a data point d, the aggregation distance between d and a query set P is AggD(d, P ) = Fj=1 x d, p j, where F can be SUM, MAX or MIN. In this paper, we use Fsum, Fmax, andfmin to represent the respective aggregation function. Definition 3.2: Nearest Positive distance Given a set of data points d, and a set of positive query points P ={p 1,p 2,,p x }, the nearest positive distance is to find the maximum AggD(d, P ) of d to the set of positive query points P. Nearest Positive distance = Fmax x j=1 d, p j. Definition 3.3: Farthest Negative distance Given a data points d, and a set of negative query points N={n 1,n 2,,n y }, the farthest negative distance is to find the minimum AggD(d, N) of d to the set of negative data points N. Farthest Negative distance =Fmin y j=1 d, n j. We can then use the above two distances to define the aggreate nearest positive and farthest negative distances. Definition 3.4: Aggregate Nearest Positive distance, PAgg Given a set of data points D= {d 1,d 2,,d i }, and a set of positive query points P ={p 1,p 2,,p x }, the aggregate

Positive Query Set Negative Query Set Distance (km) p 1 p 2 p 3 n 1 n 2 n 3 d 1 12 28 32 16 32 45 d 2 25 28 18 20 38 57 d 3 21 41 47 26 25 31 d 4 38 5 58 28 11 17 Fig. 5: The distances among data points, positive and negative query sets of Figure 4. nearest positive distance is to find one data point d which has the minimum AggD(d, P ) among all data points in D to the set of positive query points P. PAgg = MIN{AggD(d t,p)} = MINt=1{Fmax t=i x j=1 d t,p j }. Definition 3.5: Aggregate Farthest Negative distance, NAgg Given a set of data points D= {d 1,d 2,,d i }, and a set of negative query points N={n 1,n 2,,n y }, the aggregate farthest negative distance is to find a data point d which has the maximum AggD(d, N) among all data points in D to the set of negative data points N. NAgg = MAX{AggD(d t,n)} = MAXt=1 t=i {Fminy j=1 d t,n j }. Definition 3.6: dominate ( ) A data point d i dominates another data point d j on both PAgg and NAgg if and only if PAgg of d i is less than that of d j and NAgg of d i is large than that of d j. Definition 3.7: Aggregate Nearest Positive and Farthest Negative Neighbor (ANPFNN) queries Given a set of data points D= {d 1,d 2,,d i }, a set of positive query points P ={p 1,p 2,,p x }, and a set of negative query points N={n 1,n 2,,n y }, ANPFNN is to find a data point d (d D) where d is not dominated by any other data point of D in both PAgg and NAgg. 3.2 Indexing Structure To reduce I/O and computation cost of processing ANPFNN query, a well underlying indexing structure is indispensable. In this paper, we use R-tree [9] as our indexing structure. In a R-tree, objects are recursively grouped in a bottom-up manner according to their locations. For instance, Figure 6(a) gives a two-dimensional example where eight data points p 1 to p 8. The corresponding R-tree is showing in Figure 6(b). Each entry of a leaf node of a R-tree has the structure (o. ptr,(o. x,o. y )), where o. ptr is a pointer to the actual data point in the database, and o. x and o. y represent the X and Y coordinate of data point o, respectively. Each entry of an internal node has the structure (MBR E,(x l,y d,x r,y u ), E. ptr ), where MBR E is the minimum bounding rectangle (MBR) that encloses all the data points in the child node E of this internal node, (x l,y d,x r,y u ) represent the lower bound and upper bound of node E in X and Y coordinates respectively, and E. ptr is a pointer to node E. In our paper, we use three R P -tree, R N -tree, and R D -tree to represent the positive query set, the negative query set, and the given data set. Y p 7 p8 p 6 p1 p5 p 4 p2 p3 X Fig. 6: Representations of entries in the R-tree 4. Aggregate Nearest Positive and Farthest Negative Neighbor Search Algorithm A naive solution for processing a ANPFNN query is to scan all points and enumerate the PAgg as well as NAgg of each data point to the positive and negative query sets, respectively. Finally, applying the dominance test to each data point in both PAgg and NAgg to retrieve the final answers of ANPFNN. However, the time complexity of calculating dominance relationship, the PAgg and the NAgg of each data points is time consuming. Thus, we propose an ANPFNN algorithm which highly reduces the computation complexity. The ANPFNN algorithm consists of two main phases: (1)the filtering phase and (2)the refinement phase. 4.1 Filtering Phase We first employing a branch-and-bound traversal on R P - tree to find the data point d i in D which d i has the minimum PAgg. Then, the branch-and-bound traversal is also applied on R N -tree to find the data point d j in D which d j has the maximum NAgg. This phase retrieves data points from R P - tre and R N -tree in a round-robin fashion until a data point d t shows up in both R P -tree and R N -tree. The way of finding the minimum PAgg and maximum NAgg are consider by Lu [10]. While traversing the nodes of the R P -tree, a heap is applied to keep the MinDist(d, P ) in an descending order where the MinDist(d, P ) is the minimum possible distance from d to an object in R P -tree. Then, we retrieve the first entry of the heap for further processing. This entry should has the maximum MinDist and it may contain the final result. If the entry is a data point, it

should be the possible answer for the minimum PAgg, and we can continue to the second step of the filtering phase. Otherwise the entry is repeatedly decomposed for finding the data point of this internal node. While traversing the nodes of the R N -tree, we use another heap to keep the MaxDist(d, N) in an ascending order where the MaxDist(d, N) is the maximum possible distance from d to an object in R N -tree. Follow the same procedure as finding the minimum PAgg, the first entry of the heap is retrieved for further processing. If the entry is a data point, it should be the possible answer for the maximum NAgg, and we can continue to the second step of the filtering phase. Otherwise the entry is repeatedly decomposed for finding the data point of this internal node. 4.2 Refinement Phase When the filtering phase is terminated, many results are retrieved. However, these results may not meet both request of users such as the retrieved data point is near the positive query points but also near the negative query points). They may only be outperformed in one condition. Thus, we have to provide users the results which no other data can beat it in both conditions. We consider the B-tree [8] indexing method for finding the final results for users. We use two indexing for aggregate nearest positive and farthest negative distance and sort the retrieved data points according to the their PAgg descendingly and NAgg ascendingly. Then, scan through the whole index simultaneously to find the first match results d t. Any result which has not listed after d t is definitely not part of the answer because it is dominated by d t. Meanwhile, the points listed before should be the final results for users to make decision. Due to space limitations, we refer readers to [11] for more detail examples about these two phases. 5. Performance In this section we evaluate the efficiency of the proposed algorithm to the naive algorithm. The performance is measured by the average CPU time. The algorithms were developed in C++ and executed on a PC with a Intel i7 CPU of 2.8GHz. The default number of positive objects, negative objects and query points are all 110k, and we also vary the range of these three types of points from 10k to 210k. The coordinates of the objects are uniformly normalized in the domain [0, 10000] 2. We compare the ANPFNN with naive algorithm under different number of negative objects, positive objects, and data points in Figure 7, 8, 9 respectively. Under these three circumstances, the average CPU time of ANPFNN increases smoothly, however, that of Naive algorithm increases dramatically. The performance results show that ANPFNN algorithm outperform the naive algorithm on the different object distributions. Due to space limitations, we show and the detail experimental evaluations are listed in [11]. Fig. 7: The average CPU time under the number of negative query size. Fig. 8: The average CPU time under the number of Positive query size. 6. Conclusion In this paper, we proposed the design and implementation of an algorithm for processing aggregated nearest positive and farthest negative neighbor queries in spatial networks. Our design applies the filtering phase to efficiently filter out all possible results and retrieve the final answers by using a refinement phase. Our performance study showed that this design exhibits a superior performance in terms of computation cost. The potential of ANPFNN query has not been fully exploited yet. Currently, we are extending the capability of this design to deal with query in a road network. An efficient query processing technique for processing ANPFNN query in this type of network is also under designed. References [1] B. Cui, B. C. Ooi, J. Su, and K.-L. Tan, Indexing high-dimensional data for efficient in-memory similarity search, ACM Transcations on Knowledge and Data Engineering, vol. 17, no. 3, pp. 339 353, 2005.

Fig. 9: The average CPU time under the number of data point size. [2] L. Hong, B. C. Ooi, H. T. Shen, and X. Xue, Hierarchical indexing structure for efficient similarity search in video retrieval, ACM Transcations on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1544 1559, 2006. [3] B. Zheng, J. Xu, W. chien Lee, and D. L. Lee, Energy conserving air indexes for nearest neighbor search, in Proceedings of the 9th International Conference on Extending Database Technology, 2004, pp. 48 66. [4] X. Xiong, M. F. Mokbel, and W. G. Afre, Sea-cnn: Scalable processing of continuous k-nearest neighbor queries in spatio-temporal databases, in Proceedings of International Conference on Data Engineering, 2005, pp. 643 654. [5] D. Papadias, Y. Tao, K. Mouratidis, and C. K. Hui, Aggregate nearest neighbor queries in spatial databases, ACM Transcations on Database Systems, vol. 30, no. 2, pp. 529 576, 2005. [6] M. L. Yiu, N. Mamoulis, and D. Papadias, Aggregate nearest neighbor queries in road networks, ACM Transcations on Knowledge and Data Engineering, vol. 17, no. 6, pp. 820 833, 2005. [7] Y. Gao, L. Shou, K. Chen, and G. Chen, Aggregate farthest-neighbor queries over spatial data, in International conference on Database Systems for Advanced Applications, DASFAA, 2011, pp. 149 163. [8] S. Borzsonyi, D. Kossmann, and K. Stocker, The skyline operator, in Proceedings of the 17th International Conference on Data Engineering, ICDE 2001, 2001, pp. 421 430. [9] A. Guttman, R-trees: A dynamic index structure for spatial searching, in Proceedings of the 1984 ACM SIGMOD international conference on Management of data, 1984, pp. 47 57. [10] H. Lu, On computing farthest dominated locations, Journals and Magazines, vol. 23, no. 6, pp. 928 941, 2011. [11] I.-F. Su, Y.-K. Huang, Y.-C. Chung, and I.-T. Shen, Anpfnn query, 2012, http://140.116.247.159/anpfnn.docx.