An index structure for efficient reverse nearest neighbor queries

Size: px

Start display at page:

Download "An index structure for efficient reverse nearest neighbor queries"

Mildred Lewis
5 years ago
Views:

1 An index structure for efficient reverse nearest neighbor queries Congjun Yang Division of Computer Science, Department of Mathematical Sciences The University of Memphis, Memphis, TN 38152, USA King-Ip Lin Division of Computer Science, Department of Mathematical Sciences The University of Memphis, Memphis, TN 38152, USA Abstract The Reverse Nearest Neighbor (RNN) problem is to find all points in a given data set whose nearest neighbor is a given query point. Just like the Nearest Neighbor (NN) queries, the RNN queries appear in many practical situations such as marketing and resource management. Thus efficient methods for the RNN queries in databases are required. This paper introduces a new index structure, the Rdnn-tree, that answers both RNN and NN queries efficiently. A single index structure is employed for a dynamic database, in contrast to the use of multiple indexes in previous work. This enables significant savings in dynamically maintaining the index structure. The Rdnn-tree outperforms existing methods in various aspects. Experiments on both synthetic and real world data show that our index structure outperforms previous method by a significant margin (more than 9% in terms of number of leaf nodes accessed) in RNN queries. It also shows improvement in NN queries over standard techniques. Furthermore, performance in insertion and deletion is significantly enhanced by the ability to combine multiple queries (NN and RNN) in one traversal of the tree. These facts make our index structure extremely preferable in both static and dynamic cases. 1 Introduction Indexing is an indispensable tool in database systems. Various kinds of indexes are used to speed up query execution. Moreover, new applications and queries continue to demand new and improved indexes and associated algorithms. One type of query that has recently received attention is the Reverse Nearest Neighbor (RNN) Query: Given a data set Ë and a query point Õ, an RNN query finds all the points in Ë having Õ as their nearest neighbor. This problem corresponds to a class of problems we called influence problems. For instance, suppose a bank is to open up a new branch at a location. It may want to know which existing branches will be affected by the new branch, assuming people choose the nearest branch to conduct business. Moreover, a rival bank may also assess the influence of putting a new branch in that location and what effects it has on existing branches of other banks. Also, with the advance of the Internet and the Web, people are expecting systems to deliver (or push) interesting and relevant information to them. While the users do not want to be inundated with a large volume of junk messages, it is crucial for them to receive information that is more relevant to them than what they have received. One way to achieve this balance is to push only information most pertaining to the interests of the user. For instance, a company can send advertisements about a new product to only those customers who will find this product more relevant than any of the existing products. This allows the users to receive the information that they absolutely need, and at the same time spare them from sorting through junk information that is cluttering their mailboxes, thus making the advertisement more effective. Hence we can see that the reverse nearest neighbor queries is a very practical and important class of queries. Korn and Muthukrishnan [8] provided more examples. A naive solution of the problem requires Ç Ò ¾ µ time with no preprocessing, as the nearest neighbors of all the points in Ë has to be found. Thus more efficient algorithms are required. One approach, described by Korn and Muthukrishnan [8] is to pre-compute the nearest neighbors of every point in Ë. Then given the query point Õ one can compare it with the existing nearest neighbors of every point in Ë. For each point Ü in Ë, one can compute and store a spherical region with Ü as the center and the distance from Ü to its nearest neighbors as radius. It can be seen that if a query point Õ falls into the region, Ü is an RNN of Õ ([8] provides a proof).

2 All the regions can be organized into a multi-dimensional index structure (for instance, the R-tree family [1, 5, 13]) for effective storage and query performance. This method, while pioneering, has some drawbacks. For instance, it requires two indexes in the dynamic case where insertions and deletions to the data set occur. Moreover, the regions that are stored tend to have significant overlap, thus hampering performance. In this paper we present a new structure, called the Rdnn-tree (R-tree containing Distance of Nearest Neighbors), which is well suited for RNN queries in both static and dynamic cases. Rdnn-tree differs from standard R-tree structure by storing extra information about nearest neighbor of the points in each node. This piece of information provides significant improvement in all algorithms. The Rdnn-tree has many advantages, including: It significantly outperforms the index structures in [8], and typically requires only 1-2 leaf access to locate the RNNs. The Rdnn-tree can perform NN queries efficiently. As a result, we only require one tree in the dynamic case, for both NN and RNN queries. The Rdnn-tree enables one to execute multiple NN and RNN queries in one traversal of the tree, further enhancing performance in the dynamic case. The rest of the paper is organized as follows: Section 2 outlines previous work in multi-dimensional index and queries. Section 3 describes the previous algorithms in RNN in more detail, and outlines the potential for improvement. The proposed Rdnn-tree is presented in Section 4. Section 5 provides experimental results and Section 6 summarizes our work and discusses future directions. 2 Related work There has been a large body of work on multidimensional index structures. For instance, the index structure we propose is based on the popular R-tree family [5, 13, 1], which generalizes the -tree to multiple dimensions by storing minimum bounding regions (hyperrectangles) instead of numbers which represent 1-D intervals. Interested readers are referred to [12] and [4] for full surveys on multi-dimensional index structures. Early work on multi-dimensional index structures focus on range queries. Recently the nearest neighbor query problem has received substantial attention. In addition to the work in computational geometry (e.g. see [11]), many algorithms have been proposed to search nearest neighbors using tree-based indexes like the R-trees. Many such algorithms take the branch-and-bound approach: the tree is traversed from the root; and at each step a certain heuristic is used to determine which branch to traverse next, and which branches can be pruned from the search. Various algorithms differ in the order of the search. For instance, Roussopoulos et al. [1] use a depth-first approach, while Hjaltason and Samet [6] propose a distance-browsing algorithm, using a priority queue to order the branches to be traversed. Other approaches have been proposed. One approach is to modify index structures to enhance the branch-and-bound algorithms. Two examples are the SS-tree [15] and the SRtree [7]. An alternative approach, proposed by Berchtold et al [2], indexes an approximation of the Voronoi diagram associated with the data set. 3 Definitions and existing algorithms This section presents existing algorithms for Reverse Nearest Neighbor search and discusses potential improvements. We first provide formal definitions of the Nearest neighbor and Reverse Nearest Neighbor search. In what follows, we assume that Ë is a set of points in dimensional space. Ô Õµ is the distance between two points Ô and Õ. If Ì is a subset of Ë, Ô Ì µ denotes the minimum distance between Ô and any points in Ì. Ô Öµ is a circle centered at Ô with radius Ö. DEFINITION: (Nearest Neighbor Search (NN Search)) Given a set Ë of points in some dimensional space and a query point Õ, the Nearest Neighbor Search problem is to find a subset ÆÆ Ë Õµ of Ë defined as follows ÆÆ Ë Õµ Ö ¾ Ë Ô ¾ Ë Õ Öµ Õ Ôµ DEFINITION: (Reverse Nearest Neighbor Search (RNN Search)) Given a set Ë of points in some dimensional space and a query point Õ, the Reverse Nearest Neighbor Search problem is to find a subset ÊÆÆ Ë Õµ of Ë defined as follows ÊÆÆ Ë Õµ Ö ¾ Ë Ô ¾ Ë Õ Öµ Ö Ôµ Notice that Ô ÆÆ Ë Ôµµ is the distance between Ô and its nearest neighbors in Ë. For simplicity we denote it by ÒÒ Ë Ôµ. Ë will be omitted from the above notations where the context is clear. In general, there is no natural relationship between ÆÆ Ë Õµ and ÊÆÆ Ë Õµ. For instance, Ö ¾ ÆÆ Ë Õµ µ Ö ¾ ÊÆÆ Ë Õµ, and vice versa. The general RNN-search algorithm is presented by Korn and Muthukrishnan [8]. Let Ë be a given set of points and Õ a query point. For any point Ô in Ë, Ô takes Õ as its nearest 2

3 neighbor if and only if Ô Õµ ÒÒ Ë Ôµ, i.e., Ô is at least as close to Õ as to its nearest neighbors in Ë. Since Ë is known, we can pre-compute ÆÆ Ë Ôµ for every point Ô in Ë and store it in a certain way. Korn and Muthukrishnan used an RNN-tree, which is essentially an Ê -tree. For every point Ô in the data set Ë, the RNN-tree stores the minimum bounding rectangle of the circle Ô ÒÒ Ë Ôµµ in the leaf node. With such an index structure, the RNN search problem becomes a simple point query problem. For any given query point Õ, Ô is in ÊÆÆ Ë Ôµ only if Ô falls inside the circle and hence inside the minimum bounding rectangle of the circle. Complications arise when points are inserted into or deleted from the tree. In such cases, the RNN-tree has to be updated. Consider first the case of insertion. When a point Ô ¼ is inserted into Ë, we need to make two kinds of adjustments: for every point in ÊÆÆ Ë Ô ¼ µ, we need to update the region stored in the RNN-tree since Ô ¼ is the new nearest neighbor; also the region corresponding to Ô ¼ (i.e. Ô ¼ ÒÒ Ë Ô ¼ µµ) needs to be computed and inserted to the RNN-tree. This implies that the insertion algorithm needs to find both ÆÆ Ë Ô ¼ µ and ÊÆÆ Ë Ô ¼ µ. One would like to use the RNN-tree to find the nearest neighbors. However, the leaf nodes of an RNN-tree contain geometric objects (the regions) instead of points themselves. This makes the higher level bounding region larger and makes the tree sub-optimal for standard nearest neighbor queries. Thus it is proposed that a second tree, the NN-tree (a simple Ê -tree) be created to ensure efficient nearest neighbor search. However, this implies the second tree also needs to be maintained during insertion. The insertion algorithm can be summerized as follows: Algorithm 1 RNN-Insert(RNN-tree, NN-tree, Ô ¼ ) 1) Perform RNN search RNN-tree for Ô ¼ to find ÊÆÆ Ë Ô ¼ µ. 2) For each Ô ¾ ÊÆÆ Ë Ô ¼ µ shrink Ô ÒÒ Ë Ô µµ to Ô Ô Ô ¼ µµ 3) Call standard NN search algorithm on NN-tree to find ÆÆ Ë Ô ¼ µ. 4) Insert Ô ¼ into RNN-tree using Ê -tree insertion algorithm. 5) Insert Ô ¼ into NN-tree using Ê -tree insertion algorithm. A similar situation arises when a point Ô ¼¼ is deleted. Again, we need to make two kinds of adjustments: deleting the region corresponding to Ô ¼¼ (i.e. Ô ¼¼ ÒÒ Ë Ô ¼¼ µµ); as well as finding the new nearest neighbors for all the points in ÊÆÆ Ë Ô ¼¼ µ and adjust their corresponding regions in the RNN-tree. Once again both NN and RNN queries are needed. Notice that we might have to perform multiple NN queries in the second case. The deletion algorithm is listed as follows. Algorithm 2 RNN-Delete(RNN-tree, NN-tree, Ô ¼¼ ) 1) Delete Ô ¼¼ into RNN-tree using Ê -tree deletion algorithm. 2) Delete Ô ¼¼ into NN-tree using Ê -tree deletion algorithm. 3) Perform RNN search on RNN-tree for Ô ¼¼ to find ÊÆÆ Ë Ô ¼¼ µ. 4) For each Ô ¾ ÊÆÆ Ë Ô ¼¼ µ, call standard NN search algorithm on NN-tree to find ÆÆ Ë Ô µ and enlarge Ô ÒÒ Ë Ô µµ to Ô Ô Ô ¼¼ µµ Thus in the dynamic case, one needs to update two trees to maintain the index structures. This leads to inefficiency in both time and space. While the technique above is a general approach, there are other techniques that work for lower dimensional points. One such approach is to take advantage of the geometric properties of the problem. Stanoi, Agrawal and El Abbadi [14] introduced an algorithm that works directly on an Ê -tree. It transforms the RNN problem into a set of constrained nearest neighbor queries. An interesting fact about RNN queries is that the maximum number of RNNs of a query point is bounded, and if multiple RNNs exist, they have to be distributed fairly evenly. Thus, upon receiving the query point Õ, the algorithm divides the entire space into a number of regions based on Õ. The number of regions is equal to the maximum possible solution. For each region, the algorithm finds the nearest neighbors of Õ. It can be shown that the true RNNs are among these points, and finding the correct solutions from them can be done easily. The main drawback of the algorithm is that the number of regions to be searched grows very fast when the dimensionality increases. For instance, for Ä ½ norm, the growth is exponential. This renders the algorithm ineffective in higher dimensions. Moreover, every region has to be searched, whether an RNN resides or not; thus, there can be a lot of wasted effort during the search. 4 The Rdnn-tree 4.1 Motivation We have dicussed the limitations of the RNN-tree approach in the last section. While storing the spherical region Ô ÒÒ Ë Ôµµ is necessary, the RNN-tree suffers from the following: Large overlap between the regions causes increased overlapping in parent nodes MBR (minimum bounding rectangles), hampering the RNN search performance. 3

4 Storing the spherical regions themselves renders the index structure ineffective in solving NN queries, thus a second tree is needed for the dynamic cases. This serverly adds cost for maintaining the index. Thus we want to find a structure such that the pointlocation and NN queries are utilized, while the information of ÒÒ Ë Ôµ is maintained to ensure RNN queries are supported properly. Thus we propose the Rdnn-tree (Ê -tree with Distance of Nearest Neighbors) to kill two birds with one stone: using the Ê -tree to store the data points themselves, but enhance the nodes with information about distance of the nearest neighbors of the points in the nodes. 4.2 The Rdnn-tree structure In an Rdnn-tree, a leaf node contains entries of the form ÔØ ÒÒµ, where ÔØ refers to a -dimensional point in the data set and ÒÒ is the distance from the point to its nearest neighbors in the data set. A non-leaf node contains an array of branches of the form ÔØÖ Ê Ø Ñ Ü ÒÒµ. ÔØÖ is the address of a child node in the tree. If ÔØÖ points to a leaf node, Ê Ø is the minimum bounding rectangle of all points in the leaf node. If ÔØÖ points to a nonleaf node, Ê Ø is the minimum bounding rectangle of all rectangles that are entries in the child node; Ñ Ü ÒÒ Ñ Ü ÒÒ Ë Ôµ, where Ô are points contained in the subtree rooted at this node. 4.3 Algorithms We first provide the NN and RNN algorithms for the Rdnn-tree, as they are both needed for the insertion and deletion algorithms. RNN search The reverse nearest neighbor search on the Rdnn-tree is similar to a point-location search. The only difference is the criterion to decide which branch(es) to go down the tree. Assume that Õ is the query point: For a leaf node, we need to examine each point Ô in the node. If Õ Ôµ ÒÒ Ë Ôµµ, i.e. Ô is at least as close to Õ as to its nearest neighbor, then Ô is one of the reverse nearest neighbors. For an internal node, we compare the query point Õ with each branch ÔØÖ Ê Ø Ñ Ü ÒÒµ. Here Ñ Ü ÒÒ plays a crucial role. By definition, all points in the subtree rooted at are contained in Ê Ø and the distance from each point to its nearest neighbor is not greater than Ñ Ü ÒÒ (Ñ Ü ÒÒ is the largest of them). Hence if Õ Ê Øµ Ñ Ü ÒÒ, then branch need not to be visited. This is because any point in cannot be closer to Õ than to its nearest neighbor in Ë. Our experiments (cf Section 5) show that this is very efficient in pruning the search path. To summarize the above description, we have the following formal algorithm. Algorithm 3 RNN-Search (Node Ò, Point Õ) Input: Node Ò to start the search and query point Õ Output: the reverse nearest neighbors of Õ If Ò is a leaf node, then for each entry ÔØ ÒÒµ, if Õ ÔØ µ ÒÒ, output ÔØ as one of the RNNs of q If Ò is an internal node, then for each branch ÔØÖ Ê Ø Ñ Ü ÒÒµ, if Õ Ê Øµ Ñ Ü ÒÒ, call RNN-Search( ÔØÖ Õ) NN search As the Rdnn-tree has all properties of the Ê - tree, we can apply the standard nearest neighbor search technique (e.g. [1]) for the NN search. Moreover, the ÆÆ Ë Ôµ information can help us prune extra branches during the branch-and-bound search. This is due to the following lemma: LEMMA 4.1 Let q be a query point and p any point from the data set S. If Ô Õµ ÒÒ Ë Ôµ ¾, then Ô is the nearest neighbor of Õ in Ë. The correctness of the above lemma is easy to see. Ô ÒÒ Ë Ôµµ is a circle that contains no points from the data set Ë. If Ô Õµ ÒÒ Ë Ôµ ¾, then Ü Õµ ÒÒ Ë Ôµ ¾ for any point Ü ¾ Ô ÒÒ Ë Ôµµ. This means that the distance from the query point Õ to any point outside the circle is greater than Õ Ôµ. Hence Ô is the nearest neighbor of Õ. When we search a leaf node for the nearest neighbor of a query point Õ, we can stop the search if the condition Ô Õµ ÒÒ Ë Ôµ ¾ is satisfied. Therefore we have the following improved NN search algorithm: Algorithm 4 NN-Search (Node Ò, Point Õ) Input: A node to start search and a query point Õ Output: The nearest neighbor of Õ 1) Initialize the candidate nearest neighbor 2) if Ò is a leaf node then for each data point Ô do: if Ô Õµ Ô ÆÆ Ë Ôµµ ¾, output Ô and stop the search; if Ô Õµ Õ µ, replace by Ô. 3) if Ò ½ µ is a non-leaf node, where ÔØÖ Ê Ø Ñ Ü ÒÒ µ. Let Õ Ê Ø µ and sort according to. For each, if Õ µ, call NN-Search(ÔØÖ Õ) 4

5 Insertion and Deletion Insertion and deletion are similar to the RNN-tree. The main difference is that we have only one tree and in the tree we maintain a number Ñ Ü ÒÒ containing nearest neighbor information instead of a rectangle. We first look at insertion. When a point Ô ¼ is to be inserted into an Rdnn-tree containing a data set Ë, we first perform an NN and an RNN search to find ÆÆ Ë Ô ¼ µ and ÊÆÆ Ë Ô ¼ µ respectively. With ÆÆ Ë Ô ¼ µ we can compute ÒÒ Ô ¼ µ, to create the entry for Ô ¼. The ÊÆÆ Ë Ô ¼ µ gives us the information of the points that are affected. The ÒÒ fields for those points will need to be recomputed and Ñ Ü ÒÒ field of their ascendent nodes will also need to be adjusted. This can be done in a way very similar to the RNN-Search algorithm. The only difference is that we adjust the ÒÒ field whenever we find a new RNN point for Ô ¼ in a leaf node and propagate the changes to the parent nodes on the way back up. Since we have one index structure for both NN and RNN search, we can combine the two steps into one. This means we also search for the nearest neighbors of Ô ¼ when we search for the affected points (ÊÆÆ Ë Ô ¼ µ) and adjust Ñ Ü ÒÒ field for the corresponding nodes. Our experiments show this combined NN-RNN search has virtually the same cost as the RNN search alone. This saves us one NN search while maintaining the correctness for the index. If we call this step the pre-insertion phase, we have the following formal Pre-insert algorithm. Algorithm 5 Pre-insert (Node Ò, Point Ô ¼ ) Input: The root node of the tree and a point Ô ¼ Output: the adjusted tree and ÆÆ Ë Ô ¼ µ 1) Initialize the candidate nearest neighbor 2) If Ò is a leaf node, then for each entry ÔØ ÒÒµ do: If Ô ¼ ÔØ µ Ô ¼ µ, then let ÔØ ; If Ô ¼ ÔØ µ ÒÒ, output ÔØ and return Ô ¼ ÔØ µ 3) If Ò is a non-leaf node, then for each branch ÔØÖ Ê Ø Ñ Ü ÒÒµ do: If Ô ¼ Ê Øµ Ñ Ü ÒÒ or Ô ¼ Ê Øµ Ô ¼ µ call Pre-Insert(ÔØÖ, Ô ¼ ); If ÔØÖ was adjusted, adjust Ñ Ü ÒÒ for and return Ñ Ü ÒÒ With the above preparations, we can present the following Insertion algorithm Algorithm 6 Insert (Node Ò, Point Ô ¼ ) Input: Root node Ò and point Ô ¼ to be inserted Output: the tree with Ô ¼ inserted 1) Pre-Insert(Ò, Ô ¼ ) 2) Call Ê -tree insertion algorithm to insert entry Ô ¼ ÒÒ Ë Ô ¼ µµ into Ò Now we turn our attention to deletion. Just like the RNNtree, deleting a point from the Rdnn-tree will also affect the reverse nearest neighbors of the point to be deleted. In order to maintain the integrity of the Rdnn-tree while deleting a point Ô ¼¼, an NN search needs to be done for each point in ÊÆÆ Ô ¼¼ µ. This is an expensive step. Observe that the points in ÊÆÆ Ô ¼¼ µ should be physically close to each other in the data set because they are the reverse nearest neighbors of a point Ô ¼¼. Moreover, the number of points in ÊÆÆ Ô ¼¼ ) is upper-bounded (the bound is based on the dimensionality). Hence, we can do a Batch NN search, finding the nearest neighbors for multiple query points in one pass. Let s call it the Batch-NN-Search. To delete the point physically, the standard Ê -tree deletion algorithm suffices. Algorithm 7 Delete (Node Ò, Point Ô ¼¼ ) Input: A tree rooted at Ò and a point Ô ¼¼ to be deleted Output: A tree with Ô ¼¼ deleted 1) Call Ê -tree algorithm to delete Ô ¼¼ from Ò 2) Call RNN-Search(Ò, Ô ¼¼ ) to find ÊÆÆ Ë Ô ¼¼ µ 3) Call Batch-NN-Search(Ò, ÊÆÆ Ë Ô ¼¼ µ) 4) Adjust the ÒÒ for each point in ÊÆÆ Ë Ô ¼¼ µ and propagate the change up to the root The Batch-NN-Search procedure is a slight modification of the NN-Search algorithm. Formally, it looks as follows. Algorithm 8 Batch-NN-Search (Node Ò, Point Õ ½ Õ ) Input: A tree rooted at node n and an array of query points Õ ½ Õ Output: The nearest neighbors of Õ ½ Õ 1) Initialize all candidate NNs ½ for Õ ½ Õ 2) If Ò is a leaf node, update if there is a better one in the leaf node. 3) If Ò is a non-leaf node. Let Ò ½ µ, where ÔØÖ Ê Ø Ñ Ü ÒÒ µ. Let Õ Ê Ø µ, Ñ Ü ½ µ. Sort according to. For, if Õ µ for any ¾ ½ call Batch-NN- Search(ÔØÖ Õ ½ Õ ). Comparing our insertion and deletion algorithms with those presented in Section 3, we need only a single index, as opposed to the combined NN, RNN tree approach. Considering the insertion algorithm, inserting one point into the Rdnn-tree and the RNN-tree are almost equivalent. Both have a pre insertion phase followed by a call to the standard R*-tree insertion algorithm. However, employing one index makes it possible for us to perform a combined NN-RNN search in the pre-insertion phase. Our experiments show 5

6 that the combined search saves us one NN search. Better yet, we do not need to insert the point into a second index. Regarding the deletion algorithm, we have the same situation. In addition, we propose to do batch NN searches in the post deletion phase. This provides greater savings. 5 Experimental results This section presents the results of our experiments. We compare the Rdnn-tree on reverse nearest neighbor (RNN) queries with the RNN-tree method of Korn and Muthukrishnan. We also measure the performance of the Rdnn-tree on nearest neighbor (NN) queries and compare it to standard NN algorithms. Furthermore, we look at two other kinds of queries, combined NN-RNN queries and batch NN queries. These results have significant impact on performance in the dynamic case. We implement both structures in C++, and run our tests on a machine with 2 5-MHz Pentium II processors with 512 MB RAM under SCO UNIX. For the RNN-tree, we use the code provided by Korn and Muthukrishnan. We obtain a large real data set from the US National Mapping Information web site (URL: It contains populated places in the USA, represented by latitude and longitude coordinates. We sample different number of items from the data set to create our various data sets to be indexed and then sample 5 items from the rest of the data set to be the query set. For higher dimensional data we generate random points for both the data and query set. Static performance : RNN search The first set of experiments compares RNN-search performance. Figure 1 shows the results. We measure both the number of leaf nodes and the total number of nodes. We can see that the Rdnn-tree provides significantly better performance than the RNNtree approach. For instance, in the 2-D case, the RNN-tree approach on average takes 2 leaf access for the 1, item data set, while in our case less than 2 leaf access is required on average, an improvement of more than 9%. Significant improvement can also be seen on the total disk access case the Rdnn-tree is consistently 4 to 5 times better than the RNN-tree in the 2-D case, and even better in the 4-D case. This establishes the effectiveness of the Rdnntree. Dynamic performance : NN queries One of the main advantages of the Rdnn-tree is the elimination of a second tree in dynamic cases, as the Rdnn-tree itself can perform NN queries effectively. To verify this, we implement the standard NN search algorithm by Roussopoulos et al. [1] and compare it to the Rdnn-tree approach. Table 1 shows the result for total pages accessed (results for the leaf access is similar). It turns out that the Rdnntree performs slightly better than the standard Ê -tree. This is due to the fact that the nearest neighbor information in the Rdnn-tree increases the pruning power of the algorithm. More importantly, this illustrates the feasibility of the Rdnntree for NN-queries, enabling us to eliminate the extra index and significantly cutting down the maintenance cost. Dynamic performance : Combined NN-RNN queries Inserting a data point into the index requires the algorithm to locate both the NNs and the RNNs of the point for update purposes. If one can combine the NN and RNN query for the point into one pass, there will be significant savings. Thus we run experiments to measure the costs for NN queries, RNN queries, and the combined NN-RNN queries. Figure 2 shows the results. We show only the 4-D results, as the 2-D results are similar. We can see that the cost for a combined NN-RNN query is essentially the same as that of an RNN query, and is much less than the combined cost of a separate NN and RNN query. This shows we can get the NN of the query point nearly for free when we run the RNN query. Dynamic performance : Batch NN queries Recall that batch NN queries can be used to speed up deletions. We run experiments to test its effectiveness by measuring the cost of NN queries involved in the deletions. In the experiments, we simulate the delete procedure by picking 5 points from the data set, finding their RNNs, and doing the NN queries for the RNNs of each point. Observe that for any point Ô, we have RNN Ôµ ¼, where Ë is the cardinality of a set Ë. If RNN Ôµ ½, batch NN and regular NN queries for RNN Ôµ are the same. Only when RNN Ôµ ¾ is a batch NN query necessary. For each data point Ô, we compare the cost of running the NN queries separately for each point in RNN Ôµ with that of running the batch-nn query for all points in RNN Ôµ. Figure 3 shows the average results of the 5 points. We can see that doing batch NN queries significantly reduces the disk access. Also not shown in the figure is that the cost of a batch NN query is comparable to a single NN query. This means that if RNN Ôµ, then the batch NN for RNN Ôµ reduces cost times. Our experiments show that is usually in the range of to 5. The importance of batch NN queries increases with the increase of dimensionality. For instance in 2-D only 2-3% of the deletion require a batch-nn, i.e., RNN ½, while at 4-D over 6% of the deletion requires a batch-nn query. 6

7 3 25 RNN-tree (leaf) Rnnd-tree (leaf) RNN-tree (total) Rnnd-tree (total) 8 7 RNN-tree (leaf) Rnnd-tree (leaf) RNN-tree (total) Rnnd-tree (total) D (real data set) 4-D (uniform data set) Figure 1. Comparison of performance for (static) RNN queries 2-D data sets 4-D data sets Number of points 1, 25, 5, 75, 1, 5, 5, Rdnn-tree Ê -tree Table 1. Comparison of NN queries performance (Total pages accessed) 6 Conclusion and future work In this paper, we presented the Rdnn-tree, an Ê -tree enhanced by storing nearest neighbor distance information. We demonstrated that this structure is much more efficient in answering RNN-queries, by eliminating the need for a second index, and by providing superior performance in both static and dynamic cases. Our focus in this paper is the monochromatic reverse nearest neighbor problem. A future direction for us is to adapt the Rdnn-tree to the bichromatic reverse nearest neighbor problem. In such problem the data are divided into 2 types. Given a query point Õ of one type, the system is required to find all the points of the second type that has Õ as its nearest neighbor. It will be interesting to see what different constraints (like a single index for both types, or separate index on each type) will have on the effect of the algorithms. Also it would be interesting to see how well Rdnn-tree adapts to the problem. The Rdnn-tree is based on the Ê -tree. While it works well in lower dimensions, its performance degrades in high dimensions. We plan on exploring how to adjust the Rdnntree techniques to high-dimensional indexing techniques, like the TV-tree [9] and the X-tree [3]. Acknowledgments We would like to thank Flip Korn for providing the RNN-tree code, and Flip Korn and Ioana Stanoi for their comments. We would also like to thank Diane Mittelmeier for proofreading the manuscript. References [1] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-Tree: an efficient and robust access method for points and rectangles. In Proc. of the 199 ACM SIGMOD International Conference on Management of Data, pages , Atlantic City, NJ, May 199. [2] S. Berchtold, B. Ertl, D. A. Keim, H.-P. Kriegel, and T. Seidl. Fast nearest neighbor search in high-dimensional spaces. In Proc. of the 14th IEEE Conference on Data Engineering, pages 23 27, Feb [3] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree : An index structure for high-dimensional data. In Proc. of 22th International Conference on Very Large Data Bases, pages 28 39, 3 6 Sept [4] V. Gaeda and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 3(2):17 231, June [5] A. Guttman. R-trees: a dynamic index structure for spatial searching. In Proc. of the 1984 ACM SIGMOD International Conference on Management of Data, pages 47 57, Boston, Mass, June

8 12 1 RNN query combined NN-RNN query seperate RNN + NN query 2 RNN query combined NN-RNN query seperate RNN + NN query D (uniform data set), leaf nodes 4-D (uniform data set), total nodes Figure 2. Performance of combined NN-RNN queries 4 Non-batch NN Batch NN 1 Non-batch NN Batch NN D (uniform data set), leaf nodes 4-D (uniform data set), total nodes Figure 3. Comparison for Batch NN queries for 4-D data [6] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. ACM Transactions on Database Systems, 24(2): , June [7] N. Katayama and S. Satoh. The SR-tree: An index structure for high-dimensional nearest neighbor queries. In Proc. of 1997 ACM SIGMOD International Conference on Management of Data, pages , June [8] F. Korn and S. Muthukrishnan. Influence sets based on reverse nearest neighbor queries. In Proc. of 2 ACM SIG- MOD International Conference on Management of Data, pages , May 2. [9] K.-I. Lin, H. Jagadish, and C. Faloutsos. The TV-tree - an index structure for high-dimensional data. The VLDB Journal, 3: , Oct [1] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. of 1995 ACM SIGMOD International Conference on Management of Data, pages 71 79, San Jose, CA, May [11] J. Sack and J. Urrutia, editors. Handbook on Computational Geometry. North-Holland, 2. [12] H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 199. [13] T. Sellis, N. Roussopoulos, and C. Faloutsos. The Ê tree: a dynamic index for multi-dimensional objects. In Proc. 13th International Conference on Very Large Databases, pages , England, Sept [14] I. Stanoi, D. Agrawal, and A. E. Abbadi. Reverse nearest neighbor queries for dynamic databases. In Proc. of 2 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 44 53, May 2. [15] D. A. White and R. Jain. Similarity indexing with the sstree. In Proceedings of the 12th International Conference on Data Engineering, pages , Feb

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree X-tree Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d a Department of Computer and Information Science, University of Konstanz b Department of Computer Science, University