Evaluation of R-trees for Nearest Neighbor Search

Size: px

Start display at page:

Download "Evaluation of R-trees for Nearest Neighbor Search"

Barbra Skinner
5 years ago
Views:

1 Evaluation of R-trees for Nearest Neighbor Search A Thesis Presented to The Faculty of the Department of Computer Science University of Houston In Partial Fulfillment Of the Requirements for the Degree Master of Science By Mei-kang Wu December, 26

2 Evaluation of R-trees for Nearest Neighbor Search Mei-kang Wu APPROVED: Dr. Christoph F. Eick, Advisor Dr. Shuhab D. Khan, Commitee member Dr. Shishir Shah, Commitee member Dean, College of Natural Sciences and Mathematics ii

3 Acknowledgements First, I would like to express sincere appreciation to Dr. Christoph F. Eick for providing the guidance and insights throughout the research. I would also like to thank the committee members Dr. Shuhab D. Khan and Dr. Shishir Shah for providing instructive comments. In addition, I would like to express my gratitude towards Mr. Christophe Picard, Mr. Shaun Loether and Miss Geji George for giving me valuable suggestions of the thesis contents. Last but not least, I would like to thank all my friends and family members for their support and encouragement. iii

4 Evaluation of R-trees for Nearest Neighbor Search An Abstract of a Thesis Presented to The Faculty of the Department of Computer Science University of Houston In Partial Fulfillment Of the Requirements for the Degree Master of Science By Mei-kang Wu December, 26 iv

5 Abstract Nearest neighbor search (k-nn search) is essential in many applications, for instance, spatial data queries, information processing in sensor networks and instancebased classification algorithms. k-nn search can be very expensive for large datasets. R- trees are data-partitioning multidimensional index structures that partition the data to bounded rectangles, neighboring objects are grouped in the same tree node thus speeding up k-nn search. This thesis implements the quadratic R-tree, the packed R-tree and preprocessing bulk loading algorithms, such as the space filling curve ordering (Hilbert ordering) and the spectral locality preserving mapping (Spectral LPM), as well as different types of approximate R-tree k-nn search algorithms, namely, the ε-approximation and the probabilistic approximation. Furthermore, precomputation methods and parallel computing (OpenMP) techniques are investigated to reduce the time for repetitive k-nn search. Experiments using large spatial datasets will be presented for comparing and evaluating the above algorithms. Our results show that the performance of the approximate k-nn search methods heavily depends on R-tree structures and datasets used. R-trees constructed by preprocessing bulk loading algorithms speed up the k-nn search significantly and Hilbert ordering outperforms Spectral LPM for both the quadratic R-tree and the packed R-tree. In addition, the use of OpenMP parallelism decreases execution time of repetitive R-tree k-nn search. v

6 Contents CHAPTER INTRODUCTION... CHAPTER 2 RELATED WORK Introduction of Multidimensional Index Structures Data-partitioning Index Structures (R-tree Variants) Guttman s R-trees R*-tree X-tree cr-tree Index Structures using Bounding Spheres Space-partitioning Index Structures Object-relational Spatial Indexing Frameworks in ORDBMS... 2 CHAPTER 3 ALGORITHMS FOR R-TREE NEAREST NEIGHBOR SEARCH 23 vi

7 3. Quadratic R-tree Packed R-tree Preprocessing Bulk Loading Algorithms for R-tree Construction Hilbert Ordering and Spectral LPM Examples of Using Preprocessing Bulk Loading Algorithms for Real Datasets R-tree Search Heuristics Definition of Metrics MINDIST, MINMAXDIST and MAXDIST Nearest Neighbor Search on R-trees Approximate Nearest Neighbor Search on R-trees Experiments and Results Evaluation Methods Relationship between Node Utilization and Nearest Neighbor Search Efficiency Nearest Neighbor Search on Quadratic R-tree Bulk Loading for Packed R-tree Problems of using Euclidean Distance for Spherical Datasets CHAPTER 4 REPETITIVE NEAREST NEIGHBOR SEARCH N- fold Cross Validation vii

8 4.2 Parallel Computing Concepts OpenMP Experiments and Results Evaluation Measures Brute-force Computation and R-tree Search Parallel Nearest Neighbor Search Approximate Nearest Neighbor Classification... 8 CHAPTER 5 CONCLUSION AND FUTURE WORK APPENDIX I PROOF OF THE OPTIMALITY OF SPECTRAL LPM APPENDIX II DATASETS AND COMPUTATION RESOURCES... 9 viii

9 List of Figures Figure. A k-nearest neighbor classification example shows... 2 Figure 2. A simple graphical representation of the R-tree index structure Figure 2.2 Results of different ways of split in R-trees... Figure 2.3 Data structure of an X-tree... 2 Figure 2.4 Splitting an overflow node into two (b) or three(c) groups... 4 Figure 2.5 An overlap-free split in bounding spheres is not possible... 5 Figure 2.6 Data structure of a SR-tree... 5 Figure 2.7 Nodes in the SR-tree are defined by both bounding spheres and bounding rectangles Figure 2.8 Data Structure of a KD-tree Figure 2.9 Data structure of a hybrid tree... 8 Figure 2. Different implementation schemes of index structures in the DBMS system... 2 Figure 2. Building the index structure to a relational DBMS schema Figure 3. Illustration of R-tree data structure Figure 3.2 The Insertion algorithm of the quadratic R-tree Figure 3.3 The ChooseLeaf algorithm of the quadratic R-tree Figure 3.4 The AdjustTree algorithm of the quadratic R-tree ix

10 Figure 3.5 The SplitNode algorithm of the quadratic R-tree Figure 3.6 The packed R-tree algorithms... 3 Figure 3.7 Basic component of Hilbert curve (first order)... 3 Figure 3.8 Different orders of Hilbert curve Figure 3.9 The algorithm for constructing n th order Hilbert curve in an area of.. 32 Figure 3. The locality Preserving Mapping (Spectral LPM) algorithm Figure 3. Multidimensional points can be represented by a graph G(V,E) Figure 3.2 Adjacency matrix and degree matrix of the graph in Figure Figure 3.3 The Laplacian Matrix of the graph in Figure Figure 3.4 Different ordering obtained from the Spectral LPM algorithms Figure 3.5 Partitioning Earthquakes dataset into grids of size 64 x Figure 3.6 Examples of computing the order for Wyoming Poverty Distribution dataset (495 points) using either Spectral LPM (left) on a 8 x 8 grid or Hilbert ordering (right) on a 6 x 6 grid Figure 3.7 Definition of MINDIST Figure 3.8 Definition of MINMAXDIST... 4 Figure 3.9 An example of MINMAXDIST calculation... 4 Figure 3.2 Definition of MAXDIST... 4 Figure 3.2 The recursive nearest neighbor search algorithm Figure 3.22 Using a stack to keep the nodes to be visited Figure 3.23 ε-approximation for nearest neighbor search on R-trees Figure 3.24 The radius of the circle that centered on q (a query point) is x

11 Figure 3.25 Approximate nearest neighbor search algorithms using probabilistic method Figure 3.26 Time measurement using the clock() function... 5 Figure 3.27 The graph shows minimum filled requirement Figure 3.28 The relationship between node utilization and the percentage of nodes being processed during exact nearest neighbor search Figure 3.29 The relationship between node utilization Figure 3.3 Comparison of NN search on the quadratic R-tree created Figure 3.3 The average time needed and the average NN agreement rate Figure 3.32 Comparison of Spectral LPM and Hilbert ordering applied on the packed R-tree Figure 4. N-fold cross validation for the nearest neighbor classification Figure 4.2 The distance matrix stores the pairwise distance of all data points Figure 4.3 Pseudocode for distance matrix computation Figure 4.4 Pseudocode for pre-computed NN-List Figure 4.5 Data structure of the NN-list (The linked-list that store the first m nearest neighbors) Figure 4.6 Von Neumann architecture: central processing unit (CPU) gets instructions and data from memory, and then sequentially processes them Figure 4.7: SMP systems: multiple processors shared the same memory Figure 4.8 Pseudocode for N-fold cross validation Figure 4.9 Time measurement using omp_get_wtime() function in OpenMP xi

12 Figure 4. Comparison of time complexity of different data sizes Figure 4. Time measurement of parallel computation on nearest neighbor search on R-tree (left) and brute-force computation Figure 4.2 Overhead measurement of parallel computation on nearest neighbor search on R-trees (left) and brute Force computation Figure 4.3 The imbalance overhead measure for R-tree search and brute force computation... 8 Figure 4.4 Approximate nearest neighbor classification results... 8 List of Tables Table 3. Frequently-used notations Table 4. Results of different computation approach for -fold cross validation.. 77 xii

13 Chapter Introduction The problem of finding k nearest neighbors, also called k-nn search, is to identify k points from a set of points which are nearest to a given query point according to some distance measure. In real life, we often encounter such problems; for instance, finding the nearest gas station or finding the nearest active volcano with respect to an earthquake center. In order to answer this type of queries, data have to be searched and distances have to be computed and compared. Moreover, nearest neighbor classification relies on k-nn search; this type of classification has been proven to be a simple but practical non-parametric method []. It classifies an unknown object y based on the class labels of its k nearest neighbors, using majority vote or more complicated weighting algorithms. An example of nearest neighbor classification is shown in Figure.. In this example, we wish to find out if an unknown person y, represented as a question mark in Figure., has tendency to maintain a good credit score. We can find the nearest neighbor of this person among a set of pre-classified people, based on the distance function of some attributes such as income, loan status or age. One problem which the nearest neighbor classification faces is that the value of k cannot be determined in advance; often, N-fold cross validation is used to determine a best value for k. Therefore, the amount of computations is large.

14 Figure. A k-nearest neighbor classification example shows how we can classify the unknown data using the pre-labeled data. Nowadays, data collection is a very common task; for example, financial institutions collect personal information of their customers in order to classify new credit card applicants. If there are 5, known records and we want to evaluate the classification models of ten different k values by using -fold cross validation; totally more than 2 billion computations will be necessary. Therefore, many indexing techniques have been proposed in previous studies to speed up k-nn search. Among them, R-trees have gained popularity for low dimensional datasets. R-trees are data-partitioning multidimensional index structures that have been implemented in several popular database management systems (DBMS) e.g., Oracle, IBM DB2, MySQL and PostgreSQL) and are widely used in various applications. 2

15 Moreover, many different variants of R-trees and search algorithms associated with them have been proposed in the literature. This thesis implements and investigates the performance and the usefulness of R-trees. Different sort-based bulk loading algorithms are applied in both the quadratic R-tree and the packed R-tree. Furthermore, repetitive k- NN search between known objects such as in N-fold cross validation is also discussed. Precomputation methods and parallel computing are incorporated in order to speed up repetitive k-nn search, and the performance and the scalability are analyzed. Chapter 2 reviews commonly used index structures for k-nn search. These index structures are divided into two categories, namely, data-partitioning index structures and space-partitioning index structures. Chapter 3 describes algorithms for implementing R- tree index structures, including the quadratic R-tree [2] and the packed R-tree [3, 4]. The nearest neighbor search traversal methods [5, 6, 7] are discussed, and different approximate k-nn search algorithms [8, 9] are also explained in detail. The spectral locality preserving mapping (Spectral LPM) [] is applied prior to the creation of R- trees and its performance is compared with another popular bulk loading algorithm that uses Hilbert ordering [3]. Chapter 4 centers on repetitive k-nn search; N-fold cross validation of nearest neighbor classification is a typical example. The uses of R-trees, OpenMP technology, and of precomputation techniques are evaluated. Finally, Chapter 5 gives the conclusion and identifies areas of future research. 3

16 Chapter 2 Related Work 2. Introduction of Multidimensional Index Structures In order to efficiently handle queries involving multidimensional dataset or spatial data objects comprise of more complex data types (e.g. points. lines or polygons), a specific index structure which takes advantage of locality features can be used to improve the performance. Traditional index structures such as hash tables or B+ trees, are not suitable for spatial queries. Hash tables are based on exact matching and do not support range queries, whereas B+ tree rely on single attribute ordering which is not suitable for nearest neighbor search. Multidimensional index structures can be roughly categorized into two different types; one is the data-partitioning type and the other is the space-partitioning type. The former uses the bounding rectangles or bounding spheres as the index type, that is, the index stores the information of bounded intervals (maximum and minimum) in all dimensions. R-tree variants are most well-known representatives of this type. The latter decomposes the space recursively into disjoint partitions, of which the nodes only split in one dimension. The KD-tree and Quadtrees belong to this type of index structures. Guttman s original R-tree [2] allows overlapping of tree nodes. When adding a new entry to the overflow node, node splitting criteria must be carefully considered in order to reduce the overlapping effect. In his paper, Guttman proposed three algorithms 4

17 to minimize the total area covered by two split nodes: exhaustive, quadratic and linearcost algorithms. Subsequent modified versions of R-tree try to improve the performance and the node utilization. The R*-tree [] not only reduces the area covered by two split nodes but also considers the area of overlapping and the margin of nodes. The forced reinsertion of p entries when adding a new object (where p can be a tuning parameter) changes entries between neighboring nodes thus dynamically decreases the overlapping. The X- tree [2] utilizes the concept of supernodes; instead of splitting, directory nodes are extended over the usual block size. Supernodes are created during insertion when there is no possibility to avoid the overlap, and they can be read sequentially in order to reduce the numbers of extra block accesses. The cr-tree [3] uses the clustering techniques upon splitting and introduces a multi-way splitting method. The Hilbert R-tree uses space filling curve (Hilbert curve) to impose a linear ordering on the data entries. Many more R-tree variants intend to solve the curse of dimensionality problems as R-tree performance usually drops significantly with increasing dimensions. Other index structures employ the space-partitioning methods that use a single dimension and a single position in the dimension to split nodes. When this approach is used, the sub-trees are completely disjoint and do not overlap. Some well-known structures include the KD-tree [4], Quadtrees [5] and the hybrid tree [6]. A KD-tree is a balanced binary tree in which each node is a decision point on a specific dimension and splits that dimension into two regions. Quadtrees recursively divide the space to up to four regions and are most commonly used in 2-dimensional image representation. Space- 5

18 partitioning index structures guarantee that the fan-out is independent of dimensionality, but their mutually disjoint split may cause cascading splits, results in low utilization of nodes, or even creates empty nodes. The hybrid tree allows the overlap when utilization constraint is violated and has been proposed for higher-dimensional datasets. As described above, spatial data and their index structures are quite complex so that it is difficult for database management systems (DBMS) to satisfy every function in different domains. Thus, extensible index framework is a good choice for users to implement their special needs. An example is the Generalized Search Tree (GiST) [7] which serves as a high-level framework that provides full ACID support. Another approach is to map the spatial index structure to a relational schema organized by the built-in access methods (e.g. common SQL interface). It is easily supported by any relational database system but the performance is still a challenging issue. As an example, Oracle developers built relational R-trees physically using tables inside the database. The search involves recursive SQL for traversing the tree from root to leaves. It is more efficient for answering search queries but uncommitted node splits may lock the entire sub-trees against concurrent updates [8]. Consequently it is only suitable for readonly or single-user environment. The fact reveals that different approaches have different The ACID model is the fundamental concept in database management systems, the four letters state for atomicity, consistency, isolation and durability. 6

19 trade-offs, it is important to consider our needs when choosing an implementation schema. 2.2 Data-partitioning Index Structures (R-tree Variants) Spatial data objects such as country boundaries, urban utilization regions or earthquake zones cannot all be represented as simple point objects. To retrieve these complex data efficiently, index structures based on the spatial locality are desirable. Data-partitioning type index structures define bounding rectangles (or bounding spheres, bounding diamonds) as nodes in a tree-like structure; thus facilitates range search and nearest neighbor search. R-tree family is the most classical representative of this type of index. We will briefly summarize some previous work in the following sections Guttman s R-trees Guttman s R-trees form the fundamental backbone of the subsequent R-tree algorithms [2]. In general, objects in R-trees are represented by tuples, and each tuple has a unique identifier for retrieving. Leaf nodes contain index record entries of the form (I, tuple-identifier) where I is a d-dimensional rectangle, or the so called MBR (Minimum Bounding Rectangle). Note that I=(I, I,...,I n- ) and I i is the closed bounded 7

20 interval [a,b] describing the extent of the object along dimension i. Non-leaf nodes contain entries of the form (I, child-pointer) where child-pointer contains the address of a lower level node in R-trees and I covers all rectangles in the lower node s entries. A graphical representation is shown in Figure 2.[9]. Figure 2. A simple graphical representation of the R-tree index structure. As Figure 2. depicts, non-leaf node R 2 contains ( (,), (5,5)) as its MBR and two child-pointers that points to R 6, R 7. The leaf node H (H is the object id) contains ((3,3),(2,2)), we can see that its MBR has degenerated from a rectangle to a point, since this is an example of a 2-dimensional point dataset. 8

21 Let M be the maximum number of entries that will fit in one node and let m 2 M be a parameter specifying the minimum required number of entries in a node. Under this assumption, Guttman states that R-trees must satisfy the following properties:. Every leaf node contains between m to M index records unless it is the root, and every non-leaf node has between m and M children unless it is a root. 2. For each leaf node, I is the MBR that spatially contains the d-dimensional data objects. For each non-leaf node, I is the MBR that spatially contains the rectangles in the child node. 3. The root node has at least two children unless it is a leaf. 4. All leaves appear on the same level. As a consequence of the above properties, the hierarchy directory structure of R- trees allows the overlap of the MBRs. By tuning the parameter m, the height of the tree is controlled at log m N - where N is the total number of index records. Guttman considers that node splitting should avoid the situation when both new nodes need to be examined in the search process. Thus, the total area of two covering rectangles after a spilt must be minimized. Figure 2.2 depicts an example of different split methods. The left one is marked as bad split because the larger covering area of nodes will increase the search space. 9

22 Figure 2.2 Results of different ways of split in R-trees. Three algorithms are proposed by Guttman for splitting nodes in R-trees:. Exhaustive algorithm: A straightforward method that generates all possible groupings of MBRs and chooses the best one. However, it takes exponential time and is therefore not a feasible approach when datasets are large. 2. Quadratic-cost algorithm: This method picks two of the M+ entries to be the first elements (seeds) of the two new nodes by choosing the pair that would waste the most area if both were put in the same node; the remaining entries are assigned one at a time, greedily, according to the minimum area expansion policy. 3. Linear-cost algorithm: Different from quadratic-cost algorithm, it selects the pair with the greatest normalized separation along any dimension as seeds.

23 2.2.2 R*-tree N. Beckmann et al. have proposed the R*-tree [] which is capable to dynamically reorganize its structure. In addition to minimizing the area after node splits, the R*-tree considers that the following parameters are equally important:. The area covered by a MBR should be minimized. 2. The overlap between MBRs should be minimized. 3. The margin (perimeter) of a MBR should be minimized. 4. Storage utilization should be optimized. (The height of the tree should be kept low). Upon inserting, the R*-tree adapt the forced reinsertion to reorganize the existing structure. When an overflow happens, the M+ entries of the overflow node N are sorted according to the distance between the center of their MBR and the center of N. p entries (where p is a tunable parameter) with greater distance will be removed from N and reinsert to the tree following the normal R-tree insertion algorithm. The forced reinsertion reorganizes the tree, thus, preventing future splitting and enhancing node utilization. However, the creation cost is higher due to the reinsertion process.

24 2.2.3 X-tree Berchtold s X-tree provides a solution for high-dimensional data. Since the overlapping problem deteriorates when dimensionality increases, the X-tree extends the node to a larger block size instead of using node splitting which results in high degree of overlap. The resulting X-tree has a special structure named supernode as depicted in Figure 2.3. Figure 2.3 Data structure of an X-tree. The X-tree consists of three different types of nodes: data nodes, normal directory nodes and supernodes. Supernodes are large nodes of a multiple of the normal block size; they are created when the overlap cannot be avoided. The linear scanning of the supernode is more efficient than several accesses to the overlapping nodes. The height of the tree also decreases with increasing dimension as number and size of supernodes increase. 2

25 Once supernodes are created, they are kept in the main memory if feasible; otherwise, the replacement policy depends on a priority function considering the level, the type (normal node or supernode) and the size of nodes. Although the X-tree experimentally outperforms some R-tree variants for point queries and nearest-neighbor queries, it incurs the overhead of disk (memory) management operations for creating and maintaining the variable sized nodes cr-tree The cr-tree adopts clustering algorithms in its splitting procedure [3]. Clustering maximizes the similarity of spatial objects so that the objects in the same cluster will have higher probability of simultaneous access. The original R-trees splitting problem is turned into a typical Cluster(N,k) problem that tries to find optimal k clusters of N objects, with k = 2 and N = M+ (M is the maximum number of entries in a node). However, k is not restricted to two; a multi-way splitting is also introduced in the cr-tree as illustrated in Figure 2.4. Figure 2.4(a) represents the overflow node and Figure 2.4(b) shows the result after splitting (a) to two nodes. However, the resulting node on the top has much dead space. By splitting to three nodes in stead of two, as shown in Figure 2.4(c), the dead space is eliminated and nodes are more compact. 3

26 Figure 2.4 Splitting an overflow node into two (b) or three(c) groups Index Structures using Bounding Spheres The SS-tree [2] is designed for multidimensional point data; it employs the bounding spheres instead of bounding rectangles. The center of a sphere is the centroid of all entries. When inserting a new entry, a sub-tree with its centroid closest to the new entry is chosen. When a node split is necessary, the coordinate variance of each dimension from the centroid of its children is calculated and the dimension with the highest variance is chosen for splitting the nodes regardless the volume or the enlargement of the overlap. Since a sphere is determined by the center and its radius, it requires less storage than a bounding rectangle with lowerbound and upperbound in every dimension. Spheres also dominate in their volume-equivalent MBRs because the Minkowski sum is smaller. However, bounding spheres are not amenable to an easy, overlap-free split, as depicted in Figure 2.5 [2]. 4

27 Figure 2.5 An overlap-free split in bounding spheres is not possible. The SR-tree [22] (Sphere-rectangle tree) integrates both bounding shapes. The region of an SR-tree is specified by the intersection of a bounding sphere and a bounding rectangle. The rectangles allow the neighborhoods to be partitioned to smaller regions than the SS-tree and decrease overlapping. The leaf structure of the SR-tree is like the SS-tree, each leaf node contains data objects and a non-leaf node entry has the structure (S,R,w,child-pointer) where S is the bounding sphere, R is the bounding rectangle, w is the total number of entries contained in the sub-tree whose top is the child pointed by child-pointer. The SR-tree structure is shown in Figure 2.6: Figure 2.6 Data structure of a SR-tree. 5

28 Regions in the SR-tree are specified by the intersection of a bounding sphere and a bounding rectangle as shown in Figure 2.7: Figure 2.7 Nodes in the SR-tree are defined by both bounding spheres and bounding rectangles. When inserting, the SR-tree needs to update both bounding spheres and rectangles. The way to update the rectangle is the same as in R-trees [22], but to determine the bounding sphere of a parent node, it must utilize both the bounding spheres and the bounding rectangles of its children. Thus, it is relatively complicated and expensive to create and update. Besides, since the SR-tree contains both spheres and rectangles, its size will be definitely much larger, which leads to the fan-out reduction problem that more nodes may be required to be read on queries, this obviously affects the performance. 6

29 2.3 Space-partitioning Index Structures Space-partitioning index structures recursively decompose the space into disjoint partitions, it is used extensively in feature based similarity search. The KD-tree is one example which can be viewed as a binary search tree that store objects in d-dimensional space. Figure 2.8 [23] illustrates a simple 2-dimensional point dataset and its KD-tree structure. If a node N has a n-discriminator value, then all nodes having their n-coordinate value less than that of N are located under N s left child, while all nodes with an n- coordinate value greater than or equal to that of N are located under N s right child. In Figure 2.8, even level of the tree is spilt by the x-value (the root is considered as level ) and odd level of the tree is split by the y-value. Figure 2.8 Data Structure of a KD-tree. The drawback of the KD-tree is that the structure is dependent on the order of insertion; deletion will also cause reorganization of the tree. Worse of all, since the division hyperplanes are defined by the position of points, the KD-tree may be highly 7

30 imbalanced [23, 24]. An adaptive solution is to divide them into two subgroups with equal amounts of points or to employ optimal discretization algorithms. The hybrid tree is another index structure that combines the data-partitioning and space-partitioning techniques. It splits the node using only one dimension to guarantee that the fan-out is independent of dimensionality but relaxes the constraint that indexed subspaces must be mutually disjoint. Figure 2.9 shows a hybrid tree structure. Figure 2.9 Data structure of a hybrid tree. 8

31 The hybrid tree is very similar to the KD-tree, except that each internal node has a second split position field. The first split position represents the right boundary of the left partition (denoted by lsp) and the second split position represents the left boundary of the right partition (denoted by rsp). If lsp is equal to rsp, it indicates that this is a nonoverlapping partition, and an overlapping partition occurs if lsp is greater than rsp. Also, the hybrid tree maps the KD-based representation to an array of bounding rectangles. This mapping is defined recursively. The root of the hybrid tree is the entire data space ((,),(6,6)). The bounding rectangle of its left child is defined as Rroot (dim_value lsp), which is ((,),(6,6)) (x 3) = ((,)(3,6)). Similarly, the right child is defined as Rroot (dim_value rsp), which is ((,),(6,6)) (x 3) = ((3,)(6,6)). The children of the internal node with lsp greater than rsp will have an overlapping bounding rectangles as the grey area shown on the top. The hybrid tree takes advantage of both data-partitioning (DP) and spacepartitioning (SP) methods so that. The fan-out is independent of dimensionality. 2. The SP structure speeds up intra-node search. 3. The DP structure guarantees storage utilization and avoid the cascading split. Unfortunately, the hybrid tree does not improve the search performance significantly according to the previous researches and studies [6]. 9

32 2.4 Object-relational Spatial Indexing Frameworks in ORDBMS There are several ways to implement the index structures in a DBMS. In general, we can categorize them to three different approaches as illustrated in Figure 2. [8]. Figure 2. Different implementation schemes of index structures in the DBMS system. The integrating approach as shown in Figure 2. (b) tries to hard-wired the spatial access method to the existing database system. Thus, full support of ACID properties must be implemented, such as concurrency control or recovering services. 2

33 When developing a new indexing method, one has to deal with the storage, buffer and log managers. Examples are standard hash or B+ trees indexes of the database kernel, R- LinkTree in Informix IDS/UDO for spatially extended objects and the UB-tree in TransBase/HC for multidimensional point databases. However, integrating the new index structures to the database system may require more efforts than building the index itself. The generic approach, shown in Figure 2. (c), uses a high level framework in which the developers can simply overload the predefined functions. GiST (Generalized Search Tree) proposed by Hellerstein, Naughton and Pfeffer is a famous example. SP- GiST is another extensible index framework that especially focuses on the spacepartitioning unbalanced trees [24]. It well defines the common functionalities such as insertion, deletion, updating, concurrency control, recovery and I/O access optimization. However, these frameworks are not available in most database systems. They are only implemented in the open source system PostgreSQL. Another interesting concept is to connect the file-based generalized search trees with the indexing interface of the commercial ORDBMS. OraGist [25] is an example which operates on LibGist library and Oracle 9i, but it only has prototype implementations that are restricted to rather simple objects. The relational approach shown in Figure 2. (d) tries to map the new index structure to a relational schema, that is, the index is built on top of the relational query language. Thus, the implementation and maintenance require less effort and if we use a standardized DDL and DML such as SQL:999, the resulting code can be platform 2

34 independent. One example is the Oracle Spatial R-tree implemented by a recursive version of SQL (SQL:999). The relational mapping is illustrated in Figure 2. [8]. Figure 2. Building the index structure to a relational DBMS schema. However, if a transaction inserts a polygon which induces an enlargement of bounding boxes in the root node, the entire sub-trees will be locked until commit or rollback. This causes the overhead when subsequent operations try to insert polygons that cause enlargement in the root region. 22

35 Chapter 3 Algorithms for R-Tree Nearest Neighbor Search In this chapter, different algorithms that use R-tree index structures for speeding up k-nn search will be introduced and compared. Since our main interest is in the field of data mining applications, we focus on the index structures that are suitable for large and static datasets. Section 3. presents the algorithm for creating the quadratic R-tree. We choose to implement Guttman s quadratic R-tree because of its popularity, as it is also used in PostgreSQL and MySQL DBMS systems. Section 3.2 introduces the packed R-tree, which can be created fast from scratch and has more compact nodes. It is observed that the performance of k-nn search depends heavily on the insertion order when the index is created. Thus, we apply sort-based preprocessing ordering techniques prior to index construction and evaluate the improvements compared to random insertion. Section 3.3 describes two ordering algorithms, namely, Hilbert ordering and Spectral LPM. Section 3.4 investigates the approximate k-nn search; the comparison of different nearest neighbor search pruning heuristics will be introduced and compared. Finally, section 3.5 presents and analyzes the experimental results regarding to the above algorithms. Table 3. lists some notations and definitions that will be referred to in the later sections. 23

36 Notation MBR Definition Abbreviation of Minimum Bounding Rectangle. R= (s,t) A particular n-dimensional bounding rectangle that is defined by two endpoints ( p p ) = ( s s ) and t ( t t,..., ) s, 2,..., s n =, 2 t n where i ti s < for i n. p =, 2,..., p n A n-dimensional point p, p i ( i n ) is the coordinate in the i th dimension. Inside(p, R) A boolean predicate that returns true if n-dimensional point p is lying inside an n- dimensional MBR R=(s,t) such that si for i n, otherwise returns pi ti false. Current_best NN The best nearest neighbors that have been found so far (during the search process). Current_best distance I M m The greatest distance among k Current_best NN. The index entry of a point or a rectangle. Maximum number of entries allowed in a node of an R-tree (page size). Minimum number of entries required in a node (minimum filled requirement). Table 3. Frequently-used notations. We implemented the quadratic R-tree and the packed R-tree according to [, 3] in C++. The characteristics of R-tree variants were explained in Chapter 2. Our implementation of node structure is shown in Figure 3.. The tree node is an objectoriented class which contains the following class members: the upperbound, the lowerbound, the number of entries and a union of child pointers (for intermediate nodes) or data identifiers (for leaf nodes). The details of the algorithms for the quadratic R-tree construction is described in Figure 3.2 Figure

37 Figure 3. Illustration of R-tree data structure 3. Quadratic R-tree The quadratic R-tree is constructed by inserting entries in a top-down approach starting from the root node. Nodes are further split to form new nodes. When an entry is inserted, the ChooseLeaf algorithm is invoked to select a best node such that the enlargement of MBR caused by this new entry should be minimized. If the selected node is full, the SplitNode algorithm is invoked to create a new node to accommodate this new entry along with all other existing entries. The goal of the SplitNode algorithm is to properly redistribute the entries and reduce overlapping between MBRs after the split. 25

38 After the node splitting, the AdjustTree algorithm is invoked to propagate the split up if necessary. The Insertion algorithm is described in Figure 3.2. Algorithm R-tree_Insertion(I): Insert a new index entry I into the R-tree /*Invoke algorithm ChooseLeaf() to select a leaf node N to place the new entry I*/ N= ChooseLeaf (I, root node) If (N has space for the new entry I) Add I to N Invoke algorithm AdjustTree(N, N) Else Invoke algorithm SplitNode(N) to obtain new node N, NN where N and NN contain all entries in N and I Invoke algorithm AdjustTree(N, NN) End If Create new root whose children are the two resulting nodes if necessary Figure 3.2 The Insertion algorithm of the quadratic R-tree. The ChooseLeaf algorithm is shown in Figure 3.3. It tries to find a candidate leaf node for inserting the new entry. This candidate node needs the least enlargement to accommodate the new entry. Algorithm ChooseLeaf (I, current node): Recursively select a leaf node to place a new index entry I Least_enlargement = Candidate = Φ Num_I = number of existing index entries in current node If (current node is a leaf node) Return current node Else For i=:num_i If (child-pointer[i] s enlargement needed to include I < least_enlargement) Candidate = (child-pointer[i]) 26

39 End If End For ChooseLeaf(I, Candidate) End If Figure 3.3 The ChooseLeaf algorithm of the quadratic R-tree. The AdjustTree algorithm (Figure 3.4) updates MBRs during the insertion. If the node is not full, it simply adjusts the bounds and propagates the change up; otherwise, the SplitNode algorithm is called. Algorithm AdjustTree(A, A2): Adjust the bounding rectangles and propagate node splits as necessary /* A!=A 2 means there is a previous split and A 2 is the newly created node */ P = A s parent node If(A!=A 2 ) If (P has space) Add a new entry for A in P Adjust the upperbound and lowerbound of P AdjustTree(P, P ) Else Invoke Algorithm SplitNode to produce P and P 2 which contain A and all entries in P AdjustTree(P, P 2 ) End If Else /*If there is no split occurs, simply update the bounds until the root*/ While (A is not in the root level) Adjust the upperbound and lowerbound of P AdjustTree(P, P ) End While End If Figure 3.4 The AdjustTree algorithm of the quadratic R-tree. 27

40 The SplitNode algorithm (Figure 3.5) for the quadratic R-tree first selects two seeds that waste the most space if grouped together. The rest of the entries are assigned according to the their preference level. Assuming two seeds S and S 2 are selected, the preference level of an entry I is defined as: The area needed to include S and I The area needed to include S 2 and I Algorithm SplitNode(): Divide M+ entries into two nodes where M is the maximum number of entries that fit into a node (Page size) Pick two seeds S and S 2 from M+ entries that waste most space if grouped together Node A = { S }; Node B = { S 2 } Num_unassigned = M- While (Not all entries have been assigned) /*Assign other entries one at a time to S or S 2 according to the order of preference level*/ MaxPL = Candidate =Ø For i=: Num_unassigned /* An entry that has the greatest preference over S or S 2 is the one that has greatest difference of area enlargement requirements when grouped to S or S 2 */ Preference Level (PL i ) = S s enlargement for I i - S 2 s enlargement for I i If (MaxPL < PL i ) MaxPL = PL i Candidate = I i End If End For Assign I i to the preferred node Num_unassigned = Num_unassigned - /* Each node must have at least m entries: m is the minimum filled requirement*/ 28

41 If (one group (e.g. Node A) already has M-m+ entries) Assign rest of the entries to the other node (e.g. node B) End If End While Return node A, Node B Figure 3.5 The SplitNode algorithm of the quadratic R-tree. 3.2 Packed R-tree If the dataset that needs to be indexed is already known and does not change often, using the packed R-tree may be a better choice since it uses bulk loading techniques that lead to more compact trees. The packed R-tree is a special type of R-trees that is tightly packed so that the node utilization reaches almost %. The basic assumption is that, when the index is created for the first time, the entries must be organized in some optimal order [4] to guarantee a compact tree. The packed R-tree algorithm itself is straightforward [3]; the tree is created in a bottom-up fashion starting from the leaf nodes. However, we need to apply a preprocessing step that orders the entries to be inserted prior to the tree construction. For example, we can sort the entries using a single attribute, using a space filling curve ordering, or using the spectral locality preserving mapping (Spectral LPM) []. The packed R-tree algorithm is described in Figure 3.6. The randomly ordered data entries are sorted based on some preprocessing algorithms and the tree is created by inserting them in a bottom-up approach. 29

42 Algorithm Packed R-tree /* We refer this preprocessing sorting step as Order(MBR) and MBR(i) is the ordered MBR in the ith position where i=,2,,n and n is the total number of MBRs. M is the maximum number of entries allowed in each node*/ Order(MBR) /*Create leaf nodes*/ j= //number of nodes at the current level h= // current height i = While (There are more MBRs to insert) Generate a new leaf node N h For (:M) Insert MBR(i) to N h i= i+ End For j=j+; End While /*Create intermediate nodes in level h+*/ While (j>) j=; While (There are more nodes at level h to insert) Generate a new intermediate node N h+ For (:M) Insert address of N h to N h+ as a child pointer End For j= j+; End While h=h+; End While Figure 3.6 The packed R-tree algorithms. 3

43 3.3 Preprocessing Bulk Loading Algorithms for R-tree Construction In this section we introduce two different algorithms that give a linear order to multidimensional data. These ordering techniques are categorized as sort-based bulk loading algorithms [26] when applied on R-tree construction. The Hilbert curve ordering (Hilbert ordering) has been used in the packed R-tree [3]. It preserves a locally optimal mapping but has the boundary effect if two entries are not in the same fragment. Thus, we propose using a well-established globally optimized ordering technique (Spectral LPM) and compare their usefulness for R-tree construction Hilbert Ordering and Spectral LPM Hilbert ordering is obtained from a space filling curve that fills a square. The basic pattern is a curve which starts near the bottom left corner of a square and terminates near the bottom right corner as shown in Figure 3.7. Figure 3.7 Basic component of Hilbert curve (first order). 3

44 Figure 3.8 shows different order (2 to 5) of Hilbert curve; higher order of Hilbert curve can be constructed recursively from the previous order and fills up the entire square in a more refined fashion. The space is thus divided into many fragments. Hilbert ordering visits all points in one fragment in an optimal order that preserves the locality. However, the locality relationship is not preserved if two points are in different fragments. Figure 3.8 Different orders of Hilbert curve. The algorithm for constructing the Hilbert curve in an area of is described Figure 3.9. x, y represents vectors of coordinates generated from the previous iteration. Algorithm Hilbert(n) If n <= x= y= Else [ y x, ] =Hilbert(n-); x= *(- + y, - + x, x, - y ); 2 2 End If y= *(- + x, y, 2 + y, - - x ); 2 2 Figure 3.9 The algorithm for constructing n th order Hilbert curve in an area of. 32

45 The second ordering technique presented is the spectral locality preserving mapping (Spectral LPM) []. Although it has not been used for R-trees in the previous researches, it gives a globally optimal ordering which uses eigenvalues and eigenvectors of the matrix representation of a graph. Multidimensional points can be treated as vertices in a graph where there are edges connect to their near neighbors (Figure 3.). The algorithm is summarized in Figure 3.. Readers should refer to appendix I for symbols used in this section. Algorithm Spectral_LPM (DP, g x, g y ) // where DP is a set of data points. Partition the DP space into grids G of size g x g y where G jk represents a grid cell and j g x, k g y The center of each grid G jk represents an imaginary point where i ( g x g ) y 2. P is Modeled as a graph G(V,E) such that each point p i P is represented by a vertex v i V, and there is an edge ( v i, v j ) E if and only if p i p j = p i P 3. Compute the Laplacian matrix L(G) of graph G(V,E) by L(G)=D(G)-A(G) 4. Compute the second smallest eigenvalue λ 2 and its corresponding eigenvector X 2 (Fiedler vector) 5. Assign the value of x i X 2 to v i V and to the corresponding grid G jk 6. Sort the G jk by its value in ascending (or descending order), this linear order gives the Spectral LPM ordering Figure 3. The locality Preserving Mapping (Spectral LPM) algorithm. 33

46 34 Figure 3. Multidimensional points can be represented by a graph G(V,E). An example is given in Figure 3.. First, we model a set of points P as a graph G(V,E) such that each point P p i is represented by a vertex V v i, and there is an edge ( i v, j v ) E if and only if = j i p p. For instance, in Figure 3., = p p and 3 = p p so there are two edges from vertex v. i.e., ( v, v ) and ( v, 3 v ). Then we compute the Laplacian matrix ) ( ) ( ) ( G A D G G L =, where ) (G A is the adjacency matrix of G where ) ( = G A ij if and only if there is an edge E j i ), (. ) (G D is the degree matrix of G where ) (G D ii = degree of vertex i v. ) (G D has a diagonal structure. The matrices corresponding to Figure 3. are shown in Figure 3.2. = D(G) = A(G) Figure 3.2 Adjacency matrix and degree matrix of the graph in Figure 3..

47 35 Thus, the Laplacian matrix of this graph can be computed by ) ( ) ( ) ( G A D G G L = : = L(G) Figure 3.3 The Laplacian Matrix of the graph in Figure 3.. The next step is to compute the second smallest eigenvalue 2 λ and its corresponding eigenvector 2 X of ) (G L, also known as the Fiedler vector. By assigning the coordinate value of 2 X x i to V v i and sorting the values of V v i in the ascending (or descending) order we can get a linear ordering of P. For example, if the Fiedler vector we get is (-., -.29, -.57,.28,, -.28,.57,.29,.) then. = v, = v, ; by ordering these values we will get the Spectral LPM order (2,, 5,, 4, 8, 3, 7, 6). We implement the algorithm using MatLab ARPACK library for finding the Fiedler vector. However, when the algebraic multiplicity of Fiedler vector is greater than, different spectral orderings may be obtained. Since all of them are optimal orderings, we can randomly select one. The proof of the optimality of Spectral LPM has been given in [] and we summarize and extend the proof by showing that, theoretically, there are

infinite numbers of optimum orders satisfying the objective function (Appendix I). Figure 3.

4 Different ordering obtained from the Spectral LPM algorithms. 3.

Section 3.3. on real datasets, we need to partition the data points into grid cells.

48 infinite numbers of optimum orders satisfying the objective function (Appendix I). Figure 3.4 shows examples of different results when there are more than one unique second smallest eigenvalue. Figure 3.4 Different ordering obtained from the Spectral LPM algorithms Examples of Using Preprocessing Bulk Loading Algorithms for Real Datasets In order to apply the ordering algorithms described in Section 3.3. on real datasets, we need to partition the data points into grid cells. This is because the edge between neighboring grid cells can be easily computed. Figure 3.5 shows an example of using grid size of 64 x 64 on Earthquakes dataset (65,536 points). The dots represent the center of each cell. Ideally, this graph should have 496 vertices but some of them do not have data representation. 36

The original data will then be sorted based on their locality ordering and inserted to R-trees according to this ordering.

49 Figure 3.5 Partitioning Earthquakes dataset into grids of size 64 x 64. After data partitioning, we apply the locality ordering techniques on the grid as shown in Figure 3.6. The original data will then be sorted based on their locality ordering and inserted to R-trees according to this ordering. Figure 3.6 Examples of computing the order for Wyoming Poverty Distribution dataset (495 points) using either Spectral LPM (left) on a 8 x 8 grid or Hilbert ordering (right) on a 6 x 6 grid. 37

50 3.4 R-tree Search Heuristics To examine the nearest neighbor search performance, we need to use search algorithms traversing the tree nodes in order to answer the nearest neighbor queries. One common nearest neighbor search method on R-trees is a branch and bound search algorithm as described in [5]. However, for a large dataset, it may be quite expensive to find an exact nearest neighbor because the search space is huge. Thus, we can consider approximate nearest neighbor search based on different pruning metrics in order to save running time. Based on R-trees characteristics such that neighboring data entries are grouped together in the same node, the goal of the approximation is to prune nodes that are unlikely to contain the answers to the query. The metrics needed for nearest neighbor search and their role in the search pruning process are explained in The three important metrics MINDIST, MINMAXDIST and MAXDIST measure the distance between a point and a rectangle relying on the Euclidean distance Definition of Metrics MINDIST, MINMAXDIST and MAXDIST The three distance metrics between a point p ( p p,..., ) =, 2 p n and an MBR R(s,t) are defined below. Since the metrics are only used to measure the nearest neighbor 38

51 relationship, we do not need to compute the square root of the Euclidean distance, thus saving some computation time.. MINDIST (p,r) : The minimum distance from a query point p ( p p,..., ) MBR R(s,t); if p lies within R, then MINDIST (p,r) is zero. =, 2 p n to the If Inside(p,R) MINDIST(p,R) = Else MINDIST(p, R) = n i= p i r i Where p = p, p,..., p ) is the query point and ( s, t) ( 2 n rectangle that is defined by two end points s and t such that s = ( s, s 2,..., sn ), t = ( t, t 2,..., tn ) and si ti for i n : 2 R = is the n-dimension bounding End If Figure 3.7 Definition of MINDIST. 2. MINMAXDIST (p,r): The minimum of the maximum possible distance of the object O enclosed in the MBR R=(s,t) from a query point p ( p p,..., ) within R, then MINMAXDIST (p,r) is zero. =, 2 p n ; if p lies 39

52 If Inside(p, R) MINMAXDIST (p,r) = Else MINMAXDIST 2 ( p, R) min( Pk rmk + Pi rm i i n i k i n = 2 Where End If ( sk + tk ) rmk = sk if pk ( si + ti ) 2 rm i = si if pi 2 and rmk = tk otherwise rm = i ti otherwise Figure 3.8 Definition of MINMAXDIST. Figure 3.9 An example of MINMAXDIST calculation. An example of calculation of MINMAXDIST(p,R) is given in Figure 3.9. The query point p lies outside the rectangle R, thus, according to the definition, we will compute 2 2 P rm + P rm in all three dimensions and find the minimum of k k i k i n i i it: 4

53 In the x dimension, In the y dimension, In the z dimension, rm x = 4, rm y = 2, rm z = 4, rm y =8, rm x =3, rm x =3, rm z =2; MINMAXDIST(p,R) x = 85 rm z =2; MINMAXDIST(p,R) y = 33 rm y =8; MINMAXDIST(p,R) z = 2 Thus, the minimum of the three (85) is the MINMAXDIST(p,R): it is the minimum of the maximum possible distance to p for points in R that guarantees a nearest point to p in R exists, and is depicted as a red dotted line in Figure MAXDIST (p,r): The maximum distance from a query point p ( p p,..., ) =, 2 p n to the MBR R(s,t) ; it is the distance from p to the farthest vertex of R. If p lies within R, then MAXDIST(p,R) is zero. If Inside(p, R) Else MAXDIST(p, R) = MAXDIST(p,R)= p i r i n i= 2 Where: ri ri ri = = = t i s i p i if if pi < si p > t i otherwise i End If Figure 3.2 Definition of MAXDIST. 4

54 3.4.2 Nearest Neighbor Search on R-trees The nearest neighbor search algorithms based on [5] is described in Figure 3.2; this is a best-first branch and bound search that recursively visits the nodes until reaching a leaf node. Algorithm RecursiveNearestNeighborSearch (current_search_node): Given a query point, find its k nearest neighbors based on a particular distance metrics, current_search_node is the tree node that is currently processed. The algorithm initializes current_search_node to root node when first called. If (current_search_node is a leaf node) Compute distance from p to all points contained in current_search_node Update the Current_best NNs and Current_best distance if a better solution is found Else Sort the MBRs contained in current_search_node using MINDIST in ascending order This sorted list is denoted the Active Branch List (ABL) Compute the minimum of the MINMAXDIST of all MBRs denoted MM For i= : length of ABL // Downward pruning If MINDIST of MBR[ i] > MM Discard MBR[ i] and all other nodes with greater MINDIST from ABL Break the For loop End If End For For i= : length of ABL RecursiveNearestNeighborSearch(MBR [ i]) End For For i= : length of ABL // Upward pruning If MINDIST of MBR[ i] > current_best distance Discard MBR[ i] and all other nodes with greater MINDIST from ABL 42

55 End If End For End If End Algorithm Figure 3.2 The recursive nearest neighbor search algorithm. Our implementation of the above nearest neighbor search algorithm is slightly different in that we use a stack data structure to store the pointers to the candidate nodes and iteratively search and prune through the candidate nodes. The algorithm is described in Figure Algorithm IterativeNearestNeighborSearch (current_search_node): Given a query point, find its k nearest neighbors based on a particular distance metrics, a stack is maintained in order to search in an iterative fashion; N is the tree node that is currently processed, it stores two elements: The address of the node and the MINDIST of the node to the query point p. Initially the current_best NNs is an empty set and the current_best solution is equal to infinity. Push root node to the stack While (stack is not empty) N=stack.pop() If (N s MINDIST current_best distance) If (N is a leaf node) Compute distance from p to all points contained in N Update the Current_best NNs and Current_best distance if a better solution is found Else Set the minimum of the MINMAXDIST (denoted MM) = For i= : number of MBRs contained in N Compute the MINMAXDIST of each MBR n If n.minmaxdist < MM MM = n.minmaxdist End If 43

56 End For For i= : number of MBRs contained in N Compute the MINDIST of each MBR n If n.mindist MM and n.mindist current_best distance Add n to the Active Branch List (ABL) End If End For Sort the ABL in ascending order For i= : length of ABL Push all remaining nodes in the ABL to the stack // The node with greatest MINDIST is pushed first End For End If End If End While End Algorithm Figure 3.22 Using a stack to keep the nodes to be visited Approximate Nearest Neighbor Search on R-trees For large datasets, it is very time-consuming to find the exact nearest neighbors due to the large search space. An approximation result may not be the best answer but can be computed in a reasonable amount of time. Several approximate methods that can be applied on R-tree nearest neighbor search are presented in this section. These methods as listed below aim to reduce the search space, i.e. more nodes are pruned during the search process. 44

57 . ε-approximation [6]: ε-approximation returns the (+ ε)-approximate nearest neighbor, whose distance to the query point q is guaranteed to be within a relative error ε of the optimal solution. That is, the distance to q from the approximate solution p, denoted as d(p,q), and that from the actual solution p, denoted d(p, q), has the following relationship: d(p,q) - d(p, q) d(p, q) ε (ε ) We can apply the ε-approximation on R-trees such that, for a query point p, MBR x is pruned if the MINDIST(p,x) is greater than c(+ ε) ; where c is the current_best distance. We can simply modify the iterative R-tree search algorithm described in section by changing the following pseudocode: If n.mindist MM and n.mindist current_best distance Add n to the Active Branch List (ABL) End If to: If (+ ε)* n.mindist MM and n.mindist current_best distance Add n to the Active Branch List (ABL) End If Figure 3.23 ε-approximation for nearest neighbor search on R-trees. 2. α-allowance method [7]: The pruning algorithm is as follows: An MBR x will not be searched if MINDIST(x) + α(c) > c, where c is the current_best distance. α(c) is called an allowance function and α(c). 45

58 3. N-consider method [7]: This approximate algorithm considers only n-percentage ( < n ) of the total number of elements as the search space; e.g. the elements in an internal node are ordered according to their MINDIST and only the top n-percentage of them will be visited. 4. Time-consider method [27]: This technique simply follows a best first search scheme and tries to get the best possible result within a particular time limit. For the exact search, the time limit is. Note that in Method 3 and Method 4, many sub-optimal solutions may be found but there is no guarantee that the distance error will be bounded by some constant as in Method and Method 2. However, method 4 puts an upperbound on the running time which is not the case for the other methods. 5. Probabilistic approximation: It is suggested in [9] that we can predict the probability of a given node (which represents an MBR) to contain a point o such that for a query point p ( p p,..., ) =, 2 p n, distance(p,o) < current_best distance. We can use this probability as a metric for node search pruning. Intuitively, if node x has less than certain probability (for instance, 7%) to contain a better solution point o, we do not need to search x. A Monte Carlo approach (generating random points inside the MBR) is suggested to estimate the volume of the intersection area between the MBR and the current solution sphere (shown by the green area in Figure 3.24). If the number of random points generated in this area is less than a fraction β, this MBR is said to be less probable to contain better solutions and will be pruned. 46

59 According to the statement above, each MBR has β -probability to be pruned. If n such pages are pruned during nearest neighbor search, there will be (- β) n chance that the correct solution is not pruned; this value (- β) n is defined as pessimistic probability of this approximate algorithm. For example, if β =. and we have pruned 5 pages (nodes), the pessimistic probability for the nearest neighbor found to be the exact NN should be (.999) 5 =.6 (6%). That is, at least 6% accuracy should be guaranteed by this algorithm. However, the distance error is not guaranteed to be bounded using probabilistic approximation. Figure 3.24 The radius of the circle that centered on q (a query point) is the distance of current solution. We can see in Figure 3.24 that the MINDIST of both nodes (the rectangle shape) is smaller than the current_best distance c. However, the node on the right has a very small intersection to the circle; this indicates that the probability of this MBR to contain a 47

60 nearer neighbor of q may not be very high. Using the Monte Carlo approach to estimate the area of intersection is very expensive, thus, other heuristics were proposed which simply considers the Probability Factor (PF) where PF = current_best distance - MinDist MaxDist - MinDist PF is an approximation of the probability of finding an exact nearest neighbor in the corresponding MBR and can be used as a pruning metric. Thus, a node is pruned if its PF value is smaller than a fraction α. α acts as the degree of approximation here: using a greater α results in a higher degree of approximation. We can simply modify the exact algorithm by changing the pseudocode of the iterative R-tree search algorithm described in as shown in Figure 3.25: For i= : number of MBRs contained in N Compute the MINDIST of each MBR n If n.mindist MM and n.mindist current_best distance Add n to the Active Branch List (ABL) End If End For to: For i= : number of MBRs contained in N Compute the MINDIST of each MBR n Compute the MAXDIST of each MBR n If (current_best distance - n.mindist)/(n.maxdist-n.mindist) = α //α is a pruning parameter Add n to the Active Branch List (ABL) End If End For Figure 3.25 Approximate nearest neighbor search algorithms using probabilistic method. 48

61 3.5 Experiments and Results In this section, we present the experimental results regarding different R-tree index construction methods and their impact on the nearest neighbor search performance. We first examine the R-tree k-nn search performance for different minimum filled requirement values (m) in section 3.5.2, then we apply different bulk loading techniques on both the quadratic R-tree (section 3.) and the packed R-tree (section 3.2). We evaluate the nearest neighbor query performance by measuring both the running time and the number of nodes accessed. Datasets used in our experiments are described in Appendix II Evaluation Measures. Time Complexity For measuring the time complexity of different implementations, i.e., the elapse CPU time (wall clock time), we use the clock function in the standard header <time.h> within the applications. The clock function returns the number of clock ticks of elapsed processor time, counting from the program startup time, or returns - if the target environment cannot measure elapsed processor time. The usage is illustrated in Figure

62 #include <time.h> clock_t start, end; double elapsed; start = clock(); /*The application is running here */ end = clock(); elapsed = ((double) (end - start)) / CLOCKS_PER_SEC; Figure 3.26 Time measurement using the clock() function. In addition to running time measurement for comparing the exact k-nn search and the approximate k-nn search in R-trees, we also measure the number of tree nodes visited (page accesses) in order to know how many percentage of tree node accesses are needed. For all the experiments, tree node size is set to 2KB except for small datasets (less than, entries) where it is set to KB in order to obtain a tree with reasonable height. 2. Nearest Neighbor Accuracy This applies to the approximate nearest neighbor search algorithm. In order to compare the accuracy of the approximate k-nn search, we measure the NN agreement rate which is defined as: Numbers of approximate NN that agree with the exact k NN For -NN problems, an approximate nearest neighbor agrees with its exact solution if their distances are equal. For k-nn problems, we measure the percentage of approximate NN with distance falling within the distance of the exact k th solution. For 5

63 example, if the exact 5 th NN s distance is 6. and we get a set of approximate 5-NN with the following distance{3., 4.9, 6., 8., 8.}; then the agreement rate is 5 3, which is equal to 6%. We also measure the distance error as: Average Approximate NN' s distance - Average Exact NN' s distance Average Exact NN' s distance For example, if the distances of the exact 5-NN are {2., 3.9, 4., 5., 6.} while that of the approximate 5-NN are {3., 4.9, 6., 8., 8.}: the average exact k-nn distance is ( ) 5 = 4.8 the average approximate k-nn distance is ( ) = 6 5 and the distance error is = Relationship between Node Utilization and Nearest Neighbor Search Efficiency The minimum filled requirement (m) is used to control the node utilization, this is to avoid the node to be too empty and result in too much dead space. m value ranges from < m M M. A node is split when there are M+ points, so m= (5% of M) is the 2 2 maximum number of entries that we can guarantee after a split occurs. However, in the 5

64 quadratic split algorithm, in order to fulfill this requirement, some points may be forced to be grouped with outlier points that are far away from them. We conduct experiments on three large datasets to observe the relationship of node utilization and the -NN search performance. The node size is set to be 2KB. The results are shown below:. Relationship between m, actual node utilization and number of nodes created: Figure 3.27 shows that a greater m value does not guarantee a better node utilization. It is observed that for Modified Wyoming Poverty Status dataset and Northeast dataset, the actual node utilization for the R-tree constructed decreases while the minimum filled requirement m increases. Actual node utilization 8.% 7.% 6.% 5.% 4.% 3.% 2.%.%.% % 2% 4% 6% Minimum filled requirement Wyoming Dataset Earthquake Dataset Northest Dataset Figure 3.27 The graph shows minimum filled requirement behaves differently in three different datasets. 52

65 2. Relationship between actual node utilization and -NN search performance: We conduct experiments to understand the relationship between nearest neighbor query performance and the node utilization. Theoretically, better node utilization should benefit the search since there is less empty space. However, there are some exceptions, as shown in Figure The algorithm behaves differently on Earthquakes dataset such that the -NN search needs less node traversal in a lower utilization tree (38%) than a higher utilization tree (49%). This indicates that k-nn search does not necessarily benefit from a more compact tree structure Percentage of nodes processed in R-tree 5 5 % % 2% 3% 4% 5% 6% 7% 8% Node utilization in R-tree Wyoming Earthquake Northeast Figure 3.28 The relationship between node utilization and the percentage of nodes being processed during exact nearest neighbor search. 53

66 3.5.3 Nearest Neighbor Search on Quadratic R-tree R-tree index structures are very sensitive to the insertion order. In order to observe this effect, the sort-based bulk loading algorithms are applied before construction of the quadratic R-tree index; in particular, Spectral LPM and Hilbert ordering were used. The minimum filled requirement (m) is set to 4% and the node size is set to 2KB except for smaller datasets (less than, points) that we set it to KB in order to maintain a reasonable tree height. Data objects are first divided into 64 x 64 grid cells and the order of each grid is calculated. Data objects are then inserted to R-trees according to the order of the grid which they are mapped to. R-trees created by preprocessing bulk loading algorithms are compared to those created by random insertion of objects. Figure 3.29 shows the node utilization rate of R-trees created by using different preprocessing methods. The results indicate that neither Spectral LPM nor Hilbert ordering improves the utilization rate over random insertion. For Modified Wyoming Poverty Status dataset and Northeast dataset, random insertion actually achieves the highest node utilization rate. 54

67 Node Utilization in R-tree 7% 6% 5% 4% Earthquake Wyoming Northeast 3% 2% % % Random Insertion LPM Hilbert Methods Figure 3.29 The relationship between node utilization and preprocessing methods in three different large datasets. Percentage of nodes processed during NN search (/) Earthquake Wyoming Northeast Random Insertion LPM Hilbert Methods Figure 3.3 Comparison of NN search on the quadratic R-tree created by random insertion and pre-ordering methods. We also compare the performance of the k-nn search queries on the quadratic R- tree with or without the bulk loading preprocessing. The results show a drastic decrease 55

68 of running time when preprocessing methods are applied. In Figure 3.3, the Y-axis represents the average percentage of node processed for, random queries. For Earthquakes dataset, at an average 2% of the nodes have to be accessed in order to retrieve the nearest neighbor in a randomly inserted R-tree, whereas only less than 5% need to be accessed if bulk loading techniques are applied. We can see that both Spectral LPM and Hilbert ordering reduce the page accesses by several folds, and Hilbert ordering seems to perform better than Spectral LPM for the three large datasets. Besides, we found that the tree structure is more important to the nearest neighbor search than the search heuristics per se. Preprocessing of entries using spatial ordering alleviates the overlapping problem, thus reducing the time needed for searching redundant nodes. The experimental results show that probabilistic approximation of k-nn search on a randomly inserted R-tree needs more page accesses and running time than the exact search on R-trees created by ordered insertion. Figure 3.3 shows the result of using probabilistic approximation to find the approximate nearest neighbor. With the same parameter (α=.), R-trees created by ordered insertion return the results in a shorter period of time and achieve a better NN agreement rate. Apparently, R-trees constructed by Hilbert ordering give the best result. 56

69 Random Insertion LPM Hilbert Time needed (sec) NN agreement rate (%) Figure 3.3 The average time needed and the average NN agreement rate of several datasets with different sizes on R-trees. 57

70 3.5.4 Bulk Loading for Packed R-tree In this section, we conduct experiments on the packed R-tree, a static R-tree algorithm explained in section 3.2. Spectral LPM and Hilbert ordering is applied before the tree is created. Since the node utilization is nearly %, the k-nn search requires less page accesses than the quadratic R-tree. Figure 3.32 shows the comparison of these two different preprocessing methods. The Y-axis shows the average percentage of node accessed during -NN search. We can see that Hilbert ordering outperforms Spectral LPM for all three datasets. Percentage (%) of node processed during exact NN search in packed R-tree Hilbert LPM Earthquake Wyoming Northeast Method Figure 3.32 Comparison of Spectral LPM and Hilbert ordering applied on the packed R-tree. 58

71 3.6 Problems of using Euclidean Distance for Spherical Datasets Although Euclidean distance is suitable for most spatial datasets, computing the distance between two points on earth based on longitude and latitude is actually not very precise. Euclidean distance is a frequently used approximate distance calculation. It measures the distance in a straight line. However, if the points are scattered around the world or ranging in a larger area, Euclidean computation may result in errors since the distance between the points are not straight lines anymore. There are some other distance formulae which provide more accurate solutions. For example, the Haversine formula [28] is suitable for greater circle distance computation (the shortest distance between any two points on the surface of a sphere measured along a path on the surface of the sphere). First, the longitude and latitude of two points a and b are converted to radians (dividing by8 / π ) Let d = longitude longitude ) and d = longitude longitude ) x ( a b y ( a b 2 2 a = sin ( d / 2) + cos( longitude ) * cos( longitude ) *sin ( d ) y distance(a,b) = 2Rsin (min(, a)) a b x 59

72 For points near the pole, the parallels of latitude are not only shorter than great circles, but indispensably curved. Thus, we need to use the following Polar Coordinate Flat-Earth Formula: x = π / 2 latitude a y = π / 2 latitude b 2 2 Distance(a,b) = R ( x + y 2xy cos( longitude a longitude b ) However, both formula are ill-conditioned when the two points are nearly antipodal (on opposite site of the earth). Besides, if R-tree index is created using longitude and latitude as two dimensions, the nearest neighbor search heuristics mentioned in this chapter may not perform well; because of the nature of R-trees, objects locate at (-79, 23) and (79,23) will be placed in different nodes despite the fact that they are actually very close to each other on the earth sphere. 6

73 Chapter 4 Repetitive Nearest Neighbor Search Frequently, we may encounter the situation where we have to repeatedly find nearest neighbors for a given set of fixed objects. For example, in k-nearest neighbor classification, we often use N-fold cross validation to measure classifier accuracy, and this process usually repeats many times for the same dataset. In stead of using the exhaustive brute force computation, we can use precomputation methods to speed up this process; for instance, computing a distance matrix, or keep a linked-list of precomputed nearest neighbors in memory for future lookup. However, these methods do not work well with large datasets because the space needed will exceed the main memory size. As more multiprocessors computer systems with increasing physical memory have become available, it is worthwhile exploring parallel nearest neighbor search for speeding up repetitive k-nn queries. It is predicable that future database sizes and the complexity of queries will increase; thus, parallelism will be a key solution to realize high-performance computing and to support simultaneous query processing [27]. In this chapter, we use OpenMP parallel programming to implement both brute force k-nn search and R-tree k-nn search. The parallel computing concepts are explained and the performance of our experiments is analyzed. 6

74 4. N- fold Cross Validation Computing nearest neighbors of multiple objects is very common, especially in distance-based data mining algorithms. Cross validation [29] is a method used to measure the quality of a model in such algorithms. This method partition the data into subsets, such that one subset (training set) is used to construct the model, and other subsets (test sets) are used to validate the accuracy of this model. When cross validation is applied to a distance-based algorithm, we need to repeatedly compute nearest neighbors for the entire dataset. The simplest cross validation is the holdout cross validation in which data points are subdivided into two disjoint subsets (training set and test set). This method requires less computation time, but the evaluation is heavily dependant on the division of data thus become highly variant. N-fold cross validation solves this problem in a way that data points are grouped into N subsets, and the holdout method is repeated for N times. At each iteration, one subset is used as the test data and the union of the other N- subsets are used as training data. The N results are further combined for evaluation. Leave-oneout method is the extreme version of N-fold cross validation where N is equal to the number of instances in the original sample. Performing N-fold cross validation in instance-based learning such as k-nearest neighbor classification can be very expensive. The naïve way is to repeatedly compute the distance between all training data and one test data, the time complexity is O(n 2 ) as shown in Figure

75 For fold=,n (N-fold cross validation) For i=,m (m :test data size) i s k-nn candidates = //empty set current distance of i s k-nn candidates = For j=,l (l: training data size) If (number of k-nn <= k) Add j to k-nn candidates and keep k-nn candidates sorted Else Compute distance(i,j) If (distance(i,j)< current distance of k-nn candidates) Update k-nn candidates by using binary search End If End If End For /*test data i can now be classified by k-nn candidates*/ End For End For Figure 4. N-fold cross validation for the nearest neighbor classification. If cross validation is performed multiple times, a distance matrix that contains the pairwise distances of all objects may be helpful. An example of a distance matrix is shown in Figure 4.2. Each cell in the matrix stores the distance between two objects. Figure 4.2 The distance matrix stores the pairwise distance of all data points. 63

76 Figure 4.3 presents the pseudocode for constructing and querying the distance matrix. The time cost for constructing the distance matrix is O(n 2 ) and the space needed is also O(n 2 ) where n is the number of total data objects. Thus, it is not feasible to store such a matrix for large datasets. For a dataset with, objects, if a double type (8 bytes) is used to store the distance, the space required will be: (,) (24) 3 2 *8 = 74.5(GBytes) Construct phase: For i=,n (n: total data size) For j=,n Compute Euclidean distance(i,j) Store in matrix[i][j] End For End For Query phase: /*Must maintain a flags array that indicate the group identity in order to distinguish training set and test set*/ For i=,m (m: test data size) i s k-nn candidates = Ø //empty set current distance of i s k-nn candidates = For j=,n (n: total data size) If( j belongs to training set) If (number of k-nn <= k) Add j to i's k-nn candidates and keep k-nn candidates sorted Else Lookup matrix[i][j] If (matrix[i][j] < current distance of k-nn candidates) Update k-nn candidates by using binary search 64

77 End If End If End If End For /*test data i can now be classified by k-nn candidates*/ End For Figure 4.3 Pseudocode for distance matrix computation. Another approach is to precompute the nearest neighbors of each object and store them in an array of list (NN-list). This requires extra computation with O( 3 n log N complexity if k is small and can be neglected. Although it provides faster lookup for k- NN search, it is only feasible when n is small (small dataset), N is large (e.g. leave-oneout method) and repeated computation occurs frequently. This approach is described in Figure 4.4. n N ) Construct phase: For i =, n (n: total data size) i s k-nn candidates (NN-List[i]) = Ø //empty set /* NN-List[i] is used to contain a list of nearest neighbors of the i th instance where i <= n*/ current distance of i s k-nn candidates = For j=, l (l: test data size) If (number of k-nn <= k) Add j to k-nn candidates Else Compute distance(i,j) If (distance(i,j)< current distance of k-nn candidates Update NN-List[i] by popping the last (far most) element out 65

78 and inserting j using linear search in order to keep k-nn candidates sorted End If End If End For End For Query phase: For i=, m (m: test data size) While (j <= length(nn-list[i]) If(NN-list[i][j] belongs to training set) Add NN-list[i][j] to i's k-nn candidates End If End While /*test data i can now be classified by k-nn candidates*/ End While End For Figure 4.4 Pseudocode for pre-computed NN-List. The idea of precomputation is to simply use an array of NN-list to store a list of nearest neighbors of all instances. However, in N-fold cross validation example, the length of this list must be n + k where n is the size of the dataset and k is the number of N nearest neighbors, in order to ensure that k-nearest neighbors of a test set object will be retrieved at any fold,. Thus, the precomputation method also requires lots of space unless N is large. For example, if we wish to perform -fold cross validation on a dataset of, points, for each point, we need to store, nearest neighbors address (4 2 (,) 4 bytes). The total space needed will be: 3 (24) = 37.25(GBytes). 66

79 An example of NN-list is shown in Figure 4.5. For one run of N-fold cross validation k-nearest neighbor classification applied on n data points. n + k nearest N neighbors must be computed and stored. Besides, a group array which identifies the group must be maintained. For example, Object 6 is the first nearest neighbor of Object but since they are placed in the same group, the nearest neighbor in the training set for Object will be Object 4. This method helps when we need to repeat the k-nn search for the same data many times; but it may be very inefficient when the data size is large because excess nearest neighbors have to be stored. Figure 4.5 Data structure of the NN-list (The linked-list that store the first m nearest neighbors). Both distance matrix and pre-computation NN-list approach require too much memory; therefore they are not scalable. Thus, we explored parallel computation for k- 67

80 NN search in order to divide the exhausted distance computation to different processors and obtain results faster. 4.2 Parallel Computing Concepts Traditional serial computation runs on a single central processing unit (CPU) where only one instruction can be executed at a time. Most computer architectures follow the Von Neumann model as depicted in Figure 4.6. Figure 4.6 Von Neumann architecture: central processing unit (CPU) gets instructions and data from memory, and then sequentially processes them. Parallel Computing divides the computation between multiple processors at the same time, thus decreases the response time. A common architecture available is the shared-memory multiprocessor systems (SMP systems) in which multiple processors 68

have direct access to the same memory via memory bus (Figure 4.7). A single program that runs on this architecture can distribute its task to several processors in a multithreading approach.

81 have direct access to the same memory via memory bus (Figure 4.7). A single program that runs on this architecture can distribute its task to several processors in a multithreading approach. There are no explicit communication calls required since all threads share the same program space. Figure 4.7: SMP systems: multiple processors shared the same memory OpenMP OpenMP is the most popular multithreading standard that utilizes SMP systems. It provides a set of specifications and interfaces for parallelizing programs [3]. Compilers that support OpenMP interpret the source program with pragmas and generate the parallelized executables. An OpenMP program starts from a single master thread. When it enters the parallel region, the master thread will fork several new threads that will run simultaneously. When exiting the parallel region, all threads will synchronize and join to the master thread again [3]. 69

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Introduction to Indexing R-trees Dimitris Papadias Hong Kong University of Science and Technology 1 Introduction to Indexing 1. Assume that you work in a government office, and you maintain the records