A Cost Model For Nearest Neighbor Search. High-Dimensional Data Space

Similar documents
A Cost Model for Query Processing in High-Dimensional Data Spaces

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree

1 Surprises in high dimensions

Fast Parallel Similarity Search in Multimedia Databases

Analysis of half-space range search using the k-d search skip list. Here we analyse the expected time for half-space

Fast Nearest Neighbor Search in High-dimensional Space

Skyline Community Search in Multi-valued Networks

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Shift-map Image Registration

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

6 Gradient Descent. 6.1 Functions

Indexing the Edges A simple and yet efficient approach to high-dimensional indexing

New Version of Davies-Bouldin Index for Clustering Validation Based on Cylindrical Distance

Optimal Oblivious Path Selection on the Mesh

Kinematic Analysis of a Family of 3R Manipulators

CONSTRUCTION AND ANALYSIS OF INVERSIONS IN S 2 AND H 2. Arunima Ray. Final Paper, MATH 399. Spring 2008 ABSTRACT

Blind Data Classification using Hyper-Dimensional Convex Polytopes

Online Appendix to: Generalizing Database Forensics

Divide-and-Conquer Algorithms

Research Article Inviscid Uniform Shear Flow past a Smooth Concave Body

Non-homogeneous Generalization in Privacy Preserving Data Publishing

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

WLAN Indoor Positioning Based on Euclidean Distances and Fuzzy Logic

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE

BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES

A Classification of 3R Orthogonal Manipulators by the Topology of their Workspace

NEW METHOD FOR FINDING A REFERENCE POINT IN FINGERPRINT IMAGES WITH THE USE OF THE IPAN99 ALGORITHM 1. INTRODUCTION 2.

Shift-map Image Registration

Design of Policy-Aware Differentially Private Algorithms

Distributed Decomposition Over Hyperspherical Domains

Learning convex bodies is hard

Loop Scheduling and Partitions for Hiding Memory Latencies

A Versatile Model-Based Visibility Measure for Geometric Primitives

Using Vector and Raster-Based Techniques in Categorical Map Generalization

Ad-Hoc Networks Beyond Unit Disk Graphs

Learning Subproblem Complexities in Distributed Branch and Bound

Indexing High-Dimensional Space:

Overlap Interval Partition Join

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

Characterizing Decoding Robustness under Parametric Channel Uncertainty

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

EXACT SIMULATION OF A BOOLEAN MODEL

Coupling the User Interfaces of a Multiuser Program

Improving Spatial Reuse of IEEE Based Ad Hoc Networks

filtering LETTER An Improved Neighbor Selection Algorithm in Collaborative Taek-Hun KIM a), Student Member and Sung-Bong YANG b), Nonmember

Optimal path planning in a constant wind with a bounded turning rate

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

Feature Extraction and Rule Classification Algorithm of Digital Mammography based on Rough Set Theory

CS 106 Winter 2016 Craig S. Kaplan. Module 01 Processing Recap. Topics

Lecture 1 September 4, 2013

Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters

State Indexed Policy Search by Dynamic Programming. Abstract. 1. Introduction. 2. System parameterization. Charles DuHadway

Polygon Simplification by Minimizing Convex Corners

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

A Stochastic Process on the Hypercube with Applications to Peer to Peer Networks

Rough Set Approach for Classification of Breast Cancer Mammogram Images

Modifying ROC Curves to Incorporate Predicted Probabilities

Holy Halved Heaquarters Riddler

Image compression predicated on recurrent iterated function systems

APPLYING GENETIC ALGORITHM IN QUERY IMPROVEMENT PROBLEM. Abdelmgeid A. Aly

Fuzzy Clustering in Parallel Universes

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

2-connected graphs with small 2-connected dominating sets

Architecture Design of Mobile Access Coordinated Wireless Sensor Networks

FINDING OPTICAL DISPERSION OF A PRISM WITH APPLICATION OF MINIMUM DEVIATION ANGLE MEASUREMENT METHOD

d 3 d 4 d d d d d d d d d d d 1 d d d d d d

Visualizing and Animating Search Operations on Quadtrees on the Worldwide Web

Animated Surface Pasting

Improving the Query Performance of High-Dimensional Index Structures by Bulk Load Operations

Object Recognition Using Colour, Shape and Affine Invariant Ratios

PART 2. Organization Of An Operating System

A New Search Algorithm for Solving Symmetric Traveling Salesman Problem Based on Gravity

Investigation into a new incremental forming process using an adjustable punch set for the manufacture of a doubly curved sheet metal

SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH

Nearest Neighbor Search using Additive Binary Tree

Short-term prediction of photovoltaic power based on GWPA - BP neural network model

Image Segmentation using K-means clustering and Thresholding

Threshold Based Data Aggregation Algorithm To Detect Rainfall Induced Landslides

A Highly Scalable Parallel Boundary Element Method for Capacitance Extraction

Pairwise alignment using shortest path algorithms, Gunnar Klau, November 29, 2005, 11:

Performance Modelling of Necklace Hypercubes

A Plane Tracker for AEC-automation Applications

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks

Throughput Characterization of Node-based Scheduling in Multihop Wireless Networks: A Novel Application of the Gallai-Edmonds Structure Theorem

A Convex Clustering-based Regularizer for Image Segmentation

A Framework for Dialogue Detection in Movies

Evolutionary Optimisation Methods for Template Based Image Registration

On the Placement of Internet Taps in Wireless Neighborhood Networks

The Effects of Dimensionality Curse in High Dimensional knn Search

New Geometric Interpretation and Analytic Solution for Quadrilateral Reconstruction

A PSO Optimized Layered Approach for Parametric Clustering on Weather Dataset

0607 CAMBRIDGE INTERNATIONAL MATHEMATICS

1 Shortest Path Problems

6. Concluding Remarks

10. WAVE OPTICS ONE MARK QUESTIONS

Transcription:

A Cost Moel For Nearest Neighbor Search in High-Dimensional Data Space Stefan Berchtol University of Munich Germany berchtol@informatikuni-muenchene Daniel A Keim University of Munich Germany keim@informatikuni-muenchene Christian Böhm University of Munich Germany boehm@informatikuni-muenchene Hans-Peter Kriegel University of Munich Germany kriegel@informatikuni-muenchene Abstract In this paper, we present a new cost moel for nearest neighbor search in high-imensional ata space We first analyze ifferent nearest neighbor algorithms, present a generalization of an algorithm which has been originally propose for Quatrees [13], an show that this algorithm is optimal Then, we evelop a cost moel which - in contrast to previous moels - takes bounary effects into account an therefore also works in high imensions The avantages of our moel are in particular: Our moel works for ata sets with an arbitrary number of imensions an an arbitrary number of ata points, is applicable to ifferent ata istributions an inex structures, an provies accurate estimates of the expecte query execution time To show the practical relevance an accuracy of our moel, we perform a etaile analysis using synthetic an real ata The results of applying our moel to Hilbert an X-tree inices show that it provies a goo estimation of the query performance, which is consierably better than the estimates by previous moels especially for highimensional ata Key Wors: Nearest Neighbor Search, Cost Moel, Multiimensional Searching, Multiimensional Inex Structures, High-Dimensional Data Space 1 Introuction In this paper, we escribe a cost moel for nearest neighbor queries in high-imensional space Nearest neighbor queries are very important for many applications Examples inclue multimeia inexing [9], CAD [17], molecular biology (for the ocking of molecules) [24], string matching [1], etc Most applications use some kin of feature vector for an efficient access to the complex original ata Examples of feature vectors are color histograms [23], shape escriptors [16, 18], Fourier vectors [26], text escriptors [15], etc Nearest neighbor search on the high-imensional feature vectors may be efine as follows: Given a ata set of -imensional points, fin the ata point NN from the ata set which is closer to the given query point Q than any other point in the ata set More formally: NN( Q) { e e : e Q e Q } Usually nearest neighbor queries are execute using some kin of multiimensional inex structure such as k--trees, R-trees, Quatrees, etc In section 2, we iscuss the ifferent nearest neighbor algorithms propose in the literature We present a generalization of an algorithm, which has been originally propose for Quatrees [13] an show that this algorithm is optimal A problem of inex-base nearest neighbor search is that it is ifficult to estimate the time which is neee for executing the nearest neighbor query The estimation of the time, however, is important not only for a theoretic complexity analysis of the average query execution time but it is also crucial for optimizing the parameters of the inex structures (eg, the block size) an for query optimization An aequate cost moel shoul work for ata sets with an arbitrary number of imensions an an arbitrary number of ata points, it shoul be applicable to ifferent ata istributions an inex structures, an most important, it shoul provie accurate estimates of the expecte query execution time Unfortunately, existing moels fail to fulfill these requirements In particular, none of the moels provies accurate estimates for nearest neighbor queries in high-imensional space, an most moels pose awkwar an unrealistic re-

quirements on the number of necessary ata points preventing the moels from being practically applicable One of the reasons for the problems of existing moels is that basically none of them accounts for bounary effects, ie effects that occur if the query processing reaches the borer of the ata space As we will show later, bounary effects play an important role in processing nearest neighbor queries in highimensional space Our moel etermines the expecte number of page accesses when performing a nearest neighbor query by intersecting all pages with the minimal sphere aroun the query point containing the nearest neighbor In contrast to previous approaches, our cost moel consiers bounary effects an therefore also provies accurate estimates for the high-imensional case Furthermore, our moel works for an arbitrary number of ata points an is applicable to a wie range of inex structures such as k--trees, R-trees, quatrees, etc Besies escribing our cost moel, we provie a etaile experimental evaluation showing the accuracy an practical relevance of our moel In our experiments, we use artificial as well as real ata an compare the moel estimates with the actually measure page counts obtaine from two ifferent inex structures: the Hilbert-inex an the X-tree 2 Algorithms for Nearest Neighbor Search In the last ecae, a large number of algorithms an inex structures have been propose for nearest neighbor search In the following, we give an overview of these algorithms 21 Known Algorithms Algorithms for nearest neighbor search may be ivie into two major groups: partitioning algorithms an graph-base algorithms Partitioning algorithms partition the ata space (or the actual ata set) recursively an store information about the partitions in the noes Graph-base algorithms precalculate some nearest neighbors of points, store the istances in a graph an use the precalculate information for a more efficient search Examples for such algorithms are the RNG* algorithm of Arya [2] an algorithms using Voronoi iagrams [20] Although in this paper we concentrate our iscussion on partitioning algorithms, we believe that our results are applicable to graph-base algorithms, as well initialize PartitionList with the subpartitions of the root-partition sort PartitionList by MINDIST; while (PartitionList is not empty) if (top of PartitionList is a leaf) fin nearest point NNC in leaf; if (NNC closer than NN) prune PartitionList with NNC; let NNC be the new NN else replace top of PartitionList with its son noes; enif resort PartitionList by MINDIST; enwhile output NN; Figure 1: Algorithm NN-opt A rather simple partitioning algorithm is the bucketing algorithm of Welch [27] The algorithm ivies the ata space into ientical cells an stores the ata objects insie a cell in a list which is attache to the cell During nearest neighbor search the cells are visite in orer of their istance to the query point The search terminates if the nearest point which has been etermine so far is nearer than any cell not visite yet Unfortunately, the algorithm is not efficient for high-imensional or real ata A more practical approach is the k-tree algorithm of Friemann, Bentley an Finkel [12] In contrast to Welch s algorithm, the orer in which the k--algorithm visits the partitions of the ata space is etermine by the structure of the k--tree Ramasubramanian an Paliwal [21] propose an improvement of the algorithm by optimizing the structure of the k--tree Roussopoulos etal [22] propose a ifferent approach using the R*-tree [4] for nearest neighbor search The algorithm traverses the R*-tree an stores for every visite partition a list of subpartitions orere by their minmaxist The minmaxist of a partition is the maximal possible istance from the query point to the nearest ata point insie the partition If a point is foun having a istance smaller than the nearest point etermine so far, all partition lists can be prune because all noes with a larger minmaxist cannot contain the nearest neighbor A problem of the R*-tree algorithm is that it traverses the inex in a epth-first fashion Subnoes are sorte before escent, but once a branch has been chosen, its processing has to be complete, even if sibling branches appear more likely to contain the NN The algorithm therefore accesses more partitions than actually necessary In [13], Hjaltason an Samet propose an algorithm using PMR-Quatrees In contrast to the algorithm of Roussopoulos etal, partitions are visite orere by their minist The minist of a partition is the minimal istance from the query point Qto any point p insie the partition P More formally: MINDIST ( P, Q) min( p Q ) p P The algorithmic principle of the metho of Hjaltason an Samet can be applie to any hierarchical inex structure which uses recursive an conservative partitioning In Figure 1, we present a generalization of the algorithm which works for any hierarchical inex structure Pruning the partition list with a point NNC means that all partitions in the list which

have a minist larger than the istance of NNC to the query point are remove from the list 22 Optimality of Algorithm NN-opt In this section, we show that the algorithm NN-opt (cffigure 1) is optimal For this purpose, we nee to efine the minimal sphere aroun the query point containing the nearest neighbor: Definition 1: (NN-sphere) Let Q be a query point an NN be the nearest neighbor of Q Then NN-ist Q NN is the istance of the nearest neighbor an the query point The NN-sphere SP( Q, r) of a query point Q is efine as the sphere with center Q an raius r NN-ist Definition 2: (Optimality) An algorithm for nearest neighbor search is optimal if the pages accesse by the algorithm uring the nearest neighbor search are exactly the pages that intersect the NN-sphere Note that we use the term Optimality relative to an unerlying inex structure an not relative to the nearest neighbor problem itself Lemma 1: Algorithm NN-opt is an optimal algorithm accoring to efinition 2, ie algorithm NN-opt accesses exactly the partitions which intersect the NN-sphere but no other partitions Proof: From the correctness of algorithm NN-opt as provie in [13] it follows that any partition intersecting the NN-sphere is accesse uring the search process To show the minimality of the accesse partitions, let us assume that algorithm NN-opt accesses a partition NA, which oes not intersect the NN-sphere, ie minist( NA) > r Let NP 0 be the partition (ata page) containing the nearest neighbor, NP 1 be the partition containing NP 0,, an NP k be the partition in the root-page containing NP 0,, NP k 1 Thus, r minist( NP 0 ) minist( NP k ) Consequently, minist( NA) > r minist( NP 0 ) minist( NP k ) Since NP k is in the root-page, NP k is replace uring the search process by NP k 1 an so on, until NP 0 is loae If, as assume, the algorithm accesses NA, NA has to be on top of the partition list at some point uring the search Since minist( NA) is smaller than the minist of any partition containing the nearest neighbor, NA cannot be loae until NP 0 has been loae If NP 0 is loae, however, the algorithm prunes all partitions which have a minist smaller than N C eff a r Therefore, NA is prune an not accesse which is in contraiction to the assumption 3 The Cost Moel number of imensions number of ata points average number of ata points per inex page ege length of a ata page NP partition of the inex structure containing partitions NP 1 i,, NP i 1 Q SP ( E, r) Vol Sp Vol avg ( r) Vol Mink ( r) ( r) query point ata space -imensional hypersphere with center E an raius r volume of a -imensional hypersphere average volume of a -imensional hypersphere, bounary effects consiere Minkowski sum of an inex page an a query sphere with raius r p( r), P( r) istribution function of the raius, ensity function of the raius NN-ist, E(NN-ist) #pages, E(#pages) nearest neighbor istance, expecte nearest neighbor istance number of page accesses, expecte number of page accesses The objective of our cost moel is to provie accurate estimates of the execution time of nearest neighbor queries incluing high-imensional ata It is a well-known fact that simple queries, incluing nearest neighbor queries, are I/Oboun an only complex queries such as the spatial join may be CPU-boun Therefore, it is justifie to take the number of page accesses as a measure for the query performance Our cost moel may be use for optimizing the parameters of the inex structures such as the block size as well as for query optimization 31 Previous Approaches an their Problems Due to the high practical relevance of nearest neighbor queries, cost moels for estimating the number of necessary page accesses have been propose alreay several years ago The first approach is the well-known cost moel propose by Frieman, Bentley an Finkel [12] The assumptions of the

moel, however, are unrealistic for nearest neighbor queries on high-imensional ata, since N is assume to converge to infinity an bounary effects are not consiere The moel by Cleary [7] extens the Frieman, Bentley an Finkel moel by allowing non-rectangular-boune pages, but still oes not account for bounary effects Sproull [25] uses the existing moels for optimizing the nearest neighbor search in high imensions an shows that the number of ata points must be exponential in the number of imensions for the moels to provie accurate estimates Accoring to [25], bounary effects significantly contribute to the costs unless the following conition hols: 1 N >> C eff ------------------------------------- C eff Vol 1 + 1 Sp -- 2 where Vol Sp ( r) is the volume of a hypersphere with raius r which can be compute as Vol Sp ( r) π Γ --------------------------- ( 2 + 1) r with Γ( x + 1) x Γ( x), Γ( 1) 1 Γ 1 2 -- an π Unfortunately, the assumptions mae in the existing moels o not hol in the high-imensional case The main reason for the problems of the existing moels is that they o not account for bounary effects Bounary effects is short for an exceptional performance behavior, when the query reaches the bounary of the ata space As we show later, bounary effects occur frequently in high-imensional ata spaces an lea to a pruning of major amounts of empty search space, which is not consiere by the existing moels To examine these effects, we performe experiments to compare the necessary page accesses with the moel estimates Figure 2 shows the real page counts versus the estimates of the Frieman, Bentley an Finkel moel For high-imensional ata, the moel completely fails to estimate the number of page accesses Papaopoulos an Manolopoulos present in a very recent work [19] an analysis of nearest neighbor queries using R- trees In a recent paper [3], Arya, Mount, an Narayan evelop a moel that is capable of accounting for bounary effects The problem of the Arya approach, however, is that the moel still assumes N to be growing exponentially with the imension an it also uses the L metric, which is not suitable for most atabase applications Note that our moel also confirms the earlier results of Yao an Yao [28] 32 Overview of our Cost Moel The main objective of this paper is to present a new cost moel for nearest neighbor queries in high imensions In contrast to existing moels, our cost moel provies accurate estimates of the number of page accesses in the highimensional case since it accounts for bounary effects Furthermore, our moel is base on the optimal algorithm for nearest neighbor search (cf subsection 22) an works for an arbitrary number of ata points For the presentation of our cost moel, we first assume that the ata is uniformly istribute an that the split is performe in a k- tree fashion We will show later that our moel is also applicable to arbitrary ata istributions an a wie range of inex structures such as k--trees, R-trees, quatrees, Z-inices, etc The goal of our moel is to etermine the expecte number of pages which have to be accesse in performing a nearest neighbor query The number of ata pages which have to be accesse can be etermine by intersecting all pages with the minimal sphere aroun the query point containing the nearest neighbor The first step in eveloping the cost function is to etermine the average portion of a query sphere with a given query raius, which is insie the ata space Note, that the ata space is assume to be normalize to the unit hypercube [01] Then, we etermine the expecte raius of the sphere, which can be escribe as a stochastic variable Taking bounary effects into account, we erive the istribution function, probability ensity, an expecte value of the nearest neighbor istance (cf subsection 33) In the next step, we have to etermine the number of pages intersecte by the query sphere For this purpose, we require the Minkowski sum of the query sphere an the shape of an inex page (eg, the bouning box of the page in case of the R-tree) Due to bounary effects, portions of the volume of the Minkowski sum are outsie of the ata space, an therefore we have to introuce some moifications to the stanar Minkowski sum (cf subsection 34) The last step is the integration of the separate steps into the cost function For etermining the ex- page counts 250000 200000 150000 100000 50000 000 000 400 800 1200 1600 imension measure page accesses Bentley moel Figure 2: Real Page Counts versus Estimates by Moel [12]

pecte number of page accesses, we have to form the weighte average of the costs associate with the nearest neighbor istances weighte by the probability of their occurrence The etails are provie in subsection 35 33 Expecte Nearest Neighbor Distance In this subsection, we now etermine the number of pages intersecte by a query sphere with a given raius r For this purpose, we have to etermine the Minkowski sum of the query sphere an the inex pages As can be seen in Figure 3, the concept of the Minkowski sum transforms a spherical query on a set of boxes into an equivalent point query on a set of enlarge objects The Minkowski sum irectly correspons to the volume of the intersecte pages Graphically presente, the Minkowski sum escribes the volume which results from moving the center of the query sphere over the surface of the bouning box of the inex page (cf Figure 4 for an ex- The goal of this subsection is to etermine the expecte istance between a query point an its nearest neighbor in a atabase of N points Before we are able to solve this problem, however, we first consier a simpler problem, namely the expecte istance of two uniformly istribute points (one query point an one ata point) in the ata space Let us first assume that the ata point (ata entry E) has a fixe position E [e 1, e 2,, e ] Then, the probability that the istance from the query point Q [q 1, q 2,, q ] is less than r can be moele as the volume of the hypersphere aroun E with raius r If point E is close to the borer of the ata space [ i { 1 }: ( r > e i ) ( e i > 1 r) ], we have to consier that part of the hypersphere volume is outsie of the ata space an oes not contribute to the probability The volume of the intersection of the ata space an the hypersphere can be expresse as the integral of a piecewise efine function integrate over all possible positions of Q Vol( SP ( E, r) ) where E 1 if E Q r Q 0 otherwise Q r ( e i q i ) 2 r 2 i 1 1 1 an f ( X) X f ( x 1,, x ) x 1 x 0 0 If we assume that the ata point is also ranomly taken from the ata space, the above formula has to be average over all possible positions of E 1 Vol avg ( r) Vol( SP ( E, r) ) E Note that Vol avg ( r) correspons to the probability P( E Q r) To etermine the expecte istance between a query point an its nearest neighbor in a atabase of N points, we have to etermine the probability istribution of the minimum istance between query an ata points The probability that the nearest neighbor istance is at most r can also be escribe by the opposite: None of the N ata points is in the intersec- 1 As is [01], the enominator of the average is 1 tion of an the NN-sphere The corresponing istribution function P(r) is therefore: P( r) 1 ( 1 Vol avg ( r) ) N The ensity function p( r) of P( r) can be erive by etermining the erivative of this function p( r) P( r) r Volavg ( r) N ( 1 Vol r avg ( r) ) N 1 From this, we obtain the expecte nearest neighbor istance by the integral E( NN-ist) r p( r) S NN Figure 3: Transforming a Spherical Query into a Point Query by the Concept of Minkowski Sum 0 N r Vol r ( ( avg r ) ) ( 1 Vol avg ( r) ) N 1 r 0 In section 4, we will show that this formula may be use to accurately preict the expecte nearest neighbor istance 34 Number of Pages Intersecte by the Query Sphere r S

r a 2 1 2 -- a Vol 1 Sp ( r) a 2 -- Vol Sp ( r) Figure 4: Example of the Minkowski Sum in Two Dimensions ample of the two-imensional Minkowski sum) For calculating the Minkowski sum, we have to consier volumes of each imension between 1 an which result from the ifferent faces of the bouning box If the inex page is a bouning box with an extension a in all imensions, the Minkowski sum may be calculate as Vol Mink ( r) The Minkowski sum is the expecte value of the hyper-volume of the bouning boxes of the ata pages which are intersecte by the NN-sphere The expecte value of the number of ata pages can easily be etermine by normalizing the Minkowski sum using the volume of the bouning box #Pages( r) The Minkowski sum, however, oes not consier bounary effects which occur in high-imensional space because r becomes large an portions of the volume of the Minkowski sum are outsie of the ata space To obtain a more realistic moel for the high-imensional case, we have to introuce some moifications to the Minkowski sum Similar to the case escribe in the previous subsection, we integrate over the ata space an etermine the intersection of partition B with the query sphere aroun Q: Vol Mink If B is a rectilinear bouning box with a lower corner l l u u [ b 1,, b ] an an upper corner [ b 1,, b ], MINDIST may be compute as MINDIST (B,Q) 2 1 4 a i i 0 i Vol Mink ( r) --------------------------- a ( r) Vol Mink ( r) i Vol Sp ( r) 1 if MINDIST (B,Q) r 0 otherwise i 1 r l u 0 if ( b i q i b i ) l ( b i q i ) 2 l if ( q i < b i ) u ( b i q i ) 2 otherwise To etermine the Minkowski sum accoring to this formula, l we woul nee a stochastic moel for the parameters b i an u b i of the inex pages In practical experiments, we observe that in high-imensional space usually one of the two parameters, b i or b i, falls together with one of the borers of the l u ata space which results from the fact that each imension has been split at most once If all imensions are of about the same significance, the split algorithm has to use all imensions as split axes in orer to obtain a high selectivity In this case it is practically impossible in a high-imensional space to obtain more than one split per imension since the number of ata points oes not increase exponentially with the imension In general, the number of ata points is even not high enough that all imensions are split once Therefore, without loss of generality, we may assume that only the first ' imensions have been split at position s i in imension i ( 1 i ) may be etermine as N ' log 2 --------- C eff The Minkowski sum over all inex pages which irectly correspons to the average number of pages intersecte by the query sphere can be etermine as ' #Pages( r) Vol( SP k ([ s i1,, s ik ], r) ) k 0 { i 1,, i k } P( { 1,, ' }) For each k, the partitions have some ( -k)-imensional faces insie At these faces, a hyper-cyliner arises which is spherical in k imensions (with raius r) an cubical in the remaining imensions (with sie-length 1) The spherical part may be intersecte with an only this intersection is relevant The secon sum iterates over all elements of the power set of {1,, }an thus, selects exactly all possible k-imensional projections of the split imensions, encountering all possible cyliners For uniformly istribute ata, the s i are all at the same position ( s i 1 j -- In this case, the formula becomes j ) 2 ' #Pages( r) Vol SP k 1 ( ( --,, 1 2 --, r) ) 2 k 0 { i 1,, i k } P( { 1,, ' }) As the volume of all k-imensional cyliners is ientical now, we may simplify the formula to: #Pages( r) ' ' k Vol SP 1 ( ( --,, 1 2 2 k --, r) ) k 0

E (#Pages) N Volavg r r ( ) ( 1 Vol avg ( r) ) N 1 ' Vol( SP k ([ s i1,, s ik ], r) ) r 0 k 0 { i 1,, i k } P( { 1,, ' }) Figure 5: Cost Formula for the Expecte Number of Page Access 35 Expecte Number of Page Accesses In the previous section, we evelope a moel to etermine the number of page accesses for a query sphere with a given raius The goal of this section is to etermine the expecte number of page accesses for a nearest neighbor query To etermine the expecte number of page accesses for a nearest neighbor query, we have to integrate over the raius multiplie with the probability with which the raius occurs More formally, the expecte number of page accesses for a nearest neighbor query E(#Pages) may be etermine as E (#Pages) #Pages( r) p( r) If we integrate the partial results from subsections 33 an 34, we obtain the formula presente in Figure 5 4 Experimental Evaluation In this section, we first escribe the implementation of our cost moel presente in section 3 Then, we escribe the experiments conucte to show the practical applicability of our cost moel an provie a short interpretation of the experimental results 41 Implementation of the Cost Moel 0 In subsection 33, we presente an integral formula to etermine the volume of the intersection between the ata space an a query sphere with raius r This volume integral can be evaluate easily using numerical integration Among the various methos, the so-calle Montecarlo integration is bestsuite in the high-imensional case Montecarlo integration [14] is base on the principle of ranomization an can be concisely escribe, as follows: The volume of a complex object correspons irectly to the probability that a point, ranomly selecte from the ata space, is insie this object Therefore, an approximation of the volume can be gaine by selecting a number of points an measuring the fraction of points insie the object Note that Montecarlo integration may be use for arbitrary ata istributions We use a variation of this technique to etermine the volume functions Vol avg ( r) an Vol SP 1 ( ( --,, 1 2 2 --, r) ) as well as the corresponing erivative for the require ranges of an r These functions are inepenent from iniviual parameters such as the number of points in the atabase or the capacity or geometrical shape of the ata pages an are r thus universally applicable for all subsequent cost computations The expecte value of the NN-istance can then be efficiently integrate from the precompute function Vol avg ( r) by the ex- tene trapezoial rule The same applies for the cost function 42 Experiments To show the accuracy of our moel, we mae several experiments on both, synthetic an real ata We integrate the algorithm NN-opt in an implementation of the well-known Hilbert-inex [11] an in the original implementation of the X- tree [5] The Hilbert-inex maps -imensional points to a one-imensional space which is then inexe by a B + -tree Accoring to subsection 21, the algorithm NN-opt first examines the partition (given by a range of Hilbert values) with the lowest MINDIST uring the search process The X-tree is an R-tree-like multiimensional inex structure which has been especially esigne for inexing high-imensional ata Our cost moel is base on an estimation of the raius of the NN-sphere To show the accuracy of our moel, we compare the average nearest neighbor istance of a uniformly istribute ata set with the raius estimate by our moel For the experiments, we varie from 2 to 16 using up to 369,000 ata points We average the raius over 100 NN-queries an foun our expecte nearest neighbor istance perfectly confirme (cf Figure 6) To evaluate the accuracy of our cost function an its applicability to various inex structures, we performe several experiments In the first experiment, we fixe the imension to 16 an varie the number of uniformly istribute ata points from 93,000 to 2,976,000 In this experiment, we use the Hilbert inex with a B + -tree page size of 32 KBytes which implies an effective capacity of 360 ata objects per ata NN-istance 055 050 045 040 035 030 025 020 015 010 005 000 2 4 6 8 10 12 14 16 imension measure istance our moel Figure 6 Expecte NN-istance Depening on the Dimension

70000 3500 60000 50000 3000 page accesses 40000 30000 20000 Hilbert our moel 2500 2000 1500 moel Hilbert 10000 0 500000 1000000 1500000 2000000 2500000 3000000 N 1000 Figure 7: Expecte Number of Page Accesses an Hilbert Inex Performance Depening on the Number of Data Points 500 000 N46000 N92000 N184500 page The experiment confirme our cost moel up to a relative error of 5-8% (cf Figure 7) This remaining error is ue to the impact of the specific split behavior, which is ifficult to inclue in any formal moel In the experiment shown in Figure 8, we compare our cost moel to the performance of the X-tree with a fixe number of ata pages an varying imensionality ( 2 50) The performance of the X-tree is slightly better than the estimate of our cost moel The reason for the better performance is that the X-tree ignores ea space, ie parts of the ata space which are not covere by any partition As the experiments show, however, the estimates of our moel are sufficiently close to the real performance of the X-tree Even for low an meium imensions, the accuracy of our moel is much better than the moel of Frieman, Bentley an Finkel Note that in general our moel is also applicable to R-treelike inex structures especially in higher imensions To show the practical relevance of our approach, we also performe experiments using real ata The test ata use for the experiments originate from a real atabase consisting of highimensional Fourier points Each 16-imensional Fourier point correspons to a region of a CAD-moel escribing an inustrial part We store the Fourier-points in the Hilbert-inex an performe 100 ranom nearest neighbor queries Since in general the actual imensionality of a real ata set is lower than the formal imensionality [10], we have to use the fractal imension of the Fourier atabase for in our moel page accesses 100000000000 100000000 100000 100 000 1000 2000 3000 4000 5000 imension X-tree FBF-moel our moel Figure 8: Expecte Number of Page Accesses an Measure X-tree Performance Depening on the Dimension Figure 9: Application of the Cost Moel to Real Data We therefore etermine the fractal imension of the Fourier ata set which is 1056 Using 10 as the imension in our moel, we get an accurate estimation of the page accesses Figure 9 shows the result of some experiments using ifferent numbers N of ata items 5 Conclusion In this paper, we presente a new cost moel for nearest neighbor queries in high-imensional ata space using conservative recursive inex structures such as the R-tree, k-- B-tree or quatree Our cost moel is accurate even in high imensions, where other moels completely fail, because our moel consiers bounary effects As a further avantage, our moel uses the Eucliean metric which is relevant to many atabase applications We showe the applicability an accuracy of our moel by presenting the results of various experiments both on synthetic an real ata sets comparing our preictions with the performance of X-tree an Hilbert-base inices Whereas previous moels such as the moel by Frieman, Bentley an Finkel overestimate the cost by orers of magnitue in high imensions, our moel is exact up to a moerate relative error Our further research will focus on the extension of our moel to k-nearest neighbor queries In aition, we plan to perform a theoretically well-foune analysis of various inex structures for high-imensional ata References [1] Altschul S F, Gish W, Miller W, Myers E W, Lipman D J: A Basic Local Alignment Search Tool, Journal of Molecular Biology, Vol 215, No 3, 1990, pp 403-410 [2] Arya S: Nearest Neighbor Searching an Applications, PhD thesis, University of Marylan, College Park, MD, 1995 [3] Arya S, Mount D M, Narayan O: Accounting for Bounary Effects in Nearest Neighbor Searching, Proc 11th Annual Symposium on Computational Geometry, Vancouver, Canaa, 1995, pp 336-344 [4] Beckmann N, Kriegel H-P, Schneier R, Seeger B:

The R*-tree: An Efficient an Robust Access Metho for Points an Rectangles, Proc ACM SIGMOD Int Conf on Management of Data, Atlantic City, NJ, 1990, pp 322-331 [5] Berchtol S, Keim D, Kriegel H-P: The X-tree: An Inex Structure for High-Dimensional Data, 22n Conf on Very Large Databases, 1996, Bombay, Inia [6] Berchtol S, Keim D, Kriegel H-P: Fast Searching for Partial Similarity in Polygon Databases, accepte for publication: V LDB Journal, 1996 [7] Cleary J G: Analysis of an Algorithm for Fining Nearest Neighbors in Eucliean Space, ACM Transactions on Mathematical Software, Vol 5, No 2, June 1979, pp183-192 [8] Eastman CM: Optimal Bucket Size for Nearest Neighbor Searching in k- Trees, Information Processing Letters Vol 12, No 4, 1981 [9] Faloutsos C, Barber R, Flickner M, Hafner J, et al: Efficient an Effective Querying by Image Content, Journal of Intelligent Information Systems, 1994, Vol 3, pp 231-262 [10] Faloutsos C, Gaee V: Analysis of n-dimensional Quatrees Using the Hausorff Fractal Dimension, Proc ACM SIGMOD Int Conf on Management of Data, 1996 [11] Faloutsos C, Roseman S: Fractals for Seconary Key Retrieval, Proc 8th ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems, 1989, pp 247-252 [12] Frieman J H, Bentley J L, Finkel R A: An Algorithm for Fining Best Matches in Logarithmic Expecte Time, ACM Transactions on Mathematical Software, Vol 3, No 3, September 1977, pp 209-226 [13] Hjaltason G R, Samet H: Ranking in Spatial Databases, Proc 4th Int Symp on Large Spatial Databases, Portlan, ME, 1995, pp 83-95 [14] Kalos M H, Whitlock P A: Monte Carlo Methos, Wiley, New York, 1986 [15] Kukich K: Techniques for Automatically Correcting Wors in Text, ACM Computing Surveys, Vol 24, No 4, 1992, pp 377-440 [16] Jagaish H V: A Retrieval Technique for Similar Shapes, Proc ACM SIGMOD Int Conf on Management of Data, 1991, pp 208-217 [17] Mehrotra R, Gary J E: Feature-Base Retrieval of Similar Shapes, Proc 9th Int Conf on Data Engineering, Vienna, Austria, 1993, pp 108-115 [18] Mehrotra R, Gary J E: Feature-Inex-Base Similar Shape Retrieval, Proc of the 3r Working Conf on Visual Database Systems, March 1995 [19] Papapoulos A, Manolopoulos Y: Performance of Nearest Neighbor Queries in R-Trees, Proc of the 6th International Conference on Database Theory, Delphi, Greece, 1997, LNCS 1186, pp 394-408 [20] Preparata FP, Shamos M I: Computational Geometry, Chapter 5 ( Proximity: Funamental Algorithms ), Springer Verlag New York, 1985, pp 185-225 [21] Ramasubramanian V, Paliwal K K: Fast k-dimensional Tree Algorithms for Nearest Neighbor Search with Application to Vector Quantization Encoing, IEEE Transactions on Signal Processing, Vol 40, No 3, March 1992, pp 518-531 [22] Roussopoulos N, Kelley S, Vincent F: Nearest Neighbor Queries, Proc ACM SIGMOD Int Conf on Management of Data, 1995, pp 71-79 [23] Shawney H, Hafner J: Efficient Color Histogram Inexing, Proc Int Conf on Image Processing, 1994, pp 66-70 [24] Shoichet B K, Boian D L, Kuntz I D: Molecular Docking Using Shape Descriptors, Journal of Computational Chemistry, Vol 13, No 3, 1992, pp 380-397 [25] Sproull RF: Refinements to Nearest Neighbor Searching in k-dimensional Trees, Algorithmica 1991, pp 579-589 [26] Wallace T, Wintz P: An Efficient Three-Dimensional Aircraft Recognition Algorithm Using Normalize Fourier Descriptors, Computer Graphics an Image Processing, Vol 13, pp 99-126, 1980 [27] Welch T: Bouns on the Information Retrieval Efficiency of Static File Structures, Technical Report 88, MIT, June 1971 [28] Yao AC, Yao FF: A General Approach to D-Dimensional Geometric Queries, Proc of the ACM Symposium on Theory of Computing, 1985