Finding the k-closest pairs in metric spaces

Finding the k-closest pairs in metric spaces Hisashi Kurasawa The University of Tokyo 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, JAPAN kurasawa@nii.ac.jp Atsuhiro Takasu National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, JAPAN takasu@nii.ac.jp Jun Adachi National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, JAPAN adachi@nii.ac.jp ABSTRACT We investigated the problem of reducing the cost of searching for the k closest pairs in metric spaces. In general, a k-closest pair search method initializes the upper bound distance between the k closest pairs as infinity and repeatedly updates the upper bound distance whenever it finds pairs of objects whose distances are shorter than that distance. Furthermore, it prunes dissimilar pairs whose distances are estimated as longer than the upper bound distance based on the distances from the pivot to objects and the triangle inequality. The cost of a k-closest pair query is smaller for a shorter upper bound distance and a sparser distribution of distances between the pivot and objects. We propose a new divide-and-conquer-based k-closest pair search method in metric spaces, called Adaptive Multi-Partitioning (AMP). AMP repeatedly divides and conquers objects from the sparser distance-distribution space and speeds up the convergence of the upper bound distance before partitioning the denser space. As a result, AMP can prune many dissimilar pairs compared with ordinary divide-and-conquer-based method. We compare our method with other partitioning method and show that AMP reduces distances computations. Categories and Subject Descriptors H.2.4 [Database Management]: Systems Multimedia databases; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing Indexing methods 1. INTRODUCTION It is important to enumerate similar pairs of objects from a data set. This has various applications such as record linkages, data mining, multimedia databases, and geographical information systems. There are two similar object pair enumeration problems: similarity join and k-closest pair finding. The former finds pairs of objects whose distances are shorter than a specified upper bound, and the latter finds the top-k closest object Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NTSS 2011, March 25, 2011, Uppsala, Sweden. Copyright 2011 ACM 978-1-4503-0612-6/11/03...$10.00. pairs from a data set for a given number k. The similarity join query potentially does not return any pair or many pairs when it is difficult to define an appropriate upper bound distance. Even in such a case, the k-closest pair query answers a fixed number of closest pairs. We propose a fast method for k-closest pair query in metric spaces. The k-closest pair query can be solved using a nested loop method. However, this naive method requires ( N 2) distance computations. Such a high search cost leads to poor scalability. A divide-and-conquer method reduces the cost to O(N(log N) d 1 ) for d-dimensional Euclidean spaces. Such a method consists of the following steps. Divide: It partitions a region into two sub-regions using a hyperplane that perpendicularly crosses axes at a mean value. Conquer: It recursively searches closest pairs in each sub-region and update the upper bound distance between the kth closest pair. Combine: It finds closest pairs with one object in each sub-region that lies within the upper bound distance of the partitioning boundary. Unfortunately, the divide-and-conquer method for metric spaces is more complicated than for Euclidean spaces. Coordinates are not defined in metric spaces and cannot be used for partitioning a region. Instead, ball partitioning is generally used [8, 9]. Ball partitioning selects one object as a pivot, and divides a region based on the distance from the pivot. It requires N 1 distance computations where N is the number of objects. Thus, we need to reduce the computational cost in the divide step as well as the conquer step. Furthermore, the distribution of distances between the pivot and objects is often skewed. Many objects tend to reside near the partitioning boundary when the ball partitioning divides a region into two sub-regions by a median distance. We should also take into account the distance distribution. However, existing methods using ball partitioning do not take these issues into account. We developed a new partitioning method that can prune more objects with less pivots than existing methods by using the distance distribution. We propose a new divide-and-conquer-based k-closest pair search method in metric spaces, called Adaptive Multi-Partitioning (AMP). AMP is based on the following observations:

The divide-and-conquer method works well for multipartitioning in which intervals exceed the upper bound distance between the kth closest pair. It is more difficult to prune dissimilar objects in dense regions of distances from the pivot than sparse regions. The upper bound distance u is updated and decreases whenever it finds pairs of objects whose distances are shorter than u. A shorter upper bound distance can prunes more dissimilar objects. AMP iteratively and recursively divides a region using a sparse-region-first strategy. For a sparse region, it calculates the k closest pairs. It then sets the interval of the next region to the maximum distance in the currently obtained k closest pairs. As a result, a space is multiple partitioned for each pivot, and the upper bound distance is expected to decrease before searching dense regions. The contributions of this paper are as follows. We propose a novel k-closest pair search method to reduce the number of distance computations. The method uses only the triangle inequality, and it can handle all distance functions that satisfy metric space postulates. We conducted experiments on several real datasets and demonstrated that our method outperforms other methods. The rest of this paper is organized as follows. Section 2 formally defines the problem we focus on in this paper. Section 2 gives the background of this study, and Section 3 overviews some related work. Section 4 describes AMP in detail. Section 5 shows the experimental results, and Section 6 is our conclusion. 2. PROBLEM STATEMENT Our k-closest pair search method deals with metric spaces, which are defined as follows. Definition 1 (Metric space). Let M = (D, d) be a metric space defined for a domain of objects D and a distance function d : D D R. M satisfies the following postulates [17]. x, y D, d(x, y) 0(non-negativity) x, y D, d(x, y) = d(y, x)(symmetry) x, y D, x = y d(x, y) = 0(identity) x, y, z D, d(x, z) d(x, y) + d(y, z)(triangle inequality) (1) Examples of the distance function d are the Minkowski distance, Jaccard s coefficient, the Hausdorff distance, and the edit distance. A k-closest pair query is defined as follows. Definition 2 (k-closest pair query). Given a set of objects S in D, a k-closest pair query with threshold k returns the object set R and the upper bound distance u. R Table 1: Notation Notation Meaning M metric space D domain of objects d distance function k query for the k-closest pair method upper bound distance u between the kth closest pair q query object o, a, b, x, y object p pivot S, X, Y object set A (temporary) result set and u satisfy A S S A = k (x, y) A, (a, b) (S S A), d(x, y) d(a, b) u = max ({d(x, y) (x, y) A}). (2) That is, A consists of the most similar k object pairs from S. We focus on the self k-closest pair query case. The k-closest pair query is sometimes called top-k similarity join query or kcpq. Table 1 summarizes the symbols used in this article. 3. RELATED WORK We start with pruning techniques in metric spaces. Because coordinates are not explicitly defined in metric spaces, all pruning techniques use the distance from the pivot to each object and the triangle inequality. A simple partitioning technique is ball partitioning, which uses only one pivot and divides a region into two sub-regions based on the distance from each object to the pivot [15]. It is used to construct a similarity search index for range searches and k-nearest neighbor searches [7, 5]. Quickjoin [9] modifies ball partitioning for similarity join searches as follows. Definition 3 (Modified Ball partitioning). For a metric space M = (D, d), suppose a pivot p divides a set X of objects in D into two regions: X 1 = {o X d(o, p) r p}, (3) X 2 = {o X d(o, p) > r p}, (4) Furthermore, suppose the following subsets: X 1 = {o X 1 d(o, p) > r p u}, (5) X 2 = {o X 2 d(o, p) r p + u}, (6) where u is an upper bound distance. When searching object pairs in X whose distance is shorter than u, it is sufficient to check object pairs in X 1, X 2, and X 1 X 2. Similarly, suppose p also divides an object set Y in D: Y 1 = {o Y d(o, p) r p}, (7) Y 2 = {o Y d(o, p) > r p}, (8) Y 1 = {o Y 1 d(o, p) > r p u}, (9) Y 2 = {o Y 2 d(o, p) r p + u}. (10)

When searching object pairs in X Y whose distance is shorter than u, it is sufficient to check object pairs in X 1 Y 1, X 2 Y 2, X 1 Y 2, and X 2 Y 1. Modified ball partitioning can prune more dissimilar object pairs by a smaller upper bound distance and a smaller number of objects near the partitioning boundary. Now let us briefly survey k-closest pair search methods. Because the similarity join is very similar to the k-closest pair query, we overview studies on these two problems. Previously proposed similarity join methods in metric spaces can be categorized into two methods. The first is the indexbased method. This method constructs an index and inserts all the objects in the dataset into the index. For each object, it searches for other objects whose distances are within the query by using the same procedure as the range search on the index. All existing index-based studies focus on improving the index. While general metric indexes are designed to deal with any distances in the range queries [10, 18, 11], the indexes for similarity join queries assume fixed-distance range queries. ed-index [8] constructs an index that is an extension of D-index [7]. ed-index differs from D-index in that it replicates objects whose distances from the partitioning boundary are within the query while dividing a region into sub-regions like modified ball partitioning. By using this replication technique, ed-index can execute the similarity join query independently within each separated region. List of twin clusters (LTC) is another extension of similarity search indexes [13]. LTC [5] is based on list of clusters (LC). It is designed to resolve range queries, similarity join, and k-closest pair queries between two datasets. It makes two lists of overlapping clusters and a distance matrix between the cluster centers of each dataset. It uses the matrix to prune objects based on the triangle inequality. For searching the k-closest pairs, LTC creates a queue heap to store temporary k-closest object pairs and their distances. It also makes a variable upper bound distance and initializes it to. When the size of the heap exceeds k, it removes the longest distance pair from the heap and sets the upper bound distance to the distance between the kth closest pair. It sets the upper bound distance as the similarity join query and finds the closest object pairs by updating the heap and decreasing the upper bound distance. It does not tune the query, while ed-index does. LTC is a general-purpose index for similarity searches between two datasets. The second method is the divide-and-conquer-based method. Quickjoin [9] recursively divides and conquers an object set into subsets based on the distances from a pivot by using modified ball partitioning. It prunes all the object pairs in the two subsets if the distance between the sets exceeds the query. The partitioning boundary of modified ball partitioning is the distance between the pivot and a randomly selected object. This results in many objects tending to reside near the partitioning boundary, which makes it difficult to prune dissimilar object pairs. Most k-closest pair methods are designed for a specific data structure. They mainly focus on pruning and search ordering. The Top-k set similarity join [16] handles sets; it uses a token-based pruning and avoids repeated verification. Approximate k-closest pairs with SPace-filling curves (ASP) [3] is an approximate k-closest pair method for highdimensional vectors; it uses Hilbert space filling curves. An index-based method [6] for spatial databases has been proposed that prunes object pairs by using R-tree. These methods cannot be extended to metric spaces or another specific space. Only a few papers discuss k-closest pair queries in metric spaces. Furthermore, the authors of those papers did not exploit the distribution of distances between the pivot and objects for determining the partitioning distance. By using the distance distribution in partitioning techniques, it may be possible to reduce the number of pivots and improve the pruning effect. This reduces the cost of distance computations between the pivots and objects and that among the objects. Therefore, we developed a new partitioning technique using the distance distribution. Compared with the index-based method, the divide-and-conquer-based method is better at dealing with self join queries because there is no reason to build an index for one query as the authors of LTC mentioned. Thus, we used the divide-and-conquerbased method. To the best of our knowledge, only LTC supports the k-closest pair query in metric spaces in previous studies. 4. ADAPTIVE MULTI-PARTITIONING AMP is a divide-and-conquer-based k-closest pair search method in metric spaces. It uses multi-ball partitioning for reducing the computational cost in the divide step. Moreover, it uses the convergence of the upper bound distance and the distance distribution for the partitioning procedure, especially for determining the interval size of the partitioning. 4.1 Multi-ball partitioning The idea of multi-partitioning is used in various studies for specific spaces such as ε-kdb [14]. Multi-ball partitioning is defined as follows. Definition 4 (Multi-Ball partitioning). For a metric space M = (D, d), suppose a pivot p divides an object set X in D into regions: X 0 = {o X 0 d(o, p) < t 0}, X 1 = {o X t 0 d(o, p) < t 1}, X i = {o X t i 1 d(o, p) < t i}, (11) X n = {o X t n 1 d(o, p) < t n}, where t i (0 i n) are the partitioning distances of p. For an upper bound distance u, suppose the inequality t i+1 t i u (12) holds when searching object pairs in X whose distance is shorter than u. Then, it is sufficient to check object pairs {(a, b) a X i, b X j, i j 1}. Similarly, suppose p divides an object set Y in D into regions: Y 0 = {o Y 0 d(o, p) < t 0}, Y 1 = {o Y t 0 d(o, p) < t 1}, Y i = {o Y t i 1 d(o, p) < t i}, (13) Y n = {o Y t n 1 d(o, p) < t n}.

Figure 1: Adaptive Multi-Partitioning When searching object pairs in X Y whose distance is shorter than u, it is sufficient to check object pairs {(a, b) a X i, b Y j, i j 1}. Multi-ball partitioning can be adapted to k-closest pair search method if the upper bound distance u exceeds the distance between the kth closest pair. Thus, we should consider the intervals of the partitioning distances. In general, a k-closest pair search method initializes the upper bound distance u as infinity and repeatedly updates the upper bound distance u whenever it finds k pairs of objects whose distances are shorter than u. That is, the upper bound distance u converges to the distance between the kth closest pair while searching. We came up with the idea to adjust the partitioning distances t i by taking the convergence into consideration. 4.2 Partitioning Procedure Because pivots prune dissimilar object pairs based on the triangle inequality, the distance density with respect to a pivot and the upper bound distance affects the pruning performance. We can prune objects in a sparse region w.r.t. the distance from the pivot more effectively. Moreover, a shorter upper bound distance can prune more dissimilar objects. We focus on the convergence of the upper bound distance and the distance distribution. We believe that dissimilar object pairs in a dense distance distribution should be pruned by using the upper bound distance, which has already converged. Thus, AMP searches closest pairs in the sparse distance distribution before the dense distance distribution. For detecting the distance distribution, AMP calculates the skewness s of the distance density. Skewness is a measure of the asymmetry of the distribution. It is defined as: [ ((χ ) ] 3 µ) s = E, (14) σ where χ is a random variable, µ is the mean, σ is the standard deviation, and E is the expectation operator. A negative skew of the distance density indicates that objects near the pivot are sparse. AMP applies divide-and-conquer operations from the near to far side of the pivot. On the other hand, a positive skew indicates that objects near the pivot are dense. Therefore, AMP divides and conquers in the opposite direction. Figure 1 shows the concept of AMP. AMP searches the k-closest pairs on two given object sets X and Y ( X Y ) by performing the following steps. For a metric space M = (D, d), AMP first creates a queue heap A to store temporary k-closest object pairs and their distances. It also makes a variable upper bound distance u and initializes it to. When the size of the heap exceeds k, it removes the longest distance pair from the heap and sets the upper bound distance to the distance between the kth closest pair. AMP manages the reference distance Ref o,s of each object o in an object set S. The initial distance of Ref o,s is nil. Let us define t i as the ith partitioning distance. AMP then recursively divides and conquers the given object sets X and Y, which we call AMP(X, Y ). In the following procedure, expr 1? expr 2 : expr 3 means to evaluate the expression expr 2 if expr 1 is true; otherwise, evaluate expr 3 as in C language. 1. Remove the objects from X and Y, which hold Ref o,s nil Ref o,s > u (S = X, Y ). 2. If min { X, Y } 3 holds, search closest pairs by Nested Loops and return. 3. Randomly choose a pivot object p from X, and remove p from X. 4. Calculate the distance from p to each object in X Y. 5. Set d min, d max, d mean, and s to be the minimum, maximum, mean, and skewness of the distances, respectively. 6. Update A and u if the object pair (o i, p) is found where o i is in Y and d(o i, p) < u holds. 7. Set t 0 to be (s < 0? d min : d max). 8. Set t 1 to be (s < 0? min{t 0 + u, d mean} : max{t 0 u, d mean}). 9. Set S 1 to be (s < 0? {o S t 0 d(o, p) < t 1} : {o S t 0 d(o, p) > t 1}) (S = X, Y ). 10. Call AMP(X 1, Y 1). 11. While (s < 0? t i < d max : t i > d min) holds: (a) Set t i+1 to be (s < 0? t i + u : t i u). (b) Set S i+1 to be (s < 0? {o S t i d(o, p) < t i+1} : {o S t i d(o, p) > t i+1}) (S = X, Y ). (c) Call AMP(X i+1, Y i+1). (d) Set object sets S i to be {o o S i t i d(p, o) < u}. (e) For all o S i, Ref o,s i = min {Ref o,s, t i d(p, o) } (S = X, Y ). (f) Set object sets S i+1 to be {o o S i+1 t i d(p, o) < u}. (g) For all o S i+1, Ref o,s i+1 = min {Ref o,s, t i d(p, o) } (S = X, Y ).

(h) Call AMP(X i, Y i+1) and AMP(X i+1, Y i ). We can solve the k-closest pair problem in the case of X = Y with minor modification. For brevity we omit the modification. 4.3 Search Cost The computational cost is the number of distance computations between objects. Let X and Y be given object sets. The computational cost of AMP(X, Y ) is: AMP(X, Y ) = X + Y 1 + AMP(X i, Y i) }{{} i divide step }{{} conquer step + AMP(X i, Y j), (15) {(i,j) i j =1} } {{ } combine step where AMP(, ) denotes the cost of the k-closest pair query. When X is equal to Y, the cost of AMP(X) is: AMP(X) = X 1 + i AMP(Xi) + {(i,j) i j =1} AMP(Xi, Xj). (16) 5. EVALUATION We evaluated the computational cost for finding k closest pairs on real datasets. 5.1 Outline of Experiments We used the following methods in the experiment. AMP is our k-closest pair search method. It uses a sparseregion-first strategy. AMP in reverse order is a comparative method of AMP. It uses a dense-region-first strategy. Binary partitioning is a modified ball partitioning-based k-closest pair search method. Nested loops is a naive k-closest pair method, which does not prune any objects and computes the distances of all pairs of objects in the dataset. We used the following real datasets, all of which are available on the Web. These datasets have been used in many recent related studies [12, 9, 4]. We removed duplicate objects from each dataset. NASA [1] is a set of feature vectors made by NASA. It consists of 40, 150 vectors in a 20-dimensional feature space. The vectors were compared using the Euclidean distance. Corel image feature [2] consists of color histogram vectors generated from the Corel image collection. It consists of 68, 040 vectors in a 32-dimensional space. The vectors were compared using the Euclidean distance. Color histogram [1] consists of color histograms of 112, 544 images represented by vectors in a 112-dimensional space. The vectors were compared using the Euclidean distance. Figure 3 shows the distance density of each dataset, and Table 2 lists the properties of the distances between the objects in each dataset. Note that NASA is in the lowest dimensional feature space, the skewness of the Color histogram is the highest, and NASA has a wide range of distances and its skewness is the lowest. We implemented AMP and comparative methods on the Metric Space Library [1], which is written in C. We con- Table 2: Real Dataset Corel Color Nasa image features histogram Distance Euclid Euclid Euclid Dimension 20 32 112 Average 1.48 0.564 0.415 Variance 0.211 0.0332 0.0310 Skewness 0.0447 0.444 0.828 Kurtosis 2.39 3.08 3.57 ducted the experiment on a Linux PC equipped with an Intel(R) Quad Core Xeon(TM) X5492 3.40 GHz CPU and 64 GB of memory. The library and our codes were compiled with GCC 4.4. All datasets were processed in memory for all examined methods. 5.2 Computational Cost We evaluated how AMP reduces the search cost with respect to the query size. We measured the number of distance computations for k-closest pair queries with k ranging from 1 to 100, 000. Each result was the average of over 500 queries of its dataset. Figure 2 shows the computational cost. The vertical axis represents the number of distance computations for a k- closest pair query divided by the value for Nested loops, i.e., (N (N 1))/2 where N is the number of objects in the dataset. The horizontal axis is k. None of the methods require an index, so this result shows the total computational cost for the search. A lower percentage indicates a lower computational cost. We can see that AMP reduces the computational cost. These results show that multi-partitioning prunes more dissimilar object pairs than binary partitioning. Furthermore, AMP works better than AMP in reverse order for all results. This indicates that the sparse-region-first strategy is better than the dense-region-first strategy. In particular, AMP requires much less distance computations than other similar methods for the Color histogram. This suggests that the skewness of the Color histogram is large and the sparseregion-first strategy works well for skewed datasets. 6. CONCLUSION We investigated the problem of the k-closest pair query in metric spaces. We proposed an efficient k-closest pair search method that prunes dissimilar object pairs based on the triangle inequality. The method repeatedly divides and conquers the objects from the sparser space and speeds up the convergence of the upper bound distance before partitioning the denser space. We are currently conducting experiments using synthetic datasets and theoretically analyzing the performance of our method in detail. 7. REFERENCES [1] Metric spaces library, http://www.sisap.org/metric_space_library.html. [2] Uci kdd archive, http://kdd.ics.uci.edu/. [3] F. Angiulli and C. Pizzuti. Approximate k-closest-pairs in large high-dimensional data sets.

(a) NASA (b) Corel image feature (c) Color histogram Figure 2: Computational Cost (a) NASA (b) Corel image feature (c) Color histogram Figure 3: Distance density Journal of Mathematical Modelling and Algorithms, 4(2):149 179, 2005. [4] B. Bustos and G. Navarro. Improving the space cost of k-nn search in metric spaces by using distance estimators. Multimedia Tools Appl., 41(2):215 233, 2009. [5] E. Chevez and G. Navarro. A compact space decomposition for effective metric indexing. Pattern Recognition Letters, 24(9):1363 1376, 2005. [6] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Algorithms for processing k-closest-pair queries in spatial databases. Data & Knowl. Eng., 49(1):67 104, 2004. [7] V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. D-index: Distance searching index for metric data sets. Multimedia Tools and Applications, 21(1):9 33, 2003. [8] V. Dohnal, C. Gennaro, and P. Zezula. Similarity join in metric spaces using ed-index. In DEXA, 2003. [9] E. H. Jacox and H. Samet. Metric space similarity joins. ACM Trans. on Database Systems, 33(2):1 38, 2008. [10] H. V. Jagadish, B. C. Ooi, K. L. Tran, C. Yu, and R. Zhang. idistance: An adaptive b+-tree based indexing method for nearest neighbor search. ACM Trans. on Database Systems, 30(2):364 397, 2003. [11] H. Kurasawa, D. Fukagawa, A. Takasu, and J. Adachi. Maximal metric margin partitioning for similarity search indexes. In CIKM, 2009. [12] G. Navarro and N. Reyes. Dynamic spatial approximation tree. Journal of Experimental Algorithmics, 12:1 68, 2008. [13] R. Paredes and N. Reyes. Solving similarity joins and range queries in metric spaces with the list of twin clusters. Journal of Discrete Algorithms, 7(1):18 35, 2009. [14] K. Shim, R. Srikant, and R. Agrawal. High-dimensional similarity joins. In ICDE, 1997. [15] J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40(4):175 179, 1991. [16] C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, 2009. [17] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag, 2005. [18] Y. Zhuang, Y. Zhuang, Q. Li, L. Chen, and Y. Yu. Indexing high-dimensional data in dual distance spaces: a symmetrical encoding approach. In EDBT, 2008.