A Mixed Hierarchical Algorithm for Nearest Neighbor Search

Size: px

Start display at page:

Download "A Mixed Hierarchical Algorithm for Nearest Neighbor Search"

Blaze Hudson
5 years ago
Views:

1 A Mixed Hierarchical Algorithm for Nearest Neighbor Search Carlo del Mundo Virginia Tech 222 Kraft Dr. Knowledge Works II Building Blacksburg, VA ABSTRACT The k nearest neighbor (knn) search is a computationally intensive application critical to fields such as image processing, statistics, and biology. Recent works have demonstrated the efficacy of k-d tree based implementations on multi-core CPUs. It is unclear, however, whether such tree based implementations are amenable for execution in high-density processors typified today by the graphics processing unit (GPU). This work seeks to map and optimize knn to massively parallel architectures such as the GPU. Our approach synthesizes a clustering technique, k-means, with traditional brute force methods to prune the search space while taking advantage of data-parallel execution of knn on the GPU. Overall, our general case GPU version outperforms a singlethreaded CPU by factors as high as INTRODUCTION & MOTIVATION knn is a fundamental algorithm for classifying objects. It works by finding the nearest neighbor of one or several query points in a metric space. Figure 1 depicts an example of knn for a 2D Euclidean metric space. In the context of this paper, the input data set is referred to as the reference points (shown as circles), and the targets are referred to as query points (shown as an X). The two closest neighbors (K = 2) of the query point are shown in green. Computing knn presents a prohibitive cost for large inputs and dimensionalities. This work seeks to capitalize on the rich parallel resources of GPUs to accelerate knn calculations. Traditional techniques focus on k-d tree based data structures to achieve O(log N) searches by pruning the search space at each level of the tree. To the best of our knowledge, no works have focused on search space pruning techniques for knn on the GPU. We explore the use of a clustering technique, known as k- means, to perform offline groupings of like-coordinate points. Our approach takes advantage of the properties of clusters. Mariam Umar Virginia Tech 222 Kraft Dr. Knowledge Works II Building Blacksburg, VA mariam.umar@vt.edu Figure 1: Nearest neighbor search for N = 12 and K = 2. The red X represents the query point, q, and its nearest neighbors are shown in green. Points are clustered together based on a convergence criterion. These clusters of near-proximity points have centers known as the centroids and are calculated as the average coordinates of points within a cluster. We assume that the nearest neighbor of a query point, q, belongs to the closest cluster to q. This fundamental assumption prunes the search space by discarding points that do not belong to the closest cluster. This tree-like pruning behavior can significantly cut down on the number of data points to test while avoiding branch divergence penalties. In this work, we characterize the brute force (BF) linear knn method for both the CPU and GPU. We then demonstrate the efficacy of our hierarchical algorithm on the GPU against the BF CPU. To that end, our contributions are as follows. 1. A characterization of the data-parallel brute force algorithm on CPU and GPU 2. The design, implementation, and characterization of a mixed hierarchical knn algorithm that prunes the search space via clustering The rest of the document is outlined as follows. Section 2 discusses related work, Section 3 presents our hierarchical approach, and finally, Section 4 summarizes and discusses our results.

2 2. RELATED WORK 2.1 K-means Clustering is a widely studied problems of which k-means is the canonical clustering method. Pelleg et al. [8] discusses issues in implementation for k-means such as poor scaling and finding local minima. They propose several solutions for these problems. Alsabti et al. [1] explored a k-d tree based implementation of k-means. They claim that the calculation of k-means with a k-d tree approach improves performance by two orders of magnitude. Bradley et al. [4] shows that the initial point calculation is very important in performance, and they argue that defining a better initial point helps k-means converge to a better minimum. Finding a better initial point improves solutions for both continuous and discrete data sets. Kanungo et al. [7] use Lloyd s algorithm for k-means. Their approach differs from the conventional approach in that they construct a k-d tree for data points rather than the query points. They claim that their implementation performs better both for synthetically generated data sets as well as real data sets. K-means has been a popular clustering algorithm for many decades, but no theoretical bounds have been established until now. Arthur et al. [2] shows that it is simple and fast, but additional research is required to understand its theoretical complexity. They claim that even the initial clusters and their corresponding centers are chosen uniformly at random points. The cluster calculation is still superpolynomial. They are calculating lower bound on k-means using reset widgets. This widget is introduced to make computation of k-means much longer in order to calculate lower bound. Reza et al. [] have implemented k-means on GPUs. They structure their implementation based on the architecture of the GPU. They have taken performance and efficiency into account demonstrating data intensive tasks on GPU is better due to power constraints. Their speedups improve by factors as high as 68 for the NVIDIA 88 Ultra GTX. A serious concern about their speed up is that they have not taken into account the data transfer times between CPU and GPU, which may serve as a bottleneck for larger data sets. 2.2 knn One prominent tree based nearest neighbor algorithm is based on the work of Arya et al [3]. This work focuses on creating a tree data structure on the CPU to cut-down search and space complexity to O(log N) and O(n), respectively. The authors discuss the implications of nearest neighbor for higher dimensionalities (d > 2) and how to avoid common pitfalls. Their methodology involves a modified k-d tree, also known as a bounding box decomposition tree. Finally, they relax the constraints of the nearest neighbor algorithm in order to gain substantial speedup with respect to algorithmic complexity. Garcia et al. propose a GPU-based implementation of the knn algorithm [6]. They implemented a brute-force approach of the knn problem by composing their computation as a series of matrix and sorting problems. By leveraging CUDA and CUBLAS, the authors have shown substantial speedups by factors as high as 64 and 189 faster than Arya s work on multi-core CPUs. 3. APPROACH The common approach to knn is: (1) a linear brute force (BF) approach and (2) the k-d tree approach. Instead, we propose a mixed hierarchical algorithm that uses a combination of BF and clustering. In the BF algorithm, a simple Euclidean distance kernel is applied to an array of points. This approach does not require extra data structures other than an array. This approach is relatively slow, on the order of O(N) where N is the number of points in the list. Other efficient partitioning techniques use spatial data structures such as k-d trees reducing the complexity of the search to O(log N) [3]. Unfortunately, k-d tree based implementations for nearest neighbor search has only been widely studied on the CPU. We propose a mixed hierarchical algorithm that first compresses the data via a clustering scheme. Since points are clustered by a distance metric, we can assume that the query points and their neighbors are within the same cluster. Determining the cluster is done by finding the Euclidean distance between the query point and the centroid of the cluster. Finally, a brute force, data parallel approach can traverse the remaining data points within the identified cluster. This approach effectively prunes the number of reference points by a factor related to the cluster size. Figure 2 compares and contrasts the following algorithms (1) brute force, (2) the k-d tree, and (3) our proposed hierarchical algorithm. The brute force algorithm, shown in (a), calculates distances between the query point and every other point (O(N)). In the k-d tree method, the coordinate space is subdivided as a set of tiles. Though this has algorithmic complexity of O(log N) for search, it is unclear whether such a data structure will map well onto the GPU. Finally our hierarchical algorithm, shown in (c), first partitions data points into clusters. Distance calculations are first done between each cluster s centroid with the query point. Once the closest cluster is determined, a brute force approach is applied to points within that cluster. Our implementations are based on the pseudocode outlined in Sections 3.1, 3.2 and 3.3 for the brute force knn, k-means clustering, and our mixed algorithm, respectively. 3.1 Brute force knn algorithm Given a set of query points, Q, and a set of input data points, I. Then for each query point, q i: 1. Compute the Euclidean distance between q i and each point in I. 2. Sort the distances in descending order. The k nearest neighbors for query point q i will be the first k entries in the sorted array. 3.2 K-means clustering algorithm The K-means clustering algorithm takes a parameter, C, which is the total number of clusters to group a set of reference data points.

3 (a) (b) (c) Figure 2: Approaches to knn. In (a), the brute force nearest neighbor algorithm is shown. For a query point, q, a distance calculation is performed on every other point. In (b), the k-d tree nearest neighbor algorithm subdivides the coordinate space into equally spaced tiles. The search complexity is of order O(log N). Finally, in (c), an example of our proposed hierarchical algorithm is shown. Instead of traversing a tree structure on the GPU, clustering is performed on the CPU and clusters transferred to GPU. To effectively narrow down the query point to its nearest neighbors, a distance calculation is performed from the query point to each cluster s centroids. The brute force approach is then applied to points in the closest cluster. Our assumption is that the nearest neighbor will be contained within the closest cluster. 1. Choose C initial points as the initial clusters. The centroid of these clusters is the point coordinate of the initial points. 2. Calculate the Euclidean distance between each point to the current clusters. Group each point to its nearest cluster. 3. Recalculate the centroid of each cluster based on all points belonging to that cluster. 4. Repeat the previous two steps until the centroids converge. 3.3 Mixed Algorithm 1. Calculate a set of C clusters using the algorithm in Section For each query point, q i: (a) Determine the closest cluster to q i by calculating the Euclidean distance of each cluster s centroid with the query point. (b) Apply the brute force algorithm as detailed in section 3.1 to the closest cluster. 3.4 Limitations Boundary cases. Like the k-d tree implementation, boundary cases are an issue. Suppose the query points are in the boundaries between two clusters. In this situation, both clusters must be traversed in order to determine the nearest neighbor. The problem is further exacerbated when applied to query points in the boundaries between N clusters. This could be alleviated by creating a bounding volume for each cluster, therefore, identifying potential clusters that contain the nearest neighbor. Costs for creating the cluster. The start up costs for clustering can be prohibitive for large sample sizes. We assume that the cost of creating the clusters beforehand will be amortized by fast query searches. Number of clusters and size of clusters. Empirical testing must be performed in order to determine the optimal number of clusters and the size of the respective clusters. Many clusters will increase the overhead of determining which cluster the query point belongs to. Similarly, a large cluster will increase the overhead of the brute force algorithm. 4. RESULTS AND DISCUSSION Here, we outline our experimental setup, results, and discussion. 4.1 Experimental Testbed Table 1: Experimental Testbed. OS/Kernel Debian Wheezy 7., v bit Software CUDA., Driver v313.3 CPU Intel Celeron E33 (2 cores 2. GHz) GPU NVIDIA Tesla C27 (448 cores 1.1 GHz) Compiler nvcc -O3 (only optimizes CPU) Our experimental testbed is listed in Table 1. For the course of experimentation, we have used the nvcc compiler with compiler flags (-3) to amortize the cost of having a slow CPU. The compiler flag improved CPU performance by a factor of five over CPU without the flags. For our dataset, we use the USA-Central nodes data set and vary the input size from 1 to 64 MB. We fixed the number of query (Q) and neighbor points (K) to one. Finally, we fix our experiment to 1 clusters.

4 Our implementations are broken down into three kernels: (1) distance computation, (2) sort, and (3) k-means. We run these kernels on the CPU, GPU, or both. Our experiments are as follows. Distance (8.%) Distance (6.3%) 1. BF CPU. Distance Computation (on CPU), Sort (on CPU). 2. BF GPU. Distance Computation (on GPU), Sort (on GPU). 3. BF GPU + k-means. K-means (on CPU), Distance Computation (GPU), Sort (on GPU). Sort (91.%) (a) Reduction (34.7%) (b) Since the number of neighbor points (K) are fixed to one element, a reduction operation can be substituted for the sorting operation. We demonstrate the efficacy of reduction vs. sorting. The authors implemented the distance computation on CPU and GPU and k-means for CPU, used STL::sort and Thrust::sort for the CPU and GPU, respectively, and the Thrust::reduction for the GPU reduction operation. We do not include the execution time of k-means for our mixed algorithm. Furthermore, we assume that cluster creation is negligible. 4.2 Results Execution Time (ms) CPU BF GPU BF GPU BF + KM GPU BF + KM + Reduction Number of Reference Points (MB) Figure 3: Results for BF CPU, BF GPU, and BF GPU with k-means. Our primary results are listed in Figure 4. We note that both axes are on a logarithmic scale. In all cases, our GPU versions outperform the CPU versions with performance improving with successive GPU implementations. Figure 4 depicts the execution time for N = 64 for the {GPU BF + KM + Reduction} broken down in its constituent stages: distance/sort or distance/reduction. Recall that the sorting operation for the BF algorithm can be substituted for a reduction operation if the number of neighbor points (K) is one. Substituting sort for reduction improves performance by a factor of 2; in addition, we note that the execution time is no longer dominated by the sorting stage, but rather by the distance stage. 4.3 Discussion The linear growth of all experiments is expected since the BF algorithm requires O(N) search time. We note that the Figure 4: Percentage Execution for the Distance and Sort/Reduction Stage for N = 64. In (a), the execution of {BF GPU + KM} is shown, and in (b), the execution of {BF GPU + KM + Reduction}. Sorting is the dominant component of GPU BF algorithm with k-means comprising 91.% of the execution time for 64 MB. Substituting a reduction operation instead of sorting, the distance component of the BF GPU becomes the dominant factor. differences in execution time in the GPU versions can be attributed to differences in algorithmic design. The naive {GPU BF} version computes both distance and sorting for all points in the data set. The {GPU BF + KM} computes distance and sorting for only a subset of points (within the selected cluster). Finally, the {GPU BF + KM + Reduction} is similar to {GPU BF + KM}, but instead performs a reduction operation in lieu of sorting. Overall, our fastest GPU version, {GPU BF + KM + Reduction} outperforms its respective CPU version by factors as high as 822 over our CPU implementation. We note, however, this is a special corner case in the knn computation (where K = 1). Therefore, for the general case, {GPU BF + KM} outperforms its respective CPU version by factors as high as 18 over our CPU implementation.. REFERENCES [1] Khaled Alsabti. An efficient k-means clustering algorithm. In In Proceedings of IPPS/SPDP Workshop on High Performance Data Mining, [2] David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In Nina Amenta and Otfried Cheong, editors, Symposium on Computational Geometry, pages ACM, 26. [3] Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM, 4(6): , November [4] P. S. Bradley and Usama M. Fayyad. Refining initial points for k-means clustering. pages Morgan kaufmann, [] Reza Farivar, Daniel Rebolledo, Ellick Chan, and Roy H. Campbell. A parallel implementation of k-means clustering on gpus. In PDPTA, pages 34 34, 28. [6] V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud.

5 K-nearest neighbor search: Fast gpu-based implementations and application to high-dimensional feature matching. In Image Processing (ICIP), 21 17th IEEE International Conference on, pages , 21. [7] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine Piatko, Ruth Silverman, and Angela Y. Wu. The analysis of a simple k-means clustering algorithm, 2. [8] Dau Pelleg and Andrew Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In In Proceedings of the 17th International Conf. on Machine Learning, pages Morgan Kaufmann, 2.

Accelerated Machine Learning Algorithms in Python

Accelerated Machine Learning Algorithms in Python Patrick Reilly, Leiming Yu, David Kaeli reilly.pa@husky.neu.edu Northeastern University Computer Architecture Research Lab Outline Motivation and Goals