An Accelerated Nearest Neighbor Search Method for the K-Means Clustering Algorithm

Size: px

Start display at page:

Download "An Accelerated Nearest Neighbor Search Method for the K-Means Clustering Algorithm"

Elmer Francis
5 years ago
Views:

1 Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference An Accelerated Nearest Neighbor Search Method for the -Means Clustering Algorithm Adam Fausett and M. Emre Celebi Department of Computer Science, Louisiana State University in Shreveport, Shreveport, LA, USA Abstract -means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, the nearest neighbor search step of this algorithm can be computationally expensive, as the distance between each input vector and all cluster centers need to be calculated. To accelerate this step, a computationally inexpensive distance estimation method can be tried first, resulting in the rejection of candidate centers that cannot possibly be the nearest center to the input vector under consideration. This way, the computational requirements of the search can be reduced as most of the full distance computations become unnecessary. In this paper, a fast nearest neighbor search method that rejects impossible centers to accelerate the k-means clustering algorithm is presented. Our method uses geometrical relations among the input vectors and the cluster centers to reject many unlikely centers that are not typically rejected by similar approaches. Experimental results show that the method can reduce the number of distance computations significantly without degrading the clustering accuracy. 1 Introduction Clustering, the unsupervised classification of patterns into groups, is one of the most important tasks in exploratory data analysis (Jain, Murty, and Flynn 1999). Primary goals of clustering include gaining insight into data (detecting anomalies, identifying salient features, etc.), classifying data, and compressing data. Clustering has a long and rich history in a variety of scientific disciplines including anthropology, biology, medicine, psychology, statistics, mathematics, engineering, and computer science. As a result, numerous clustering algorithms have been proposed since the early 1950s (Jain 2010). Clustering algorithms can be broadly classified into two groups: hierarchical and partitional (Jain 2010). Hierarchical algorithms recursively find nested clusters either in a top-down (divisive) or bottom-up (agglomerative) fashion. In contrast, partitional algorithms find all the clusters simultaneously as a partition of the data and do not impose a hierarchical structure. Most hierarchical algorithms have quadratic or higher complexity in the number of data points (Jain, Murty, and Flynn 1999) and therefore are not suitable Copyright c 2013, Association for the Advancement of Artificial Intelligence ( All rights reserved. for large data sets, whereas partitional algorithms often have lower complexity. Given a data set X = {x 1, x 2,...,x N } R D, i.e., N vectors each with D attributes, hard partitional algorithms divide X into exhaustive and mutually exclusive clusters P = {P 1, P 2,...,P }, i=1 P i = X, P i P j = for 1 i j. These algorithms usually generate clusters by optimizing a criterion function. The most intuitive and frequently used criterion function is the Sum of Squared Error (SSE) given by: SSE = x j y i 2 2 (1) i=1 x j P i where. 2 denotes the Euclidean norm and y i = 1/ P i x j P i x j is the centroid of cluster P i whose cardinality is P i. The number of ways in which a set of N objects can be partitioned into non-empty groups is given by Stirling numbers of the second kind: S(N,) = 1 ( ) ( 1) i i N (2)! i i=0 which can be approximated by N /! It can be seen that a complete enumeration of all possible clusterings to determine the global minimum of (1) is clearly computationally prohibitive. In fact, this non-convex optimization problem is proven to be NP-hard even for =2(Aloise et al. 2009) or D =2(Mahajan, Nimbhorkar, and Varadarajan 2012). Consequently, various heuristics have been developed to provide approximate solutions to this problem (Tarsitano 2003). Among these heuristics, Lloyd s algorithm (Lloyd 1982), often referred to as the k-means algorithm, is the simplest and most commonly used one (Celebi and ingravi 2012; Celebi, ingravi, and Vela 2013). This algorithm starts with arbitrary centers, typically chosen uniformly at random from the data points. Each point is assigned to the nearest center and then each center is recalculated as the mean of all points assigned to it. These two steps are repeated until a predefined termination criterion is met. Algo. 1 shows the pseudocode of this procedure. Despite its linear time complexity, k-means can be time consuming for large data sets due its computationally expensive nearest neighbor search step. In this step, the nearest 426

2 center to each input vector is determined using a full search. That is, given an input vector x, the distance from this vector to each of the centers is calculated and then the center with the smallest distance is determined. The dissimilarity of two vectors x =(x 1,x 2,...,x D ) and y =(y 1,y 2,...,y D ) is measured using the squared Euclidean distance: D d 2 (x, y) = x y 2 2 = (x d y d ) 2 (3) Equation (3) shows that each distance computation requires D multiplications and 2D 1 additions/subtractions. It is therefore necessary to perform D multiplications, (2D 1) additions/subtractions, and 1 comparisons to determine the nearest center to each input vector. In this paper, a novel method is proposed to minimize the number of distance computations necessary to find the nearest center by utilizing the geometrical relationships among the input vector and the cluster centers. The rest of the paper is organized as follows. Section 2 presents the related work. Section 3 describes the proposed accelerated nearest neighbor search method. Section 4 provides the experimental results. Finally, Section 5 gives the conclusions. input : X = {x 1, x 2,...,x N } R D (N D input data set) output: C = {c 1, c 2,...,c } R D ( cluster centers) Select a random subset C of X as the initial set of cluster centers; while termination criterion is not met do for (i =1;i N; i = i +1)do Assign x i to the nearest cluster; m[i] = argmin k {1,2,...,} x i c k 2 ; Recalculate the cluster centers; for (k =1;k ; k = k +1)do Cluster P k contains the set of points x i that are nearest to the center c k ; P k = {x i m[i] =k}; Calculate the new center c k as the mean of the points that belong to P k ; c k = 1 x i ; P k x i P k Algorithm 1: -Means algorithm 2 Related Work Numerous methods have been proposed for accelerating the nearest neighbor search step of the k-means algorithm. Note that k-means is also known as the Generalized Lloyd Algorithm (GLA) or the Linde-Buzo-Gray (LBG) algorithm (Linde, Buzo, and Gray 1980) in the vector quantization literature. In this section, we briefly describe several methods that perform well compared to full search. Triangle Inequality Elimination (TIE) Method Let x be an input vector and y l be a known nearby center. Center y j is a candidate to be the nearest center only if it satisfies d(x, y j ) d(x, y l ). This inequality constrains the centers which need to be considered, however, testing the inequality requires evaluating d(x, y j ) for each center y j, which is computationally expensive. To circumvent this, the TIE method bounds d(y l, y j ) rather than d(x, y j ). Using the triangle inequality, center y j is a candidate to be the nearest center only if the following condition is satisfied (Huang and Chen 1990; Chen and Hsieh 1991; Orchard 1991; Phillips 2002): d(y i, y l ) d(x, y j )+d(x, y l ) 2d(x, y l ) (4) With d(x, y l ) evaluated, (4) makes it possible to eliminate center y j from consideration by evaluating the distances between y j and another center, y l, rather than between y j and the input vector x. This means that the distance computations do not involve the input vector, and thus they can be precomputed, thereby allowing a large set of centers to be rejected by evaluating the single distance d(x, y l ). TIE stores the distance of each center y i to every other center in a table. Therefore, the method requires tables, each with 1 entries. The entries in each table are then sorted in increasing order. Mean-Distance-Ordered Search (MOS) Method Given an input vector x =(x 1,x 2,...,x D ) and a center y i = (y i1,y i2,...,y id ), the MOS method (Ra and im 1993) uses the following inequality between the mean and the distortion: ( D ) 2 x d y id Dd 2 (x, y i ) (5) In this method, the centers are ordered according to their Euclidean norms prior to nearest neighbor search. For a given input vector x and an initial center, the search procedure moves up and down from the initial center alternately in the pre-ordered list. Let y i be the center with the current minimum distance, d(x, y i ), among the centers examined so far. If a center y j satisfies (6) we do not need to calculate d(x, y j ) because d(x, y j ) d(x, y i ). ( D ) 2 Dd 2 (x, y i ) x d y jd (6) Consequently, (6) does not have to be checked for a center y k that is higher, or lower, in the pre-ordered list than y j as d(x, y k ) d(x, y j ) d(x, y i ) (Choi and Chae 2000). Poggi s (PGG) Method This method takes advantage of distance estimations to reject potential centers from consideration as the nearest center for an input vector x. Let y l be the current nearest center 427

3 to x and let y j be any other center. The center y j can be rejected if the following condition is satisfied (Poggi 1993): 1 x d 1 d(x, y l ) y jd (7) x y j x If the center y j is not rejected due to (7), then d(x, y j ) must be computed. The rejection procedure for an input vector x is shown in Algo. 2. for j =1to do d(x, yl ) r = D ; α = β =0; for d =1to D do α = α + x d ; β = β + y jd ; α = α/d; β = β/d; if α β >rthen Reject y j as a potential nearest center for x; else Calculate d(x, y j ); Algorithm 2: Rejection procedure for Poggi s method Activity Detection (AD) Method It can be observed that the only changes made by k-means are local changes in the set of centers, and the amount of change differs from center to center. Therefore, some centers will stabilize much faster than others. This information is useful because the activity of the centers can be monitored indicating whether the center was changed during the previous iteration. The centers can then be classified into two groups: active and static. The cluster of a static center is called a static cluster. The number of static centers increases as the iterations progress (aukoranta, Fränti, and Nevalainen 2000). This activity information is useful for reducing the number of distance computations in the case of vectors in static clusters. Consider an input vector x in a static cluster P i, and denote the center of this cluster by y i. It is not difficult to see that if another static center y j was not the nearest center for x in the previous iteration, it cannot be the nearest center in the current iteration either. The partition of the centers in a static cluster cannot therefore change to another static cluster. For these vectors, it is sufficient to calculate the distances only to the active centers because only they may become closer than the center of the current cluster. This method is also useful for certain input vectors in an active cluster. Consider an input vector x in an active cluster α. If the center c α moves closer to x than it was before the update, then x can be treated as if it were in a static cluster. Note that for each input vector x, its distance to the current nearest center y j must be stored. For the remaining input vectors (whose distance to current center was increased), a full search must be performed by calculating the distances to all centers. This method requires that a set of the active centers be maintained. This is done by comparing the new and the previous centers with at most O(D) extra work and O(D) extra space. For each input vector x, the method checks to see if its current nearest center y i is in the set of active centers. If this is not the case, we perform a search for the nearest center from the set of the active centers by using any fast search method. Otherwise, when x is mapped to active center y i, we calculate the distance d(x, y i ). If this distance is greater than the distance in previous iteration, a full search is necessary. If, however, the distance does not increase, only the set of the active centers is used. To store the distances of the previous iteration O(ND) extra space is needed. 3 Proposed Accelerated Nearest Neighbor Search Method Many centers can be omitted from computations based on their geometrical relation to the input vector. Consider an input vector x, its nearest center y i, and a new potential center y j.ify j is located outside the circle centered at x with a radius equal to the Euclidean distance between x and y i, then it should not be considered as the nearest center, since the distance from x to y j must be strictly larger than the current nearest center distance. For y j to be a nearer center for x, it must be inside the described circle. Evaluating whether or not a new center is inside this circle normally requires comparing the Euclidean distance between x and y j with the radius. Such computations are expensive. However, they can be largely avoided if the circle is inscribed in a square whose sides are tangent to the circle. Any center that lies outside of the square is also outside of the circle and thus cannot be nearer to x than y i. Fig. 1 illustrates this in two dimensions. Figure 1: Geometrical relations among an input vector x and various cluster centers in two dimensions (Tai and Lin 1996) To see this mathematically, let d min be the Euclidean distance between the input vector x and its nearest center y i. Any center whose projection x d on the axis d is outside the 428

4 range [x d d min,x d +d min ] can be firmly discarded as it has no chance to be nearer to x. This rule is applied to each axis (attribute) d D, eliminating any center that lies outside the square in Fig. 1 from further consideration. The focus can now be narrowed to only those centers that lie inside the square. There exists an in-between region near the corners of the square where it is possible to be inside the square and yet outside the circle. Consider a circle of radius 10, centered at the origin, O. The point A(9,9) is inside the square, however, d(o, A) = = 162, placing it outside of the circle as 162 > 10. A potential center located in this region should be rejected. Determining whether a point is in the in-between region in this manner requires a distance calculation. It is possible to make this decision with a less expensive operation by comparing the City-block distance (8) between O and A to the radius of the circle multiplied by the square root of D (number of dimensions). Using this approach, T (O, A) = ,soA is rejected as expected. T (x, y) = x d y d (8) Therefore, if T (x, y j ) d min D, then yj lies in the region between the circle and the enclosing square and thus can be firmly discarded. If, at this point, the potential center has not been rejected, it becomes the new nearest center, as it lies inside the circle with a radius equal to the distance from x to y i. The exact distance to the newly found nearest center, y j, must now be calculated. The absolute differences calculated in (8) may be stored in array: V [d] = x d y d d {1, 2,...,D}. These values can then be reused in the event that the potential center is not rejected, resulting in the need for a full distance calculation. Eq. (3) then becomes (9), thereby not wasting any of the calculations preformed thus far. d 2 (x, y) = (x d y d ) 2 = V [d] 2 (9) Implementation Implementation of the proposed method is fairly straightforward as shown in Algo. 3. Here, d min is the current nearest neighbor distance found at any point in time, and index is the index of x s current nearest center. The RejectionTest procedure is given in Algo. 4. Note that a) it is necessary to set the variable radius equal to the square root of the current minimum distance, i.e., dmin, rather than simply d min as we are using the squared Euclidean distance to measure the dissimilarity of two vectors, and b) the value of D does not change during the course of nearest neighbor search. Therefore, the square root of D is calculated only once outside of this method. 4 Experimental Results Eight data sets from the UCI Machine Learning Repository (Frank and Asuncion 2012) were used in the experiments. d min = d(x, y index ); for j =1to do if RejectionTest (x, y j, d min ) == false then dist =0; for d =1to D do dist = dist + V [d] 2 ; d min = dist; index = j; Algorithm 3: Proposed accelerated nearest neighbor search method radius = d min ; T =0; for d =1to D do V [d] = x d y d ; if V [d] radius then return true; T = T + V [d]; if T radius D then return true; else return false; Algorithm 4: Rejection test for the proposed method Table 1: Descriptions of the Data Sets (N: # points, D: # attributes) ID Data Set N D 1 Concrete Compressive Strength 1, Covertype 581, Landsat Satellite (Statlog) 6, Letter Recognition 20, Parkinsons 5, Person Activity 164, Shuttle (Statlog) 58, Steel Plates Faults 1, The data set descriptions are given in Table 1. Each method was executed a 100 times with each run initialized using a set of centers randomly chosen from the input vectors. Tables 2 and 4 give the average percentage of distance computations and average percentage of centers rejected per iteration for {2, 4, 8, 16, 32}, respectively. Note that in the former table smaller values are better, whereas in the latter one larger values are better. Tables 3 and 5 summarize Tables 2 and 4, respectively by averaging each method across the eight data sets for each value. It can be seen that when compared to the other methods, the proposed method reduces the number of distance computations significantly and rejects a substantial number 429

5 Table 2: % of full distance computations per iteration ID Method PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed Table 4: % of rejected centers per iteration ID Method PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed PGG AD MOS TIE Proposed Table 3: Summary of Table 2 Method PGG AD MOS TIE Proposed of centers in each iteration. Furthermore, the advantage of our method is more prominent for larger values. Table 5: Summary of Table 4 Method PGG AD MOS TIE Proposed Conclusions In this paper, we presented a novel approach to accelerate the k-means clustering algorithm based on an efficient near- 430

6 est neighbor search method. The method exploits the geometrical relations among the input vectors and the cluster centers to reduce the number of necessary distance computations without compromising the clustering accuracy. Experiments on a diverse collection of data sets from the UCI Machine Learning Repository demonstrated the superiority of our method over four well-known methods. 6 Acknowledgments This publication was made possible by a grant from the National Science Foundation ( ). References Aloise, D.; Deshpande, A.; Hansen, P.; and Popat, P NP-Hardness of Euclidean Sum-of-Squares Clustering. Machine Learning 75(2): Celebi, M. E., and ingravi, H Deterministic Initialization of the -Means Algorithm Using Hierarchical Clustering. International Journal of Pattern Recognition and Artificial Intelligence 26(7): Celebi, M. E.; ingravi, H.; and Vela, P. A A Comparative Study of Efficient Initialization Methods for the - Means Clustering Algorithm. Expert Systems with Applications 40(1): Chen, S. H., and Hsieh, W. M Fast Algorithm for VQ Codebook Design. IEE Proceedings I: Communications, Speech and Vision 138(5): Choi, S. Y., and Chae, S. I Exted Mean- Distance-Ordered Search Using Multiple l 1 and l 2 Inequalities for Fast Vector Quantization. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 47(4): Frank, A., and Asuncion, A UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. Huang, S. H., and Chen, S. H Fast Encoding Algorithm for VQ-based Image Coding. Electronics Letters 26(19): Jain, A..; Murty, M. N.; and Flynn, P. J Data Clustering: A Review. ACM Computing Surveys 31(3): Jain, A Data Clustering: 50 Years Beyond - means. Pattern Recognition Letters 31(8): aukoranta, T.; Fränti, P.; and Nevalainen, O A Fast Exact GLA Based on Code Vector Activity Detection. IEEE Transactions on Image Processing 9(8): Linde, Y.; Buzo, A.; and Gray, R An Algorithm for Vector Quantizer Design. IEEE Transactions on Communications 28(1): Lloyd, S Least Squares Quantization in PCM. IEEE Transactions on Information Theory 28(2): Mahajan, M.; Nimbhorkar, P.; and Varadarajan, The Planar k-means Problem is NP-hard. Theoretical Computer Science 442: Orchard, M. T A Fast Nearest-Neighbor Search Algorithm. In Proceedings of the 1991 International Conference on Acoustics, Speech, and Signal Processing, volume 4, Phillips, S Acceleration of -Means and Related Clustering Algorithms. In Proceedings of the 4th International Workshop on Algorithm Engineering and Experiments, Poggi, G Fast Algorithm for Full-Search VQ Encoding. Electronics Letters 29(12): Ra, S. W., and im, J A Fast Mean-Distance- Ordered Partial Codebook Search Algorithm for Image Vector Quantization. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 40(9): Tai, S. C.and Lai, C. C., and Lin, Y. C Two Fast Nearest Neighbor Searching Algorithms for Image Vector Quantization. IEEE Transactions on Communications 44(12): Tarsitano, A A Computational Study of Several Relocation Methods for -Means Algorithms. Pattern Recognition 36(12):

An Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm

An Adaptive and Deterministic Method for Initializing the Lloyd-Max Algorithm Jared Vicory and M. Emre Celebi Department of Computer Science Louisiana State University, Shreveport, LA, USA ABSTRACT Gray-level