Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of California, San Diego lewis@cs.ucsd.edu Abstract We discuss the issue of how well K-means scales to large databases. We evaluate the performance of our implementation of a scalable variant of K- means, from Bradley, Fayyad and Reina (1998b), that uses several, fairly complicated, types of compression to t points into a xed size buer, which is then used for the clustering. The running time of the algorithm and the quality of the resulting clustering are measured, for a synthetic dataset and the database from the KDD '98 Cup data mining contest. Lesion studies are performed to see if all types of compression are needed. We nd that a simple special case of the algorithm, where all points are represented with their suf- cient statistics after being used to update the model once, produces almost as good clusterings as the full scalable K-means algorithm in much less time. Setting all the parameters of the scalable algorithm was dicult and we were unable to make it cluster as well as the regular K-means operating on the whole dataset. Introduction Clustering is a method for grouping together similar data items in a database. Similar data items can be seen as being generated from a probability distribution which is an unobserved feature of the data elements. The clustering problem involves determining the parameters of the probability distributions which generated the observable data elements. The K-means algorithm was developed as a solution to the clustering problem. The K-means algorithm is an iterative renement algorithm. It assumes data points are drawn from a xed number (K) of clusters. It attempts to determine the parameters for each cluster by making a hard assignment of each data point to one of the K clusters. The algorithm is an iterative process of assigning cluster membership and reestimating the cluster parameters. The algorithm terminates when the data points no longer change membership due to the reestimated cluster parameters. When data points change cluster membership, it will only be with neighbors of their current cluster. Therefore, only the distance from the data point to the neighboring cluster means need to be checked to determine if the point will change membership. This insight can make the K-means algorithm more ecient than checking the distance from a point to each of the K cluster means. However, in high dimensions, datasets tend to be sparse and the number of neighboring clusters may be large. The time complexity of the algorithm is based on the access time for each data element, again with each element needing to be accessed on each iteration. The K-means algorithm can become inecient for large databases. A modied version of this algorithm which addresses these ineciencies will be investigated here. The scalable K-means algorithm Bradley, Fayyad and Reina (1998b) describe a method of scaling clustering algorithms, in particular K-means, to large datasets. The idea is to use a buer where points from the database are saved in compressed form. First, the means of the clusters are initialized as in the ordinary K-means. Then, all available space in the buer is lled up with points from the database. The current model is updated on the buer contents in the usual way. The buer contents is then compressed in two steps. The rst step, called the primary compression, nds points that are close to the cluster they are currently assigned to. These points are unlikely to end up in a dierent cluster. There are two methods to do this. The rst measures the Mahalanobis distance from the point to the cluster mean it's associated with, and discards the point if it is within a certain radius. For the second method, condence intervals are set up for the cluster means. Then, for each point, a worst case scenario is set up by perturbing the cluster means within the condence intervals. The cluster mean that is associated with the point is moved away from the point, and the cluster means of all other clusters are moved towards the point. If the point is closest to the same cluster mean after the perturbations, it is unlikely to change cluster membership. Points that are unlikely to change cluster membership are removed from the buer, and are placed in one of the discard sets. Each of the main clusters has a discard

set, represented by the sucient statistics for all points belonging to that cluster that have been discarded. On the rest of the points in the buer, a secondary K-means clustering is performed, with a larger number of clusters than for the main clustering. The reason for doing this is to try to save buer space by storing some of these clusters instead of the points. In order to replace points in the buer with the secondary cluster, the cluster must fulll a tightness criterion, meaning that its standard deviation in each dimension must be below a certain threshold. Clusters are oined with each other using hierarchical agglomerative clustering, as long as the merged clusters are tight enough. The rest of the points in the buer are called the retained set. The space in the buer that has become available because of the compression is now lled up with new points, and the whole procedure is repeated. The algorithm ends after one scan of the dataset, or if the main cluster means do not change much as more points are added. We implemented the algorithm in C++, without reusing any existing code. The platform for our experiments was a dual Pentium II workstation with 256 MB RAM memory, running Linux. Our program is not multi-threaded, so only one of the processors is directly used for the experiments. The program is compiled with full optimization turned on. Some details of the implementation of the scalable K- means algorithm are not given by Bradley et al (1998b). During the primary compression, a Mahalanobis radius that discards a certain fraction of the newly sampled points belonging to a cluster must be determined. Our implementation computes the distance between each newly sampled point and the cluster it's assigned to, and, for each cluster, sorts the list. Then it's easy to nd a radius so that a certain fraction of points is discarded. Note that this approach may change the time complexity of the whole algorithm. It may be possible to implement this more eciently, at least when the fraction of discarded points is small. For all main and secondary clusters, our implementation stores the sucient statistics (sum of elements, squared sum of elements, number of points) as well as the mean and standard deviation of that cluster. The mean is stored so that the distance between old and new means (the new mean is computed from the sum of the elements) can be computed when doing K-means clustering, and the standard deviation is stored to speed up the primary compression. This makes each cluster occupy about four times as much space as a point. One problem that may occur during secondary compression is that a tight cluster does not t in a buer. Therefore, each time a new tight cluster is found (after oining), the space occupied by that cluster is compared to the space occupied by the points it will replace, and only if it reduces the occupied space in the buer are the points compressed. For our purposes, the sucient statistics are represented by two vectors, Sum and SumSq, and one integer n. The vectors are for storing the sum and the sum of squares of the elements of the points in the cluster, and the integer is used to keep track of the number of points in the cluster. From these statistics, the mean and variance along each dimension can be calculated. Let the sucient statistics of a cluster A be given by (Sum (A) ; SumSq (A) ; n (A) ) If a point x is added to the cluster, the sucient statistics are updated according to Sum (A) SumSq (A) := Sum (A) := SumSq (A) n (A) := n (A) + 1 + x + x 2 If clusters A and B are merged, the sucient statistics for the resulting cluster C are Sum (C) SumSq (C) = Sum (A) = SumSq (A) n (C) = n (A) + n (B) + Sum (B) + SumSq (B) Synthetic Dataset Synthetic datasets are generated to experiment with the modied K-means algorithms. This allows the cluster means determined by the algorithms to be compared with the known probability distributions of the synthetic data. The data points are drawn from a xed number of Gaussian distributions. Each Gaussian is assigned a random weight which determines the probability of generating a data point from that distribution. The mean and variance for each Gaussian distribution are uniformly sampled, for each dimension, from the intervals [-5,5] and [0.7, 1.5] respectively. To determine the accuracy of the clustering the true cluster means must be compared with the estimated cluster means. The problem of discovering which true cluster mean corresponds to which estimated mean must be solved. As long as the number of clusters K is small, it's possible to use the one of the K! permutations that yields the smallest value. This is the way we do it, and that's the reason why only a few clusters are used in our experiments. Lesion Studies To evaluate the contribution of each data compression method lesion studies have been conducted. The experiments are run using the synthetic datasets with 20 dimensions and 100000 data points. Comparisons are made between four variations of the scalable K-means algorithm, the regular K-means algorithm, and a naive scalable K-means algorithm described in the next section. Each method clustered the same synthetic data set and works within a limited buer, except for regular K-means, large enough to hold approximately 10% of

the data points. The number of clusters is xed at ve for these lesion studies. The four variations of the scalable K-means algorithm involve adding each of the data compression methods to the previous variation. The rst variation doesn't use any of the data compression techniques proposed in the scalable K-means algorithm. The K-means algorithm would run until convergence on the rst ll of the buer. No points would be added or removed after this initial ll. This variation is similar to clustering on a 10% sampling of the data points. For the next variation, the rst primary compression technique is used. This involves moving to the discard set points within a certain Mahalanobis distance from its associated cluster mean. Next, the second primary compression technique is added. Condence intervals are used to discard data points unlikely to change cluster membership. The fourth variation includes the secondary data compression technique of determining secondary clusters. For secondary compression the number of secondary clusters is xed at 25. All other constants used in these experiments are shown in Table 1. Parameter Value Condence level for cluster means 95% Max std. dev. for tight clusters () 1.1 Number of secondary clusters 25 Fraction of points discarded (p) 10% Table 1: The parameters used for the lesion studies of the scalable K-means algorithm are given above. Naive scalable K-means A special case of the scalable K-means algorithm would be when all points in the buer are discarded each time. The algorithm is: 1. Randomly initialize cluster means. Let each cluster have a discard set in the buer. The discard set keeps track of the sucient statistics for all points from previous iterations. 2. Fill buer with points. 3. Perform K-means iterations on the points and discard sets in the buer, until convergence. For this clustering, the discard sets are treated like regular points placed at the means of the discard sets, but weighted with the number of points in the discard sets. 4. For each cluster, update the sucient statistics of the discard set with the points assigned to the cluster. Remove all points from the buer. 5. If database empty, end. Otherwise, repeat from 2. This algorithm will subsequently in this paper be called naive scalable K-means. Like the scalable algorithm, it performs only one scan over the dataset and works on a xed size buer, but requires much less computation each time the buer is lled, and the whole buer can be lled with new points for each iteration. Results The experiments ran on 100 dierent synthetic datasets. The results are shown in Figure 1 and Figure 2. For each dataset ve dierent clustering models are generated from dierent initial starting positions. The best model is retained for a comparison of the accuracy of the clustering. Some of the synthetic datasets could not be eectively clustered by any of the methods used in the experiments. These datasets distort the quality measurements and are removed before comparing the methods. We believe using a larger number of initial starting positions may reduce the number of datasets which don't converge to a solution. Of the 100 synthetic datasets 24 are not used. To compare the running times, all of the 100 trials are used. Each of the initial starting position are combined to give an accumulative running time for each clustering method. The results of the lesion studies show no increase in accuracy for identifying the clusters for each data compression technique added, but a signicant increase in running time. Except for the rst scalable K-means algorithm, each method used all of the points in the datasets. Therefore, none of the experiments produced a model which converged using less than the full dataset. The cluster quality is about the same for each of the methods which used the whole dataset. The running time is directly related to the complexity of the algorithms. Each additional data compression technique adds to the running time of the algorithm. The regular K-means algorithm converges to a solution faster than and of the scalable K-means algorithms which used data compression. The naive approach runs faster than the other methods while producing the same level of accuracy. It should be noted that the variation using the secondary compression method found no points tight enough to be merged as secondary clusters. The inability of the lesion studies to show any signicant dierences between the quality of the clusterings may have resulted from the use of the synthetic data sets. KDD '98 Dataset To experiment with a real world database, which is suf- ciently large to evaluate the scalability of the proposed algorithm, the dataset from the 1998 KDD (Knowledge and Data Discovery) cup data mining contest is used. This database contains statistical information about people who have made charitable donations in response to direct mailing requests. Clustering can be used to identify groups of donors who may require specialized requests in order to maximize donation prots. The database contains 95412 records, each of which has 481 statistical elds. To test the clustering algorithms, we selected 56 features from these elds. The features are represented by vectors of oating point numbers. Numerical data (e.g. donation amounts, income, age) are stored as oating point numbers. Date values (e.g. donation dates, date of birth) are stored as the number of months from a xed min-

0.7 Cluster quality 15 Running time Distance between estimated and true means 0.6 0.5 0.4 0.3 0.2 0.1 Running time [s] 10 5 0 R10 S10 S10 S10 N10 K 0 R10 S10 S10 S10 N10 K Figure 1: The graph shows the sum of the distances between the estimated and true cluster means, for a synthetic dataset of 100000 points and ve clusters. The algorithms are random sampling K-means (no compression) (R10), scalable K-means with primary 1 compression only (S10{), with primary 1 and 2 compression (S10-), with all types of compression (S10), the naive scalable K-means (N10) and the regular K-means operating on the whole dataset (K). Error bars show the standard errors. imum date. Binary features, such as belonging to a particular income category, are also stored as oating point numbers. Of the 56 features that are used, 18 are binary. To give equal weight to each feature, each of the elds were normalized to have a mean of zero and variance of one. The records in the database are converted to this format and saved to a separate le which is used to retrieve data points during the clustering process. The purpose of this experiment is to compare the running time and cluster quality of the regular K-means, operating on the whole dataset or on resamples, the scalable K-means algorithm using all types of compression, and the naive scalable K-means where every point is put in the discard set after being used to update the model. The experiment is performed with resamples and buers of 10% and 1% of the size of the whole dataset. The number of clusters is 10. First, the dataset is randomly reordered. Then it is clustered ve times by all algorithms, each time with new randomly chosen initial conditions (but with all algorithms starting from the same initial conditions). The running time is measured. The quality of the clustering is measured by the sum of the squared distances between each point and the cluster mean it is associated with. Of these ve clusterings, the one with the best quality is used. This is done because K-means is known to be sensitive to initial conditions. This whole procedure is repeated 14 times with dierent random orders of the dataset. For the scalable clustering, we found it hard to set Figure 2: The graph shows the average running times for the modied versions of the scalable K-means algorithm. Error bars show the standard errors. Parameter Value Condence level for cluster means 95% Max std. dev. for tight clusters () 1.1 Number of secondary clusters 40 Fraction of points discarded (p) 20% Table 2: The parameters used for the scalable K-means algorithm are given above. the parameters, especially for the secondary clustering. If the number of secondary clusters is low, many points are assigned to each cluster. This means that in order to get any tight clusters at all, the maximum allowed standard deviation must be quite high. If the number of secondary clusters is too high on the other hand, it will take too much time to run the secondary K-means clustering and merge the resulting clusters, with only slightly better clusterings. The parameter values we used are given in Table 2. Figure 3 shows the average quality of the best of the ve clusterings, and Figure 4 shows the average running time of all the clusterings. From Figure 3 it can be seen that the quality of the random sampling K-means operating on a 1% resample is much worse than everything else, and that ordinary K-means produces the best clusterings, which can be expected. Larger buers produce better clusterings. The scalable K-means produce slightly better clusterings than the naive scalable K-means, but the dierence is not signicant. The running time is by far highest for the ordinary K-means, but scalable K-means takes much time as well, compared to the naive scalable and random sampling K- means.

4 x 106 Cluster distortion 300 Running time Squared point cluster distance 3.98 3.96 3.94 3.92 3.9 3.88 3.86 3.84 3.82 Running time [s] 250 200 150 100 50 3.8 S10 S1 N10 N1 R10 R1 K 0 S10 S1 N10 N1 R10 R1 K Figure 3: The graph shows the sum of the squared distances between each point in the dataset and the cluster mean it's associated with, on the KDD Cup '98 dataset of 95412 points and 10 clusters. The algorithms are the scalable K-means (S10 and S1), the "naive" scalable K- means (N10 and N1), random sampling K-means (R10 and R1), and the ordinary K-means working on the whole dataset. names ending with 10 use a buer or resample with size 10% of the whole dataset, names ending with 1 use a 1% buer or resample. Error bars show the standard errors. Complexity To perform one K-means iteration over a set of n entities has time complexity O(n). Here, it is assumed that O(log n) iterations over the set are needed for convergence of the clusters. Table 3 shows the time, space and disk I/O complexity for regular, scalable and naive scalable K-means, where m is the size of the RAM buer. The O(n log m) time complexity of the scalable K-means algorithm is caused by the K-means iterations over the buer, the sorting on Mahalanobis distance in our implementation, and the secondary K-means clustering. Our experiments show that the constant in the time complexity of the scalable K-means is large compared to the constant for regular K-means, and that the gain from using the scalable method comes from reducing the number of disk accesses. This can be seen from the running times on synthetic and real world datasets. For the synthetic datasets, regular K-means runs faster than scalable K-means, but for the real world dataset the scalable K-means is faster. Time Space Disk I/O Regular K-means O(n log n) O(1) O(n log n) Scalable K-means O(n log m) O(m) O(n) Naive K-means O(n log m) O(m) O(n) Table 3: The table shows the time, RAM and disk complexity for three dierent K-means variations. Figure 4: This gure shows the average running time for performing one clustering, for the same runs as the one shown in Figure 3. Error bars show the standard errors. Conclusions We have implemented a scalable K-means algorithm, and performed lesion studies on it. A special, but much simplied, case of the scalable K-means algorithm is when all points are discarded after being used to update the model. This variant has been shown to produce good clusterings of large synthetic and real world datasets, in much less time than the full scalable K- means algorithm. Also, we have shown that the scalable K-means does not produce as good clusterings as the regular K-means operating on the full dataset, even if the parameters are set to give it a running time of about 40% of the full clustering. It may be that we did not nd the right parameters, but we argue that an algorithm which has many parameters without any guidelines of how to set them, is probably not worth the extra eort of implementing it. It is worth to note one extra thing about the scalable K-means algorithm: it is possible to update several models, each starting from dierent initial conditions, sharing the same buer and performing only one scan of the database. We did not use that for our experiments, so it's possible that the running time would have decreased if we did. Also, the scalable algorithm is not restricted to K-means clustering, but can be extended to work with more expensive clustering, such as expectation maximization (EM) (Bradley, Fayyad, & Reina 1998a). Our main conclusion is that a complicated scalable K-means algorithm is only worthwhile if the database is very large or the time it takes to access a record is high, and a high quality clustering is highly desirable.

References Bradley, P.; Fayyad, U.; and Reina, C. 1998a. Scaling em (expectation maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research, Redmond, WA. Bradley, P. S.; Fayyad, U.; and Reina, C. 1998b. Scaling clustering algorithms to large databases. In Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, 9{15. AAAI Press.