Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Similar documents
Worst-case running time for RANDOMIZED-SELECT

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

CHAPTER 4: CLUSTER ANALYSIS

9/24/ Hash functions

Clustering Part 4 DBSCAN

University of Florida CISE department Gator Engineering. Clustering Part 4

Based on Raymond J. Mooney s slides

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

III Data Structures. Dynamic sets

Unsupervised Learning

Introduction to Mobile Robotics

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Lesson 3. Prof. Enza Messina

K-Means Clustering 3/3/17

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Chapter 2 Basic Structure of High-Dimensional Spaces

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Contrained K-Means Clustering 1 1 Introduction The K-Means clustering algorithm [5] has become a workhorse for the data analyst in many diverse elds.

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases,

Hierarchical Clustering 4/5/17

signicantly higher than it would be if items were placed at random into baskets. For example, we

Clustering CS 550: Machine Learning

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Reduction of Huge, Sparse Matrices over Finite Fields Via Created Catastrophes

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

Lecture 13: AVL Trees and Binary Heaps

II (Sorting and) Order Statistics

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

Quality Metrics for Visual Analytics of High-Dimensional Data

University of Florida CISE department Gator Engineering. Clustering Part 2

Cluster Analysis. Ying Shen, SSE, Tongji University

ECLT 5810 Clustering

As an additional safeguard on the total buer size required we might further

A Population-Based Learning Algorithm Which Learns Both. Architectures and Weights of Neural Networks y. Yong Liu and Xin Yao

ECLT 5810 Clustering

CPSC W2 Midterm #2 Sample Solutions

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

GSAT and Local Consistency

Incremental Discovery of Sequential Patterns. Ke Wang. National University of Singapore. examining only the aected part of the database and

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Analytics and Visualization of Big Data

Parallel Galton Watson Process

CPSC W2: Quiz 2 Sample Solutions

Clustering and Visualisation of Data

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

Hardware Implementation of GA.

Segmentation. Multiple Segments. Lecture Notes Week 6

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

336 THE STATISTICAL SOFTWARE NEWSLETTER where z is one (randomly taken) pole of the simplex S, g the centroid of the remaining d poles of the simplex

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms

7 Next choice is 4, not 3...

Randomized Algorithms: Element Distinctness

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Networks for Control. California Institute of Technology. Pasadena, CA Abstract

10701 Machine Learning. Clustering

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Hash Table and Hashing

7. Decision or classification trees

2. CNeT Architecture and Learning 2.1. Architecture The Competitive Neural Tree has a structured architecture. A hierarchy of identical nodes form an

CPSC 320 Sample Solution, Playing with Graphs!

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Chapter 14 Global Search Algorithms

The EMCLUS Procedure. The EMCLUS Procedure

Towards the world s fastest k-means algorithm

Clustering in Data Mining

Small Matrices fit into cache. Large Matrices do not fit into cache. Performance (MFLOPS) Performance (MFLOPS) bcsstk20 blckhole e05r0000 watson5

Building a Concept Hierarchy from a Distance Matrix

University of Florida CISE department Gator Engineering. Clustering Part 5

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

An Ecient Algorithm for Mining Association Rules in Large. Databases. Ashok Savasere Edward Omiecinski Shamkant Navathe. College of Computing

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

K-means and Hierarchical Clustering

residual residual program final result

Unsupervised Learning

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Unsupervised Learning : Clustering

Mining of association rules is a research topic that has received much attention among the various data mining problems. Many interesting wors have be

CPSC W1: Midterm #1 (Group Version) Thu

DS-Means: Distributed Data Stream Clustering

CSCI6900 Assignment 3: Clustering on Spark

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Transcription:

Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of California, San Diego lewis@cs.ucsd.edu Abstract We discuss the issue of how well K-means scales to large databases. We evaluate the performance of our implementation of a scalable variant of K- means, from Bradley, Fayyad and Reina (1998b), that uses several, fairly complicated, types of compression to t points into a xed size buer, which is then used for the clustering. The running time of the algorithm and the quality of the resulting clustering are measured, for a synthetic dataset and the database from the KDD '98 Cup data mining contest. Lesion studies are performed to see if all types of compression are needed. We nd that a simple special case of the algorithm, where all points are represented with their suf- cient statistics after being used to update the model once, produces almost as good clusterings as the full scalable K-means algorithm in much less time. Setting all the parameters of the scalable algorithm was dicult and we were unable to make it cluster as well as the regular K-means operating on the whole dataset. Introduction Clustering is a method for grouping together similar data items in a database. Similar data items can be seen as being generated from a probability distribution which is an unobserved feature of the data elements. The clustering problem involves determining the parameters of the probability distributions which generated the observable data elements. The K-means algorithm was developed as a solution to the clustering problem. The K-means algorithm is an iterative renement algorithm. It assumes data points are drawn from a xed number (K) of clusters. It attempts to determine the parameters for each cluster by making a hard assignment of each data point to one of the K clusters. The algorithm is an iterative process of assigning cluster membership and reestimating the cluster parameters. The algorithm terminates when the data points no longer change membership due to the reestimated cluster parameters. When data points change cluster membership, it will only be with neighbors of their current cluster. Therefore, only the distance from the data point to the neighboring cluster means need to be checked to determine if the point will change membership. This insight can make the K-means algorithm more ecient than checking the distance from a point to each of the K cluster means. However, in high dimensions, datasets tend to be sparse and the number of neighboring clusters may be large. The time complexity of the algorithm is based on the access time for each data element, again with each element needing to be accessed on each iteration. The K-means algorithm can become inecient for large databases. A modied version of this algorithm which addresses these ineciencies will be investigated here. The scalable K-means algorithm Bradley, Fayyad and Reina (1998b) describe a method of scaling clustering algorithms, in particular K-means, to large datasets. The idea is to use a buer where points from the database are saved in compressed form. First, the means of the clusters are initialized as in the ordinary K-means. Then, all available space in the buer is lled up with points from the database. The current model is updated on the buer contents in the usual way. The buer contents is then compressed in two steps. The rst step, called the primary compression, nds points that are close to the cluster they are currently assigned to. These points are unlikely to end up in a dierent cluster. There are two methods to do this. The rst measures the Mahalanobis distance from the point to the cluster mean it's associated with, and discards the point if it is within a certain radius. For the second method, condence intervals are set up for the cluster means. Then, for each point, a worst case scenario is set up by perturbing the cluster means within the condence intervals. The cluster mean that is associated with the point is moved away from the point, and the cluster means of all other clusters are moved towards the point. If the point is closest to the same cluster mean after the perturbations, it is unlikely to change cluster membership. Points that are unlikely to change cluster membership are removed from the buer, and are placed in one of the discard sets. Each of the main clusters has a discard

set, represented by the sucient statistics for all points belonging to that cluster that have been discarded. On the rest of the points in the buer, a secondary K-means clustering is performed, with a larger number of clusters than for the main clustering. The reason for doing this is to try to save buer space by storing some of these clusters instead of the points. In order to replace points in the buer with the secondary cluster, the cluster must fulll a tightness criterion, meaning that its standard deviation in each dimension must be below a certain threshold. Clusters are oined with each other using hierarchical agglomerative clustering, as long as the merged clusters are tight enough. The rest of the points in the buer are called the retained set. The space in the buer that has become available because of the compression is now lled up with new points, and the whole procedure is repeated. The algorithm ends after one scan of the dataset, or if the main cluster means do not change much as more points are added. We implemented the algorithm in C++, without reusing any existing code. The platform for our experiments was a dual Pentium II workstation with 256 MB RAM memory, running Linux. Our program is not multi-threaded, so only one of the processors is directly used for the experiments. The program is compiled with full optimization turned on. Some details of the implementation of the scalable K- means algorithm are not given by Bradley et al (1998b). During the primary compression, a Mahalanobis radius that discards a certain fraction of the newly sampled points belonging to a cluster must be determined. Our implementation computes the distance between each newly sampled point and the cluster it's assigned to, and, for each cluster, sorts the list. Then it's easy to nd a radius so that a certain fraction of points is discarded. Note that this approach may change the time complexity of the whole algorithm. It may be possible to implement this more eciently, at least when the fraction of discarded points is small. For all main and secondary clusters, our implementation stores the sucient statistics (sum of elements, squared sum of elements, number of points) as well as the mean and standard deviation of that cluster. The mean is stored so that the distance between old and new means (the new mean is computed from the sum of the elements) can be computed when doing K-means clustering, and the standard deviation is stored to speed up the primary compression. This makes each cluster occupy about four times as much space as a point. One problem that may occur during secondary compression is that a tight cluster does not t in a buer. Therefore, each time a new tight cluster is found (after oining), the space occupied by that cluster is compared to the space occupied by the points it will replace, and only if it reduces the occupied space in the buer are the points compressed. For our purposes, the sucient statistics are represented by two vectors, Sum and SumSq, and one integer n. The vectors are for storing the sum and the sum of squares of the elements of the points in the cluster, and the integer is used to keep track of the number of points in the cluster. From these statistics, the mean and variance along each dimension can be calculated. Let the sucient statistics of a cluster A be given by (Sum (A) ; SumSq (A) ; n (A) ) If a point x is added to the cluster, the sucient statistics are updated according to Sum (A) SumSq (A) := Sum (A) := SumSq (A) n (A) := n (A) + 1 + x + x 2 If clusters A and B are merged, the sucient statistics for the resulting cluster C are Sum (C) SumSq (C) = Sum (A) = SumSq (A) n (C) = n (A) + n (B) + Sum (B) + SumSq (B) Synthetic Dataset Synthetic datasets are generated to experiment with the modied K-means algorithms. This allows the cluster means determined by the algorithms to be compared with the known probability distributions of the synthetic data. The data points are drawn from a xed number of Gaussian distributions. Each Gaussian is assigned a random weight which determines the probability of generating a data point from that distribution. The mean and variance for each Gaussian distribution are uniformly sampled, for each dimension, from the intervals [-5,5] and [0.7, 1.5] respectively. To determine the accuracy of the clustering the true cluster means must be compared with the estimated cluster means. The problem of discovering which true cluster mean corresponds to which estimated mean must be solved. As long as the number of clusters K is small, it's possible to use the one of the K! permutations that yields the smallest value. This is the way we do it, and that's the reason why only a few clusters are used in our experiments. Lesion Studies To evaluate the contribution of each data compression method lesion studies have been conducted. The experiments are run using the synthetic datasets with 20 dimensions and 100000 data points. Comparisons are made between four variations of the scalable K-means algorithm, the regular K-means algorithm, and a naive scalable K-means algorithm described in the next section. Each method clustered the same synthetic data set and works within a limited buer, except for regular K-means, large enough to hold approximately 10% of

the data points. The number of clusters is xed at ve for these lesion studies. The four variations of the scalable K-means algorithm involve adding each of the data compression methods to the previous variation. The rst variation doesn't use any of the data compression techniques proposed in the scalable K-means algorithm. The K-means algorithm would run until convergence on the rst ll of the buer. No points would be added or removed after this initial ll. This variation is similar to clustering on a 10% sampling of the data points. For the next variation, the rst primary compression technique is used. This involves moving to the discard set points within a certain Mahalanobis distance from its associated cluster mean. Next, the second primary compression technique is added. Condence intervals are used to discard data points unlikely to change cluster membership. The fourth variation includes the secondary data compression technique of determining secondary clusters. For secondary compression the number of secondary clusters is xed at 25. All other constants used in these experiments are shown in Table 1. Parameter Value Condence level for cluster means 95% Max std. dev. for tight clusters () 1.1 Number of secondary clusters 25 Fraction of points discarded (p) 10% Table 1: The parameters used for the lesion studies of the scalable K-means algorithm are given above. Naive scalable K-means A special case of the scalable K-means algorithm would be when all points in the buer are discarded each time. The algorithm is: 1. Randomly initialize cluster means. Let each cluster have a discard set in the buer. The discard set keeps track of the sucient statistics for all points from previous iterations. 2. Fill buer with points. 3. Perform K-means iterations on the points and discard sets in the buer, until convergence. For this clustering, the discard sets are treated like regular points placed at the means of the discard sets, but weighted with the number of points in the discard sets. 4. For each cluster, update the sucient statistics of the discard set with the points assigned to the cluster. Remove all points from the buer. 5. If database empty, end. Otherwise, repeat from 2. This algorithm will subsequently in this paper be called naive scalable K-means. Like the scalable algorithm, it performs only one scan over the dataset and works on a xed size buer, but requires much less computation each time the buer is lled, and the whole buer can be lled with new points for each iteration. Results The experiments ran on 100 dierent synthetic datasets. The results are shown in Figure 1 and Figure 2. For each dataset ve dierent clustering models are generated from dierent initial starting positions. The best model is retained for a comparison of the accuracy of the clustering. Some of the synthetic datasets could not be eectively clustered by any of the methods used in the experiments. These datasets distort the quality measurements and are removed before comparing the methods. We believe using a larger number of initial starting positions may reduce the number of datasets which don't converge to a solution. Of the 100 synthetic datasets 24 are not used. To compare the running times, all of the 100 trials are used. Each of the initial starting position are combined to give an accumulative running time for each clustering method. The results of the lesion studies show no increase in accuracy for identifying the clusters for each data compression technique added, but a signicant increase in running time. Except for the rst scalable K-means algorithm, each method used all of the points in the datasets. Therefore, none of the experiments produced a model which converged using less than the full dataset. The cluster quality is about the same for each of the methods which used the whole dataset. The running time is directly related to the complexity of the algorithms. Each additional data compression technique adds to the running time of the algorithm. The regular K-means algorithm converges to a solution faster than and of the scalable K-means algorithms which used data compression. The naive approach runs faster than the other methods while producing the same level of accuracy. It should be noted that the variation using the secondary compression method found no points tight enough to be merged as secondary clusters. The inability of the lesion studies to show any signicant dierences between the quality of the clusterings may have resulted from the use of the synthetic data sets. KDD '98 Dataset To experiment with a real world database, which is suf- ciently large to evaluate the scalability of the proposed algorithm, the dataset from the 1998 KDD (Knowledge and Data Discovery) cup data mining contest is used. This database contains statistical information about people who have made charitable donations in response to direct mailing requests. Clustering can be used to identify groups of donors who may require specialized requests in order to maximize donation prots. The database contains 95412 records, each of which has 481 statistical elds. To test the clustering algorithms, we selected 56 features from these elds. The features are represented by vectors of oating point numbers. Numerical data (e.g. donation amounts, income, age) are stored as oating point numbers. Date values (e.g. donation dates, date of birth) are stored as the number of months from a xed min-

0.7 Cluster quality 15 Running time Distance between estimated and true means 0.6 0.5 0.4 0.3 0.2 0.1 Running time [s] 10 5 0 R10 S10 S10 S10 N10 K 0 R10 S10 S10 S10 N10 K Figure 1: The graph shows the sum of the distances between the estimated and true cluster means, for a synthetic dataset of 100000 points and ve clusters. The algorithms are random sampling K-means (no compression) (R10), scalable K-means with primary 1 compression only (S10{), with primary 1 and 2 compression (S10-), with all types of compression (S10), the naive scalable K-means (N10) and the regular K-means operating on the whole dataset (K). Error bars show the standard errors. imum date. Binary features, such as belonging to a particular income category, are also stored as oating point numbers. Of the 56 features that are used, 18 are binary. To give equal weight to each feature, each of the elds were normalized to have a mean of zero and variance of one. The records in the database are converted to this format and saved to a separate le which is used to retrieve data points during the clustering process. The purpose of this experiment is to compare the running time and cluster quality of the regular K-means, operating on the whole dataset or on resamples, the scalable K-means algorithm using all types of compression, and the naive scalable K-means where every point is put in the discard set after being used to update the model. The experiment is performed with resamples and buers of 10% and 1% of the size of the whole dataset. The number of clusters is 10. First, the dataset is randomly reordered. Then it is clustered ve times by all algorithms, each time with new randomly chosen initial conditions (but with all algorithms starting from the same initial conditions). The running time is measured. The quality of the clustering is measured by the sum of the squared distances between each point and the cluster mean it is associated with. Of these ve clusterings, the one with the best quality is used. This is done because K-means is known to be sensitive to initial conditions. This whole procedure is repeated 14 times with dierent random orders of the dataset. For the scalable clustering, we found it hard to set Figure 2: The graph shows the average running times for the modied versions of the scalable K-means algorithm. Error bars show the standard errors. Parameter Value Condence level for cluster means 95% Max std. dev. for tight clusters () 1.1 Number of secondary clusters 40 Fraction of points discarded (p) 20% Table 2: The parameters used for the scalable K-means algorithm are given above. the parameters, especially for the secondary clustering. If the number of secondary clusters is low, many points are assigned to each cluster. This means that in order to get any tight clusters at all, the maximum allowed standard deviation must be quite high. If the number of secondary clusters is too high on the other hand, it will take too much time to run the secondary K-means clustering and merge the resulting clusters, with only slightly better clusterings. The parameter values we used are given in Table 2. Figure 3 shows the average quality of the best of the ve clusterings, and Figure 4 shows the average running time of all the clusterings. From Figure 3 it can be seen that the quality of the random sampling K-means operating on a 1% resample is much worse than everything else, and that ordinary K-means produces the best clusterings, which can be expected. Larger buers produce better clusterings. The scalable K-means produce slightly better clusterings than the naive scalable K-means, but the dierence is not signicant. The running time is by far highest for the ordinary K-means, but scalable K-means takes much time as well, compared to the naive scalable and random sampling K- means.

4 x 106 Cluster distortion 300 Running time Squared point cluster distance 3.98 3.96 3.94 3.92 3.9 3.88 3.86 3.84 3.82 Running time [s] 250 200 150 100 50 3.8 S10 S1 N10 N1 R10 R1 K 0 S10 S1 N10 N1 R10 R1 K Figure 3: The graph shows the sum of the squared distances between each point in the dataset and the cluster mean it's associated with, on the KDD Cup '98 dataset of 95412 points and 10 clusters. The algorithms are the scalable K-means (S10 and S1), the "naive" scalable K- means (N10 and N1), random sampling K-means (R10 and R1), and the ordinary K-means working on the whole dataset. names ending with 10 use a buer or resample with size 10% of the whole dataset, names ending with 1 use a 1% buer or resample. Error bars show the standard errors. Complexity To perform one K-means iteration over a set of n entities has time complexity O(n). Here, it is assumed that O(log n) iterations over the set are needed for convergence of the clusters. Table 3 shows the time, space and disk I/O complexity for regular, scalable and naive scalable K-means, where m is the size of the RAM buer. The O(n log m) time complexity of the scalable K-means algorithm is caused by the K-means iterations over the buer, the sorting on Mahalanobis distance in our implementation, and the secondary K-means clustering. Our experiments show that the constant in the time complexity of the scalable K-means is large compared to the constant for regular K-means, and that the gain from using the scalable method comes from reducing the number of disk accesses. This can be seen from the running times on synthetic and real world datasets. For the synthetic datasets, regular K-means runs faster than scalable K-means, but for the real world dataset the scalable K-means is faster. Time Space Disk I/O Regular K-means O(n log n) O(1) O(n log n) Scalable K-means O(n log m) O(m) O(n) Naive K-means O(n log m) O(m) O(n) Table 3: The table shows the time, RAM and disk complexity for three dierent K-means variations. Figure 4: This gure shows the average running time for performing one clustering, for the same runs as the one shown in Figure 3. Error bars show the standard errors. Conclusions We have implemented a scalable K-means algorithm, and performed lesion studies on it. A special, but much simplied, case of the scalable K-means algorithm is when all points are discarded after being used to update the model. This variant has been shown to produce good clusterings of large synthetic and real world datasets, in much less time than the full scalable K- means algorithm. Also, we have shown that the scalable K-means does not produce as good clusterings as the regular K-means operating on the full dataset, even if the parameters are set to give it a running time of about 40% of the full clustering. It may be that we did not nd the right parameters, but we argue that an algorithm which has many parameters without any guidelines of how to set them, is probably not worth the extra eort of implementing it. It is worth to note one extra thing about the scalable K-means algorithm: it is possible to update several models, each starting from dierent initial conditions, sharing the same buer and performing only one scan of the database. We did not use that for our experiments, so it's possible that the running time would have decreased if we did. Also, the scalable algorithm is not restricted to K-means clustering, but can be extended to work with more expensive clustering, such as expectation maximization (EM) (Bradley, Fayyad, & Reina 1998a). Our main conclusion is that a complicated scalable K-means algorithm is only worthwhile if the database is very large or the time it takes to access a record is high, and a high quality clustering is highly desirable.

References Bradley, P.; Fayyad, U.; and Reina, C. 1998a. Scaling em (expectation maximization) clustering to large databases. Technical Report MSR-TR-98-35, Microsoft Research, Redmond, WA. Bradley, P. S.; Fayyad, U.; and Reina, C. 1998b. Scaling clustering algorithms to large databases. In Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, 9{15. AAAI Press.