Parallel K-Means Clustering with Triangle Inequality

Size: px

Start display at page:

Download "Parallel K-Means Clustering with Triangle Inequality"

Rudolph Carson
5 years ago
Views:

1 Parallel K-Means Clustering with Triangle Inequality Rachel Krohn and Christer Karlsson Mathematics and Computer Science Department, South Dakota School of Mines and Technology Rapid City, SD, 5771, USA Abstract Clustering divides data objects into groups to minimize the variation within each group. This technique is widely used in data mining and other areas of computer science. K-means is a partitional clustering algorithm that produces a fixed number of clusters through an iterative process. The relative simplicity and obvious data ism of the K-means algorithm make it an excellent candidate for distributed-memory optimization, particularly as datasets grow beyond the size of a single machine. The triangle inequality, when applied to the K-means algorithm, allows unnecessary distance calculations between data objects and cluster centroids to be avoided. Various implementations of the K-means algorithm exist, but no example comparing a standard implementation to one utilizing the could be located. This paper seeks to fill this gap by presenting experimental results demonstrating the performance of both standard and improved K-means implementations compared to a implementation. keywords: K-means, clustering, data mining,, 1 Introduction The field of data mining seeks to extract useful information from large datasets drawn from real-world situations. As big data becomes more popular and data storage capabilities increase, demand for data mining analysis increases while datasets grow in size. One common data mining task is clustering, which partitions the data into discrete categories. The K- means algorithm clusters data objects into k separate clusters using some distance measure, with each data object belonging to the cluster with the nearest cluster centroid[6]. Although the K-means algorithm is simple to implement, growing datasets present unique challenges. Larger datasets require more time to analyze, and if the entire dataset does not fit into the memory of a single machine, costly disk I/O operations further slow the analysis process. One solution is to use processing to speed up the analysis process and, in the case of distributed-memory solutions, reduce disk I/O. This paper will introduce the basic K-means algorithm and summarize current implementations before introducing a new flexible distributed-memory solution that utilizes the for improved speedup. 2 Basic K-Means Algorithm K-means is a simple prototype-based, partitional clustering algorithm. Each cluster is represented by a prototype, or centroid, which is typically defined as the mean of all data points in that cluster. Because K- means is partitional, each data point will only belong to one cluster, and clusters will not be nested. The number of clusters, k, is specified by the user before the algorithm begins processing. Initial centroids can be selected in many different ways; the exact method does not impact the main algorithm. K-means then relies on an iterative procedure to refine the centroids, as described in Algorithm 1 [6]. Algorithm 1 Basic K-Means Algorithm 1: Select k points as initial centroids 2: repeat 3: Form k clusters by assigning each point to its closest centroid 4: Recompute the centroid of each cluster 5: until Centroids do not change Initial centroid selection varies depending on the specific K-means implementation being used. Some common methods include randomly selecting k data points, or selecting k points with maximum separation distance. Initial centroids can impact the final clustering results, as poor initial centroids may produce a sub-optimal clustering. At each K-means iteration, all points are reassigned / copyright ISCA, CAINE 16 September 26-28, 16, Denver, Colorado, USA

2 to the nearest centroid based on some distance metric. Note that finding the nearest centroid requires calculating the distance between the current data object and all cluster centroids. Euclidean distance is often used, since it is simple to implement and produces good results if the data is normalized. Based on the new cluster assignments, the centroids for each cluster are recalculated to reflect the new cluster membership. Often a cluster centroid is defined as the average of all data objects belonging to that cluster. The data object assignment and centroid adjustment process is repeated until the cluster centroids do not move, indicating a stable final clustering. In practice, particularly for large datasets, at each iteration the number of data points that switch clusters is tracked, and the algorithm terminates once this count falls below a certain threshold [6]. Since most centroid movement occurs in the first few iterations, stopping early does not greatly undermine the final results. There are also cases where a small number of points will repeatedly oscillate between two clusters; using the threshold termination condition prevents this. One of the greatest advantages of the K-means algorithm is its simplicity and efficiency. Despite these advantages, K-means is not suitable for all datasets. K-means does not handle non-globular clusters, or clusters of differing sizes and densities; other clustering algorithms exist to handle these situations. The algorithm is also not robust to outliers; detecting and removing outliers before running K-means is a viable solution. 3 Parallel K-Means Algorithm Examining the K-means algorithm reveals an obvious data ism in the algorithm [7]. To reassign each point to the nearest cluster centroid, the distance from each point to all the cluster centroids must be calculated; these distances are then compared to determine the nearest centroid. Because this process must be repeated for every data object in the set, and results for one data object do not impact other data objects, it is appealing to ize the K-means algorithm to reduce overall execution time. 3.1 Existing Work Following theoretical development of a K- means algorithm [7], various implementations have been created. Some implementations rely on shared memory [4] for simplicity, but these programs cannot handle extremely large datasets. Other researchers, desiring a distributed-memory solution, use MPI [1] [8] or Apache Hadoop [9] for ization. More recent work focusing on GPU-based solutions [2] also exists. Various algorithm modifications have been researched and implemented to improve the running time of the basic K-means algorithm. Improvements seek to reduce either the number of algorithm iterations or the number of distance calculations at each iteration. Both of these strategies reduce the overall complexity of the K-means algorithm, thereby achieving speedup. Implementing the to reduce the number of Euclidean distance calculations greatly improves the scalability of the K-means algorithm [5]. Applying the to an object s previous cluster assignment and a table of inter-centroid distances eliminates unnecessary distance calculations. Further improvement is achieved by sorting the centroids in order of distance from the previous cluster centroid. The main advantage of this improvement strategy is its simplicity, as the triangle inequality does not alter the foundation of the basic K-means algorithm. Another improvement strategy is to avoid looping the entire dataset at each centroid adjustment iteration by removing data points close to the centroid from consideration [3]. Using statistical analysis, the dataset can be reduced to a subset of boundary points, which are points near the edges of a cluster. If centroids do not move significantly, only the boundary points must be reassigned, decreasing the cost of each algorithm iteration. This strategy produces speedup by reducing the number of passes over the entire dataset, which reduces the total number of distance calculations. Although the boundary point approach offers significant speedups, it is much more difficult to implement than the triangle inequality. 3.2 Implemenation Details OpenMPI is a library implementation of the Message Passing Interface for distributed-memory computing. Typically, OpenMPI provides greater control to the programmer than MapReduce, allowing for more complex and frequent communication between processes. For this reason, OpenMPI was selected as the platform for this implementation. Unfortunately, increased control generally causes larger communication overhead, so careful algorithm design is required to minimize communication between processes. Once the desired number of MPI processes are initialized, the root process retrieves the data object attributes from the user-specified input file. The data is partitioned between the different processes; each process is then responsible for clustering only part of the entire dataset. A series of k data objects are randomly

3 selected to serve as the initial cluster centroids. The number of desired clusters and the initial centroids are disseminated to all processes before the main clustering procedure begins. The K-means algorithm was converted to an MPI procedure to exploit data ism. Each process is responsible for a subset of the data objects, but knows the location of all cluster centroids. At each iteration, a process assigns each of its own data objects to the nearest cluster centroid. The attributes of all data objects assigned to each cluster are summed as they are assigned, and the size of each cluster is tracked. After all data objects are assigned, the processes exchange the local cluster sizes and attribute sums. Each process then computes updated centroid locations independently before repeating the data object assignment procedure. Using this strategy, our implementation produces the same results as the K-means algorithm while limiting communication and reducing overall computation time. Because a few data objects may oscillate between cluster centroids, preventing the algorithm from terminating, our implementation utilizes a stopping threshold. Once fewer than 1% of data objects change clusters, the program stops. Two results files are created, one giving the exact location of each cluster centroid, and the other indicating the membership of each cluster. 4 The Triangle Inequality The can be applied to the K-means algorithm to reduce the number of distance calculations required. By comparing the distance between a data object X i and its current cluster centroid C curr to the distance between C curr and any potential cluster centroid C pot, unnecessary distance calculations are eliminated. Taking d(a, b) as the Euclidean distance between points a and b, the states that d(c curr, C pot ) d(x i, C curr ) + d(x i, C pot ), which means that d(x i, C curr ) d(c curr, C pot ) d(x i, C pot ). Considering d(c curr, C pot ) 2d(X i, C curr ), the allows us to conclude that d(x i, C pot ) d(x i, C curr ). This indicates that object X i will not be assigned to cluster C pot, and d(x i, C pot ) does not need to be calculated. A prerequisite for these comparisons is a table of distances between all pairs of centroids, which must be recalculated following each centroid adjustment. Each process computes this table independently after receiving the new centroid location data. Because the number of cluster centroids is much smaller than the number of objects, the cost of computing this table is negligible in relation to the number of distance calculations saved by this improvement. For our implementation the centroids are not sorted because the extra speedup did not outweigh the additional complexity, particularly when k is small. 5 Experimental Results To evaluate the value of ization, we conducted extensive experiments on the, basic, and optimized K-means implementations. All experiments were performed on a cluster of computers; each machine has 16 GB of memory and eight cores running at 3.6 GHz. All timing measurements rely on the MPI Wtime call. Since the goal of this paper is to compare the performance of the clustering algorithm, I/O time is ignored for these measurements. Sequential testing consisted of a single process on a single machine, while programs used eight processes on two nodes. This configuration was selected to include inter-process communication both across and within network nodes. Each graph datapoint represents an average of several test runs. The datasets used for all test runs were synthetically generated. Each dataset contains k equal-sized, wellseparated, globular clusters, with theoretical cluster centers following a normal distribution. Within each cluster, individual data items also follow a normal distribution. Attributes are floating-point numbers ranging from zero to one with five digits of precision. If not specifically mentioned, the default dataset for each run consists of 5, items, each with 3 attributes, with k set to clusters. All individual test runs use the same initial centroids for consistency. By comparing the outputs of the programs, we proved that all three versions of the K-means algorithm produced the same final clustering. Centroid locations and object cluster assignments were identical across all test runs, regardless of which program performed the clustering. Figure 1 shows the runtimes of all three program versions for datasets with varying numbers of objects, ranging from, to 2 million objects. As expected, the program takes much longer to produce the same results as the ized version. The K-means version with the optimization does perform better than the standard algorithm. The runtime of both programs grows slower than the runtime of the version, indicating that the algorithm is scalable. The effect of k on runtime was also considered;

4 number of objects (X) 5 15 number of attributes Figure 1: Effect of dataset size on runtime. Figure 3: Effect of number of attributes on runtime plementation, particularly as the problem size grows. The optimization does improve upon the standard implementation, with the greatest benefit occurring for large numbers of clusters Figure 2: Effect of number of clusters on runtime. Figure 2 shows these results. As with the number of data objects, the runtime of the K- means algorithm grows quickly as k increases, while the runtime of the programs increases more slowly. The speedup from the optimization is more apparent here, with the consistently reducing clustering runtime by half. Finally, the number of attributes of the data objects was adjusted and the impact on runtime examined. Figure 3 shows that once again, the implementations significantly outperform the K- means algorithm. The optimization appears to offer some benefit in these cases, although the improvement is not as significant as for larger numbers of clusters. Overall, the experiments prove that a implemenation of the K-means algorithm greatly reduces the execution time as compared to a im- k 6 Future Work Because the size of the test datasets was limited in this study by the need to time execution, the benefits of the optimization does not seem significant when compared with the improvements achieved by a standard implementation. Further testing with larger datasets and more attributes is required to better assess the effectiveness of the triangle inequality. Experiments using real-world data instead of synthetically generated objects is also necessary. The standard K-means algorithm does not specify a method for selecting the initial centroids, so many implementations rely on randomly-selected initial centroids. Because poor initial centroids can result in sub-optimal clustering results [6], clustering is often repeated so that the best result can be used. Additional research into initial centroid selection methods is required to solve this problem. Other K-means algorithm improvements, such as boundary point tracking [3], may allow for even greater speedups. 7 Conclusion In this paper, a survey of the basic K-means algorithm and current related work is presented as an introduction to clustering. Then, a distributed-memory implementation that exploits the data ism present in the K-means algorithm is discussed. The is also introduced as an op-

5 timization of the basic algorithm, including implementation details. Finally, experimental results are presented for all three program versions. Our OpenMPI implementation of K-means clustering partitions the dataset between multiple processes to speedup program execution. The algorithm is designed to limit communication between nodes and maximize data ism. A second variation of this program utilizes the to prevent unnecessary Euclidean distance calculations. We evaluate our programs performance through detailed experimentation. The results show that ization of the K-means algorithm does improve running time, and that the optimization offers further benefits. More testing with larger datasets is required to more clearly demonstrate the additional speedup gained through the optimization. To the best of our knowledge, a performance comparison between basic K-means and a version implementing the does not exist in literature. The K-means algorithm is a relatively simple and flexible clustering mechanism for many datasets. Exploiting the data ism present in this algorithm greatly reduces overall runtime, and the provides an opportunity for further optimization, particularly for large datasets. [5] Jitendra Kumar, Richard T. Mills, Forrest M. Hoffman, and William W. Hargrove. Parallel k- means clustering for quantitative ecoregion delineation using large data sets. In Procedia Computer Science, pages Elsevier, 11. [6] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Pearson Addison Wesley, 1st edition, 5. [7] Jinlan Tian, Lin Zhu, Suqin Zhang, and Lu Liu. Improvement and ism of k-means clustering algorithm. Tsinghua Science and Technology, (3): , 5. [8] Jing Zhang, Gongqing Wu, Xuegang Hu, Shiying Li, and Shuilong Hao. A clustering algorithm with mpi - mkmeans. Journal of Computers, 8(1): 17, 13. [9] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In Cloud Computing, pages Springer, 9. References [1] Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Revised Papers from Large- Scale Parallel Data Mining, Workshop on Large- Scale Parallel KDD Systems, SIGKDD, pages Springer-Verlag, [2] Reza Farivar, Daniel Rebolledo, Ellick Chan, and Roy Campbell. A implementation of k-means clustering on gpus. In Parallel and Distributed Processing Techniques and Applications, pages CSREA Press, 8. [3] Ruoming Jin, Anjan Goswami, and Gagan Agrawal. Fast and exact out-of-core and distributed k-means clustering. Knowledge and Information Systems, (1):17 4, 6. [4] Tayfan Kucukyilmaz. Parallel k-means algorithm for shared memory multiprocessors. Journal of Computer and Communications, 2(11):15 23, 14.

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu