Parallelization of K-Means Clustering Algorithm for Data Mining

Similar documents
GPU-Accelerated Apriori Algorithm

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Parallel Approach for Implementing Data Mining Algorithms

An Optimization Algorithm of Selecting Initial Clustering Center in K means

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

Cluster Analysis (b) Lijun Zhang

Top-k Keyword Search Over Graphs Based On Backward Search

Accelerated Machine Learning Algorithms in Python

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

Iteration Reduction K Means Clustering Algorithm

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Hotel Recommendation Based on Hybrid Model

Research and Improvement of Apriori Algorithm Based on Hadoop

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

Data Mining 4. Cluster Analysis

Twitter data Analytics using Distributed Computing

Comparative Study Of Different Data Mining Techniques : A Review

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

An Efficient AC Algorithm with GPU

A Review on Cluster Based Approach in Data Mining

Dynamic Clustering of Data with Modified K-Means Algorithm

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

Introduction to Machine Learning

An Efficient Clustering for Crime Analysis

Accelerating Spark RDD Operations with Local and Remote GPU Devices

Clustering in Data Mining

CSE 5243 INTRO. TO DATA MINING

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Clustering from Data Streams

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Clustering and Dimensionality Reduction

Optimization solutions for the segmented sum algorithmic function

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Hydraulic pump fault diagnosis with compressed signals based on stagewise orthogonal matching pursuit

A Rapid Automatic Image Registration Method Based on Improved SIFT

Unsupervised Learning Partitioning Methods

CSE 5243 INTRO. TO DATA MINING

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Improved MapReduce k-means Clustering Algorithm with Combiner

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

An improved MapReduce Design of Kmeans for clustering very large datasets

Logging Reservoir Evaluation Based on Spark. Meng-xin SONG*, Hong-ping MIAO and Yao SUN

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Performance impact of dynamic parallelism on different clustering algorithms

CS294 Big Data System Course Project Report Gemini: Boosting Spark Performance with GPU Accelerators

Analyzing Outlier Detection Techniques with Hybrid Method

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

Algorithm research of 3D point cloud registration based on iterative closest point 1

Indoor air quality analysis based on Hadoop

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Architecture, Programming and Performance of MIC Phi Coprocessor

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

arxiv: v1 [cs.dc] 2 Apr 2016

Face recognition based on improved BP neural network

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Research Article Apriori Association Rule Algorithms using VMware Environment

Distance-based Methods: Drawbacks

6. Dicretization methods 6.1 The purpose of discretization

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

A Fast and High Throughput SQL Query System for Big Data

GPU Programming Using NVIDIA CUDA

An Improved Apriori Algorithm for Association Rules

Performance Analysis of Data Mining Classification Techniques

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Research on outlier intrusion detection technologybased on data mining

Gene Clustering & Classification

Keywords Clustering, K-Means, GPU, CUDA, Data Mining, Hybrid architecture, Hybrid programming

Creating Time-Varying Fuzzy Control Rules Based on Data Mining

Effective Learning and Classification using Random Forest Algorithm CHAPTER 6

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Accelerating CFD with Graphics Hardware

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

An Improved KNN Classification Algorithm based on Sampling

ACCELERATION OF IMAGE RESTORATION ALGORITHMS FOR DYNAMIC MEASUREMENTS IN COORDINATE METROLOGY BY USING OPENCV GPU FRAMEWORK

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Clustering Part 4 DBSCAN

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Parallel Computing with MATLAB

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Cost Models for Query Processing Strategies in the Active Data Repository

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

An Improved Parallel Scalable K-means++ Massive Data Clustering Algorithm Based on Cloud Computing

Fast, Interactive, Language-Integrated Cluster Computing

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Clustering and Visualisation of Data

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

DS504/CS586: Big Data Analytics Big Data Clustering II

General Purpose GPU Computing in Partial Wave Analysis

XIV International PhD Workshop OWD 2012, October Optimal structure of face detection algorithm using GPU architecture

Face Detection CUDA Accelerating

A priority based dynamic bandwidth scheduling in SDN networks 1

Transcription:

Parallelization of K-Means Clustering Algorithm for Data Mining Hao JIANG a, Liyan YU b College of Computer Science and Engineering, Southeast University, Nanjing, China a hjiang@seu.edu.cn, b yly.sunshine@qq.com Abstract: In this paper, we studied the parallelization of K-Means clustering algorithm, proposed a parallel scheme, designed a corresponding algorithm, and implemented the algorithm in GPU environment. The experimental result shows that the GPU-based parallelization algorithm has a good acceleration effect compared with the CPU-based serialization algorithm. 1 Introduction Cluster analysis is an important research topic in the field of data mining. [1] Clustering is the process of classifying a set of physical or abstract objects into similar object classes. The objects in the same cluster have high similarity, and the differences in the different clusters are large. The automatic clustering is able to identify dense and sparse regions in the object space, and then find interesting correlations between the global distribution pattern and the data attributes. [2] In the context of the application of large data and increasing mass data, the data size has reached TB or even PB level, which puts forward a higher requirement of the cluster. Massive data and huge processing tasks can not be completed by the general computer within the specified time. In order to improve the ability to deal with massive data and enhance the real-time data processing, parallelization of the clustering algorithm become an attractive choice. Today's widely used methods are distributed computing systems, such as Hadoop [3] and Spark [4], which combine multiple computers into a unified distributed system, and each computer processes user data to improves processing efficiency. In order to fully tap the computing power of a single computer, you can transplant the appropriate parallel computing algorithm to GPU platform, and with the GPU s powerful parallel computer capability the computer's data computing capabilities will be improved. Similarly to the CPU, you can easily add a number of GPU to the same computer, to further improve the efficiency of a single computer data processing. It is not difficult to imagine that if a multi-gpu computer with powerful computing power is organized into a distributed cluster, it will greatly improve the data processing capacity compare to the same level of computer clusters. Therefore, the parallelization of GPU-based K-Means algorithm has a wide range of practical application value. 2 Overview of K-Means Algorithm The K-Means [5] algorithm is one of the most famous and most commonly used clustering algorithms proposed by MacQueen. The algorithm is simple, fast and easy to implement. The K-Means algorithm uses K as the input parameter to divide the N sets of objects into K clusters, which makes the similarity in the same cluster high, and the similarity between different clusters is low. The similarity of clusters is about the mean measure of the objects in the cluster, and can be regarded as the centroid or center of gravity of each cluster. The K-Means algorithm can be described as follows: Given a sample set D={x_1,x_2,, x_n}, and x i χ R n is the eigenvector of the instance, the K-Means algorithm is designed to map N samples to k(k N) clustering centers C={c_1,c_2,, c_k}, and make sure that the square sum of the distance between each sample and its nearest clustering center is minimized, and square sum of that distance is called The Square Error Function, marked as E, as shown in equation (1). (1) The processing flow of the K-Means algorithm: Input: Data set D={,, }, and ; k is the number of clusters. Output: a collection of k clusters. Step: 1) Arbitrary select K samples from D as the center of initial cluster; 2) Repeat; 3) Calculate all the distance between each data sample and the center of the cluster; 4) Assign each object to the most similar cluster according to the above distance; 5) Calculate the mean of all the objects in each cluster and update the cluster mean; 6) Calculate the Squared Error Function; 7) Until the criterion function satisfies the threshold; The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).

8) The algorithm terminates. The K-Means algorithm attempts to determine the k divisions of the minimized Squared Error Function. When the cluster shape is compact, the difference between cluster and cluster is obvious, and size of all the clusters is similar, clustering result is ideal. The time complexity of the K-Means algorithm is O(NKt), where N is the sample set size, K is the number of clusters, and t is the number of times of iteration. [6] Specifically, the most of time, in the process of the K-Means algorithm s each iteration, spent on calculating the distance between each data set objects and calculating the Square Error Sum of all objects. Although the process of alloction each data object and the update of all k clusters center need to be executed many times, but the most heavy calculation in each time is to repeat the calculation of the Square Error Sum between data and different center points, and calculate the new center point and find the new center of each cluster. So, it can be separated into two single kernel functions. Then, it can be sent to the GPU processing core for parallel computing. 3 Parallel Design of K-Means Algorithm First, this paper presents a GPU-based K-Means parallel algorithm--g-k-means algorithm. The main idea of that algorithm is to improve the performance by moving the part which data is independent and computation is intensive in the traditional K-Means algorithm from the host to the GPU. Since the K-Means algorithm is an iterative convergence process which will calculate the distance between the data and the center of the cluster every time, the main parallel scheme includes the following two aspects: (1) Parallel calculation of all the distance between data objects and cluster center point In order to facilitate the data calculation on the GPU, we construct the data set matrix T[N][D], the center point matrix C[k][D] and the distance matrix Dis[N][k]. N is the size of the data set, k is the number of clusters to be classified, D is the dimension of the data sample, Dis[N][k] is the result from T[N][D] and C[k][D] set the C^T [D][k] matrix. Each row in the distance matrix Dis[N][k] represents the distance between a data object and center points of k clusters. This step can be transplanted to GPU, and use similar block matrix multiplication to parallel computing. Finally, it can improve the efficiency of distance calculation between the data samples. (2) Parallel calculation of the Square Error Sum for all objects in the data set In the process of the iterative convergence of the K- Means algorithm, each iteration needs to calculate the Squared Error Sum of all data objects in order to determine whether it is converge. First, for each object of each cluster, the operation of calculating the square sum of the distance from the object to its cluster center is independent, so parallelism can be achieved here. Second, the calculation of Square Error Sum for each object in each cluster is independent, parallelism can be achieved too. In the specific implementation, GPU grid is divided into N/1024 one-dimensional parallel blocks, each block has 1024 onedimensional thread. Each thread corresponds to an object for calculating the Square Error Sum with its cluster center, and then use Reduction Thought to sum each thread result. During the Reduction, the thread in each block is first summed, and the Square Error Sum of the object in each block is obtained. Then, the same method is used to Reduction Sum each block and get the final result. Although the calculation model of SIMD in GPU is good at parallel computing, the GPU-based K-Means algorithm has three important principles[7]. Firstly, GPU branch control and data cache mechanism are very weak, because a large number of computing processing unit occupies most of the space within GPU. Secondly, the data transfer rate between GPU and GPU's global memory is much slower than the data transfer rate between CPU and CPU cache, so with the appropriate thread block and thread bundle size, GPU can reflect powerful computing speed. Thirdly, the GPU-based K-Means algorithm increases the time when transfering data between GPU global memory and CPU memory compared to the traditional K-Means algorithm. Therefore, in order to optimize the performance of the algorithm, we must reasonably allocate the responsibilities of the host and the device, and design or implement the data storage and parallel computing model. The main flow chart of GPUbased G-K-Means algorithm is shown in Fig. 1. 2

Figure 1. G-K-Means algorithm flow. Algorithm process is as following: 1) Initialize the convergence threshold, the data set matrix and the random selected k samples central point matrix, where D is the data sample dimension. And calculate Squared Error Sum ; 2) Pass the matrix T and transposed into the GPU; 3) In the GPU, use the sample matrix T and to calculate the distance matrix of each sample and center point in parallel by matrix operation; 4) According to the distance matrix, each data sample is marked to the smallest distance cluster, and the center point of each cluster is recalculated, and then the center point transpose matrix is updated; 5) According to matrix and matrix, calculate the sample Square Error Sum, then determine whether it is convergent, if is satisfied, then turn to 6), otherwise turn to 3); 6) Output cluster results then end. 4 Experiment and Analysis The experimental data are randomly generated when comparing the GPU-based G-K-Means and the CPU-based K-Means. The experiment is divided into two groups. In the first group, data set T has a size N=100000, a data dimension D = 100.And in the other group, there are several data set ranging from hundreds KB to hundreds MB. Since the efficiency of the K-Means algorithm is affected by the value of k, the first group of tests is taken k from 2 to 1024, and the second group keep 128 as the value of k in order to observe the speedup ratio of parallel algorithm compare to serial algorithm. In the experiment, the convergence threshold ε is 0.001 and the number of iterations is less than 500. Table 1 describes the results of the first set of experiments, including the convergence time of the two algorithms, the number of iterations and the speedup ratio. Table 2 describes the results of the second set of experiments. Table 1. N=100000, D=100 G-K-Means&C-K-Means clustering time compare Table 2. k=128 G-K-Means&K-Means clustering time compare time/s Iteration G-K- K- K times SpeedUp Means Means 2 11.386 11.186 133 0.98 4 15.446 24.914 177 1.62 8 19.089 55.645 222 2.92 16 12.339 67.049 141 5.43 32 13.236 136.298 148 10.30 64 12.580 268.150 149 21.32 128 8.307 330.629 93 39.80 256 5.61 417.006 58 74.33 512 3.607 527.611 36 146.27 1024 3.212 689.322 23 214.61 time/s G-K- Means K-Means Iteration times 0.3 2.013 1.357 33 0.67 0.8 0.907 0.437 14 0.48 1.8 1.472 1.420 24 0.96 5.0 1.656 4.259 26 2.57 11.2 2.798 18.159 47 6.49 SpeedUp 23.5 5.61 90.714 93 17.58 50.0 8.396 234.688 121 27.95 116.0 7.804 316.495 87 40.56 232.0 8.834 516.551 73 58.47 3

From the experimental results above, we can see that the clustering time of the algorithm is influenced by the value of k and the number of iterations. The G-K-Means parallel algorithm has a significant improvement over the clustering convergence rate compared with the K-Means algorithm. In particular, with the increase of k in clustering and the increase of clustering data, the speedup effect of parallel algorithm is more obvious. It can be seen from Fig. 2 that the clustering time difference between the two algorithms is very small when the value of k is small, and the advantage of G-K-Means algorithm is not obvious, but with the increase of k, the clustering time of G-K-Means algorithm increase slowly compared with the K-Means algorithm, and it converges slowly. When the k is 1024, G- K-Means algorithm achieves 214.64x faster than the K- Means algorithm. The K-Means algorithm has nothing to do with the parallel running time on the GPU and the K value. Figure 2. Clustering time comparing between G-K-Means&K-Means when N=100000,D=100-K-Means algorithm flow. Figure 3. Algorithm Speedup when N=100000, D=100-K-Means. 4

Figure 4. Algorithm Speedup for different dataset when k=128. Fig. 4 depicts the clustering time comparison between the G-K-Means algorithm and the K-Means algorithm over a variety of data sets when k is 128. It can be seen that the acceleration effect of the G-K-Means algorithm on the smaller data set is not ideal, but with the increase of the data set, the clustering time of the G-K-Means algorithm is faster than that of the K-Means algorithm. When the set comes to 232MB, the acceleration ratio reached 58.47x. The experimental results show that the G-K-Means algorithm, proposed in this paper, converts the process of multiple iterations and the updating process of k cluster centers to the GPU to improve the convergence speed of the algorithm. In particular, the greater the k value, the more obvious the effect of algorithm speedup. To illustrate the scalability of the algorithm, the experiment is carried out on a variety of data sets with different sizes. The results show that the speedup effect of G-K-Means parallel algorithm becomes more and more obvious as the data set scale increases. 5 Conclusion In this paper, we deeply studied the parallelization of K- Means clustering algorithm of data mining. The experimental results show that GPU-based parallelization algorithm has greatly improved the efficiency when compares to the CPU-based serial algorithm. Especially, when data increase, the effect of acceleration becomes more obvious. Acknowledge This work was supported by Collaborative Innovation Center of Wireless Communications Technology and the National Natural Science Foundation of China (Grant No. 6157060279). References List and number all bibliographical references in 9-point Times, single-spaced, at the end of your paper. When reference l footnote at the bottom of the column in which it was cite 1. Radiner L R.A tutorial on hidden Mardov model and selected applications in speech recognition[j].proceedings of the IEEE,1989,77(2):257-286. 2. Rabiner L R,Juang B H.An Introduction to hidden Mardov models[j].ieee ASSP Magazine,1986,3(1):4-16. 3. Dean J, Gheawat S. MapReduce: Simplified data processing on large clusters[j]. Communications of the ACM-50th Anniversary Issue: 1958-2008, 2008, 51 (1): 107-113. 4. Spark: Cluster Computing with Working Sets. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Technical Report No.UCB/EECS- 2010-53. 2010. 5. Han Jiawei, Kamber Micheline. Data Mining: Concepts and Techniques [M]. Second Edition. MingFan, Xiaofeng Meng translatrion. Beijing: Machinery Industry Press, 2007: p1-449. 6. M. Zechner and M. Granitzer, Accelerating K-Means on the graphics processor via CUDA, apr.2009, pp. 7-15. 7. Anguo Ma. Research on Key Technologies of High Performance GPGPU Architecture. National University of Defense Science and Technology. 2011:03-01. 5