An improved MapReduce Design of Kmeans for clustering very large datasets

An improved MapReduce Design of Kmeans for clustering very large datasets Amira Boukhdhir Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Boukhdhir _ amira@yahoo.fr Oussama Lachiheb Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Oussama.Lachiheb@gmail.com Mohamed Salah Gouider Laboratoire SOlE Higher Institute of management Tunis Tunis, Tunisia Ms.gouider@yahoo.fr Abstract-Clustering is an important data analysis technique which is used for the purpose to classify the data into groups of similar items. Clustering has high applicability in different fields. However, it becomes very challenging due to the sharply increase in the volume of data generated by modern applications. K means is a simple and widely used algorithm for cluster analysis. But, the traditional k-means is computationally expensive, sensitive to outliers and has an unstable result hence its inefficiency when dealing with very large datasets. Solving these issues is the subject of many recent research works. MapReduce is a simplified programming model designed to process data intensive applications in a parallel environment. In this paper, we propose an improved design of k-means based on mapreduce in order to adapt it to handle large-scale datasets by reducing its execution time. Moreover we will propose two other algorithms. The first is designed to remove outliers from the dataset, and the second is designed to select automatically the initials centroids thereby stabilize the result. The implementation of our proposed algorithm on Hadoop platform shows that is much faster than three other existing algorithms in the literature. Data Keywords-k-meansj mapreducej Hadoopj C1usteringj Big I. INTRODUCTION In today's era, the proliferation of data coming quickly and from different sources creates challenges in their collection, processing, management and analysis. Big data brings out from the necessity to tackle these challenges in order to seize opportunities that data offer. Data mining is a set of techniques applied to discover interesting knowledge from huge datasets but the traditional data mining approaches cannot be directly used for big data due to their absolute complexity. Clustering is one of the most popular tools in data mining, used to accurately analyze the massive volume of data generated by modem applications. It consists in partitioning a large set of data into clusters in such a way that the patterns within a cluster are maximally similar and patterns in different clusters are maximally dissimilar. But it becomes very challenging due to the continuous rise of data quantity. Several algorithms have been designed to perform clustering, each one uses different principle. They are divided into hierarchical, partitioning, density-based, grid-based and model based algorithms [1]. K-means is the most commonly known partitioning algorithm used in cluster analysis. Its simplicity and its performance are two reasons that attracted considerable interest, especially in the recent years. However, it has some shortcomings when dealing with very large datasets. This is due to its high computational complexity, its sensitivity to outliers and the dependence of results to the initial centroids selected randomly. Many solutions have been proposed to improve the performance of k-means. But no one provide a global solution. Some of proposed algorithms are fast but they sacrifice the quality of clusters. Others generate clusters of good quality but they are very expensive in term of computational complexity. Xiaoli Cui and al. proposed an improved algorithm k means [2]. They used a sampling technique to reduce the 110 cost and the network cost of PK-means. Experimental results showed that the optimized algorithm is efficient and performs better compared with k-means. As the algorithm is applied only to representative points instead of the whole dataset, the accuracy is not significantly high. Duong Van Hieu and Phayung Meesad proposed in their paper [3] a new approach for reducing executing time of the k-means by cutting off a number of last iterations. The experiment results showed that this method can reduce up to 30% of iterations, which is equivalent to 30% of executing time, while maintaining up to 98% accuracy. However, the fact of choosing randomly the initial centroids implies an instability of clustering. In addition to that, clustering result may be affected by noise points, leading to inaccurate result. Li Ma and al. in their paper [4] proposed a solution for improving the quality of clustering of the traditional k-means. They solved the instability of clustering by selecting systematically the value of k as well as the initial centroids. The sensitivity to outliers is addressed in their paper and solved by reducing the number of noise points. This algorithm 978-1-5090-0478-2/15/$31.00 2015 IEEE

produces good quality clusters but it takes more computation time. Thus, the efficiency and effectiveness of k-means remains to be further improved. On this basis, our paper presents an enhanced MapReduce design of k-means for clustering very large datasets. Firstly we will propose a solution to reduce the execution time of k-means in order to adapt it to handle very large datasets. Secondly we will propose an algorithm to select systematically the initial centroids in order to stabilize the result and another algorithm to remove outliers from the data set because they degrade the quality of clustering. This paper is structured as follows. Firstly, related works is presented. Secondly, the proposed approach is explained. Thirdly, Experimental analysis is shown. Finally, conclusion and future scope will be covered. A II. BACKGROUND AND RELATED WORKS MapReduce Overview MapReduce [5] is a programming model developed by Google in 2004; which supports distributed parallel computing across hundreds or thousands of nodes. Its goal is to hide the messy details of parallelization, fault tolerance and data distribution. With MapReduce paradigm, the programmer has only to define two functions namely Map and Reduce. Mapreduce is responsible of all the remaining work. It permits to divide the input data into small data blocks. For each data block a map function is called. The input and output data for both map and reduce stages are in the form of key-value pairs (k,v). After the processing of the map tasks, MapReduce framework sorts the output data by key. Then, the data with the same key is entered into the same reduce task which merge them to produce a single result. More details of MapReduce can be found in [6]. The major advantages of Map Reduce are: The ease-of-use: even for programmers who are not familiar with parallel programming. Scalability: The possibility to parallelize the processing of map and reduce operations enables to increase processing power by merely adding servers to the cl uster. Failover properties: MapReduce can handle machine failures and intra machine communications. B. Hadoop Overview Hadoop [7][8] is the most popular open source framework which implements the MapReduce programming model. It is designed to write and run distributed data-intensive applications. It enables to handle huge volume of data across clusters of commodity hardware. Hadoop consists mainly of two components: The Hadoop Distributed File System (HDFS) which is an extended version of Google File System (GFS) designed to store reliably a massive amount of data on cluster of machines and to stream it at high bandwidth to user applications. The MapReduce engine which is composed of a Job Tracker presented in a master node and a set of Task Trackers presented in slave nodes. The Job Tracker is responsible to determine the plan of execution, assign tasks to slave nodes and insures the control of them. The Task Trackers perform the map and reduce tasks till their completion. Hadoop gained its popularity due to its several advantages: Outstanding scalability: It could be scale out from a single computer machine to thousands of computer machines without performing change. High failure tolerance: Hadoop provides failure tolerance at different levels including disks, nodes, switch and network. Cost effectiveness: Hadoop enables massively parallel computing to commodity hardware. Flexibility: Hadoop is shema-less i.e it can handle all types of data. C. K-means algorithm K-means algorithm [9] partitions the data into k clusters. Each cluster is represented by its centroid, which is the mean of patterns belonging to it. The algorithm k-means proceeds in 3 steps: Input: k: number of clusters, datasets. Output: Set of k clusters 1) Initially, it chooses arbitrary k patterns as centroids and assigns each of the remaining patterns to the closest centroid. 2) For each cluster, calculate the centroid and reassign each pattern in the dataset to the closest centroid. The K-means algorithm begins with an initial set of K centers selected randomly and iteratively updates it so as to decrease the error function. It stops when a convergence condition is verified. There are many convergence conditions such as stopping when obtaining a locally optimal partition (the partitioning error is not reduced by the relocation of the centers) or reaching a pre-defined number of iterations. Advantages: k-means has the following benefits i) Simple implementation with high efficiency and scalability ii) Linear complexity of time Disadvantages i) First, it is sensitive to outliers which affect the accuracy of clustering. ii) Second, it is non deterministic in the sense that it produces different clusters for each execution as the result heavily depends on the selection of initial centroids. iii) Third, it is computationally expensive when the number of data points increases.

D. Parallel k-means based MapReduce PK-means (Parallel k-means) is a parallel k-means based on mapreduce proposed in [10] [11). It consists of three functions: Map, Combine and Reduce. The Map function is responsible to calculate distances of each data point in the mapper to the k centers and assign it to the closest one. The combine function is devoted to local centroid calculations. The Reduce function calculates the new global centroids. PKmeans algorithm is as follow. Input: k: number of clusters, datasets. Output: Set of k clusters Map Input: A list of <keyl, valuel> pairs, k global centroids. Where keyl is the position of data point and valuel is its content. Output: A list of <key2, value2> pairs. Where key2 is the index of the cluster and value2 is the points belonging to that 1) Calculate distances between the data point and k centroids 2) Assign it to the closest centroid 3) Repeat 1) and 2) until processing all the data point in the mapper Combine Input: A list of <key2, value2> pairs Output: A list of <key3, value3> pairs. Where key3 is the index of the cluster and value3 is the sum of points belonging to that cluster associated with their number. 1) Calculate the sum of data points belonging to the same 2) Calculate their number. 3) Repeat 1) and 2) until processing the k clusters. Reduce Input: A list of pairs <key3, value3> Output: A list of <key4, value4>. Where key4 is the index of the cluster and value4 is its new global centroid. 1) Calculate the new centroid of cluster: The mean of data points belonging to it. 2) Repeat 1) until calculate the centroid of k clusters. This mapreduce job is repeated until convergence: the centroids don't change. Advantages: Pk-means has the following strengths It is easy to implement. It is very efficient and takes less time to build clusters. It shows its performance with respect to speed up, scale up and size up. Disadvantages : PK-means has two drawbacks. It is sensitive to outliers and suffers from instability of results like k-means. The 110 cost and the network cost are very expensive as the whole dataset will be shuffled over the cl uster of nodes at each time. III. IM-KMEANS FOR CLUSTERING VERY LARGE DATA SETS Our purpose in this paper is to solve the problems aforementioned above. To do this, we proposed an algorithm to remove the noise points from the dataset and another algorithm to select systematically the initial centroids. Moreover, we gave a solution to reduce the number of data points to be clustered at each iteration and thus improving the clustering speed and complexity. A Design of proposed algorithm The proposed algorithm consists of three stages: Stage 1: In this step the outliers are removed from the dataset. We define the radius eps and the number of neighbors nb to measure the density area to which the object belongs. For each data point in the dataset we calculate the number of its neighbors (the data points which the distance between them and the considered data point is less than eps). If the number of neighbors of the data point is less than nb, it is considered as outlier and it will be removed from the dataset. Stage 2: In this step, the initial centroids are selected systematically. To perform this task, we define a mapreduce job which consists of three functions: Map, Combine and Reduce. The map function is devoted to calculate the distances between the data points stored in the mapper and the centroids already selected. The combine function is used to determine the local farthest data point from them. The reduce function determine the most distant data point in the dataset from the centroids already selected. Before executing this mapreduce job, the master node determines the two first centroids which are the farthest objects in the input dataset. This algorithm is repeated k-2 times to select the k initial centroids. Stage 3: In this step, the dataset is partitioned in k clusters. A mapreduce job is defined to cluster the input dataset. As we aim to reduce the clustering speed and complexity, we will use a structure which enables us to reduce the number of data points to be clustered at each iteration. Our idea is to use a table that will contain the data points misplaced. The clustering is applied only on data points existing in this table. The job of clustering is composed of three functions: Map, Combine and Reduce. The map function calculates the distances between the data points existing in the table of misplaced points and the k global centroids. Then, it affects each data point to the closest centroid. The data point that the distance between it and the closest centroid is less than d (a

predetermined value) is considered as placed in the right cluster and it will be removed from the table of misplaced points. The combine function is used to calculate the local centroids. Finally, the reduce function devoted to computing the new global centroids. The following figure shows the three stages of the proposed algorithm. Stage 1: Combine Input: A list of pairs <key2, value2>. Output: A pairs <key3, value3>. Where key3 is the farthest object from all the objects in C and value3 is the corresponding distance. Reduce: 1) Find out the farthest object from all the objects in C Input: <key3, value3> Output :< key4, value4>. Where value4 is the initial centroids. 1) Find the farthest object in D from the objects in C 2) Add that object to C B. Proposed algorithm Fig. I. Execution of proposed algorithm To explain our improved algorithm, it is useful to note: D: The original dataset which will be clustered C: The dataset that will contain the initial centroids selected k: The number of the clusters predetermined by the user. Stage 1: Remove noise points from the dataset D. Algorithm 1 Input: Dataset D, eps, nb. Output: Dataset D without outliers. 1) nbneighbors = 0 2) Calculate the distance between a data point x and another point y. If the distance between x and y is less than eps, then nbneighbors = nbneighbors + 1 3) Repeat 1) and 2) until (nbneighbors = nb) or all the distances between x and the points in D are calculated 4) If (nbneighbors = 0) than remove x from the dataset D 5) Repeat 1), 2), 3), 4) until all the data points in the dataset are tested Stage 2: Select initial centroids. Algorithm 2 Map: Input: A list of pairs <keyl, valuel>, C. Where keyl is the position of a data point in D and valuel is its content. Output: A list of pairs <key2, value2>. Where key2 is the farthest object from an object in C and value2 is the corresponding distance. 1) Calculate the distance between an object of C and the objects ofd. 2) Find out the farthest object from that object 3) Repeat 1) and 2) until processing all the objects in C Stage 3: Cluster the dataset. Algorithm 3 Map Input: A list of <keyl, valuel> pairs, k global centroids, d. Where keyl is the position of the data point and valuel is its content, d is a predetermined value. Output: A list of pairs <key5, value5> where key5 is the index of the cluster and value5 is the points belonging to that 1) Initially all the data points stored in the mapper are placed in a table containing the misplaced points. 2) Calculate distances between an element from the table and k global centroids 3) If the distance between the element and the closest centroid is less than d Remove this element from the table of misplaced objects and assign it to the corresponding Else, assign it to the cluster with the closest centroid 4) Repeat 2) and 3) until processing all the data points in the table Combine Input: A list of pairs <key5, value5> Output: A list of pairs <key6, value6> where is the index of the cluster and value6 is the sum of points belonging to that cluster associated with their number. 1) Calculate the sum of data points belonging to the same 2) Calculate their number. 3) Repeat 1) and 2) until processing the k clusters. Reduce Input: A list of pairs <key6, value6> Output: A list of <key7, value7> where key7 is the index of the cluster and value7 is its new global centroid. 1) Calculate the new centroid of cluster: The mean of data points belonging to it. 2) Repeat 1) until calculate the centroid of k clusters.

Global 1M K-means algorithm IV. IMPLEMENTATION AND EXPERIMENTAL RESULTS Input: A dataset D, number of cluster k, eps, nb, d. Output: K clusters 1) At master worker Remove noise points Select the two first centroids: the farthest objects indo Form a list of <keyl, valuel> from D, called Listl Divide Listl into W blocks and distribute them to W workers 2) At all workers Run map function to produce <key2, value2> pairs Run combine function to produce <key3, value3> pairs 3) At master worker Run reduce function to produce <key4, value4> pairs are initial global centroids 4) At all workers Run map function to produce <key5, value5> pairs are an index of cluster and the data points belonging to it. Run combine function to produce <key6, value6> pairs are sum of points belonging to the same cluster and their number. 5) At master node Run reduce function to produce <key7, value7> pairs are global centroids 6) At master node Calculate the difference between the new global centroids and the old global centroids. If there is no change stop Otherwise, repeat from step 4. A Implementation of proposed algorithm Our algorithm was implemented using JAVA and Hadoop framework to ensure the execution of MapReduce tasks. We have used two machines each one has an Intel i3 processor and 4 GB of RAM. The dataset used in our experimental studies is a real dataset collected from the Tunisian stock exchange daily trading in fiscal years 2012, 2013 and 2014. This dataset contains 1,020,000 records each one has six attributes. Its size is 1,IGB. The characteristics of used dataset are shown in tablel. TABLEL EXPERIMENTAL DATASET Dataset Number of Number of Size records Attributes Datasetl 1,020,000 6 1,IGB B. Experimental results Firstly, we tested our algorithm to see the number of iterations with different numbers of data points. Table 2 shows the experimental results obtained. The following figure depicts the results showed in table 2. Flowchart of proposed algorithm is described in Fig. 2. Fig. 3. Number of iterations of 1M k-means The results obtained show that when the number of data points increase, the number of iterations increase also. Secondly, we have compared the proposed algorithm with three others algorithms that is traditional k-means, PK-means and Fast k-means in term of execution time. The experiments are carried out six times with different number of records. Fig. 4 illustrates the results obtained. Fig. 2. Flowchart of IM k-means Fig. 4. Running time comparison

According to the figure, it is clear that the execution time of 1M kmeans is much more less than the execution time of traditional k-means, Pk-means and Fast k-means. Thus, it is more efficient than them and it is more adapted to handle large-scale datasets. V. CONCLUSION AND FUTURE SCOPE In this paper, an improved design of k-means for clustering very large datasets efficiently is proposed. The experimental results show that our algorithm takes less time as compared to traditional k-means, PK-means and Fast k-means. But it has two limitations. First the value of k is required as input. Second, it can be applied only for datasets which have attributes with numerical values. In the future, we can enhance our approach such that the number of clusters is determined automatically. We can also extend the application of this approach for datasets with categorical attributes. References [I] C. Fraley and A. E. Raftery. How Many Clusters? Which Clustering Method?Answers Via Model-Based Cluster Analysis. Technical Report No. 329. Department of Statistics University of Washington, 1998. [2] [2] C. Xiaoli and al. Optimized big data K-means clustering using Map Reduce. Springer Science + Business Media New York 2014. [3] V. Duon, M. Phayung. Fast K-Means Clustering for very large datasets based on MapReduce Combined with New Cutting Method (FMR. K Means). Springer Inter national Publishing Switzerland, 2015.K. Elissa, "Title of paper if known," unpublished. [4] M. Li and al. An improved k-means algorithm based on Mapreduce and Grid. International Journal of Grid Distribution Computing, 2015. [5] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, 2004. [6] T. White. Hadoop: The Definitive Guide. 1st ed. O'Reilly Media, Inc., 2009. [7] Apache Hadoop. http://hadoop.apache.orgj [8] J. Venner. Pro Hadoop. Apress, June 22, 2009. [9] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967. [10] L. Zhenhua and al. Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce. Springer-Verlag Berlin Heidelber g, 2010. [II] Z. Weizhong and al. Parallel K-Means Clustering Based on MapReduce. Inter national Conference On Cloud Computing Technology And Science - CloudCotn, 2009.