BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

An Improved K-means Algorthm based on Cloud Platform for Data Mnng Bn Xa *, Yan Lu 2. School of nformaton and management scence, Henan Agrcultural Unversty, Zhengzhou, Henan 450002, P.R. Chna 2. College of Informaton Engneerng, Zhengzhou Engneerng and Technology College, Zhengzhou, Henan 450044, Chna Abstract Data mnng allows users to make effectve use of data n wde applcatons guded by ther specfc scentfc research and busness decsons. But wth the enormous quantty of nformaton embedded wthn data, tradtonal data mnng algorthms face greater challenges n terms of ever ncreasng processng tme or beng unable to deal wth massve data. Mgraton of tradtonal algorthms to cloud platforms for parallel processng s a frst effectve step to solve the problem. The K-Means algorthm of Hadoop platform n data mnng s normally used wth parallel processng ablty to acheve tme mprovement. Pror to clusterng n the K-Means algorthm, we sampled the data to determne the ntal center pont usng neghborhood densty and clusterng. Based on the analyss of the lmtaton of the algorthm, we propose an mproved K-Means verson based on densty and samplng (BSDK-Means). Determnaton of the ntal K and the center through samplng and densty, we meet the users needs to specfy a K and ntal defects of center pont n the ntal stage. The mproved K-Means algorthm MapReduce, uses Hadoop parallel processng ablty to enhance the scalablty of the algorthm. Our experment shows that the algorthm has better scalablty. Keywords - Hadoop; data mnng; K-Means; MapReduce I. INTRODUCTION The emergence of data mnng the user data to be used, and wth the gudance of the scentfc research and busness decsons, but because of the enormous amount of nformaton, tradtonal data mnng algorthms have faced greater challenges: data mnng algorthms cannot make tradtonal mass data processng or treatment to spend a lot of tme[]. How to data mnng n massve data has become one of the hotspot of current research [2]. The use of hgh performance computer and parallel computng can be greatly solve the problem [3], for parallel computng can provde computng power needed to process mass data, and wth the ncrease of data, can be used to ncrease the computng power by adoptng cluster. So the research on parallel data mnng algorthm s of practcal sgnfcance. The allocaton of resources dynamc, parallel computng functons of cloud computng s unable to process mass data and provdes a soluton to the tradtonal data mnng. Massve data can be stored through the cloud platform, users can access the data through dfferent ways, and mnng computaton ablty needed data through cloud platform on demand for. Cloud platform provdes condtons for data mnng, how wll the cloud platform and tradtonal data mnng are combned, the key s how to mprove the tradtonal parallel algorthm, make use of cloud platform of mass data processng. II. IDEA OF K-MEANS ALGORITHM K-Means[4] algorthm s a clusterng algorthm James proposed by MacQueen n 967, the algorthm s smple, hgh effcency, has been wdely used n scentfc research and ndustral applcaton. The basc dea of the algorthm: K-Means algorthm s a cluster analyss algorthm, the n sample s dvded nto K clusters that objects wthn a cluster have hgh smlarty, and between the clusters of elements wth low smlarty [9]. The user decdes to cluster number k, and K were randomly selected ponts as the ntal center pont, every ntal center pont as a cluster; and then accordng to the dstance formula or other smlarty calculaton formula of other ponts n the sample wll be dvded to the nearest cluster; and then calculate the average of all the objects n the cluster as the center pont of the new[6,7]. It s repeatng teraton untl the objectve functon convergence. The characterstcs of each teraton of K-Means algorthm wll calculate the sample pont whch s assgned to the nearest cluster center, f the allocaton error, you need to adjust to the correspondng cluster center, dstrbuton rght, there s no need to adjust. A. Algorthm Defnton Defnton : Clustered data set: X x x2 x3 x n {,,,, } Among them, X denotes the n data ponts, each data pont s a dmensonal data. Defnton 2: The smlarty formula: here s a selecton of Eucldean dstance formula as the smlarty calculaton formula. d( x, x ) ( x x ) ( x x )... ( x x ) 2 2 2 j j 2 j2 n jn Among them, x x j are n dmensonal data ponts. Defnton 3: cluster center DOI 0.503/IJSSST.a.7.49.45 45. ISSN: 473-804x onlne, 473-803 prnt

z x m x Among them, represents the cluster, m represents the number of data ponts that belongs to cluster. Defnton 4: Convergence condton k 2 p E p m Among them, E s the sum of the square error of all objects, p s a space pont, m s the average of the cluster. B. Algorthm Process The K-Means algorthm uses partton crtera, such as the dstance formula, the data s dvded nto k clusters. Data smlarty wthn clusters s hgh, the lowest smlarty between clusters. The man steps shown n Tab. : TABLE I. THE MAIN PROCEDURES OF K-MEANS Input: cluster number k, the ntal center pont and to dvde the data Output: K cluster members. ) Randomly selected K objects as the cluster center; 2) Calculaton of other objects and cluster center dstance, the object s dvded nto the correspondng cluster center; 3) Accordng to the cluster object, recalculate the cluster center; 4) Determne whether the cluster center change and teraton number s less than the threshold, such as changng jump to step 2; 5) Judge whether the number of teratons s less than the threshold, f less than the threshold then the output member of the K group, otherwse output cluster falure; 6) Executon. C. Aalgorthm defects The K-Means algorthm s very clear, the algorthm s smple and easy to realze. The complexty of the algorthm s smlar to Otkn ( ). Where t s the teraton number, n s the sum of the classfcaton data, k s the number of packets. Typcally k n, t n,so complexty of the algorthm s smlar to On ( ). Dsadvantages: The K-Means algorthm depends on the K set The K-Means depends on the ntal cluster center Outlers senstvty Scalablty Complexty of the K-Means algorthm s approxmate to On ( ), but n the face of large amounts of data, computng the number of ncrease, the smlarty calculaton tme has become very tme-consumng. Therefore the use of parallel computng s essental n the case of a large amount of data. III. IMPROVED K-MEANS ALGORITHM BASED ON DENSITY AND SAMPLING In ths paper, accordng to the former analyss of the defect of K-Means algorthm, ths paper proposes an mproved K-Means algorthm based on densty and samplng (BSDK-Means). Determnaton of the ntal K and the center through the samplng and densty, to solve the user needs to specfy a k and ntal center pont defects n the ntal stage. A. Concepts Defnton (neghborhood): For any pont n space P, radus, to the pont of P s the center of a crcle, whch s the radus regon, called the neghborhood of P. Defnton 2(densty): For any pont n space P, a number of ponts n the P neghborhood are called the densty of P. B. Parallel mprovement BSDK-Means algorthm manly nclude 4 parts: multple samplng of massve data; usng densty, fnd the center pont of the samplng data; confrm global center; To cluster the data usng the K-Means algorthm. Multple samplng s carred out through the acquston of huge amounts of data, generatng massve data form can reflect sample. For the samplng data, calculatng between data ponts and data ponts to determne the data belongng to the neghborhood, and accordng to determne the sample center pont neghborhood densty, accordng to the global center of orgnal data to determne the sample center pont. Center pont s determned by samplng and densty, whch solves the defects n the orgnal K-Means algorthm, depends on the ntal center ponts. Specfy center pont, the data clusterng usng K-Means algorthm. In the large amount of data, calculatng the dstance between the object and the computng center of the cluster s the most tmeconsumng operaton, operaton tme and ncreases wth the ncreasng of data sze. So there wll be BSDK-Means algorthm that s transplanted to Hadoop platform, operaton ablty to handle the most tme-consumng calculaton usng parallel Hadoop platform. The detal flow chart as shown n Fg., from the dagram can be seen, for the center pont and the clusterng of the two steps usng the characterstcs of Hadoop, the realzaton of parallel. ) Determne the data center pont based on samplng and densty Samplng densty and confrm the center pont can be executed n sequence through seral mode, but for a large number of samples and samplng number of crcumstances, the process s tme-consumng; and sample confrmaton center s of no relevance. Therefore, we use the Hadoop platform parallel processng of massve data capacty optmzaton; mprove the speed of the sample center pont. DOI 0.503/IJSSST.a.7.49.45 45.2 ISSN: 473-804x onlne, 473-803 prnt

the center pont of the output of qualfed. The man steps are as shown n Tab. 3. TABLE III. SAMPLEREDUCE MAJOR STEPS Fgure. BSDK-Means Flow Chart The Hadoop platform assgned sample to perform dfferent node, each node mplementaton of callng a custom Map functon to calculate the generaton of canddate ponts of the sample. Fnally, the reduce operaton s performed on the canddate ponts are generated, satsfy the condton (the neghborhood radus densty >densty) canddate center pont. In accordance wth the thought, desgn of class Sample Map, class Sample Reduce. The Sample Map class s a concrete realzaton of Map operaton. Map operaton for default nput on, here Key for the current row offset to the start lne, Value as a node of x coordnate nformaton. In Map operaton, the calculaton of pont X and the canddate pont dstance, f all dstances are greater than radus, t wll pont x as a new canddate, otherwse t wll pont the x nformaton s added to the x dstance s less than radus and the canddate ponts. The fnal output of the canddate ponts of.the man steps shown n Tab. 2. TABLE II. CLASS SAMPLEMAP MAJOR STEPS Input: the startng offset key, node coordnate nformaton for x Output: the canddate center pont dentfer key, canddate center pont ) calculaton of X between nodes and each canddate pont dstance; 2) f the dstance s less than radus, then x wll pont you coordnate s accumulated to a canddate, and the canddate pont densty ncrease; 3) f all dstances are greater than radus, pont to X as a canddate new ponts; 4) generates all canddate center pont, buld strng representaton of canddate center pont you coordnate, canddate center pont hash as key, a strng contanng the canddate center neghborhood nteror pont ; each coordnate, cumulatve and densty as 5) output key,, then end algorthm. The Sample Reduce class s a concrete realzaton of Reduce operaton. The Reduce operaton s default nput on, and the Key s the dentfer of canddate ponts, the V s ntermedate wth the same Key set. The Reduce functon accordng to the densty of set judgment canddate s qualfed (greater densty set), Input: canddate center pont hash key, ntermedate wth the same key V Output: the canddate center pont dentfer key, canddate center pont ) determne canddate center pont densty whether s greater than the set densty 2) If greater than then calculate the new center n the feld, the new center pont dentfer as key, the new center pont as the. Output key, ; 3) If less than then dscard the canddate center pont; 4) The algorthm termnates. The canddate center multple samples Reduce output s generated, the canddate centers of dfferent samples produced by pont may belong to the same neghborhood, so merge of canddate center s necessary. Mergng concept s also based on the global center pont densty of orgnal data. 2) Usng the K-Means algorthm to produce clusterng Use the K-Means algorthm to dvde the recalculaton clusterng n large amount of data whch manly produces n between data and the center pont and the dstance calculaton of the center pont. Here the dstance calculaton operaton s assgned to each of the Hadoop platform mplementaton of the node, the data pont and the center pont of the dstance calculated by the mplementaton of the node, and the pont nto the mnmum dstance cluster. And the center pont recalculaton completed by the Reduce operaton, n the Reduce mplementaton of the node re compute cluster center. Accordng to the dea, desgn of class KMeansMap and class KMeansReduce. The KMeansMap class s a concrete realzaton of Map operaton. Calculated for each data pont and the center pont of the dstance n the stage of Map, calculate the shortest dstance, and the data ponts assgned to the dstance from the center of the nearest pont. The man steps are shown n Tab. 4. TABLE IV. KMEANSMAP CLASS MAJOR STEPS Input: the startng offset key, node coordnate nformaton for x Output: the group number key, node X coordnate nformaton ) the frst mplementaton of the global center needs to read from HDFS, stored n the global varable space; 2) the calculaton of x and the global center dstance, fnd the mnmum dstance, determne the center pont x belongs to; 3) ndex belong to the center pont as the group number key, node x coordnate nformaton as ; 4) Output key, ; 5) The algorthm termnates. The KMeansReduce class to get each cluster of data ponts to calculate each group center, ts man steps s shown n Tab. 5. DOI 0.503/IJSSST.a.7.49.45 45.3 ISSN: 473-804x onlne, 473-803 prnt

TABLE V. KMEANSREDUCE CLASS MAJOR STEPS Input: group ndex, nodes belongng to the group Output: group ndex as key, the new center pont as the ) nodes belong to the same group of cumulatve ndex each coordnate, calculate the dmensonal coordnate average, average as the new center pont to st standard; 2) group ndex as key, the new center pont as the 3) Output ; key, ; 4) The algorthm termnates. New ponts Reduce producton, f the new center and the center pont of a wheel change s less than the threshold, then the algorthm ends. If more than the threshold, the new center pont as the ntal center pont, for the next cluster. The convergence condton of K-Means algorthm s usually a square error crteron, defned as formula (5): K E p m p C 2 E s the sum of square error of all objects, p s space, m s the average of group C. In large data, the computng tme square error cannot be gnored; ths crteron s not sutable for large data of convergence. In vew of ths stuaton, modfy the convergence condtons for the smlar two tmes the dstance from the cluster center. Its defnton as the formula (6). k 2 E p p p as the center pont, the new center pont correspondng p clusterng. C. Complexty analyss p to a The orgnal K-Means s run on a sngle node, ts complexty s Otkn ( ). The proposed BSDK-Means algorthm based on Hadoop platform, usng Hadoop parallel programmng capablty computng tasks wll be assgned to the p node executon, ts complexty s Ot ( n t k/ p). t sad the center pont of the sample to determne the cost of tme, the amount of data n the case, the samplng to determne the center pont of tme s neglgble. The t s an mproved teratve tmes, through the expermental test of. t t IV. EXPERIMENT The experment manly compares the K-Means parallel algorthm and BSDK-Means algorthm n the operaton of multple sets of data, from the clusterng result, convergence tme and speedup the three aspects of the analyss of the test results. The expermental data conssts of two parts, testng the clusterng results usng Edgar Anderson rs (Irs) data, and test the convergence tme and speedup usng artfcal data, D0 (ncludng 200000 data), D (ncludng 500000 records), D2 (ncludng 800000 records), D3 (ncludng 000000 records), D4 (ncludng 200000 records). Because the K-Means algorthm reles on the K and ntal center pont, therefore usng the scorng algorthm, each group data repeatng the experment 0 tmes, get rd of the worst and the best record, the remanng 8 records for the average. Whle the BSDK-Means algorthm depends on the densty, the neghborhood radus, lterature [8] concluded that k n, densty n. The neghborhood radus depends on the specfc data space. In the course of the experment set densty s n, the neghborhood radus s 2 n. After repeated experments, the densty and the neghborhood radus wth the expermental results very well. A. Analyss of clusterng results The rs data set (Irs dataset) s rs Anderson research Canada Jasper pennsula on the geographc varaton of data [5] presented flowers, whch contans 50 samples. In the 50 samples, ncludng three knds of rs, respectvely (Irs setosa) s a mountan of rs, rs verscolor (Irs verscolor) and Vrgna (Irsvrgnca) of rs. Each sample has four attrbutes, respectvely s the length and wdth of sepals and petals, so the data sample matrx representaton can be used 50 long 4. The choce of the rs as a test of the orgnal K-Means algorthm and BSDK-Means algorthm to data sets, the reason s that the 50 samples have been very determned and dvded nto three categores, and the clusterng central ponts clear, central locaton pont respectvely (6.588, 2.974, 5.552, 2.026), (5.006, 3.48,.464, 0.244), (5.936, 2.77, 4.26,.326). Tab.6 gves the orgnal K-Means algorthm and BSDK- Means algorthm mplementaton results n the data set of rs flower. From the table we can see that the orgnal K-Means algorthm msclassfcaton sample number s 20, the success of the sample number s 30, whle the BSDK-Means algorthm msclassfcaton sample number s 4, the success of the sample number s 36. The mproved algorthm s lower 4% classfcaton error rate than the orgnal algorthm, better clusterng effect. The BSDK-Means algorthm to select the ntal pont s determned accordng to the samplng and densty than the random ntal pont, confrm the more targeted, so t has more accuracy. B. Runnng tme The runnng tme of algorthm executon speed s used to judge. From the algorthm tself, the K-Means tme s manly consumed n the data packet, and the BSDK-Means tme s produced by the two part centers and data packet DOI 0.503/IJSSST.a.7.49.45 45.4 ISSN: 473-804x onlne, 473-803 prnt

composton. More n the orgnal data, BSDK-Means algorthm s tme consumng more n a data packet. In order to better llustrate the algorthm tself s tme-consumng, gnored here Hadoop communcaton node tme-consumng, gnore the dfferent node performs the same tme error data, usng the teraton number to measure the algorthm executon speed. TABLE VI. Clusterng Algorthm THE CONTRAST OF THE ORIGINAL K-MEANS ALGORITHM AND BSDK-MEANS ALGORITHM RESULTS Error classfcaton number Error LV K-Means 20 3.3% DSDK 4 9.3% Clusterng center c=(5.0038,3.44,.4768,0.2545) c2=(5.8799,2.763, 4.3672,.3873) c3=(6.7820,3.0384, 5.6568,2.0435) c=(5.0064,3.4020,.4962,0.25239) c2=(5.9362,2.834, 4.390,.4082) c3=(6.6359,3.047, 5.50,2.032) Error 0.2680 0.84 From the analyss of the algorthm, under the same condtons of parallel K-Means algorthm and BSDK-Means algorthm executon tme depends only on the ntal center pont. Set the number of Hadoop platform node 4 n ths experment, and the test of D0, D, respectvely D2, D3, D4 data set, obtaned the results as shown n Fg. 2. As you can see n Fg. 2, the BSDK-Means algorthm and the parallel algorthm of K-Means teraton s generally ncreased wth the ncreasng amount of data, but the BSDK-Means algorthm s better than the parallel K-Means teraton number. Because the BSDK-Means algorthm based on samplng and densty to confrm the ntal pont, than the random selecton of more targeted, so can be faster convergence. s T n Tn T s for the task executon tme on a sngle processor, T n s for task executon tme n n processor. Experments were to test the D0, D, D2, D3, D4 data sets n dfferent nodes on the executon tme. Fg. 3 s a parallel K-Means algorthm speedup, Fg. 4 BSDK-Means algorthm speedup. As can be seen from the graph on the platform of Hadoop BSDK-Means algorthm and K-Means algorthm have good speedup, but more data, the speedup s greater. But as the number of nodes ncreases, the speedup ncreases flattenng algorthm. Snce the nodes ncreases, the nter node communcaton consumpton ncreased, resultng n accelerated than ncremental gentle. Fgure 3. parallel K-Means algorthm speedup Fgure 4. BSDK-Means algorthm speedup Fgure 2. Fgure Algorthm Runnng Tmes C. Speedup The speedup rato refers to the rato of the executon tme of task executon tme n a sngle treatment wth multprocessor, commonly used to measure the performance of parallel programs and effect, defnton as formula (7). V. CONCLUSIONS Ths chapter from the start wth the K-Means algorthm, accordng to the clusterng result depends on the K of defects and ntal center pont, put forward the mproved clusterng algorthm BSDK-Means samplng, densty and based on Hadoop platform[0]. The BSDK-Means algorthm keeps the advantage of the orgnal K-Means algorthm, to select the ntal center pont by densty; the algorthm does not depend on the K and the ntal center pont. Fnally, through dfferent sets of data carres on the experment to the DOI 0.503/IJSSST.a.7.49.45 45.5 ISSN: 473-804x onlne, 473-803 prnt

algorthm, we conclude that BSDK-Means algorthm has better convergence and acceleraton. CONFLICT OF INTEREST The author confrms that ths artcle content has no conflct of nterest. ACKNOWLEDGEMENTS Ths work was fnancally supported by the Henan Scence and Technology Key Project Foundaton (220220507). REFERENCES [] Ma S and Wang TJ and Tang SW, A fast clusterng algorthm based on reference and densty. Journal of Software, pp.089-095,jul, 2003. [2] Jeffrey Dean,, Smplfed Data Processng on LargeClusters, Google.Inc. [3] MJ Ruo.and Y Ge, Member. Shared Memory Parallelzaton of DataMnng Algorthms Technques. Programmng Interface. And Performance, IEEE Transactons on Knowledge and Data Engneerng,pp.7-89,Jan,2005. [4] J.MacQueen,Some methods for classfcaton and analyss of multvarateobservatons,967,pp.28-297. [5] Edgar Anderson,The rses of the Gasp Bulletn of the Amercan Irs Pennsula,Socety, pp. 2-5,935. [6] ShL Yang and YS L, K-Means algorthm of K optmzaton problem of, systems engneerng theory & practce, pp.97-02,feb,2006. [7] Edgar Anderson,The rses of the Gaspé Pennsula. Bulletn of the Amercan Irs Socety,pp. 2-5, 935. [8] XF Le and KQ Xe and F Ln, An emcent clusterng aigorlthm based on locai optmaltyof K-Means, Journal of Software, pp.683-692,jul,2008. [9] ZhP Zhang and AJ Wang, Method for ntalzng K-Means clusterng algorthm based On breadth frst search, Computer Engneerng and Applcatons, pp.59-6,2008. [0] Apache Software Foundaton. ambar, http://ncubator.apache.org/ ambar/, 203. DOI 0.503/IJSSST.a.7.49.45 45.6 ISSN: 473-804x onlne, 473-803 prnt