Parallel Implementation of Classification Algorithms Based on Cloud Computing Environment

TELKOMNIKA, Vol.10, No.5, September 2012, pp. 1087~1092 e-issn: 2087-278X accredted by DGHE (DIKTI), Decree No: 51/Dkt/Kep/2010 1087 Parallel Implementaton of Classfcaton Algorthms Based on Cloud Computng Envronment Ljuan Zhou, Hu Wang, Wenbo Wang Captal Normal Unversty, Informaton Engneerng College, Bejng, Chna, 100048 e-mal: wanghu861218@163.com Abstract As an mportant task of data mnng, Classfcaton has been receved consderable attenton n many applcatons, such as nformaton retreval, web searchng, etc. The enlargng volumes of nformaton emergng by the progress of technology and the growng ndvdual needs of data mnng, makes classfyng of very large scale of data a challengng task. In order to deal wth the problem, many researchers try to desgn effcent parallel classfcaton algorthms. Ths paper ntroduces the classfcaton algorthms and cloud computng brefly, based on t analyses the bad ponts of the present parallel classfcaton algorthms, then addresses a new model of parallel classfyng algorthms. And t manly ntroduces a parallel Naïve Bayes classfcaton algorthm based on MapReduce, whch s a smple yet powerful parallel programmng technque. The expermental results demonstrate that the proposed algorthm mproves the orgnal algorthm performance, and t can process large datasets effcently on commodty hardware. Keywords: Naïve Bayes, Classfcaton, MapReduce, Hadoop Copyrght 2012 Unverstas Ahmad Dahlan. All rghts reserved. 1. Introducton Now, the rapd growth of the Internet and World Wde Web has led to vast amounts of nformaton avalable onlne consdered as Bg Data. The storng, managng, accessng, and processng of ths vast amount of data represents a fundamental need and an mmense challenge n order to satsfy needs to search, analyse, mne, and vsualze ths data as nformaton. Effcent parallel classfcaton algorthms and mplementaton technques are the key to meetng the scalablty and performancerequrements entaled n such scentfc data analyses. So far, several researchers have proposed some parallel classfcaton algorthms. All these parallel classfcaton algorthms have the followng flaws [1]: a) they all assume that all objects can bde n memory smultaneously; b) The parallel systems have offered restrcted programmng models and used the restrctons to parallelze the computaton automatcally. Both assumptons are prohbtve for the datasets composed wth mllons of objects. Therefore, dataset orented parallel classfyng algorthms should be developed. And the parallel algorthms should run on tens, hundreds, or even thousands of servers. For the emergence of cloud computng, parallel technques are able to solve more challengng problems, such as heterogenety and frequent falures. Cloud computng archtectures whch can support data parallel applcatons are a potental soluton to the terabyte and petabyte scale data processng requrements of Bg Data computng [2]. And several solutons have emerged ncludng the MapReduce archtecture poneered by Google and now avalable n an open-source mplementaton called Hadoop used by Yahoo, Facebook, and others. In ths paper, we adapt classfcaton algorthms n MapReduce framework whch s mplemented by Hadoop to make the classfyng method applcable to large scale data. We conduct comprehensve experments to evaluate the proposed algorthm by actual datasets. The results demonstrate that the effcency of the proposed algorthm s hgher than the ntal algorthm. The rest of the paper s organzed as follows. Secton 2 ntroduces MapReduce. Secton 3 presents the parallel Naïve Bayes algorthm based on MapReduce framework. Secton 4 shows expermental results and evaluatons. Fnally, the conclusons and future work are presented n Secton 5. Receved June 7, 2012; Revsed September 2, 2012; Accepted September 11, 2012

1088 e-issn: 2087-278X 2. MapReduce Overvew MapReduce s a software framework ntroduced by Google n 2004 to support dstrbuted computng on large data sets on clusters of computers. The MapReduce programmng mode s desgned to compute large volumes of data n a parallel fashon [3]. The model dvdes the workload across the cluster. It dvdes the nput nto nput splts. When clents submt a job to the framework, a sngle map processes an nput splt. And each splt s dvded nto records; the map processes each record n turn. The clent does not need to deal wth InputSplts drectly, because they are created by an InputFormat. An InputFormat s responsble for creatng the nput splts and dvdng them nto records. The framework assgns one splt to each map functon. The JobTracker pushes work out to avalable TaskTracker nodes n the cluster, strvng to keep the work as close to the data as possble by the rack-aware fle system. The TaskTracker wll process records n turn. The MapReduce framework makes the guarantee that the nput to every reducer s sorted by key. The process performs the sort and transfers the map outputs to the reducers as nputs known as the shuffle. The map functon not smply wrtes ts output to dsk. The process takes advantage of bufferng wrtten n memory and dong some pre-sortng for effcency reasons. Fgure 1 shows what happens. Input HDF S Map Task Splt Map Splt 0 Splt 1 Splt 2 Map Splt 3 Splt 4 Map Buffer n memory Partton sort and splt to dsk Merge on dsk Other maps Reduce Task Other reduce merge Reduce Out put merge Reduce Fgure 1. The framework of MapReduce 3. Parallel Nave Bayes Algorthm Based on MapReduce In ths secton we present the man desgn for Parallel Naïve Bayes based on MapReduce. Frstly, we gve a bref overvew of Naïve Bayes algorthm and analyse the parallel parts and seral parts n the algorthms. Then we explan how the necessary computatons can be formalzed as map and reduce operatons n detal. 3.1. Naïve Bayes Algorthm Naïve Bayes s a statstcal classfcaton method. It s a well-studed probablstc algorthm whch often used n classfcatons. It uses the knowledge of probablty and statstcs for classfcaton. Studes comparng classfcaton algorthms have found Naïve Bayes s comparable n performance wth decson tree and selected neural network classfers. Naïve Bayes have also exhbted hgh accuracy and speed when appled to large databases. The Naïve Bayes classfer assumes that the presence of a partcular feature of a class s unrelated TELKOMNIKA Vol. 10, No. 5, September 2012: 1087 1092

TELKOMNIKA e-issn: 2087-278X 1089 to the presence of any other features on a gven the class varable. Ths assumpton s called class condtonal ndependence. To demonstrate the concept of Naïve Bayes Classfcaton, consder the knowledge of statstcs. Let Y be the classfcaton attrbute and X{x1,x2,,xk} be the vector valued array of nput attrbutes, the classfcaton problem smplfes to estmatng the condtonal probablty P( Y X ) from a set of tranng patterns. P( Y X ) s the posteror probablty, and P( Y ) s the pror probablty. Suppose that there are m classes, Y1, Y2 Ym. Gven a tuple X, the classfer wll predct that X belongs to the class havng the hghest posteror probablty. The Naïve Bayes classfer predcts that tuple X belongs to the class Y f and only f P( Y X ) P( Y X ) j The Bayes rule states that ths probablty can be expressed as the formulaton (1) P( Y X ) P( X Y ) P( Y ) P( X ) = (2) As P( X ) s constant for all classes, only P( X Y ) P( Y ) needs be maxmzed. The pror probabltes are estmated by the probablty of Y n the tranng set. In order to reduce computaton n evaluatng P( X Y ), the Naïve Bayes assumpton of class condtonal ndependence s made. So the equaton can be wrtten nto the form of n P( X Y ) P( x Y ) = (3) k k = 1 and we easly estmate the probabltes P( X1 Y ), P( X 2 Y ),, P( X k Y ) from the tranng tuples. The predcted class label s the class Y for whch P( X Y ) P( Y ) s the maxmum. 3.2. Naïve Bayes Based on MapReduce Cloud Computng can be defned as a provson through the Internet of all computng servces. It s the most advanced verson of the clent-server archtecture and takes the system to a very hgh level of resource whch s sharng and scalng. The resource pools composed of a large number of computng resources whch are used to create hghly vrtualzed resources dynamcally for users. But for the analyss task of massve data, the cloud platform lack parallel mplementaton of massve data mnng and analyss algorthms [4]. Therefore, a new cloud computng model of massve data mnng ncludes the pre-processng for huge amounts of data, cloud computng for massve parallel data mnng algorthms, the new massve data mnng methods and so on [5]. The crtcal problem of the massve data mnng s the algorthm parallelzaton of data mnng. Cloud computng uses the new computng model known as MapReduce, whch means that the exstng data mnng algorthms and parallel strateges cannot be appled drectly to cloud computng platform for massve data mnng, so some transformaton must be done. Based on ths, for the characterstcs of massve data mnng algorthms, the cloud computng model has been optmzed and expanded to make t more sutable for massve data mnng [6]. Therefore, ths paper adopts the Hadoop dstrbuted system nfrastructure, whch provdes the storage capacty of HDFS and the computng capablty of MapReduce to mplement parallel classfcaton algorthms. The mplementaton of the parallel Naïve Bayes s MapReduce model s dvded nto tranng and predcton stages. 3.2.1. Tranng Stage The dstrbuted computng of Hadoop s dvded nto two phases whch are called Map and Reduce. Frst, the InputFormat whch s belonged to the Hadoop framework loads the nput data nto small data blocks known as data fragmentaton, and the sze of each data Parallel Implementaon of Classfcaton Algorhtms based on Computng (Ljuan Zhou)

1090 e-issn: 2087-278X fragmentaton s 5M, and the length of all of them s equal, and each splt s dvded nto records. Each map processes a sngle splt, and the map task passes the splt to the get RecordReader() method on InputFormat to gan a RecordReader for that splt. The RecordReader s terators of the records. Then the map task uses a RecordReader to generate record key-value pars, whch passes to the map functon. Secondly, the map functon statstcs the categores and propertes of the nput data, ncludng the values of categores and propertes. The attrbutes and categores of the nput records are separated by a comma, and the fnal attrbute s the property of classfcaton. Fnally, the reduce functon aggregates the number of each attrbute and category value, whch results n the form of (category, Index1:count1, Index2:count2, Index3:count3,, Indexn:countn), and then output the tranng model. Its mplementaton s descrbed as follows. Algorthm Produce Tranng: map(key, value) Input: the tranng dataset Output: <key, value > par, where key s the category, and value the frequency of attrbute value 1 FOR each sample DO BEGIN 2 Parse the category and the value of each attrbute 3 count thefrequence of the attrbutes 4 FOR each attrbute value DO BEGIN Take the label as key, andattrbute ndex: the frequence 5 of the attrbute value as value 6 Output<key, value > 7 END 8 END Algorthm Produce Tranng: reduce(key, value) Input: the key and value output by map functon Output: <key,value > par, where key s the lable, and value the result of frequency of attrbute values 1 sum 0 2 FOR each attrbute value DO BEGIN 3 sum+=value.next.get() 4 END 5 Take key as key, and sum as value 6 output<key, value > 3.2.2. Predcton Stage Predcate the data record wth the output of the tranng model. The mplementaton of the algorthm s stated as follows: frst, use the statstcal values of attrbute values and category values to tran the unlabeled record. In addton, use the dstrbuted cache to mprove the effcency of the algorthm n the processon of the algorthm mplementaton. Its mplementaton s descrbed as follows. Algorthm Produce Testng: map (key,value) Input: the test dataset and the Nave Bayes Model Output: the labels of the samples 1 modeltype newmodeltype() 2 categores modeltype.getcategorys() 3 FOR each attrbute value not NULL DO BEGIN 4 Obtan one category from categores 5 END FOR 6 FOR each attrbute value DO BEGIN 7 FOR each category value DO BEGIN 8 pct counter(attrbute,category)/counter(category) 9 result result*pct 10 END FOR 11 END FOR 12 Take the category of the max result as key, and the max result as value 13 output<key,value > TELKOMNIKA Vol. 10, No. 5, September 2012: 1087 1092

TELKOMNIKA e-issn: 2087-278X 1091 4. Expermental Results In ths secton, we perform some preparatory experments to test the effcency and scalablty of parallel Naïve Bayes algorthm proposed n ths paper. We buld a small cluster wth 3 busness machnes (1 master and 2 slaves) on Lnux, and each machne has two cores wth 3.10GHz, 4GB memory, and 500GB dsk. We use the Hadoop verson 0.20.2 and java verson 1.6.0_26. We use the UCI data sets to verfy the results. Expermental data sets are shown n Table one. Table 1. The expermental data sets Data sets Number of samples Dmenson Numbers of categores 1 Wne 178 13 3 2 Vertebral 310 6 2 3 Bank-data 600 11 2 4 Car 1728 6 4 5 Abalone 4177 8 28 6 Adult 32561 14 2 7 PokerHand 1000000 11 10 Frst, the pre-treatment over the above data sets must be done, all property types normalzed to nomnal attrbutes. Then, the Naïve Bayes classfer mplemented by the MapReduce trans the tranng data sets to generate the classfy model, and then use the model to classfy the removed category samples. The experment s run on the cluster composed wth three machnes, and the results s shown n Fgure 2, compared wth the general method of test results. Fgure 2. Executng tme wth dfferent szes The comparng experment shows that the performance of the mproved algorthms s hgher than the general methods wth large data set. And ths verfes the Bayesan algorthm runs on the cloud envronment s more effcent than the tradtonal Bayesan algorthm. However, due to the sze of data szes, attrbutes, and the number of dfferent categores, the tme that the algorthm spent s not appear a lnear relatonshp. Snce runnng Hadoop jobs, start the cluster frst whch takes a lttle of tme, so when the sze of data set s smaller, the data processng tme s relatvely longer. And ths also verfed the Hadoop s perfect to process huge amounts of data. Parallel Implementaon of Classfcaton Algorhtms based on Computng (Ljuan Zhou)

1092 e-issn: 2087-278X 5. Conclusons As data classfyng has attracted a sgnfcant amount of research attenton, many classfcaton algorthms have been proposed n the past decades. However, the enlargng data n applcatons makes classfyng of very large scale of data a challengng task. In ths paper, we propose a fast parallel Naïve Bayes algorthm based on MapReduce, whch has been wdely embraced by both academa and ndustry. Preparatory experments show that the parallel algorthms can not only process large datasets, but also enhance the effcency of the algorthm. In the future work, we wll further mplement other classfcaton algorthms and conduct the experments and consummate the parallel algorthms to mprove usage effcency of computng resources. Acknowledgements Ths research was supported by Chna Natonal Key Technology R&D Program (2012BAH20B03), Natonal Nature Scence Foundaton (31101078), Bejng Nature Scence Foundaton (4122016), Bejng Nature Scence Foundaton (4112013), and Bejng Educatonal Commttee scence and technology development plan project (KM201110028018), "The computer applcaton technology" Bejng muncpal key constructon of the dscplne. References [1] Wezhong Zhao, Hufang Ma and Qng He. Parallel K-Means Clusterng Based on MapReduce. Lecture Notes n Computer Scence. 2009; 5931: 674-679. [2] A Pavlo, E Paulson. A Comparson of Approaches to Large-Scale Data Analyss. Proc. ACM SIGMOD. 2009: 165-178. [3] Jeffrey Dean and Sanjay Ghemawar. MapReduce: Smplfed Data Processng on Large Clusters. In OSDI. 2004:137-150. [4] Jalya Ekanayake and Shrdeep Pallckara. MapReduce for Data Intensve Scentfc Analyss. IEEE escence, 2008: 277-284. [5] C. Chu, S. Km, et, al. Map-reduce for Machne Learnng on Multcore. In NIPS 07: Proceedngs of Twenty-Frst Annual Conference on Neural Informaton Processng Systems. [6] Qng He, FuzhenZhuang, Jncheng L and Zhongzh Sh, Parallel Implementaton of Classfcaton Algorthms Based on MapReduce, Lecture Notes n Computer Scence. 2010; 6401: 655-662. TELKOMNIKA Vol. 10, No. 5, September 2012: 1087 1092