Association Rule Mining with Parallel Frequent Pattern Growth Algorithm on Hadoop

Assocaton Rule Mnng wth Parallel Frequent Pattern Growth Algorthm on Hadoop Zhgang Wang 1,2, Guqong Luo 3,*,Yong Hu 1,2, ZhenZhen Wang 1 1 School of Software Engneerng Jnlng Insttute of Technology Nanng, Jangsu, Chna 2 Jangsu Key Laboratory of Creatvty and Applcaton of Dgtal Meda Art Jnlng Insttute of Technology Nanng, Jangsu, Chna 3 Department of Informaton Engneerng Hunan Rado&TV Unversty Changsha 410007,Chna *correspondng author s Guqong Luo Abstract Although the assocaton rules mnng algorthm FP-Growth s more effcent than Apror, t has two dsadvantages. The frst one s that the FP-tree could be too large to be created n memory, the other one s ts seral processng. A novel mproved verson Parallel Frequent Pattern Growth (P-FP-Growth) s proposed n ths paper. It does not need to create global FP-tree, and mplements parallel processng n every step durng assocaton rule mnng. We deploy t on LAN computer cluster and on Hadoop, and obtan the result of frequent pattern mnng and compare the executon tme. Our experments show that P-FP- Growth s: ) much faster than FP-Growth f we run t wth multple computng nodes, ) t can be deployed on cloud computng platform such as Hadoop, ) t s faster to deploy t on Hadoop than on LAN computer cluster wth ncreasng computng nodes. Keywords- Data Mnng, Assocaton Rule, P-FP-Growth Algorthm, Cloud Computng, Map Reduce Programmng, Hadoop Platform I. INTRODUCTION Cloud computng s a Internet applcaton mode whch can realze accessng to a Shared resource pool anytme and anywhere on demand [1]. The resources from the shared resource pool can be computng facltes, storage devces, applcaton programs, and so on. Typcally, computer cluster s adopted to form data and computng center, and provded to the users n the form of servces [2]. MapReduce [3] s a cloud computng model proposed by Google. Hadoop [4] s an Apache open source proect, and t s a dstrbuted software archtecture whch can process a large amount of data. Hadoop s open source mplementaton of Google cloud computng system, and t s composed of HDFS, MapReduce and HBase. Hadoop mplements the nfrastructure of cloud computng software platform, whch ncludng the dstrbuted fle system, MapReduce framework, and ntegrates a seres of platform such as database, management of cloud computng, data warehouse and so on. Data mnng s a type of nformaton analyss task whch has hgh requrement of storage capacty and computng capablty. Data mnng of massve applcaton data can be mplemented wth parallel algorthm, relyng on cloud computng whch has hgh performance and hgh cost performance. Assocaton rule s an mportant topc of data mnng whch s frst proposed by Agrawal n the lterature [5] n 1993. Apror algorthm s a famous assocaton rule mnng algorthm proposed by R.Agrawal and R.Srkant n 1994 [6]. Apror algorthm must spend a lot of tme to deal wth huge canddate tem sets and repeat scannng database. Amng at the shortcomngs of the Apror algorthm, Han put forward an assocaton rule mnng algorthm FP - Growth [7]. FP - Growth algorthm s approxmately one order of magntude faster than the Apror algorthm, but t has two dsadvantage: the frst s that ts frequent pattern tree may be too bg to be created n the memory; the second s ts seral processng approach. The process of classc FP - Growth algorthm s shown n fgure 1, and the whole process s seral. In addton f the data sets are very large, the FP - tree may be too bg, and the memory can't accommodate. Fgure 1. The procedure of classc FP-Growth algorthm DOI 10.5013/IJSSST.a.17.32.44 44.1 ISSN: 1473-804x onlne, 1473-8031 prnt

To realze assocaton rule mnng on Hadoop, we need to mprove the tradtonal assocaton rule mnng algorthm. Lterature[8] presents a parallel algorthm for mnng frequent tem sets based on Apror. Lterature[9] mproves CD (Count Dstrbuton) algorthms whch s based on Apror, and realzes parallel assocaton rule mnng usng MapReduce on Hadoop. Lterature[10] and [11] also realze parallel assocaton rule mnng usng MapReduce on Hadoop wth Apror class algorthms.lterature[12] proposes data parttonng (data dvson), lterature [13-14] use multple thread share memory (multthreaded memory - sharng ).Lterature[15] dvdes global FP - tree nto sub tree for parallel processng, lterature[16] regards the mnng of each condton pattern lbrary as a sub task, and assgns these sub tasks to computng node n computer cluster.lterature[17] uses multple local FP - tree nstead of global FP - tree, and puts forward a dstrbuted parallel assocaton rules mnng algorthm whch named P - FP - Growth on the bass of classc FP - Growth algorthm. FP - Growth class algorthms are approxmately one order of magntude faster than the Apror class algorthms. In FP - Growth class parallel algorthms, some only realze the parallel process on transacton tem countng, flterng or sortng, some only realze the parallel process on the condton pattern lbrary mnng. Due to use multple local FP - tree nstead of global FP - tree, P - FP - Growth Algorthm not only avods that the global FP - Tree s too bg to be stored n memory, but also mplements the parallel processng of buldng FP - Tree. P - FP - Growth Algorthm breaks through the capacty bottleneck of FP - Growth algorthm, and mplemented parallel processng n every step. If t s deployed on cloud computng platform such as Hadoop, the assocaton rule mnng task of mass data wll be solved, and the mplementaton can be provded as a servce to users so that they can acheve assocaton rule mnng but do not need a lot of computers or any specal data mnng software. II. P -FP -GROWTH ALGORITHM Parallel assocaton rules mnng algorthm P -FP - Growth uses multple local FP - tree nstead of global FP - tree, so as to avods that the global FP - Tree s too bg to be stored n memory. And n ths algorthm the whole data mnng processng s paralleled, ncludng transacton tem countng, flterng, sortng, local FP - Tree buldng, and condton pattern lbrary mnng. The process of P -FP -Growth algorthm s shown n fgure 2. Fgure 2. The process of P -FP -Growth algorthm Set I { 1, 2,..., n}, I s the collecton of all tem. Transacton T s a tem set and the tem s fromi, T I,Every transacton has a unque ID. D s a database whch s made up of more than one transacton, and D m d, 1 d ( 1,2,... m ) are dstrbuted n dfferent storage nodes M. C (=1,2 p)are a group of computng nodes wth powerful ablty of calculaton. In physcs, a storage nodes and a computng node can be completely concdence, parts overlap or completely dfferent. Now we need to make full use of the computng nodes C ( = 1, 2... p) to fnd out the frequent patterns as soon as possble from the global transacton database D n order to get assocaton rules. A. The step of P -FP -Growth algorthm (1) In each of the storage nodes M :get the count of every tems n d. (2) In the center computng node: get together and sum the count of every tems n all d, so as to get the count of every tem n the global database D. Order the tems accordng to ts count descendng. Delete tems whch s count s less than the gven mnmum support count. The result s global 1-tem frequent set L. (3) In the center computng node: In order to dstrbute the frequent pattern mnng task to many computng nodes by tems to realze parallel processng n the follow step, the center computng node needs to assgn computng node for every tem n L, and adds the assgn results to L. Fnally we obtan L such as table 1. TABLE 1. GLOBAL 1-ITEM FREQUENCY SET AND COMPUTING NODE ASSIGN RESULT tem count computng node 2 10 C 3 5 8 C 4 9 5 C 5 DOI 10.5013/IJSSST.a.17.32.44 44.2 ISSN: 1473-804x onlne, 1473-8031 prnt

The center computng node dstrbutes L to each storage nodes. There are a lot of strateges to assgn computng node for every tem. The smplest way s that assgn computng node for tems by ther count order. For example: Items are named 1,...,n accordng to ther count n descendng order. The number of computng node s, whch n turn called C1,,C. Assgn computng node C+1 for tem n when n mod =. (4) Each storage node M : processes every transactons n d accordng to global 1-tem frequent set L. The man ponts of the processon as the followng: 1 Flter out the tems wthout n L. 2 Order the rest of the tems by ther count n L n descendng. 3 Add transactons to local FP-tree. After ths step, we can get several local FP-trees and local Header Tables that are bound by global 1-tem frequency set L. (5) Each storage node M :accordng to the addtonal computng node assgn nformaton n L, sends tems n local Header Tables wth ther condton mode lbrary (or sad that condton model base) n local FP-trees to correspondng computng nodes. (6) Each computng node C :aggregates condton mode lbrary group by tem and do frequent pattern mnng. If the condton mode lbrary of an tem s very large, can recursvely nvoke ths algorthm to dsntegrate the frequent pattern mnng task of ths tem so as to take parallel processng. (7) Each computng node C :sends the frequent patterns to the center computng node, whch summarzes and gets the global assocaton rules set. B.Deploy P -FP -Growth Algorthm n MapReduce compute mode P -FP -Growth algorthm can been deployed wth MapReduce compute mode as the follow: // tem countng Mapper (, d ) { Count( d, ) } Reduce (, Count) // tem descendng sortng and delete those whose count are less than the mnmum support count Map (, d ) { Sort_del( d,t )} Reduce ( ) //copy the ntermedate data to the output drectly //buld local FP - tree for every d Map (, d ) { Buld_T ( d, FP - tree )} Reduce ( ) //copy the ntermedate data to the output drectly // dstrbute and gather condton pattern lbrary by tem Map ( FP - tree, ) {Send (, Condtonal_pattern_base)} Reduce (, Condtonal_patter_base ) // frequent patterns mnng by tem and summarze mnng results Map (,Condtonal_patter_base) { Mnng (, Condtonal_patter_base)} Reduce ( Result,Frequent Pattern). III. EXPERIMENT Used T10I4D100K.dat (http://fm.ua.ac.be/data/ T10I4D100K.dat) as the expermental data whch had 100000 transactons. The mnmum seral number of the tems s 0, and the maxmum seral number s 999. The amount of the tems whch actually appear n all transactons s 871. It means that there are 29 tems does not appear at all n all transactons. The total tmes of all tems appear n all transactons s 1010221, and the average length of all transactons s 10.1. Select multple computers n a laboratory as computer cluster. Set the Mnmum Support s 1% whch means that the Mnmum Support count s 1000. A.The frequent patterns mnng result of data set T10I4D100K The transacton data s shown as fgure 3. Every lne s a transacton. T_d s the ID of a transacton. T_tem s the orgnal data from T10I4D100K.dat lne by lne. T_tem_pro s the ntermedate results after sortng and deletng those whose count are less than the mnmum support count. Fgure3. The ntermedate results after tems sortng and deletng Ether usng FP -Growth or usng P -FP -Growth Algorthm, We get the same frequent patterns mnng results as shown n table 2. DOI 10.5013/IJSSST.a.17.32.44 44.3 ISSN: 1473-804x onlne, 1473-8031 prnt

TABLE 2 THE FREQUENT PATTERNS MINING RESULT OF DATA SET T10I4D100K Frequent pattern Count of support Support degree 217 346 1336 1.3360% 368 829 1194 1.1940% 829 789 1194 1.1940% 368 682 1193 1.1930% 39 825 1187 1.1870% 39 704 1107 1.1070% 825 704 1102 1.1020% 390 227 1049 1.0490% 722 390 1042 1.0420% 39 825 704 1035 1.0350% B. The comparson of FP -Growth and P -FP -Growth Classc FP -Growth s a algorthm of seral processng, and t needs a global FP - Tree. We run classc FP -Growth on a sngle computer. P -FP -Growth s a algorthm of parallel processng, and mplements parallel processng n every step. We run P -FP -Growth on a local area network (LAN) multple nodes computer cluster whch has ten computer. The executng tme comparson of FP -Growth and P -FP -Growth s shown as table 3. TABLE 3 EXECUTING TIME COMPARISON OF FP -GROWTH AND P -FP -GROWTH(UNIT OF TIME:SECOND) FP-Growth P-FP-Growth (10 computng nodes) Items count of all transacton 5 0.6 Items flter and sort 112 12 Buld FP -Tree 289 30 Frequent patterns mnng 729 78 Auxlary work 0 3 Total tme 1135 123.6 C. The comparson of P -FP -Growth deployed on LAN computer cluster and on Hadoop Install Hadoop n the Wndows system, and confgure them as fully dstrbuted mode. Program for P - FP - Growth algorthm accordance wth the MapReduce framework. Execute the Program of P - FP - Growth algorthm on Hadoop platform for frequent pattern mnng. comparng P - FP - Growth algorthm s executng tme of the local area network (LAN) multple nodes parallel processng and of the cloud computng on Hadoop platform, the result are shown n fgure 4. In fgure 3 the y s coordnate for the algorthm executon tme, and x s abscssa for the number of machne node whch partcpate n calculatng. Fgure 4. The comparson of executng tme. VI. CONCLUSION P - FP - Growth uses multple local FP - tree nstead of global FP - tree, so as to avod that the global FP - Tree s too bg to be stored n memory, and mplements parallel processng n every step, ncludng transacton tem countng, flterng, sortng, local FP - Tree buldng, and condton pattern lbrary mnng. P - FP - Growth s much more faster than FP - Growth as Table 3 shows f we run t wth multple computng nodes. P - FP - Growth Algorthm breaks through the capacty bottleneck of FP - Growth algorthm. If t s deployed on cloud computng platform such as Hadoop, the assocaton rule mnng task of mass data wll be solved, and the mplementaton can be provded as a servce to users so that they can acheve assocaton rule mnng but do not need a lot of computers or any specal data mnng software. Experment shows that P - FP - Growth algorthm can been deployed on cloud computng platform Hadoop, and fgure 3 tell us that t s faster to deploy P - FP - Growth on Hadoop than on LAN multple nodes parallel processng wth the count of computng nodes ncreasng. ACKNOWLEDGMENTS Ths work was supported by Top-notch Academc Programs Proect of Jangsu Hgher Educaton Insttutons,and Jangsu provnce 2016 college students' nnovatve entrepreneural tranng program(201613573047x). REFERENCES [1] Mell P,Grance T.The NIST defnton of cloud computng.natonal Insttute of Standard and Technology,U S Department of Commerce, 2010. [2] Frank E. Gllett. Future Vew: The New Technology Ecosystems of Cloud, Cloud Servces and Cloud Computng. Forrester Report, August 2008, 6(1):59-62. DOI 10.5013/IJSSST.a.17.32.44 44.4 ISSN: 1473-804x onlne, 1473-8031 prnt

[3] Dean J,Ghemawat S.MapReduce: smplfed data processng on large clusters.proceedngs of the 6th Symposum on Operatng System Desgn and Implementon.San Francsco, Calforna,USA,2008,103-115. [4] Agrawal R,Imelnsk T,Swam A.Mnng assocaton rules between sets of tems n large databases.proceedngs of ACM SIGMOD Internatonal Conference on Management of Date,1993:207-216. [5] Agrawal R,Srkant R.Fast algorthms for mnng assocaton rules.proceedngs of the 1994 Internatonal Conference on Very Large Data Bases,1994,487-499. [6] Han J,Pe J,Yn Y.Mnng Frequent Patterns Wthout Canddate Generaton.Proceedngs of ACM SIGMOD Internatonal Conference on Management of Data,2000:1-12. [7] Ye Yan-bn, Chang C-C. A Parallel Apror Algorthm for Frequent Item sets Mnng.Proceedngs of the Fourth Internatonal Conference on Software Engneerng Research Management and Applcatons. 2006,87-94. [8] YU Chul,XIAO Yng-yuan,YIN Bo.A parallel algorthm for mnng frequent tem sets on Hadoop.Journal of Tann Unversty of Technology,2011,27(1):25-28. [9] MA Je.Research on Assocaton Rule Data Mnng Algorthm under Cloud Computng Envronment.Journal of Chongqng Technology Busness Unversty,2012,29(11):36-39. [10] RONG Xang, LI Lng -uan.a method for frequent set mnng based on MapReduce.Journal of X'an Unversty of Posts and Telecommuncatons, 2011.16(4):37-39. [11] Pramudono I,Ktsuregawa,M.Parallel FP -Growth on PC cluster.proceedngs of Internatonal Conference on Internet Computng,2003:467-473. [12] Zaïane O R,Mohammad E H,Lu P.Fast parallel assocaton rule mnng wthout canddacy generaton.proceedngs of 1st IEEE Internatonal Conference on Data Mnng,2001,665 668. [13] Lu L,L E,Zhang Y,Optmzaton of frequent tem-set mnng on multple-core processors.proceedngs of 33rd Internatonal Conference on Very Large Data Bases,2007,1275 1285. [14] Tan, K.-L., Sun, Z.-H. An algorthm for mnng FP-trees n parallel. Computer Engneerng and Applcatons (13), 2006: 155-157. [15] Chen, M., L, H.-F. FP - Growth parallel algorthm n cluster system. Computer Engneerng (20), 2009:71-75. [16] Wang Zh-gang, Wang Ch-she. A Parallel Assocaton-Rule Mnng Algorthm.Proceedngs Of 2012 Web Informaton Systems and Mnng. 2012:125-129. DOI 10.5013/IJSSST.a.17.32.44 44.5 ISSN: 1473-804x onlne, 1473-8031 prnt