Algorithms for Frequent Pattern Mining of Big Data

Research Inda Publcatons. http://www.rpublcaton.com Algorthms for Frequent Pattern Mnng of Bg Data Syed Zubar Ahmad Shah 1, Mohammad Amjad 1, Ahmad Al Habeeb 2, Mohd Huzafa Faruqu 1 and Mudasr Shaf 3 1 Department of Computer Engneerng, Jama Mlla Islama, New Delh, Inda. 2 Department of Informaton Technology & Cloud, Ercsson Inda Global Servces, Gurgaon, Haryana, Inda. 3 Department of Electroncs and Communcaton Engneerng, Maharsh Dayanand Unversty, Rohtak, Haryana, Inda. 1 Orcd ID: -3-1847-8216 Abstract Frequent pattern mnng s a feld of data mnng amed at unsheathng frequent patterns n data n order to deduce knowledge that may help n decson makng. Numerous algorthms for frequent pattern mnng have been developed durng the last two decades most of whch have been found to be non-scalable for Bg Data. For Bg Data some scalable dstrbuted algorthms have also been proposed. In ths paper, we analyze such algorthms n detal. We start wth age old algorthms lke Count Dstrbuton and move on to traverse other algorthms one by one and fnally to algorthms based on MapReduce programmng model. Keywords: Frequent Pattern Mnng, Dstrbuted Systems, Bg Data Analytcs INTRODUCTION Data mnng s the task of scannng large datasets wth the am to generate new nformaton or wth the am of knowledge dscovery. Broadly speakng, data mnng ncludes felds lke clusterng, frequent pattern mnng (FPM), classfcaton and outler detecton [1, 2, 3]. FPM s a popular data mnng approach [4] whch s used to fnd attrbutes that occur together frequently. Some of ts mportant applcatons are market basket analyss, share market analyss, drug dscovery, etc. [5]. Several FPM algorthms have been proposed n the last few decades such as Apror [6], Eclat [7], FP-Growth [8], etc. [9, 1]. Identfyng all frequent patterns s a complex task and s also computatonally costly due to a large number of canddate patterns and so most algorthms become neffcent when t comes to the knd of data we deal wth n ths age, the Bg Data. Socal networkng places lke Facebook produce over 3TB of data every sngle day [11]. Amazon, Walmart and other gants regster bllons of transactons every year. Gene sequencng platforms can generate terabytes of data. To cope up wth the problem of huge volumes of data, researchers proposed several parallel algorthms for the FPM problem. Ths paper revews some of the classcal as well as recent parallel FPM algorthms. Varous parallel FPM algorthms have been dscussed and analyzed n detal startng wth the classcal algorthms and then movng onto the state of the art algorthms. ALGORITHMS FOR PERFORMING FREQUENT PATTERN MINING ON A DISTRIBUTED ENVIRONMENT A number of algorthms have been proposed for effcent frequent pattern mnng on a dstrbuted system whch nclude Count Dstrbuton, Data Dstrbuton and Canddate Dstrbuton algorthms by Agrawal and Shafer [12], Parallel Eclat by Zak el. al. [13], Parallel FP-growth algorthm by L el. al. [14], Sngle Pass Countng, Fxed Passes Combnedcountng and Dynamc Passes Combned-countng algorthms by Ln et. al. [15] and Dstrbuted Eclat algorthm by Moens et. al. [16]. We dscuss a few of these heren. Count Dstrbuton Algorthm Notatons: k-temset Freq k Cand k Proc DS DSR Cand k Itemset wth k tems Set of frequent k-temsets Canddate temset of sze k Processor Dataset local to Proc Dataset local to Proc after re-partton Canddate set of Proc durng pass k The Count Dstrbuton (CD) algorthm tres to mnmze nteracton. It parttons the dataset n horzontal manner nto equal parts for the varous processors. Each processor autonomously runs the Apror algorthm and creates the hashtree on the locally stored transactons. After each teraton, global count of canddates s calculated by addng all counts 7355

Research Inda Publcatons. http://www.rpublcaton.com over all processors. Ths algorthm has a lmtaton that every processor has to keep the whole hash-tree n ts memory. Data dstrbuton algorthm has to be used f the whole hash-tree cannot be stored n the memory. Ths algorthm permts superfluous calculatons n parallel to prevent nteracton. In the frst pass, every processor produces ts local canddate temset Cand 1. For remanng passes, the algorthm s gven below: 1. Every processor Proc produces canddate set Cand k by makng use of the frequent temset Freq k-l that was produced n the prevous teraton. 2. Proc performs a pass on ts DS and to fnd local support counts for canddates n Cand k. 3. Proc synchronzes wth all other processors and broadcasts local Cand k counts to develop global Cand k counts. 4. Every processor Proc calculates Freq k from Cand k. usng the global support count. 5. If Freq k. s a null set then the algorthm termnates, or else t contnues. Every processor Proc wll ndepently make ths decson. Fgure 1 shows the speedup of CD algorthm (on datasets D1456K.T15.14 and D114K.T2.16) as the number of processors s ncreased. Response Tme(sec) 16 14 12 1 8 6 4 2 1 2 4 6 8 1 12 14 Number of processors Fgure 1: Speedup of Count Dstrbuton Algorthm wth ncrease n the number of processors Data Dstrbuton Algorthm D1456K.T15.14 D114K.T2.16 Ths algorthm uses memory of the system extra effcently. The temsets are parttoned and shared wth processors. Each processor must fnd the count of ts local subset of canddate temsets and broadcast local data to remanng processors. Parttonng and the dstrbuton of temsets may affect performance of the algorthm, so t s desrable to have balanced load. Pass 1: Ths step s same as the step for k=1 n CD algorthm. Pass k>1: 1. Processor Proc produces canddate temset Cand k from Freq k-1 but keeps only a subset Cand k whch has only 1 N temsets of Cand k. Unon of all subsets generated by every processor s Cand k and ther ntersecton s a. 2. Proc uses both local data and data receved from remanng processors to develop support counts for the temsets n ts Cand k. 3. Processor Proc calculates Cand k usng the local canddate set Cand k after gettng complete support count data. Agan, the ntersecton of all Cand k sets s and ther unon s Freq k. 4. All processors synchronze and exchange ther local Cand k wth other processors. Every processor has the complete Freq k and can decde whether to contnue the algorthm or to termnate t ndepently. Processors fnd the support count for local temsets n step 2 and broadcast or receve local data. Network congeston may become a problem n ths setup so steps must be taken to prevent that. Parallel Eclat In Parallel Eclat (Par-Eclat), data s n vertcal layout whch s done by transformng the horzontal data nto vertcal temset lst. The two phases n the Parallel Eclat are gven below: Intalzaton Phase: There are 3 sub-steps n the ntalzaton step: 1. The support count for 2-temsets are read after the preprocessng step has been done and all the frequent temsets havng count>= mnmum support are ntroduced nto Freq 2. (Freq k s the set of frequent k-temsets). 2. Any of the clusterng schemes between these two - the equvalence class or maxmal hypergraph clque clusterng - s appled to Freq 2 to produce the set of possble maxmal frequent temsets whch are then parttoned among all the processors to acheve load-balancng. 3. The database s reparttoned so that every processor obtans the td-lsts of all the 1-temsets n the cluster apponted to t. Asynchronous Phase: After the ntalzaton phase, the td-lsts are accessble on local storage on every processor, so every processor can use 7356

Research Inda Publcatons. http://www.rpublcaton.com ts assgned maxmal cluster to generate frequent temsets wthout synchronzng wth any other processor. A cluster s processed completely before the processor moves onto the subsequent cluster. Local database s examned one tme to complete ths step. Therefore, there s less I/O overhead. Snce a sublattce s nduced by each cluster, ether bottom-up traversal s used to produce all frequent temsets or hybrd traversal s used to produce the maxmal frequent temsets, depng on the algorthm. Intally only td-lst for 1-temset are avalable locally usng whch, the td-lsts for 2-temsets clusters can be generated. These clusters are generally not bg and so keepng the emergng td-lsts n memory creates no problem. 3-temsets are produced by ntersecton of the tdlsts for 2-temsets n the bottom-up approach. If the cardnalty of the emergng td-lst >= mnmum support, the new temset s added to Freq 3. Then the resultng frequent 3- temsets, Freq 3 are dvded nto sets of equvalence classes on the bass of common pre-fxes of sze 2. Intersecton of all pars of 3-temsets wthn an equvalence class s used to determne Freq 4. The process can be repeated tll all frequent temsets are found. After fndng Freq k Freq k-1 s deleted. Memory space s thus needed for the temsets n Freq k-1 only wthn one maxmal cluster. Ths algorthm s thus man memory space effcent. For top-down stage of the hybrdtraversal, memory needs to keep track of only the maxmal element seen so far, along wth the temsets not yet seen. Par-Eclat Algorthm 1. Create Freq 2 usng 2-temset support counts 2. Produce clusters from Freq 2 usng Equvalence Classes 3. Assgn clusters to processors 4. Scan local dataset part 5. Share approprate td-lsts wth remanng processors 6. Procure td-lsts from remanng processors 7. foreach assgned cluster C fnd frequent temsets usng Bottom-Up traversal or Hybrd traversal Fgure 2 compares the executon tme of Par-Eclat algorthm wth CD algorthm [12] (for T1.I4.D284K dataset). Par- Eclat clearly outperforms CD algorthm. Fgure 2: Comparson of Executon Tme of Par-Eclat and Count Dstrbuton Sngle Pass Countng It s the parallel mplementaton of Apror algorthm usng MapReduce. It makes Apror effcent for mnng assocaton rules from massve datasets. The man step n Apror algorthm s fndng the frequency of canddate temsets. Ths process of countng can be parallelzed. In each step, the mapper program emts <x, 1> for every canddate x that s there n the transacton. The reducer collects the key-value par emtted by the mapper and sums the value for each key, whch gves the total frequency of the canddate n the database. The reducer outputs all the canddates whch have enough support count. Sngle Pass Countng (SPC), at k-th pass of database scannng n a MapReduce phase, fnds out the frequent k-temsets. Map(key, value = temset n transacton t ): Input : Dataset D foreach transacton tr D do foreach tem j tr do Executon Tme (sec) 12 1 8 6 4 2 emt <j, 1> X1.Y1.Z1 X2.Y2.Z2 X2.Y1.Z2 T1.I4.D248K X1.Y4.Z4 X2.Y2.Z4 X4.Y1.Z4 X2.Y4.Z8 X4.Y2.Z8 X8.Y2.Z16 X= No. of Hosts, Y= No. of Processors, Z=H*P X8.Y4.Z32 CD Par-Eclat Reduce(key=tem, value = count): foreach key k do foreach value val n k do k.count+=val f k.count>=mn. supp. count 7357

Research Inda Publcatons. http://www.rpublcaton.com emt <k, k.count> Phase-1 of SPC algorthm Dstrbuted Eclat Dstrbuted Eclat (Dst-Eclat) algorthm s based on Eclat algorthm for mnng frequent temsets from large datasets n parallel usng MapReduce. Ths algorthm focuses on speed. Dst-Eclat s qute fast but t requres huge man memory as t stores the complete dataset n memory n vertcal format. In phase-2, the algorthm generates frequent 2-temset usng frequent temsets Freq 1 whch are located at the Dstrbuted Cache n Hadoop. Map(key, value = temset n transacton t ): Input : Dataset D and Freq k-1(k>=2). Fetch Freq k-1 from DstrbutedCache. Buld a hash tree for Cand k=gen-apror(freq k-1). foreach transacton tr D do Cand t = subset(cand k, tr ) foreach tem Cand t do emt <, 1> Reduce(key=tem, value = count): foreach key k do foreach value val n k do k.count+=val f k.count>=mnmum_support_count emt <k, k.count> Phase-2 of SPC algorthm Algorthm SPC: 1. Run Phase- 1 to fnd Freq 1 2. Run Phase-2 to fnd Freq 2 3. forall k>2 and Freq k-1 φ Map functon of phase-2 Reduce functon of phase-2 Fgure 3: Eclat on MapReduce Framework Dst-Eclat s 3-step method (fgure 3). Each step can be dstrbuted between mappers to maxmse effcency. 1. Fndng the Frequent Items: In ths step, the vertcal database s parttoned nto equal-szed sub-databases called shards and dstrbuted among several mappers. Mappers extract the frequent sngletons from ther shard and s them to reducer. All frequent tems are gathered n the reduce phase. 2. k-frequent Itemsets Generaton: The set of frequent k- temsets, Freq k, s generated. Frst, each mapper s assgned the combned form of local frequent tem sets. Mappers fnd the frequent supersets of sze k of the tems usng Eclat algorthm and runnng t up to level k. Then a reducer assgns the frequent temsets to a new set of mappers. Dstrbuton to mappers s performed by employng Round-Robn algorthm. 3. Subtree Mnng: Ths step uses Eclat algorthm to mne the prefx tree from the assgned subsets. Subtrees can be mned ndepently by mappers snce sub-trees do not need nformaton from other sub-trees. 7358

Research Inda Publcatons. http://www.rpublcaton.com Tme(sec) 12 1 8 6 4 2 Fgure 4: Comparson of Computaton Tme of Dst-Eclat and PFP Fgure 4 compares the runtme of Dst-Eclat wth Parallel FP- Growth (PFP) [14] over del.co.us dataset [17]. Mnmum support threshold s denoted by the x-axs and the y-axs denotes the tme taken by computaton n seconds. Dst-Eclat s faster of the two algorthms. CONCLUSION In ths paper, we traversed through many dstrbuted frequent pattern mnng algorthms n order to examne them and to present a pcture of how these algorthms grew over tme. We also presented a comparson of ther runtmes on dfferent datasets. We presented both types of algorthms those based on message passng and those based on MapReduce approaches. REFERENCES MnSup [1] Han, J., Pe, J., and Kamber, M., 211, Data Mnng: Concepts and Technques, Morgan Kaufmann Publshers, Calforna, USA. [2] Shah, S. Z. A., and Amjad, M., 216, Classcal and Relearnng based Clusterng Algorthms for Huge Data Warehouses, Proc. 2nd Internatonal Conference on Computatonal Intellgence & Communcaton Technology, Ghazabad, pp. 29-213. [3] Aggarwal, C. C., 215, Data Mnng, Sprnger Internatonal Publshng, Swtzerland. [4] Aggarwal, C. C., and Han, J., 214, Frequent Pattern Mnng, Sprnger Internatonal Publshng, Swtzerland. [5] Berry, M. J. A., and Lnoff, G. S., 24, Data Mnng Technques: For Marketng, Sales, and Customer Relatonshp Management, John Wley & Sons Inc., Indana, USA. [6] Agrawal, R., and Srkant, R., 1994, Fast Algorthms for Mnng Assocaton Rules, Proc. 2th Internatonal Conference on Very Large Data Bases, Santago, Chle, PFP Dst-Eclat 1215, pp. 487-499. [7] Han, J., Pe, J., and Yn, Y., 2, Mnng Frequent Patterns Wthout Canddate Generaton, Proc. ACM SIGMOD Internatonal Conference on Management of Data, Dallas, Texas, pp. 1-12. [8] Zak, M. J., 2, Scalable Algorthms for Assocaton Mnng, IEEE Transactons on Knowledge and Data Engneerng, 12(3), pp. 372-39. [9] Sucahyo, Y. G., and Gopalan, R. P, 24, CT-PRO: A Bottom Up Non Recursve Frequent Itemset Mnng Algorthm usng Compressed FP-Tree Data Structures, Proc. ICDM Workshop on Frequent Itemset Mnng Implementatons, Brghton. [1] Uno, T., Kyom, M., and Armura, H., 24, Effcent Mnng Algorthms for Frequent/Closed/Maxmal Itemsets, Proc. ICDM Workshop on Frequent Itemset Mnng Implementatons, Brghton. [11] Ln, J., and Ryaboy, D., 213, Scalng Bg Data Mnng Infrastructure: The Twtter Experence, ACM SIGKDD Exploratons Newsletter, 14(2), pp. 6-19. [12] Agrawal, R., and Shafer, J., 1996, Parallel Mnng of Assocaton Rules, IEEE Transactons on Knowledge and Data Engneerng, 8(6), pp. 962-969. [13] Zak, M. J., Parthasarathy, S., Oghara, M., and L, W., 1997, Parallel Algorthms for Dscovery of Assocaton Rules, Data Mnng and Knowledge Dscovery, 1(4), pp. 343-373. [14] L, H., Wang, Y., Zhang, D., Zhang, M., and Chang, E. Y., 28, Pfp: Parallel FP-growth for Query Recommaton, Proc. ACM Conference on Recommer Systems, Lausanne, pp. 17-114. [15] Ln, M. Y., Lee, P. Y., and Hsueh, S. C., 212, Aprorbased Frequent Itemset Mnng Algorthms on MapReduce, Proc. 6th Internatonal Conference on Ubqutous Informaton Management and Communcaton, Kuala Lumpur, Artcle no. 76. [16] Moens, S., Aksehrl, E., and Goethals, B., 213, Frequent Itemset Mnng for Bg Data, Proc. Internatonal Conference on Bg Data, Santa Clara, Calforna, pp. 111-118. [17] Wetzker, R., Zmmermann C., and Bauckhage, C., 28, Analyzng Socal Bookmarkng Systems: A del. co. us Cookbook, Proc. Mnng Socal Data Workshop, Patras, pp. 26-3. 7359