Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *

Similar documents
An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Efficient Algorithm for Frequent Itemset Generation in Big Data

AN EFFECTIVE DETECTION OF SATELLITE IMAGES VIA K-MEANS CLUSTERING ON HADOOP SYSTEM. Mengzhao Yang, Haibin Mei and Dongmei Huang

This paper proposes: Mining Frequent Patterns without Candidate Generation

Data Mining Part 3. Associations Rules

Data Partitioning Method for Mining Frequent Itemset Using MapReduce

An Algorithm for Mining Frequent Itemsets from Library Big Data

Appropriate Item Partition for Improving the Mining Performance

Parallelization of K-Means Clustering Algorithm for Data Mining

A Fast and High Throughput SQL Query System for Big Data

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Mitigating Data Skew Using Map Reduce Application

Tutorial on Association Rule Mining

, and Zili Zhang 1. School of Computer and Information Science, Southwest University, Chongqing, China 2

International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015)

Survey on Incremental MapReduce for Data Mining

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

Distributed Bottom up Approach for Data Anonymization using MapReduce framework on Cloud

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

Association Rule Mining

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

A New Approach to Web Data Mining Based on Cloud Computing

Mining Frequent Patterns without Candidate Generation

A REVIEW ON DATA PARTITION BASED FREQUENT ITEMSET MINING

Data Mining for Knowledge Management. Association Rules

Research on Parallelized Stream Data Micro Clustering Algorithm Ke Ma 1, Lingjuan Li 1, Yimu Ji 1, Shengmei Luo 1, Tao Wen 2

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Association Rule Mining

Efficient Entity Matching over Multiple Data Sources with MapReduce

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce

Research of Improved FP-Growth (IFP) Algorithm in Association Rules Mining

A priority based dynamic bandwidth scheduling in SDN networks 1

Dynamic Data Placement Strategy in MapReduce-styled Data Processing Platform Hua-Ci WANG 1,a,*, Cai CHEN 2,b,*, Yi LIANG 3,c

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

DATA MINING II - 1DL460

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

1. Introduction to MapReduce

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Maintenance of the Prelarge Trees for Record Deletion

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Pupil Localization Algorithm based on Hough Transform and Harris Corner Detection

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

An Optimization Algorithm of Selecting Initial Clustering Center in K means

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Comparative Analysis of Range Aggregate Queries In Big Data Environment

Large-Scale Dataset Incremental Association Rules Mining Model and Optimization Algorithm

Comparison of FP tree and Apriori Algorithm

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Performance Based Study of Association Rule Algorithms On Voter DB

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Research and Improvement of Apriori Algorithm Based on Hadoop

An Improved Technique for Frequent Itemset Mining

Decision analysis of the weather log by Hadoop

ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

FP-Growth algorithm in Data Compression frequent patterns

Security Analysis of PSLP: Privacy-Preserving Single-Layer Perceptron Learning for e-healthcare

Parallel Approach for Implementing Data Mining Algorithms

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

SQL Query Optimization on Cross Nodes for Distributed System

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Parallel Implementation of Apriori Algorithm Based on MapReduce

Adaption of Fast Modified Frequent Pattern Growth approach for frequent item sets mining in Telecommunication Industry

An Approach for Finding Frequent Item Set Done By Comparison Based Technique

External Sorting. Merge Sort, Replacement Selection

Nesnelerin İnternetinde Veri Analizi

The MapReduce Framework

Performing MapReduce on Data Centers with Hierarchical Structures

Energy efficient optimization method for green data center based on cloud computing

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

FAST DATA RETRIEVAL USING MAP REDUCE: A CASE STUDY

Incremental FP-Growth

Chapter 7: Frequent Itemsets and Association Rules

A Hybrid Algorithm Using Apriori Growth and Fp-Split Tree For Web Usage Mining

Greedy Approach: Intro

Review of Algorithm for Mining Frequent Patterns from Uncertain Data

High Performance Computing on MapReduce Programming Framework

Association Rules. Berlin Chen References:

Fast Bit Sort. A New In Place Sorting Technique. Nando Favaro February 2009

Analysis of Algorithms

An Adaptive Threshold LBP Algorithm for Face Recognition

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Related Work The Concept of the Signaling. In the mobile communication system, in addition to transmit the necessary user information (usually voice

A computational model for MapReduce job flow

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

CSCI6405 Project - Association rules mining

Transcription:

2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and International Conference on Network and Communication Security (NCS 2016) ISBN: 978-1-60595-362-5 Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG * 1 Departmnet of Computer Science and Technology, Central China Normal University, Wuhan, Hubei, China 2 Collaborative & Innovative Center for Educational Technology, Central China Normal University, Wuhan, Hubei, China a yangqing@mail.ccnu.edu.cn, b feiyangdu@mails.ccnu.edu.cn, c ssisse@mails.ccnu.edu.cn *Corresponding author Keywords: Balanced Grouping, Parallel Computing, FP-Growth, MapReduce. Abstract. The existing parallel FP-Growth algorithm have solved the problems such as the partition of transaction dataset, which can guarantee that each transaction dataset is independent after the partition, but there are still many problems such as too many iterations in the process of FP-tree mining on single node and low efficiency. What s more, it did not consider the load balance when the master node divides dataset to the child nodes. By using Cutting strategy on PFP which is the original algorithm with MapReduce, we merged paths which are not frequent in FP-tree, and designed a new parallel FP-Growth algorithm. In addition, the load balancing strategy is used when the master nodes divide dataset to the child nodes. Through the combination of these two strategies, this paper designed a new parallel FP-Growth algorithm. Introduction With the arrival of the era of big data, the data set is often TB or even PB level. Computation and I/O volume is very huge when the association rules algorithm deals with massive data sets. Serial algorithms could not afford it while existing association rule mining algorithms are mostly memory consumed algorithms. When dealing with massive data, there is not enough memory or too long time to get a satisfactory result. So the parallelization of association rules mining algorithm came into being. At present, MapReduce [1] technology is flexible, efficient and virtual, which provides a new solution to solve the problems when processing massive data sets [2]. And the efficiency of traditional frequent item sets mining algorithms, such as FP-Growth [3], is extremely low. To this end, the parallel computing has been proposed to reduce the memory and CPU consumption. PFP algorithm used MapReduce or Spark framework [4] to implement the parallel processing of mining tasks. By taking the distributed advantages of the Hadoop framework, the algorithm has greatly improved parallel computing. However, in the process of data distribution, PFP algorithm did not consider the problem of balanced grouping, while balanced grouping is very important in dealing with massive data. Parallel Design of FP-Growth Algorithm FP-growth Algorithm The basic idea of FP-growth algorithm [5] is to use tree structure to compress the transaction, and keep the relationship between the attributes in the transaction. The algorithm requires scanning the transaction set twice: The first scan selects frequent items which are then sorted by frequency in descending order to form F-list. The second scan firstly takes the null as root node to construct FP-Tree. FP-Growth creates small FP-Tree from the conditional pattern base and executes FP-Growth on the FP-Tree. The process is recursively iterated until no conditional pattern base can be generated.

The FP-Growth Algorithm Based On Cutting Strategy We designed a frequent item sets mining algorithm based on cutting strategy, Cutting FP-tree algorithm. For the multiple paths item sets in FP-tree, if one path is not frequent and it meet the condition of the above ideas, we can cut the FP-tree. Compared with the original algorithm, the algorithm reduces the number of iterations of the FP-tree processing and improves the efficiency of the algorithm. After using cutting strategy to construct the FP-Tree, the following is a detailed procedure for Growth mining of the cutting FP-tree: First step:determine whether the FP-tree has a single path. Second step: If the FP-tree has a single path P, take all frequent items of the path P as each subset of the set (recorded as B). Then the generation mode is B X. Its support count is the minimum support count for all nodes. Third step: Some of the frequent items in FP-tree may exist in multiple paths. For example, frequent item i exists in the prefix path set both A and B, A belongs to B. The above mentioned cutting strategy can be used to merge the path of set B with the path of set A. Fourth step: Check whether the frequent item i turned into a single path after cutting. If not then continue to the fifth step, or back to the second step. Fifth step: For each frequent item a in the item header table, the generation model is, The count of the mode is equal to the count of the data item. Conditional frequent pattern tree Treeβ, is constructed by the conditional pattern base. If Treeβ, turn back to the first step until all the frequent pattern sets are mined. The Improved Balanced Parallel FP-growth Algorithm(IBPFP) Sharding While using Hadoop [6], we just need to copy the DB onto Hadoop distributed file system. Hadoop will do the sharding. The default size of each file block is 64M after segmentation. And the size of slice split is determined by the formula SplitSize=max{minSize,min{maxSize,blockSize}}. Parallel Counting This step uses one MapReduce job to count all the items. The input of this step is the dataset which stored in a distributed file system. The output of this step is the F-List, a list of frequent items sorted by frequency in descending order. Balanced Grouping Balanced Grouping fairly divides all the items in the F-List into Q groups, balancing load among all groups, in order to improve parallelization of the whole mining process. Balanced grouping needs two steps. Load Estimation of Frequent Items. All of the nodes that execute frequent item mining are added together to get the whole load of the parallel FP-Growth process. We can sum up the load of conditional pattern base of all frequent items on a node to obtain the whole load of the node. But it is not possible to get the exact load of the node. Because the load on the basis of the condition pattern base of each frequent item is different according to the different input data. For the above problem, we made the following estimation: The number of recursive iterations during FP-Growth execution on conditional pattern base of each item is estimated as load of FP-Growth on conditional pattern base of each item. The location of each item in F-List is estimated to be the length of the longest frequent path in the conditional pattern base. And the number of recursive iterations is exponentially proportional [7] with longest frequent path in the conditional pattern base. Set frequent item i in Flist position as P. And set the load it brings as L, According to the above ideas, we get the Eq. 1: L logp. (1)

Load Balancing Strategy. From the Eq. 1, we can know the pair value of each frequent item position in the Flist is equal to the value of its load. The whole question is equivalent to that, split a descending array into Q groups while equalizing the sum of the data in each group as much as possible. So the frequent items in Flist can be grouped by the following method: First, the Q data items in Flist were randomly placed into Q groups, Then repeat the following steps until all the data items in the Flist are assigned. (1) Find group G which sum is smallest in Q. (2) Increase load of group G by load of the new item. The implementation of the balanced grouping is to save the sum of Q groups load in a minimum heap. Because the load sum of the corresponding group of the heap top is minimum. We only need to add the load of each new data item to the heap top and update it. Parallel Cutting FP-tree Algorithm This step requires one MapReduce process. First, in the map phase, transactions in original DB are converted into the new group-dependent transactions. Then, in the reduce phase, FP-Tree construction and FP-Growth are recursively done on the group-dependent transactions by each reducer instance. Experiments We implemented PFP (No parallel improved FP-Growth algorithm), BPF (only balancing strategy is added on the basis of PFP) and IBPFP. Through the analysis and comparison of the data processing time of the three algorithms, we demonstrated the effectiveness of the algorithm IBPFP designed in this paper. The data set used in the experiment is made up of user attention and comment information data of Sina Twitter. In total, there are 185310 transaction records and 68400 data items. The average length of transaction is 300. The longest length of transaction is 2378. First of all, we compared the processing efficiency of the three algorithms in the same minimum support threshold of 10 and different transaction volume. The comparison is shown in Fig.1 processing time/s 3000 2500 2000 1500 1000 500 0 2 4 6 8 10 12 14 16 18 transaction volumn/thousands IBPFP BPFP PFP Figure 1. The processing efficiency in different transaction volume. Secondly, we compared the processing efficiency of the three algorithms in the same transaction volume of 10000 and different support threshold. The comparison is shown in Fig.2.

2000 processing time/s 1500 1000 500 0 5 10 15 20 25 30 35 40 support threshold IBPFP BPFP PFP Figure 2. The processing efficiency in different support threshold. Fig.1 shows that when the number of transactions is relatively small, the processing time of the three algorithms is roughly equal. However, with the increase of transaction volume, IBPFP algorithm began to show the advantages in processing efficiency. The efficiency of IBPFP algorithm has been always higher than BPFP, which proves the effectiveness of the cutting strategy. The efficiency of BPFP algorithm has been always higher than PFP, which proves the effectiveness of the load balancing strategy. However the efficiency difference between IBPFP and BPFP is not very sharp, which proves the improvement of load balancing strategy maybe more obvious than the cutting strategy. Fig.2 shows that in condition of the same transaction amount, the minimum support threshold is higher, the processing time of the algorithm is shorter. The above three algorithms are all satisfied with this rule. Compared with BPFP, IBPFP has the cutting process of FP-Tree, which makes some not frequent items paths merged and reduces iteration numbers of part of branches. Compared with PFP, IBPFP has the load balancing process, which makes each node balance loaded. Therefore the IBPFP is better than the BPFP and PFP in different support threshold. Though the above two experiments, we proved the better efficiency of the IBPFP. Conclusions and Future Work In this paper, we studied the MapReduce distributed computing framework and FP-Growth parallel algorithm. We improved the original algorithm by the FP-tree cutting strategy and load balance strategy. And though the experiments, we proved the better efficiency of the new algorithm named IBPFP. Now the data on the Internet is real-time dynamic growing. How to carry out parallel frequent item mining under this condition will be a new research hotspot. Acknowledgement This research was financially supported by Specific funding for education science research by self-determined research funds of CCNU from the colleges basic research and operation of MOE. References [1] Lee K.H., Lee Y.J., Choi H., et al. Parallel data processing with MapReduce: a survey [J]. AcM sigmod Record, 2012, 40(4): 11-20. [2] Dean J., Ghemawat S. MapReduce: A Flexible Data Processing Tool [J]. Communications of the Acm, 2010, 53(1): 72-77.

[3] Chen X.S., Zhang S., Tong H., et al. FP-Growth algorithm based on the boolean matrix and mapreduce [J]. J South China Univ Technol, 2014, 1: 155-156. [4] Deng L., Lou Y. Improvement and Research of FP-Growth Algorithm Based on Distributed Spark [C]//2015 International Conference on Cloud Computing and Big Data (CCBD). IEEE, 2015: 105-108. [5] Wei F., Xiang L. Improved Frequent Pattern Mining Algorithm Based on FP-Tree [C]//Proceedings of The fourth International Conference on Information Science and Cloud Computing (ISCC2015). 18-19 December 2015. Guangzhou, China. Online at http://pos. sissa. it/cgi-bin/reader/conf. cgi? confid= 264, id. 37. 2015. [6] Information on http://hadoop.apache.org. [7] Zhou L., Zhong Z., Chang J., et al. Balanced parallel FP-Growth with MapReduce [C] Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on. IEEE, 2010: 243-246.