A comparative study of Frequent pattern mining Algorithms: Apriori and FP Growth on Apache Hadoop

A comparative study of Frequent pattern mining Algorithms: Apriori and FP Growth on Apache Hadoop Ahilandeeswari.G. 1 1 Research Scholar, Department of Computer Science, NGM College, Pollachi, India, Dr. R. Manicka Chezian 2 2 Associate Professor, Department of Computer Science, NGM College, Pollachi, India, Abstract In Data Mining Research, Frequent pattern (itemset) mining plays an important role in association rule mining. The Apriori and FP-growth algorithms are the most famous algorithms which can be used for Frequent Pattern mining. The analysis of literature survey would give the information about what has been done previously in the same area, what is the current trend and what are the other related areas. This paper explains the concepts of Frequent Pattern Mining and two important approaches candidate generation approach and without candidate generation. The paper describes methods for frequent item set mining and various improvements in the classical algorithms Apriori and FP growth for frequent item set generation. Apache Hadoop is a major innovation in the IT market place last decade. From humble beginnings Apache Hadoop has become a world-wide adoption in data centers. It brings parallel processing in hands of average programmer. This paper presents a literature review of frequent pattern mining algorithms on Hadoop. Keywords : Data mining, Association rule, Frequent itemset mining,apriori and FP growth algorithm. 1. INTRODUCTION Frequent pattern mining [1] plays a major field in research since it is a part of data mining. Many research papers, articles are published in the field of Frequent Pattern Mining (FPM). This chapter details about frequent pattern mining algorithm, types and extensions of frequent pattern mining, association rule mining algorithm, rule generation, suitable measures for rule generation. Frequent pattern mining is fundamental in data mining. The goal is to compute on huge data efficiently. Finding frequent patterns plays a fundamental role in association rule mining, classification, clustering, and other data mining tasks. Frequent pattern mining was first proposed by Agarwal et. al. [1] for market basket analysis in the form of association rule mining. Frequent Itemset mining came into existence 451 where it is needed to discover useful patterns in customer s transaction database. A customer s transaction database is a sequence of transactions (T=t1 tn), where each transaction is an itemset (ti I). An itemset with k elements is called a k-itemset. An itemset is frequent if its support is greater than a support threshold, denoted by min supp. The frequent itemset problem is to find all frequent itemset in a given transaction database. The first and most important solution for finding frequent itemsets, is the Apriori algorithm. The fundamental frequent pattern algorithms are classified into two ways as follows: 1.Candidate generation approach (E.g. Apriori algorithm, AprioriTID, Apriori Hybrid) 2.Without candidate generation approach (E.g. FP Growth algorithm)

Understanding the FP Tree Structure: The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative information about frequent patterns in a database. One root labeled as null with a set of item-prefix subtrees as children, and a frequent-item-header table.[2]. 2. BASIC APPROACH S OF FREQUENT PATTERN MINING 2.1 Candidate Generation Approach Apriori: Apriori proposed by R. Rakesh[1] is the fundamental algorithm. It searches for frequent itemset browsing the lattice of itemsets in breadth. The database is scanned at each level of lattice. Additionally, Apriori uses a pruning technique based on the properties of the itemsets, which are: If an itemset is frequent, all its sub-sets are frequent and not need to be considered. Get frequent items Generate Candidate Itemset Get frequent itemset Generate Set-null yes Generate strong rules Fig 1: High level design of Apriori No AprioriTID: AprioriTID algorithm uses the generation function in order to determine the candidate item sets. The only difference between the two algorithms is that, in AprioriTID algorithm the database is not referred for counting support after the first pass itself. Apriori Hybrid: Apriori Hybrid uses Apriori in the initial passes and switches to AprioriTid when it expects that the candidate item sets at the end of the pass will be in memory[3]. 2.1.1 Implementation of Apriori Algorithm Step 1 Step 2 Step 3 Find all frequent itemsets: Get frequent items: Items whose occurrence in database is greater than or equal to the min.support threshold. Get frequent itemsets: Generate candidates from frequent items. Prune the results to find the frequent itemsets. Generate strong association rules from frequent itemsets: Rules which satisfy the min.support and min.confidence threshold. Fig 2: Apriori Algorithm Start 452

2.2 Without Candidate Generation Approach FP-growth: The principle of FP-growth method [5] is to found that few lately frequent pattern mining methods being effectual and scalable for mining long and short frequent patterns. FP-tree is proposed as a compact data structure that represents the data set in tree form. 2.2.1 Implementation of FP Growth Algorithm Calculate the support count of Each items Sort items in decreasing Counts Read transaction t Fig 3: Activity design of FP growth First a transaction t is read from database. The algorithm checks whether the prefix of t maps to a path in the FP tree. If this is the case support count of the corresponding node in the tree are incremented. If there is no overlapped path, new nodes are created with the support count 1. Step 1 The dataset is scanned to determine the support of each items. The infrequent items are discarded in FP tree. All Frequent items are ordered Step 2 based on their support The algorithm does the second pass over the data and construct the FP tree Fig 4: Algorithm for FP Growth 2.3. Comparative analysis The two algorithms discussed above are Has next Increment the widely studied algorithms for frequent pattern frequency mining. The apriori algorithm works by Non count for each generating candidate itemsets while the FPgrowth algorithm works without generating Overlapped overlapped Prefix found items the candidate sets. Overlapped The apriori algorithm has the following Prefix found bottlenecks: 1.) Difficult to handle huge number of Create a new node candidate itemsets. The candidate generation labeled with the items can be very costly with the increasing size of database. 2.) It is tedious to repeatedly scan the huge Set the frequency Create new databases. count to 1 node for The FP-growth algorithm is quite a different none algorithm from its predecessors. It works by overlapped generating a prefix-tree data structure known items as FP-tree from two scans of the database. This algorithm doesn t need to scan the database multiple times. The main drawbacks return of the apriori algorithm are removed with the introduction of the FPgrowth algorithm. The 453

table represents the comparison of two algorithms studied here based on different parameters.[4] Table 1: Differentiation between Apriori and FP Growth Parameter Apriori FP Growth Technique Memory utilization No of scans Time 454 Use Apriori property and join and prune property Large memory space for candidate itemsets Multiple scans of database. Execution time is large because of candidate itemsets generation 2.4. Apache Hadoop Constructs FP-tree and conditional pattern base satisfying minimum support. Lesser memory due to compact structure Scans the database twice only Execution time is smaller. The Apache Hadoop develops opensource software for reliable, scalable, distributed computing [6]. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is a, Java-based programming framework that supports the processing of large data sets in a distributed computing environment and is part of the Apache project sponsored by the Apache Software Foundation. Hadoop was originally conceived on the basis of Google's Map Reduce, in which an application is broken down into numerous small parts [9]. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available. service on top of a cluster of computers, each of which may be prone to failures[15]. Hadoop framework is popular for HDFS and Map Reduce. The Hadoop Ecosystem also contains different projects which are discussed below [7] [8]: The Hadoop includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. It contains libraries and utilities needed by other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides high-throughput access to application data. HDFS is the Hadoop file system and comprises two major components: namespaces and blocks storage service. The namespace service manages operations on files and directories, such as creating and modifying files and directories. The block storage service implements data node cluster management, block operations and replication. Hadoop YARN: A framework for job scheduling and cluster resource management. YARN is a resource manager that was created by separating the processing engine and resource management capabilities of Map Reduce as it was implemented in Hadoop 1. YARN is often called the operating system of

Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop. Hadoop Map Reduce: A YARNbased system for parallel processing of large data sets. The Hadoop Map Reduce framework consists of one Master node termed as Job Tracker and many Worker nodes called as Task Trackers. 3. REVIEW ON SEVERAL IMPROVEMENTS OF APRIORI AND FP GROWTH ALGORITHM ON APACHE HADOOP The table clearly summarizes the essential information of all the algorithms discussed in the paper. The main purpose of the table is to highlight the application of all the above stated algorithms. In this section we are going to discuss content of the paper, proposed system, research gape and observed parameters of different papers. Table 2: Represents a Comparative study of Algorithms Author Technique Benefit Khurana, Apriori Better than both K., and Hybrid: Apriori and Sharma Used where Apriori and AprioriTID used. AprioriTID.[3] Apriori: Borgelt, C. Best for closed item sets. 1. Fast 2. Less candidate sets, 3. Generates Khurana, K., and Sharma Hunyadi, D. Borgelt, C. Apriori TID: Used for smaller problems FP Growth: Used in cases of large problems as it doesn t require generation of candidate sets. candidate sets from only those items that were found large. [10] 1. Doesn t use whole database to count candidate sets. 2. Better than SETM. 3. Better than Apriori for small databases. 4. Time saving.[3]. 1.Only 2 passes of dataset,, Compresses data set. 2. No candidate set generation required so better than éclat, Apriori.[11][10] In table 3, Different algorithm s proposed system and their limitations are discussed which give us a research gape in that papers. Table 3: Comparative Analysis of Apriori and FP Growth Algorithm on Apache Hadoop 455

Author Othman Yahya, Osman Hegazy, Ehab Ezat. Pallavi Roy. Content and Proposed System Authors implemented an efficient MapReduce Apriori algorithm (MRApriori) based on Hadoop- MapReduce model which needs only two phases (MapReduce Jobs) to find all frequent kitemsets, In this thesis association rule mining was implemented on hadoop. An association rule mining helps in finding relation between the items or item sets in the given [13] data. Research Gape In this paper author implement Apriori algorithm on a single machine or can say a stand-alone mode so there are some chance to implement on multiple node.[12] Limitation of this algorithm is that it can generated too many association rules and also small dataset may not give good performance. There are some chance to make algorithm faster this by Sandy Moens, Emin Aksehirli, and Bart Goethals. In this paper, there are two new methods introduce for FIM: Dist- Eclat focuses on speed while BigFIM is optimized to run on really large datasets [14]. using hashing. In this paper one issues is that, Dist-Eclate or original Eclat algorithms use vertical datasets, so datasets has to be converted from horizontal to vertical data. In table 4, different parameters are observed in time of performance evolutions which summarize in this table. Table 4: Observed Parameters Pape r Executio n time Efficienc y 12 yes No No 13 yes Yes No 14 yes Yes No Load Balancin g 456

4. CONCLUSION In recent years the size of database has increased rapidly. Therefore require a system to handle such huge amount of data. In this paper, algorithmic aspects of association rule mining are dealt with. From a broad variety of efficient algorithms the most important ones are compared. The algorithms are systemized and their performance is analyzed based on runtime and theoretical considerations. Despite the identified fundamental differences concerning employed strategies, runtime shown by algorithms is almost similar. The comparison table shows that the Apriori algorithm outperforms other algorithms in cases of closed item sets whereas FP growth displayed better performance in all the cases. The overall goal of the frequent item set mining process helps to form the association rules for further use. This paper gives a brief survey on frequent pattern mining algorithm Apriori and FP Growth algorithm on Apache Hadoop. 4. REFERENCES [4] Sumit Aggarwal and Vinay Singal, "A Survey on Frequent pattern mining Algorithms.", International Journal of Engineering Research & Technology (IJERT), ISSN: 2278-0181, Vol. 3 Issue 4, pp 2606-2608, April 2014. [5] J. Wang, J. Han, and J. Pei, Searching for the Best Strategies for Mining Frequent Closed Itemsets., International conference on Knowledge Discovery and Data Mining (KDD'03), Proc. 2003, ACM SIGKDD Aug. 2003. [6] Ms. Dhamdhere Jyoti L., Prof. Deshpande Kiran B. "An Effective Algorithm for Frequent Itemset Mining on Hadoop.", International Journal of Science, Engineering and Technology Research (IJSETR), Volume 3, Issue 8, August 2014. [7] Ferenc Kovacs and Janos Illes Frequent Itemset Mining on Hadoop.,IEEE 9th International conference on Computational Cybrnetics, Volume 2 Issue 4, june 2013. [1] R.Agarwal and R. Srikant, Fast Algorithms for Mining Association Rules., International Conference on very large Databases, proc.20th, pp 487-499, june 1994. [2] Prashasti Kanikar, Twinkle Puri, Binita Shah, Ishaan Bazaz, Binita Parekh," A Comparison of FP tree and Apriori algorithm. ",International Journal of Engineering Research and Development,Volume 10, Issue 6, pp 78-82, June 2014. [3] Khurana K and Sharma.S, A comparative analysis of association rule mining algorithms., International Journal of Scientific and Research Publications, Volume 3, Issue 5, pp 38-45, May 2013. 457 [8] Ms. Dhamdhere Jyoti L., Prof. Deshpande Kiran B. "A Novel Methodology of Frequent Itemset Mining on Hadoop.", International Journal of Science, Engineering and Technology Research (IJSETR), Volume 3, Issue 8, August 2014. [9] Tao Gu, Chuang Zuo, Qun Liao, "Improving Map Reduce Performance by Data Prefetching in Heterogeneous or Shared Environments.", International Journal of Grid and Distributed Computing,Vol.6, No.5, pp.71-82, june2013. [10] Borgelt, C. Efficient Implementations of Apriori and Eclat. Workshop of frequent item

set mining implementations (FIMI 2003, Melbourne, FL, USA). [11] Hunyadi, D. Performance comparison of Apriori and FP-Growth Algorithms in Generating Association Rules. Proceedings of the European Computing Conference ISBN: 978-960-474-297-4. [12] Othman Yahya, Osman Hegazy, Ehab Ezat. An Efficient Implementation of Apriori Algorithm Based On Hadoop-Mapreduce Model. IJRIC, pp. 59:67, june 2012 [13] Pallavi Roy. Mining Association Rules in Cloud, M.S thesis, Dept. Computer. Eng, North Dakota State University of Agriculture and Applied Science, Fargo, North Dakota, August 2012. [14] Sandy Moens, Emin Aksehirli, and Bart Goethals. Frequent itemset mining for big data.ieee International Conference on Big Data, pp. 111 118, May 2013 [15] Sivaraman.E, Manickachezian.R, High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing Using Hadoop, IEEE Xplore, ISBN: 978-1-4799-3967-1 458