Frequent Itemset Mining Algorithms for Big Data using MapReduce Technique - A Review

Size: px

Start display at page:

Download "Frequent Itemset Mining Algorithms for Big Data using MapReduce Technique - A Review"

Cornelia Roberts
6 years ago
Views:

1 Frequent Itemset Mining Algorithms for Big Data using MapReduce Technique - A Review Mahesh A. Shinde #1, K. P. Adhiya *2 #1 PG Student, *2 Associate Professor Department of Computer Engineering, SSBT College of Engineering and Technology Bambhori, NMU Jalgaon, Maharashtra, India Abstract Very huge quantity of data is continuously generated from variety of different sources such as IT industries, internet applications, hospital history records, microphones, sensor network, social media feeds etc called as Big Data. By using traditional & conventional tools big data cannot be handled because of variety of data. Numerous existing data mining techniques are developed & presented to derive association rules and frequently occurring itemsets, but with the rapid arrival of era of big data traditional data mining algorithm have been unable to meet large datasets analysis requirements. Size, Complexity, and variability of big data are the major challenges to recognize association rules and frequent itemsets. A problem of memory and computational capability is handled by MREclat. ClustBigFIM & MRPrePost provide scalability and speed to mine data from large datasets. MapReduce framework is widely used for parallel processing of Big Data. MapReduce provide features such as high scalability and robustness which helps to handle problem of large datasets. In this paper, we present the deep review on different frequent itemsets mining (FIM) techniques. Keywords Big data, Data mining, Frequent Itemset Mining, Association Rule mining, MapReduce. I. INTRODUCTION Due to growth of IT industries, services, technologies and data, the huge amount of complex data is generated from the various sources that can be in various form. Such complex and massive data is difficult to handle and process that contain the billion records of million user & product information that includes the online selling data, audios, images, videos of social media, news feeds, product price and specification etc. The necessity of big data arrives from the worldwide famous companies like Google, Yahoo, Weibo, Facebook, Microsoft, and Twitter for the reasons of analysis of huge data which can be in unstructured form. For example, Google contains the huge amount massive data. To handle and process this massive data big data analytics is needed. Big data analytics analyze the huge amount of information and reveal the association rules, hidden patterns, trends and the other meaningful information. In 1998, John Mashley introduced new term called as Big Data[1]. Big data is nothing but the collection of large data which consist of different type of data. In same year, Indrukya and Weiss[2] published book on Big data. Title of that book was Big Data. Normally, data is called as big data because everyone is generating large quantity of data each day. In BigMine'12 workshop which was held at KDD Usamafayyad [3] presented some magical information about internet usage. Such as, Google handle more than one billions queries every day, online social networking service provider twitter and Facebook has greater than 250 millions twits and 800 millions updates/comments every day respectively and 4 billion user visits YouTube every day. Doug Laney[4], VP and Distinguished Analyst for Gartner Research was first person who presented three V's in Management of Big Data. These 3 V's were as follows: Volume: The size of data is more than ever before and it is increasing continuously. Traditional tools are not sufficient to use such heavy data. Variety: There are numerous varieties of data, such as image/picture, video and audio with different format, simple text, graphs, tables, location or log file, sensor data, other multimedia, and more. Velocity: Data is growing continuously as a stream of data, and primary view of user is to get only meaningful data from it in fewer real times. Additional another 2 V s are: Variability: This refers that, there are various changes in the structure of the available useful information and how users/person want to interpret that meaningful data. Value: This refers that, business value that gives organization a compelling advantage, due to the ability of making decisions based in answering questions that were previously considered beyond reach. Mainly Big data consist of two types of data: 1. Structured and 2. Unstructured. Structured data includes digits and words that are not difficult to analyze categorize. Structured datais produced from number of sources like mobile devices, aerial (remote sensing), software logs, cameras, microphones, electronic devices, radio-frequency identification readers, wireless sensor networks. And global ISSN: Page 473

2 positioning system devices. Structured data also consist of things like balance of bank account, transaction information of bank account. Another type unstructured data contains more composite data, like user reviews from flipkart website, tweets from twitter, images or pictures, videos, comments from Facebook site and other multimedia. It is really difficult task to categorize and analyzesuch composite data. Frequent itemset mining is an imperative part of data analysis and data mining. The main goal of FIM is to mine information and reveal patterns from massive datasets on the basis of frequent occurrence, i.e., an event is interesting or number of events are interesting if it occurs/seems frequently in the data, according to a user given minimum frequency threshold. Many techniques have been invented to mine frequent itemsets from databases. These techniques work well in practice on typical datasets, but they are not applicable for real Big Data. Using frequent itemset mining technique to massive databases is not easy task. There are number of difficulties. First of all, databases having large amount of records do not fit into main memory. In such cases, solution is to use level wise breadth first search based algorithms, such as Apriori algorithm, in this approach frequency counting is getted by reading the dataset over and over again for each size of candidate itemsets. Unfortunately, the memory requirements for handling the complete set of candidate itemsets blows up fast and renders Apriori based schemes very inefficient to use on single machines. Secondly, current approaches tend to keep the output and runtime under control by increasing the minimum frequency threshold, automatically reducing the number of candidate and frequent itemsets. Google[5] proposed MapReduce framework which is basically used for parallel processing of large datasets and it works on key-value pairs. Frequent itemset mining need to calculate support and confidence which can be done in parallel using MapReduce programming model. Faster processing can be achieved by calculating frequency of items using map functions which executes in parallel on set of hadoop clusters and reduce functions used to combine the local frequent items and give global frequent items. The organization of this paper is as follows. The next section II gives background, literature survey and comparative analysis of FIM techniques. In Section III, Techniques and tools necessary for big data mining and MapReduce framework is explained. Conclusion is presented in sections IV. II. BACKGROUND Size, complexity and variability of Big Data are big challenges for recognize association rules and frequent itemset mining. Market Basket model is best example of association rule which is based on relationship among elements[6]. Association rule mining and frequent itemset mining is well known techniques of data mining. It discovers frequency of items purchased together. The whole database scan is necessary in FIM, it might create challenge when datasets size is scaling, as large datasets does not fit into memory. Several approaches exist for association rule mining [7], [8], [9]. Frequent itemsets play an essential role in finding correlations, clusters, episodes and many other data mining tasks. Value discovered from frequent itemsets can be used to make decisions in marketing. Agrawal[6] in 1993 first proposed mining customer transaction database item sets problem, now FIM (frequent itemsets mining) has become an essential part of data mining. Most of the current algorithms are classified into two groups: Apriori-like algorithm and FP-growth (Frequent pattern) algorithm. Apriori rejects candidate sets by repeatedly scanning the database. The main advantage of FP Growth algorithm is FP-Tree. When faced with large data, these two algorithms are not well adapted. For the above algorithm, a solution is to consider only the large threshold value, the number of candidates can be reduced and minimized, but this will lead mining association rules out inaccurate due to low utilization data. The mining of frequent itemsets is a basic and essential problem in many data mining applications. Algorithms for mining frequent itemsets can be basically classified into two types: one is algorithms based on horizontal layout dataset such as Apriori algorithm and FP-Growth algorithm;another is algorithms based on vertical layout database such as Eclat algorithm. Eclat algorithm takes advantage over algorithms based on horizontal layout database. It saves and reduces much time as it does not need to scan the whole database repeatedly. Apriori is the most classical algorithm in history of data mining, the main idea behind the Apriori algorithm is to generate k+l-frequent itemsets based on k-candidate itemsets By traversing the database to statistics candidate collection, then by using support threshold value candidate itemsets can be neglected. The pruning strategy of candidate itemsets is that if an itemset is not occurring frequently, then its superset so is. The algorithm is very simple, but main drawback is that Apriorialgorithm requirestoo many times traversing the database and producing a large number of candidate sets, time and memory overhead will become a bottleneck. Comparing with Apriori algorithm, FP-growth is an improved algorithm. The main advantage of FP Growth is that only needs to scan the database twice, and construct a compressed data structure FP-Tree, which reduces the search space, while no candidate set, improved memory utilization. FP Growth adopts to depth-first mode policy. However, it constructs a large number of conditions pattern tree when recursive, when faced with huge amounts of data, the memory is difficult to put all of the pattern tree, and the tree traversal algorithm whose time complexity is higher. PFP is based on the Hadoop (MapReduce Framework) parallel algorithms, ISSN: Page 474

3 PFP make groups of the itemsets, as a condition database partitioned and divided to each node, each individual node independently generates the FP-Tree and mines frequent itemsets from individual partitioned database. PFP minimizes the traffic between nodes, increases the degree of polymerization of node. However, algorithm is not efficient if the database is discrete. Grouping strategy of PFP has problems with memory and speed. To balance the groups of PFP Zhou et al.[10], has proposed algorithm for faster execution using single items which is also not an efficient way. Xia et al. [11], has been proposed Improved PFP algorithm for mining frequent itemsets from massive small files datasets using small files processing strategy. There are number of Hybrid methods are invented for mining frequent itemsets. MRPrePost is hybrid method for frequent itemset mining which combines DistEclat and PrePost algorithm. MREclat is also hybrid method for frequent itemset mining. ClustBigFIM is modified BigFIM algorithm for generating frequent itemsets which uses parallel K- means and Eclat for finding potential extensions and Apriori for producing K-FIs. A. Literature Survey Basically, there are three classic frequent itemset mining algorithms that run in single node. Loop is the main logic behind success of Apriori [6] algorithms. In Apriorialgorithm loop k produces frequent itemsets with length k. By using the property and o/p of k loop, loop k+1 calculate candidate itemsets. Property is: any subset in one frequent itemset must also be frequent. FP-Growth [12] algorithm creates an FP-Tree by two scan of the whole dataset and then frequent itemsets are mined from frequent pattern tree. Eclat[4] algorithm transposes the whole dataset into a new table. In this new table, every row contains list of sorted transaction ID of respective item. In last frequent itemsets are extracted by intersecting two transaction lists of that item. Othman et al. [15], presented two different ideas for conversion Apriori algorithm into MapReduce task. In first way, all possible itemsets are extracted in Mapping phase, and then in Reduce phase itemsets those does not satisfy minimum support threshold are taken out. In second way, direct conversion from Apriori algorithm is carried out. Every loop from Apriori algorithm is converted into MapReduce task. These presented approaches are used by [13], [14]. In this approaches large data is shuffled between Map and Reduce tasks[15]. To solve these problems, they presented MRApriori algorithm. MRApriori is nothing but MapReduce based improved Apriori algorithm which uses two-phase structure. Zang et al. [16], presented improved Eclat algorithm to increase the efficiency of FIM from large datasets. Parallel algorithm MREclat based on MapReduce framework is called as MREclat algorithm. MREclat also solves the problems of storage and capability of computation not enough when mining frequent itemsets from large complex datasets. MREclat algorithm has very high scalability and better speedup in comparison with other algorithm. Algorithm MREclat consists of three steps: in the initial step, all frequent 2-itemsets and their tid-lists from transaction database is getted; the second is the balanced group step, partition frequent 1-itemsets into groups; the third is the parallel mining step, the data got in the first step redistributed to different computing nodes according to the group their prefix belong to. Each node runsan improved Eclat to mine frequent itemsets. Finally, MREclat collects all the output from each computing node and formats the final result. Moens et al.[17], proposed two methods for frequent itemset mining for Big Data on MapReduce, First method DistEclat is distributed version of pure Eclat method which optimizes speed by distributing the search space evenly among mappers, second method BigFIM uses both Apriori based method and Eclat with projected databases that fit in memory for extracting frequent itemsets. Advantage of Dist-Eclat and BigFIM is that it provides speed and Scalability Respectively. Dist-Eclat does not provide scalability and speed of BigFIM is less. Riondato et al.[18], has been presented Parallel Randomized Algorithm (PARMA algorithm) which finds set of frequent itemsets in less time using sampling method. PARMA mines frequent patterns and association rules from precise data. As a result mined frequent itemsets are those are close to the original results. It finds the sampling list using k-means clustering algorithm. The sample list is nothing but clusters. The main advantage of PARMA is that it reduces data replication and algorithm execution is faster. Liao et al.[19], presented a MRPrePost algorithm based on MapReduce framework. MRPrePost is an improved version of PrePost. Performance of PrePost algorithm is improved by including a prefix pattern. On this basis, MRPrePost algorithm is well suitable for mining large data's association rules. In case of performance MRPrePost algorithm is more superior to PrePost and PFP. The stability and scalability of MRPrePost algorithm is better than PrePost and PFP. The mining result of MRPrePost is which is closer to original result. Big FIM [17] overcomes the problems of Dist-Eclat such as, mining of sub-trees requires entire database into main memory and entire dataset needs to be communicated to most of the mappers. BigFIM is a hybrid approach which uses Apriori algorithm for generating k-fis, and then Eclat algorithm is applied to find frequent item sets. Candidate itemsets do not fit into memory for greater depth is the limitation of using Apriori for generating k-fis in BigFIM algorithm and speed is slow for BigFIM. ISSN: Page 475

4 To address above limitation Gole et al.[20], Proposed a method ClustBigFIM. ClustBigFIM provides hybrid approach for frequent itemset mining for large data sets using combination of parallel k- means, Apriori algorithm and Eclat algorithm. ClustBigFIM overcomes limitation of Big FIM by increasing scalability and performance. Resulting output of ClustBigFIM gives the results that are closer to the original results but with faster speed. ClustBigFIM work with four steps which need to be applied on large datasets, steps are Find Clusters, Finding k-fis, Generate Single Global TID list, Mining of Subtree. B. Literature Review Table 1 gives comparative analysis of different frequent itemset mining technique which works on MapReduce framework. Table I. Comparative Analysis of Different FIM Techniques Based on MapReduce Framework Author's Name Zhou et al.[10] Riondato et al.[18] Moens et al.[17] Moens et al.[17] Liao et al.[19] Gole et al.[20] Technique Benefits Limitations Balanced FP - Growth PARMA Faster execution using singleton with balanced distribution Reduces data replication, Faster execution, Scaling linearly Partitioning of search space using single item is not best way Mined frequent itemsets are Dist-Eclat Speed Scalability BigFIM Scalability Speed MRPrePost Better stability & scalability, performance better than PFP & PrePost ClustBigFIM Provide scalability& speed to mine Frequent patterns, association rules, and sequential patterns correlations from massive datasets. Mined frequent itemsets are which are closer to original result Parallel k- means give results instead of truly frequent patterns III. TECHNIQUES AND TOOLS The Big Data is collection of unstructured and structured data. This term is basically related to the open source software revolution. Worldwide famous companies like Facebook, Yahoo!, Twitter, Microsoft is taking benefit and contribute working on open source projects. Big Data infrastructure is basically deals with Hadoop, and other related software as: Apache Hadoop[21]: The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage/memory. Hadoop allows writing applications that rapidly process large amounts of data in parallel on large clusters of compute nodes. A MapReduce task partition the input dataset into set of independent subsets. These partitioned subsets are processed by map tasks individually. Then, result of mapping phase is provided to reduce phase to obtain the final result of the task. In Big Data Mining, there are many open source software. The most popular softwares are the following: Apache Mahout [22]: Apache mahout is scalable machine learning and data mining open source software library based mainly in Hadoop. This library include implementations of different data mining algorithms such as: frequent pattern mining, clustering, classification etc.. R[23]: R is Open source programming language and software environment designed for statistical computing and visualization. In 1993, Ross Ihaka and Robert Gentleman designed R at University of Auckland, New Zealand. R is used for statistical analysis of very large data sets. MOA[24]: MOA is stream data mining open source software. The main purpose of MOA is to perform data mining in real time. MOA library includes different machine learning algorithm like classification, regression, clustering and frequent item set mining and frequent graph mining, outlier detection, concept drift detection. MapReduce Framework: MapReduce by Google in 2004[25] made a great contribution to the advent of distributed association rule mining. There are various algorithms were proposed and developed or modified to implement on MapReduce framework. MapReduce framework improves the capacity of storage and computation of many distributed commodity machines. MapReduce can easily perform computation on huge datasets, and it is also greatly fit in executing complex parallel algorithms which make a very limited use of communication. MapReduce framework has two phases, Map phase and Reduce phase. Map and reduce functions are used for large parallel computations specified by users. Map function takes chunk of ISSN: Page 476

5 data from HDFS in (key, value) pair format and generates a set of (key, value ) intermediate (key, value) pairs. MapReduce framework collects all intermediate values which are bind to same intermediate key and some are passed to reduce function; it is formalized as, map :: (key, value) (key, value ); Value of map function is used by reduce function. Intermediate key details are received by reduce function, that are merged together. The intermediate values are provided to reduce function through iterator, by using which too large values fit in memory and formalized as, reduce :: (key, list (value )) (key, value ) Output can have one or more output files which are written on HDFS. Examples such as Inverted Index, Term Vector per host Distributed Sort, Distributed Grep, count of URL access frequency can be completed through MapReduce framework. IV. CONCLUSIONS Frequent itemset mining is an important research topic because it is widely applied in real world to find frequent itemsets and to mine human behavior patterns and trends. In this paper comparative study of number of FIM technique is presented. FIM process is both memory and compute intensive. Various FIM techniques are proposed and developed from last couple of year which overcomes the problems of memory and computational capability insufficient when mining frequent itemsets from massive datasets. Also by using hybrid approach, the performance, Stability and Scalability of algorithm is improved. Efficiency and scalability are crucial for designing a FIM algorithm on dealing with large datasets. However, current distributed FIM algorithms often suffer from generating huge intermediate data or scanning the whole transaction database for identifying the frequent itemsets. In future, search space should be reduced and instead of patterns truly frequent patterns should be mined within less time. REFERENCES [1] F. Diebold. On the Origin(s) and Development of the Term Big Data. Pier working paper archive, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, [2] M. Weiss and N. Indurkya, Predictive data mining: a practical guide, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, [3] U. Fayyad. Big Data Analytics: Applications and Opportunities in On-line Predictive Modeling, [Online]. Available: [4] D. Laney. 3-D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note, February 6, [5] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI. USENIX Association, [6] RakeshAgrawal, Tomasz Imieliński, and Arun Swami, Mining association rules between sets of items in large databases, SIGMOD Rec. 22, 2 (June 1993), DOI= / [7] JochenHipp, Ulrich Güntzer, and GholamrezaNakhaeizadeh. Algorithms for association rule mining a general survey and comparison. SIGKDD Explor. Newsl. 2, 1 (June 2000), [8] Woo SikSeol, HwiWoonJeong, Byungjun Lee, and Hee Yong Youn, Reduction of Association Rules for Big Data Sets in Socially-Aware Computing, Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on, vol., no., pp.949,956, 3-5 Dec [9] Jiawei Han Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. [10] L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng. Balanced parallel FP Growth with MapReduce. In Proc. YC-ICT, pages , [11] Dawen Xia, Yanhui Zhou, ZhuoboRong, and Zili Zhang, IPFP : an improved parallel FP-Growth Algorithm for Frequent Itemset Mining, isiproceedings.org, [12] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns Without Candidate Generation, in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD 00. New York, NY, USA: ACM, 2000, pp [13] M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh, Apriori-based Frequent Itemset Mining Algorithms on MapReduce, in Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ser. ICUIMC 12. New York, NY, USA: ACM, 2012, pp. 76:1 76:8. [14] N. Li, L. Zeng, Q. He, and Z. Shi, Parallel Implementation of Apriori Algorithm Based on MapReduce, in Proceedings of the th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, ser. SNPD 12. Washington, DC, USA: IEEE Computer Society, 2012, pp [15] O. Yahya, O. Hegazy, and E. Ezat, An efficient implementation of Apriori algorithm based on Hadoop MapReduce model, International Journal of Reviews in Computing, vol. 12, pp , [16] Zhigang Zhang, GenlinJi, and Mengmeng Tang, MREclat: an Algorithm for Parallel Mining Frequent Itemsets, 2013 International Conference on Advanced Cloud and Big Data, DOI /CBD [17] Moens S, Aksehirli E, Goethals B, Frequent Itemset Mining for Big Data, Big Data, 2013 IEEE International Conference on, vol., no., pp.111,118, 6-9 Oct DOI: /BigData [18] M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal. PARMA: a parallel randomized algorithm for association rules mining in MapReduce. In Proc. CIKM, pages ACM, [19] Jinggui Liao, Yuelong Zhao, and Saiqin Long, MRPrePost- A Parallel algorithm adapted for mining big data, IEEE Workshop on Electronics,Computer and Applications, [20] SheelaGole, and Bharat Tidke, Frequent Itemset Mining for Big Data in social media using ClustBigFIM algorithm, International Conference on Pervasive Computing (ICPC),2015. [21] Apache Hadoop, [Online]. Available : [22] Apache Mahout, [Online]. Available : [23] R Core Team. R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN [24] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis, [Online]. Available: Journal of Machine Learning Research (JMLR), [25] J. Dean and G. Sanjay, MapReduce: simplified data processing on large clusters, in Communications of the ACM, p , ISSN: Page 477

Efficient Algorithm for Frequent Itemset Generation in Big Data

Efficient Algorithm for Frequent Itemset Generation in Big Data Anbumalar Smilin V, Siddique Ibrahim S.P, Dr.M.Sivabalakrishnan P.G. Student, Department of Computer Science and Engineering, Kumaraguru