A REVIEW ON DATA PARTITION BASED FREQUENT ITEMSET MINING

Size: px

Start display at page:

Download "A REVIEW ON DATA PARTITION BASED FREQUENT ITEMSET MINING"

Molly Joan Green
5 years ago
Views:

Volume 119 No. 7 2018, 2273-2279 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.

1 Volume 119 No , ISSN: (printed version); ISSN: (on-line version) url: ijpam.eu A REVIEW ON DATA PARTITION BASED FREQUENT ITEMSET MINING Swathika.K, Helen W.R School of Computing, SASTRA DEEMED TO BE UNIVERSITY, India Abstract Frequent Itemset mining is an interesting domain of Data Mining that helps to discover knowledge or pattern from a large set of raw data. The discovered knowledge helps in decision making support and other business intelligence models.with the growing distributed environment that demands parallel processing to improve the efficiency various techniques has been proposed to find the frequent itemsets by partitioning them across multiple nodes in a parallel manner.this paper surveys the various techniques to partition and mine frequent itemsets over a distributed environment. Keywords:Frequent Itemset Mining,Data partition,distributed data 1.INTRODUCTION The advent of modern technology has made the data mining application excessively large and complex. Frequent Itemset Mining(FIM) is a part of Association Rule Mining Techniques.It is used to generate a group of items that occurs together most frequently in the transactions.generally, sequential FIM algorithms decrease the performance hence the parallel method of Frequent Itemset Mining is preferred over them. The Frequent patterns are the substructures or itemsets that occurs frequently in a large database. It can be useful in identifying a pattern to make certain decisions such as Market Basket analysis.it helps in identifying the correlation or relationship between the itemsets that occur frequently in the pattern.it is useful in both scientific domains such as Genetic research,neural Networks etc as well as in commercial domains such as Market-Basket analysis,plagrism checking,related searches,keyword Matching etc. The distributed environment has an objective of achieving efficiency. Data partitioning [10] over multiple nodes enables parallelism in execution which can increase the speed of processing.the data should be partitioned such that the loads are equally distributed over the multiple nodes.traditional approaches involved generating frequent itemsets, those itemsets are sorted in their descending order of frequency and they are divided into equal groups and placed over nodes.the idea of mining frequent itemsets, partitioning them together and distributing them equally reduces the load, however, the correlation between the data item is inaccurate.this results in poor data locality so that the shuffling cost and network overhead increases. 2. APPLIED METHODOLOGIES The Frequent Itemsets can be computed in two ways either by generating candidate itemsets or without candidate itemsets.mining Frequent Itemsets by generating Candidate set uses Apriori Algorithm. It uses Support and Confidence to generate frequent itemsets from candidate sets. Another method is to find itemsets that are frequent without generating any candidate set 2273

2 rather they use Frequent Pattern Growth tree.fptree is constructed by finding a set of frequent itemsets initially and generate a growth three which follows a Depthwise search Frequent Itemset Mining(FIM) Frequent Itemset Mining is a technique of Mining that comes under Association Rule Mining to find the most frequently occurring combination of Itemsets.The frequency of the Itemsets is decided with two parameters Support and Confidence.Support value gives the frequency of an item in the given set.confidence is a measure of the truthiness of a relationship between two items. Support can be given by Support=(Number of times a transaction having the item set/total number of transactions) Confidence can be given by Confidence=Number of times any items occur together in the data set. FIM can be generated in two ways, either by generating Candidate set or without generating a Candidate set. 2.2.Apriori Algorithm Though there are few algorithms such as ECLAT and Dist-ECLAT has been in practice to generate frequent itemset mining they are not suitable for parallel execution and handling large databases. Apriori algorithm uses downward-closure property since it is impossible to search all the possible combinations of frequent items.it states that Every subset of items are frequent if their superset is frequent.the algorithm uses two steps to create rules 1.Minimum support is applied over all frequent Itemsets. 2.Minimum confidence constraint is applied to find the rules. The Apriori algorithm first reads the transaction database once and generates Frequent-1 itemsets.then it generates candidate set of length K+1 for a Transaction database of K length.a candidate set is generated by applying minimum Support and Confidence rules from the length set.the process is repeated until there can be no further set of candidate generation is possible and also no candidate set is generated for infrequent itemsets. The excessive set of candidate generation for a large database in Apriori increases the computation time to great extent.further optimization on Apriori algorithm like partitioning,sampling etc has also been proposed to reduce the candidate set generation however the number of pass to scan the database and other factors like memory utilization are in trade-off. 2.3.FP-Growth Algorithm To overcome the drawbacks of Apriori algorithms FP-Growth can be used.it generates a Depth-first search based Frequent-pattern tree.the algorithm works by following these steps: 1.It reads the entire database once and generates a Frequent1-Itemsets. 2.The Frequent1-Itemsets are arranged in their decreasing order of frequency. 3.From the sorted list, a frequent-pattern tree is constructed. 4.From the FP-Tree a frequent pattern which is the longest is taken as F-list. 5.Conditional pattern base can be generated from the F-list. 6.Construct FP-tree for each conditional pattern set. And generate the final set of frequent itemsets. This algorithm eliminates the need for generation of candidate set and improves the completeness and Performance of mining the Frequent Itemsets when compared to the Apriori algorithm which spends half of the time in total for candidate generation which is not feasible for huge datasets.execution of this FP-Growth in parallel can improve the performance efficiency by huge rate. 2.4.Load Balancing FP-tree The generation of FP-tree incurs less database scan compared to the Apriori algorithm but the drawback of the pattern generation is that the computational cost increase as the size of the database increases. The load balancing FP-Tree approach divides 2274

3 the itemset based on the evalution of its depth and width. Initially the database is distributed equally among the slave nodes.the support value for each item is calculated and stored in a Local Header Table of each node.the master node finally combines the Local Header Table to form a Global Header Table.The construction of FP-tree from the Local Header Table proceeds by sorting the frequent items based on its order in Global Header Table.The number of items to be mined may vary for each FP-tree and they are divided equally among the nodes based on the evaluation of their load degree.the load degree is determined by the depth and width of the item in its FP-tree.The Master node collects all the mined data from the slave nodes. 2.5.Partitioning Strategies Traditional data partitioning strategy focused primarily on Load Balancing. Thus it partitioned data equally.the frequent Itemsets are mined and they were sorted in their descending order of frequency.this list has been split equally and sent to their respective nodes.in this approach there is no correlation between the data is analyzed hence, the data gets stored in poor data locality.when such data are transferred over the network while processing due to excessive shuffling costs and the time took the efficiency of the parallel processing deteriorated. 3.LITERATURE SURVEY Bahmani, B., Goel, a, & Shinde, R. explains that,the Locality Sensitive Hashing is chosen over MinHash method due to the fact that in MinHash method the hashing value is compared for each and every element in a pair-wise manner. Whereas Locality Sensitive Hashing scans the whole database once and finds the similar transactions. It maps the transactions that are similar to the same bucket [1]. Zhang, C., Li, F., & Jestes, J. suggests that the Table 1. Comparative analysis parallel and distributed computing environment that use number of commodity machines, is a dominating trend. MapReduce was introduced Method Merits Demerits Dist- Eclat 1.Load Balancing 2.DistributedEclat, 1.Cannot Handle Huge Method the limited data sets. candidate set at each node. Apriori Method 1.Frequent Patterns mined effectively. 2.Computation Speed Higher than Eclat. MinHash 1.The similarity between two pairs of Itemsets. 2.Evaluates each and every pair. Locality Sensitive Hashing Parallel FP- Growth tree Voronoi Diagram 1.Scans database once. 2.Maps similar transactions to same buckets. 1.Efficient mining technique for a distributed environment. 2.The depth-first model enables extraction of frequent pattern with less overhead. 1.Similar items are partitioned together. 2.Reduces redundant transactions. 1.Breadth-First search. 2.Candidate Generation consumes time. 1.Multiple scans of the same matrix. 2.Computational overhead due to the pairwise comparison. 1.Effective only on huge datasets. 1.Effective on large datasets 1.Poor selection of pivots leads to inaccuracy. 2275

4 with the goal of providing a simple yet powerful parallel and distributed computing paradigm. The MapReduce architecture also provides good scalability and fault tolerance mechanisms. In past years there has been increasing support for MapReduce from both industry and academia, making it one of the most actively utilized frameworks for parallel and distributed processing of large data today[2]. Chuk Lam explains that the computational necessities of current trend demands large set of data to be processed in a efficient way on a parallel distributed environment. Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations[3]. Curino, C., Jones, E., Zhang, Y., & Madden, proposes a model where the data items in the transactions are partitioned using Range based space keys and Hashing techniques along with graph generation. The frequent itemsets are obtained from the graph and can be partitioned by their hash value. This improves the work-aware data to partition efficiently and the distribution overhead is considerably reduced[4]. Hong, S., Huaxuan, Z., Shiping, C., & Chunyan, H. suggests the use of Improved FP-growth algorithm which generates FP-tree for data or frequent items that satisfies the specified minimum support.doing so reduces the space of the FP-growth tree which enables ease and efficiency in traversal of the FP-tree that also reduces the complexity[5]. Nasreen, S., Azam, M. A., Shehzad, K., Naeem, U., & Ghazanfar, M. A. addresses two main problems with frequent pattern mining techniques. First problem is that the database is scanned many times, second is complex candidate generation process with too many candidate itemset generated. These two problems are efficiency bottleneck in frequent pattern mining. The comparison of various algorithms and best of which can be selected based on context of the application[6]. Lu, W., Shen, Y., Chen, S., & Ooi, B. C. introduces a parallel FP-growth algorithm,it provides a efficient query recommendations or related searches. The usage of parallel FP-growth enables partitioning data based on their frequent item value from FIM algorithms. Partitioning data in a way that it produces independent processing of data computation and therefore the communication between them is also reduced[7]. Li, H., Wang, Y., Zhang, D., Zhang, M., & Chang, E. uses a k nearest neighbour approach.the (knn)k nearest neighbour approach is made use of to find the closeness among the data using MapReduce.K nearest neighbour approach uses voronoi diagram with LSH to partition similar data into the same cell. Voronoi diagram uses pivot points to which the closest data points are going to be mapped. The pivot point selection must be efficient because it determines the overall efficiency[8]. Yang, Q., Du, F., Zhu, X., & Jiang, C. uses the Dist-Eclat method is used to find Frequent Itemsets by distributing the searching space evenly to the mappers.initially it divides the data in the vertical database into equal shards and it is sent to the mappers. The mapper produces frequent singleton and the reducer collects and combines them. The second stage mappers runs Eclat,generates the frequent k-supersets of itemsets and the reducer collects the result and distributes it to all the mappers. The final phase is mining of Prefix tree. It supports large datasets but unfortunately cannot handle huge datasets[9]. Xi zhu, Qing yang, Fei-Yang Du focus on load balancing may lead to lack of correlation however employing Balanced FP-growth algorithm can benefit both in reduction of communication overhead and dividing the group dependent data in balanced manner will help in achieve load balancing as well[11]. Gao, Z., Liu, D., Yang, Y., Zheng, J., & Hao, Y. 2276

5 suggests that the distributed environment can be either a Homogeneous environment or a Heterogeneous environment.the computation time of each node in either environment depends on the amount of data to be processed on each individual nodes.in a Hadoop environment the tasks to each and every nodes are assigned based on their availability.the availability of free slots are determined by the heartbeat mechanism of Hadoop[12]. Irina Bernst,Patrick Bouliin,Jorg Frochte,Christoff Kaufmann proposes an approach of using a support system which can train the model to choose the optimal node where the data should be sent has been proposed. The support system extracts features like number of nodes, computational speed and data transfer rate from each node,it is used in partitioning the data and to train and optimize the model.another view of data partitioning based on Features and proportional partitioning[13]. 4.CONCLUSION The various Frequent Itemset Mining Algorithms like along with the data partitioning techniques that are used traditionally are analyzed with their Merits and demerits.the techniques like Parallel FP-tree growth over Apriori and Locality Sensitive Hashing based partition Strategy improves the locality of partitioned data reducing network and communication overhead.hence, the overall performance of the system can be significantly increased when compared to the traditional strategies. REFERENCES [1] Bahmani, B., Goel, a, & Shinde, R. (2012). Efficient distributed locality sensitive hashing. Proceedings of the 21st ACM International Conference, [2] Zhang, C., Li, F., & Jestes, J. (2012). Efficient parallel knn joins for large data in MapReduce. Proceedings of the 15th International Conference on Extending Database Technology - EDBT 12, 38 [3] Chuk Lam, Hadoop in Action,USA Manning publications Cp.,2010. [4] Curino, C., Jones, E., Zhang, Y., & Madden, S. (2010). Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment, 3(1 2), [5] Hong, S., Huaxuan, Z., Shiping, C., & Chunyan, H. (2013). The Study of Improved FP-Growth Algorithm in MapReduce. Proceedings of the 1st International Workshop on Cloud Computing and Information Security, (Ccis), [6] Nasreen, S., Azam, M. A., Shehzad, K., Naeem, U., & Ghazanfar, M. A. (2014). Frequent pattern mining algorithms for finding associated frequent patterns for data streams: A survey. Procedia Computer Science, 37, [7] Lu, W., Shen, Y., Chen, S., & Ooi, B. C. (2012). Efficient processing of k nearest neighbor joins using MapReduce. Proceedings of the VLDB Endowment, 5(10), [8] Li, H., Wang, Y., Zhang, D., Zhang, M., & Chang, E. (2008). Mining Frequent Patterns without candidate set generation:a frequent pattern tree approach, Data Mining and Knowledge Discovery, 8, 53 87, 2004 [9] Yang, Q., Du, F., Zhu, X., & Jiang, C. (2016). Improved Balanced Parallel FP-Growth with MapReduce, (Aice). [10] J.Savithri, 2H.Inbarani, Comparative Analysis of K-Means, PSO-K-Means, And Hybrid PSO genetic K-Means For Gene Expression Data, International Journal of Innovations in Scientific and Engineering Research (IJISER), Vol.1, no.1, pp.43-50, [11] Xi zhu, Qing yang, Fei-Yang Du, Improved Balanced Parallel FP-Growth with MapReduce, International Conference on Network and Communication Security (NCS 2016). [12] Gao, Z., Liu, D., Yang, Y., Zheng, J., & Hao, Y. (2014). A load balance algorithm based on nodes performance in Hadoop cluster. APNOMS th Asia-Pacific 2277

6 Network Operations and Management Symposium, 1 4. [13] Irina Bernst,Patrick Bouliin,Jorg Frochte,Christoff Kaufmann. An Approach For Load Balancing For Simulation In Heterogeneous Distributed Systems Using Simulation Data Mining. IADIS International Conference on Applied Computing 2014, At Porto, Portugal, Volume: ISBN ; pages [14] Tamura, M., & Kitsuregawa, M. (1999). Dynamic load balancing for parallel association rule mining on heterogenous pc cluster systems. Proceedings of the 25th International Conference on Very Large Data Bases, [15] Ogunde, A. O., Folorunso, O., & Sodiya, A. S. (2015). A partition enhanced mining algorithm for distributed association rule mining systems. Egyptian Informatics Journal, 16(3), [16] Yu, K., Zhou, J., & Hsiao, W. C. (2007). Load Balancing Approach Parallel Algorithm for Frequent Pattern Mining, International Conference on Parallel Computing Technologies,PaCT 2007: Parallel Computing Technologies pp [17] Li, H., Wang, Y., Zhang, D., Zhang, M., & Chang, E. (2008). PFP : Parallel FP-Growth for Query Recommendation. Proceedings of the 2008 ACM conference on Recommender systems,pages [18] SathishKumar, T., Kavitha, V., & Ravichandran, D. T. (2010). Efficient Tree Based Distributed Data Mining Algorithms for mining Frequent Patterns. s International Journal of Computer Applications, 10(1), [19] Hu, J., & Yang-Li, X. (2008). "A fast parallel association rules mining algorithm based on FP-Forest. Advances in Neural Networks-ISNN 2008,

7 2279

8 2280

Data Partitioning Method for Mining Frequent Itemset Using MapReduce

Data Partitioning Method for Mining Frequent Itemset Using MapReduce 1st International Conference on Applied Soft Computing Techniques 22 & 23.04.2017 In association with International Journal of Scientific Research in Science and Technology Data Partitioning Method for