FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

Similar documents
Research Article Apriori Association Rule Algorithms using VMware Environment

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Efficient Algorithm for Frequent Itemset Generation in Big Data

CLUSTERING BIG DATA USING NORMALIZATION BASED k-means ALGORITHM

Improved Frequent Pattern Mining Algorithm with Indexing

Available online at ScienceDirect. Procedia Computer Science 79 (2016 )

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Appropriate Item Partition for Improving the Mining Performance

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Mining Distributed Frequent Itemset with Hadoop

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Association Rule Mining. Introduction 46. Study core 46

Association Rules Mining using BOINC based Enterprise Desktop Grid

Implementation of Aggregation of Map and Reduce Function for Performance Improvisation

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

An Improved Apriori Algorithm for Association Rules

FIDOOP: PARALLEL MINING OF FREQUENT ITEM SETS USING MAPREDUCE

Improving Efficiency of Parallel Mining of Frequent Itemsets using Fidoop-hd

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Comparing the Performance of Frequent Itemsets Mining Algorithms

Distributed Face Recognition Using Hadoop

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

, and Zili Zhang 1. School of Computer and Information Science, Southwest University, Chongqing, China 2

Improved Balanced Parallel FP-Growth with MapReduce Qing YANG 1,a, Fei-Yang DU 2,b, Xi ZHU 1,c, Cheng-Gong JIANG *

Enhanced Hadoop with Search and MapReduce Concurrency Optimization

Keywords Hadoop, Map Reduce, K-Means, Data Analysis, Storage, Clusters.

Comparison of FP tree and Apriori Algorithm

Data Partitioning Method for Mining Frequent Itemset Using MapReduce

Research of Improved FP-Growth (IFP) Algorithm in Association Rules Mining

Utility Mining Algorithm for High Utility Item sets from Transactional Databases

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Data Analysis Using MapReduce in Hadoop Environment

A Graph-Based Approach for Mining Closed Large Itemsets

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Mining of Web Server Logs using Extended Apriori Algorithm

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

An Algorithm for Mining Frequent Itemsets from Library Big Data

HADOOP FRAMEWORK FOR BIG DATA

The Transpose Technique to Reduce Number of Transactions of Apriori Algorithm

MySQL Data Mining: Extending MySQL to support data mining primitives (demo)

Databases 2 (VU) ( / )

Efficient Mining of Generalized Negative Association Rules

Fast and Effective System for Name Entity Recognition on Big Data

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

EFFICIENT ALGORITHM FOR MINING FREQUENT ITEMSETS USING CLUSTERING TECHNIQUES

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Document Clustering with Map Reduce using Hadoop Framework

Mitigating Data Skew Using Map Reduce Application

Churn Prediction Using MapReduce and HBase

Mining Top-K Association Rules. Philippe Fournier-Viger 1 Cheng-Wei Wu 2 Vincent Shin-Mu Tseng 2. University of Moncton, Canada

Parallel Approach for Implementing Data Mining Algorithms

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallelizing Frequent Itemset Mining with FP-Trees

Optimization using Ant Colony Algorithm

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Performance Evaluation of Sequential and Parallel Mining of Association Rules using Apriori Algorithms

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

A comparative study of Frequent pattern mining Algorithms: Apriori and FP Growth on Apache Hadoop

Frequent Pattern Mining in Data Streams. Raymond Martin

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

Data Platforms and Pattern Mining

A Survey on Apriori algorithm using MapReduce Technique

An Improved Performance Evaluation on Large-Scale Data using MapReduce Technique

Research and Improvement of Apriori Algorithm Based on Hadoop

FP-Growth algorithm in Data Compression frequent patterns

An improved MapReduce Design of Kmeans for clustering very large datasets

An Algorithm of Association Rule Based on Cloud Computing

Application-Aware SDN Routing for Big-Data Processing

Parallel Implementation of Apriori Algorithm Based on MapReduce

The Analysis and Implementation of the K - Means Algorithm Based on Hadoop Platform

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Improved MapReduce k-means Clustering Algorithm with Combiner

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Performance Based Study of Association Rule Algorithms On Voter DB

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Batch Inherence of Map Reduce Framework

Introduction to MapReduce (cont.)

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

A SURVEY ON SIMPLIFIED PARALLEL DATA PROCESSING ON LARGE WEIGHTED ITEMSET USING MAPREDUCE

Clustering Lecture 8: MapReduce

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Mining High Utility Itemsets in Big Data

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Association Rule Mining from XML Data

A mining method for tracking changes in temporal association rules from an encoded database

Database Applications (15-415)

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Global Journal of Engineering Science and Research Management

Improving the MapReduce Big Data Processing Framework

Transcription:

Volume 115 No. 7 2017, 105-110 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN Balaji.N 1, Umamakeswari.A 2*, Ezhilarasie.R 3 1 M.S-Computer Science, University of Missouri-Kansas City, USA 2 School of Computing, SASTRA University, Thanjavur-613401, India 2* aum@cse.sastra.edu 3 School of Computing, SASTRA University, Thanjavur-613401, India Abstract: Mining item sets from transactions uses the frequent pattern mining technique. But in big data, the large volumes of data results in two new challenges namely Space Complexity and Time Complexity. To solve these two problems, a Frequent Item set Mining method is used to extract frequent item sets rather than processing the whole data which are iteratively sampled. This method outputs the small extracted frequent patterns which need to be parallelized and applied for mining sequences, item sets and structures. For parallelization, many existing algorithms like Apriori, FP-Growth are improved and applied which are not suitable if the size goes on increasing. So, a Map-Reduce framework is implemented to demonstrate its scalability for big data. Here along with customer reviews, ratings and page hits are also considered to prepare the datasets and work on them. Keywords: Frequent Mining, MapReduce, Hadoop, Apriori, FP-Growth, Maven plugin 1. Introduction Traditionally, frequent pattern mining technique is used for mining the item sets from transactions which are frequent and thereby extracting the most used sequences. For every kind of pattern, tremendous improvements have been made using FP-Growth and Apriori. In big data analytics, the volume of data is increasing exponentially and this large volume of data brings in factors such as Space Complexity and Time Complexity, which are challenging for frequent pattern mining. Considering Space Complexity, the data taken as input, intermediate results and patterns received are so large that those cannot be fit into memory. This prevents from executing many algorithms. In case of Time Complexity, relying on repetitive search and complex data structures for mining patterns has been proved that they are not appropriate for big data. One of the solutions is to raise the threshold frequency that minimizes the scale of patterns. To process the big data, techniques were proposed to parallelize the data sets by improving Apriori and FP-Growth algorithms, but they don t have a process that does fault tolerance, automatic parallelization, load balancing, and data distribution especially on large clusters. However, parallelization approach cannot deal with large data, Iterative Sampling based Frequent Item set Mining (ISbFIM) approach is proposed which divides manageable item sets and the frequent patterns were extracted from these subsets with a higher threshold. In order to process the big data, Apache Hadoop is the best platformhadoop Map Reduce is a software framework that processes large amounts of data on a large set of clusters in parallel manner.hadoop framework helps the user to write and test distributed file systems. It does not depend on hardware in providence of fault tolerance and high availability. Map-Reduce is mainly related to parallel processing of datasets. Datasets are important for training and testing many information processing applications. A dataset is a collection of data which is used to write and test new systems under development. The datasets quality and nature depends on both the type of application and the choice of domain. Hadoop helps in monitoring, scheduling tasks and the failed tasks will be reexecuted. It follows Master-slave architecture and consists of single master and multiple slaves for one cluster node. Each cluster master will do the work in parallel by allocating file systems to the slaves and gather the output. It consists of two 105

tasks. Map Task. It takes input data where each element is broken into tuples namely key/value pairs and given to shuffling and grouping process. Reduce Task - It takes the input as key/value pairs from the map task and combines them into a set of tuples which finally reduces the output to what is needed by user. Thus parallelization is more effective in Map-Reduce framework. 2. Related Works Agrawal R, Srikant R proposed that, for frequent mining, many algorithms are present like Apriori, FP-Growth etc. To make the above algorithms suitable for large item sets i.e. for large database transactions, Apriori Hybrid Algorithm is proposed which outperforms the general Apriori Algorithm. But as the size of the problem increases, the time taken for executing the queries is increased. The experiment results show that Apriori Hybrid scales rapidly with increasing number of transactions [1]. Agrawal R later proposed Generalized Sequential Patterns(GSP), an algorithm that discovers the generalized patterns which are sequential. Evaluation of these patterns using synthetic data indicates that GSP is more effective than the Apriori presented in [3]. GSP is directly proportional to the number of sequences, and has fine scale-up properties with respect to the data size [2]. Some traditional frequent item sets mining algorithms can t handle massive small datasets involving high memory cost, and low computing performance, high I/O overhead. An improved Parallel FP-Growth (IPFP) algorithm is proposed by Agarwal [3]. Particularly, small files processing strategy is better to decrease the defects of low read/write speed and low processing efficiency in Hadoop to overcome the drawback of FP-Growth. Moreover, use of Map-Reduce improves the overall performance of frequent item sets mining because it implements the parallelization. The results experimentally proves that the IPFP algorithm is applicable and meets the needs of the frequent item set mining for large datasets containing small files[3]. Large databases due to space issues cannot be fit in main memory which results in space complexity. And also the memory needed to handle the entire set of frequent item sets raises up fast and shows that Apriori based algorithms are inefficient when used on single machines. The present approaches tries to keep the runtime and output in control by raising the minimum threshold, thus automatically decreasing the number of candidate item sets and frequent item sets [7]. Mining the maximal item sets which are frequent is NP-hard. There is a problem to deal whether a polynomial algorithm for mining maximal frequent item sets is suitable or not. This complexity results of the above problem is not discussed so complexity techniques are needed to be discovered to rectify this problem.. If candidates are increased, Time complexity is difficult to achieve [4]. In order to do frequent item sets for large datasets, two item set mining algorithms for Map- Reduce are implemented. Dist-Eclat uses a simple load balancing scheme which deals with speed based on k-frequent Item sets (k-fis). Second is Big FIM which concentrates on mining the large datasets by using a hybrid approach. These K-FIs are frequently mined and the found frequent item sets are given to the mappers. On mappers side frequent item sets are implemented using Eclat. [5] Applying frequent item set mining[6] to large databases is problematic and it has been a main focus of research in the past twenty years. While considering the domain of Big Data, the enormous volume of data brings challenges to frequent pattern mining. To achieve space for storage and to stop using conditional patterns, FiDoop uses frequent items ultra-metric tree, instead of conventional FP-Growth trees. FiDoop on the Hadoop cluster is not highly tolerant to data distribution because item sets with various lengths have different initial and decomposition costs. Frequent Item set Mining (FIM) is important part of data mining and analysing the data. The frequently occurring information from data sets is collected from the events by Frequent Mining technique. Parallel mining algorithms for frequent item sets does not have a mechanism which prevents load balancing, automatic parallelization, fault tolerance and data distribution on very large clusters. The Iterative Sampling based Frequent Item set Mining (ISbFIM) method is a proposed framework which aims to extract frequent item sets. As an example for Big Data we can look at web search log, the extracted set of patterns are so huge that they can t be fit into memory. In order to shrink pattern volume, one can either decrease the size of data or threshold can be increased. Smaller file datasets are taken from the entire data set and patterns with maximum threshold within each 106

sample are obtained. When the volume of input is reduced and also the support threshold is suitably increased, it is possible to compute space and time complexity for frequency mining pattern on every subset [7]. The algorithm deals with smaller part of dataset. The smaller parts can be parallelized to get the output. The more efficient method for parallelization is the use of map-reduce. Map- Reduce should be provided with efficiency and ease to use data mining methods. The map-reduce framework in Hadoop follows master-slave architecture which helps in effective and fast parallelization. W ho le D at Sa mpl e sets Ge ner ator Ma ppe r Ma ppe r Re du cer Re duc er Figure 1. Map Reduce Framework 3. Methodology Ran king by Glo bal IG Ext ract ed Patt Mining item sets from transactions which are frequent, uses the frequent pattern mining technique. The domain of Big data presents challenges such as Space Complexity and Time Complexity in frequency pattern mining due to the large data volume that keeps increasing always. Space Complexity- Input data that is taken, intermediary results obtained and the received patterns are relatively large which may not fit into memory. This may prevent execution of some algorithms. Time complexity- When recursive search and complex data structures are used for pattern mining, it may prove inadequate for Big Data. A map reduce framework as shown in Fig. 1 is introduced to overcome space and time complexity. A Map-Reduce framework is implemented to divide frequent datasets. Dataset are taken from different angles like customer ratings, Reviews and page hits obtained for the particular site to supporteasy retrieval of effective data. Based on combination from different views, the effective search is obtained and suggested to the customer. This map reduce framework main feature is scalability. So few MB datasets are taken and implemented. Since it supports scalability, it can be extended to whatever data volume present. Moreover Hadoop framework allows the user to quickly write and test the dataset. The work flow model is shown in Fig 2. Figure 2. Work Flow Model 4. Experimental Setup The first step after installing Hadoop, is to kick start a Maven project. Add eclipse artifacts from an eclipse installation to the local repository. This automatically analyzes the eclipse directory, copy plug-ins jars to the local maven repo and generates appropriate proms. This is the official central repository builder for Eclipse plug-ins, so it has the necessary default values. The Maven Eclipse Plugin is used to generate Eclipse IDE files. Hadoop Map Reduce is a software framework that processes large amounts of data on a large set of clusters in parallel manner. A MapReduce job usually splits the input dataset i.e, file systems into independent chunks. These chunks are processed by the MAP task in a parallel manner. The output of mapper would be the generation of key-value pair. The framework sorts the outputs of the maps, which are then input to the REDUCE tasks. The proposed framework will take care of monitoring, scheduling and reexecutes the unsuccessful tasks. The Datasets are collected and dataset regarding each category is given to Mappers. The mappers will divide the each record.each record will be shuffled later to extract only the needed columns. After shuffling phase, Grouping and Sorting of datasets is done. In grouping and sorting phase, the primary key of the dataset will be compared with the foreign key of other dataset and if present, those two are grouped together and sorted in the order. The output of the Mapper is 107

given to the Reducer which are key-value pairs. In the final phase, only the frequent columns which users frequently access are stored and made easy access to the customer. 5. Results and Discussion A search page for finding the product details has to be created. The product should be searched effectively based on the product ratings and price. A java applet is designed to display the contents needed for a search page. The input from the user will be collected and according to the name of the product the search takes place. First it will check for the product in the datasets. If the product name does not exist, it will search for substring which contains the product name. If this case also fails then the product with user mentioned name does not exist in the dataset. It will display the product for what the user is searching is not found. By the product name given in the search box, the product will be searched in the included data set and gives the details of the product. There will be a threshold set for every product s price and ratings. Based on the threshold, the admin will instruct the user whether to purchase the product or not. The search time to retrieve the data from the dataset will be calculated and returned in the console each time the user searches in milliseconds. The datasets are based on the ratings and price details of the products like books, mobile phones, accessories etc. Two datasets are used and separately for ratings and price details. Based on the input given, the datasets are divided into groups which are done by mapper and the mapper will turn the items sets into key-value pairs. The output of the mapper is taken to the reducer. The reducer is responsible for comparing the Primary key of one dataset with another dataset and if equal will group it into a single tuple. Once the given product name is found in the dataset, the corresponding product ID will be taken. This ID acts as primary key of one dataset and gets the details of the other dataset acting as foreign key of that dataset. The price and ratings of the product will be retrieved through this ID and the values will be displayed in the result page module. The search time taken to do the search will be calculated as soon as the search button is clicked in the first module and displayed. 6. Conclusion and Future Work The datasets from different categories considering different views were taken and given to mapper to mine the frequent item sets and to effectively search the required information from the bulk volume of data. The present implementation includes map-reduce concept which efficiently mines the datasets to get required information in less search time compared to the procedures like fp-growth, apriori which lags in providing space and execution time. To overcome these difficulties, a search algorithm using map-reduce concept is implemented where two categories of datasets on different products like mobile phones, accessories, books are taken which includes data about the product price and the product ratings with all the basic information of the product. As per the search item, the data is passed through mapper and reduce phases and the required information about the product is extracted from the large volume of datasets using the primary key and they will be displayed in the result page. The time taken to search the product in the dataset will be displayed which will be more effective compared to the fp-growth and apriori algorithms. Because map-reduce works on Hadoop, it supports scalability which solves the problem of space complexity. The generalized products like accessories, details, can be taken and joins between different tables in the generalized categories can be implemented via Maven plug-in. The time taken to search the product in the dataset will be displayed which will be more effective compared to the fpgrowth and apriori algorithms. Because mapreduce works on Hadoop, it supports scalability which solves the problem of space complexity. References [1] Agrawal R, Srikant R, Fast algorithms for mining association rules in large databases, Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, VLDB 94, pp.487-499, 1994. [2] Agrawal R, Srikant R, Mining sequential patterns, Proceedings of the International Conference on Data Engineering (ICDE'95), Taipei, Taiwan, pp 3-14, 1995. [3] Agrawal R, Shafer JC, Parallel mining of association rules, IEEE Transactions 108

on Knowledge and Data Engineering, 8: pp. 962-969, 1996. [4] Yang G, Computational aspects of mining maximal frequent patterns, Theory of Computer Science 362(1 3): pp.63-85, 2006. [5] Anastasiu DC, Iverson J, Smith S, Karypis G, Big data frequent pattern mining, Frequent Pattern Mining, Springer International, pp.225-259, 2014. [6] Cheng H, Yan X, Han J, Wei Hsu C, Discriminative frequent pattern analysis for effective classification, International Conference on Data Engineering, pp.716-725, 2007. [7] Hill S, Srichandan B, Sunderraman R, An iterative map reduce approach to frequent subgraph mining in biological datasets, Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, New York, ACM BCB 12, pp. 661-666, 2012. [8] DR.P.SHUNMUGAPRIYA, R.MARAGATHAM, Fidoop-Dp: Data Partitioning In Frequent Itemset Mining On Hadoop Clust, International Innovative Research Journal of Engineering and Technology, pp. 2017. 109

110