FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

Volume 115 No. 7 2017, 105-110 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN Balaji.N 1, Umamakeswari.A 2*, Ezhilarasie.R 3 1 M.S-Computer Science, University of Missouri-Kansas City, USA 2 School of Computing, SASTRA University, Thanjavur-613401, India 2* aum@cse.sastra.edu 3 School of Computing, SASTRA University, Thanjavur-613401, India Abstract: Mining item sets from transactions uses the frequent pattern mining technique. But in big data, the large volumes of data results in two new challenges namely Space Complexity and Time Complexity. To solve these two problems, a Frequent Item set Mining method is used to extract frequent item sets rather than processing the whole data which are iteratively sampled. This method outputs the small extracted frequent patterns which need to be parallelized and applied for mining sequences, item sets and structures. For parallelization, many existing algorithms like Apriori, FP-Growth are improved and applied which are not suitable if the size goes on increasing. So, a Map-Reduce framework is implemented to demonstrate its scalability for big data. Here along with customer reviews, ratings and page hits are also considered to prepare the datasets and work on them. Keywords: Frequent Mining, MapReduce, Hadoop, Apriori, FP-Growth, Maven plugin 1. Introduction Traditionally, frequent pattern mining technique is used for mining the item sets from transactions which are frequent and thereby extracting the most used sequences. For every kind of pattern, tremendous improvements have been made using FP-Growth and Apriori. In big data analytics, the volume of data is increasing exponentially and this large volume of data brings in factors such as Space Complexity and Time Complexity, which are challenging for frequent pattern mining. Considering Space Complexity, the data taken as input, intermediate results and patterns received are so large that those cannot be fit into memory. This prevents from executing many algorithms. In case of Time Complexity, relying on repetitive search and complex data structures for mining patterns has been proved that they are not appropriate for big data. One of the solutions is to raise the threshold frequency that minimizes the scale of patterns. To process the big data, techniques were proposed to parallelize the data sets by improving Apriori and FP-Growth algorithms, but they don t have a process that does fault tolerance, automatic parallelization, load balancing, and data distribution especially on large clusters. However, parallelization approach cannot deal with large data, Iterative Sampling based Frequent Item set Mining (ISbFIM) approach is proposed which divides manageable item sets and the frequent patterns were extracted from these subsets with a higher threshold. In order to process the big data, Apache Hadoop is the best platformhadoop Map Reduce is a software framework that processes large amounts of data on a large set of clusters in parallel manner.hadoop framework helps the user to write and test distributed file systems. It does not depend on hardware in providence of fault tolerance and high availability. Map-Reduce is mainly related to parallel processing of datasets. Datasets are important for training and testing many information processing applications. A dataset is a collection of data which is used to write and test new systems under development. The datasets quality and nature depends on both the type of application and the choice of domain. Hadoop helps in monitoring, scheduling tasks and the failed tasks will be reexecuted. It follows Master-slave architecture and consists of single master and multiple slaves for one cluster node. Each cluster master will do the work in parallel by allocating file systems to the slaves and gather the output. It consists of two 105

tasks. Map Task. It takes input data where each element is broken into tuples namely key/value pairs and given to shuffling and grouping process. Reduce Task - It takes the input as key/value pairs from the map task and combines them into a set of tuples which finally reduces the output to what is needed by user. Thus parallelization is more effective in Map-Reduce framework. 2. Related Works Agrawal R, Srikant R proposed that, for frequent mining, many algorithms are present like Apriori, FP-Growth etc. To make the above algorithms suitable for large item sets i.e. for large database transactions, Apriori Hybrid Algorithm is proposed which outperforms the general Apriori Algorithm. But as the size of the problem increases, the time taken for executing the queries is increased. The experiment results show that Apriori Hybrid scales rapidly with increasing number of transactions [1]. Agrawal R later proposed Generalized Sequential Patterns(GSP), an algorithm that discovers the generalized patterns which are sequential. Evaluation of these patterns using synthetic data indicates that GSP is more effective than the Apriori presented in [3]. GSP is directly proportional to the number of sequences, and has fine scale-up properties with respect to the data size [2]. Some traditional frequent item sets mining algorithms can t handle massive small datasets involving high memory cost, and low computing performance, high I/O overhead. An improved Parallel FP-Growth (IPFP) algorithm is proposed by Agarwal [3]. Particularly, small files processing strategy is better to decrease the defects of low read/write speed and low processing efficiency in Hadoop to overcome the drawback of FP-Growth. Moreover, use of Map-Reduce improves the overall performance of frequent item sets mining because it implements the parallelization. The results experimentally proves that the IPFP algorithm is applicable and meets the needs of the frequent item set mining for large datasets containing small files[3]. Large databases due to space issues cannot be fit in main memory which results in space complexity. And also the memory needed to handle the entire set of frequent item sets raises up fast and shows that Apriori based algorithms are inefficient when used on single machines. The present approaches tries to keep the runtime and output in control by raising the minimum threshold, thus automatically decreasing the number of candidate item sets and frequent item sets [7]. Mining the maximal item sets which are frequent is NP-hard. There is a problem to deal whether a polynomial algorithm for mining maximal frequent item sets is suitable or not. This complexity results of the above problem is not discussed so complexity techniques are needed to be discovered to rectify this problem.. If candidates are increased, Time complexity is difficult to achieve [4]. In order to do frequent item sets for large datasets, two item set mining algorithms for Map- Reduce are implemented. Dist-Eclat uses a simple load balancing scheme which deals with speed based on k-frequent Item sets (k-fis). Second is Big FIM which concentrates on mining the large datasets by using a hybrid approach. These K-FIs are frequently mined and the found frequent item sets are given to the mappers. On mappers side frequent item sets are implemented using Eclat. [5] Applying frequent item set mining[6] to large databases is problematic and it has been a main focus of research in the past twenty years. While considering the domain of Big Data, the enormous volume of data brings challenges to frequent pattern mining. To achieve space for storage and to stop using conditional patterns, FiDoop uses frequent items ultra-metric tree, instead of conventional FP-Growth trees. FiDoop on the Hadoop cluster is not highly tolerant to data distribution because item sets with various lengths have different initial and decomposition costs. Frequent Item set Mining (FIM) is important part of data mining and analysing the data. The frequently occurring information from data sets is collected from the events by Frequent Mining technique. Parallel mining algorithms for frequent item sets does not have a mechanism which prevents load balancing, automatic parallelization, fault tolerance and data distribution on very large clusters. The Iterative Sampling based Frequent Item set Mining (ISbFIM) method is a proposed framework which aims to extract frequent item sets. As an example for Big Data we can look at web search log, the extracted set of patterns are so huge that they can t be fit into memory. In order to shrink pattern volume, one can either decrease the size of data or threshold can be increased. Smaller file datasets are taken from the entire data set and patterns with maximum threshold within each 106

sample are obtained. When the volume of input is reduced and also the support threshold is suitably increased, it is possible to compute space and time complexity for frequency mining pattern on every subset [7]. The algorithm deals with smaller part of dataset. The smaller parts can be parallelized to get the output. The more efficient method for parallelization is the use of map-reduce. Map- Reduce should be provided with efficiency and ease to use data mining methods. The map-reduce framework in Hadoop follows master-slave architecture which helps in effective and fast parallelization. W ho le D at Sa mpl e sets Ge ner ator Ma ppe r Ma ppe r Re du cer Re duc er Figure 1. Map Reduce Framework 3. Methodology Ran king by Glo bal IG Ext ract ed Patt Mining item sets from transactions which are frequent, uses the frequent pattern mining technique. The domain of Big data presents challenges such as Space Complexity and Time Complexity in frequency pattern mining due to the large data volume that keeps increasing always. Space Complexity- Input data that is taken, intermediary results obtained and the received patterns are relatively large which may not fit into memory. This may prevent execution of some algorithms. Time complexity- When recursive search and complex data structures are used for pattern mining, it may prove inadequate for Big Data. A map reduce framework as shown in Fig. 1 is introduced to overcome space and time complexity. A Map-Reduce framework is implemented to divide frequent datasets. Dataset are taken from different angles like customer ratings, Reviews and page hits obtained for the particular site to supporteasy retrieval of effective data. Based on combination from different views, the effective search is obtained and suggested to the customer. This map reduce framework main feature is scalability. So few MB datasets are taken and implemented. Since it supports scalability, it can be extended to whatever data volume present. Moreover Hadoop framework allows the user to quickly write and test the dataset. The work flow model is shown in Fig 2. Figure 2. Work Flow Model 4. Experimental Setup The first step after installing Hadoop, is to kick start a Maven project. Add eclipse artifacts from an eclipse installation to the local repository. This automatically analyzes the eclipse directory, copy plug-ins jars to the local maven repo and generates appropriate proms. This is the official central repository builder for Eclipse plug-ins, so it has the necessary default values. The Maven Eclipse Plugin is used to generate Eclipse IDE files. Hadoop Map Reduce is a software framework that processes large amounts of data on a large set of clusters in parallel manner. A MapReduce job usually splits the input dataset i.e, file systems into independent chunks. These chunks are processed by the MAP task in a parallel manner. The output of mapper would be the generation of key-value pair. The framework sorts the outputs of the maps, which are then input to the REDUCE tasks. The proposed framework will take care of monitoring, scheduling and reexecutes the unsuccessful tasks. The Datasets are collected and dataset regarding each category is given to Mappers. The mappers will divide the each record.each record will be shuffled later to extract only the needed columns. After shuffling phase, Grouping and Sorting of datasets is done. In grouping and sorting phase, the primary key of the dataset will be compared with the foreign key of other dataset and if present, those two are grouped together and sorted in the order. The output of the Mapper is 107

given to the Reducer which are key-value pairs. In the final phase, only the frequent columns which users frequently access are stored and made easy access to the customer. 5. Results and Discussion A search page for finding the product details has to be created. The product should be searched effectively based on the product ratings and price. A java applet is designed to display the contents needed for a search page. The input from the user will be collected and according to the name of the product the search takes place. First it will check for the product in the datasets. If the product name does not exist, it will search for substring which contains the product name. If this case also fails then the product with user mentioned name does not exist in the dataset. It will display the product for what the user is searching is not found. By the product name given in the search box, the product will be searched in the included data set and gives the details of the product. There will be a threshold set for every product s price and ratings. Based on the threshold, the admin will instruct the user whether to purchase the product or not. The search time to retrieve the data from the dataset will be calculated and returned in the console each time the user searches in milliseconds. The datasets are based on the ratings and price details of the products like books, mobile phones, accessories etc. Two datasets are used and separately for ratings and price details. Based on the input given, the datasets are divided into groups which are done by mapper and the mapper will turn the items sets into key-value pairs. The output of the mapper is taken to the reducer. The reducer is responsible for comparing the Primary key of one dataset with another dataset and if equal will group it into a single tuple. Once the given product name is found in the dataset, the corresponding product ID will be taken. This ID acts as primary key of one dataset and gets the details of the other dataset acting as foreign key of that dataset. The price and ratings of the product will be retrieved through this ID and the values will be displayed in the result page module. The search time taken to do the search will be calculated as soon as the search button is clicked in the first module and displayed. 6. Conclusion and Future Work The datasets from different categories considering different views were taken and given to mapper to mine the frequent item sets and to effectively search the required information from the bulk volume of data. The present implementation includes map-reduce concept which efficiently mines the datasets to get required information in less search time compared to the procedures like fp-growth, apriori which lags in providing space and execution time. To overcome these difficulties, a search algorithm using map-reduce concept is implemented where two categories of datasets on different products like mobile phones, accessories, books are taken which includes data about the product price and the product ratings with all the basic information of the product. As per the search item, the data is passed through mapper and reduce phases and the required information about the product is extracted from the large volume of datasets using the primary key and they will be displayed in the result page. The time taken to search the product in the dataset will be displayed which will be more effective compared to the fp-growth and apriori algorithms. Because map-reduce works on Hadoop, it supports scalability which solves the problem of space complexity. The generalized products like accessories, details, can be taken and joins between different tables in the generalized categories can be implemented via Maven plug-in. The time taken to search the product in the dataset will be displayed which will be more effective compared to the fpgrowth and apriori algorithms. Because mapreduce works on Hadoop, it supports scalability which solves the problem of space complexity. References [1] Agrawal R, Srikant R, Fast algorithms for mining association rules in large databases, Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, VLDB 94, pp.487-499, 1994. [2] Agrawal R, Srikant R, Mining sequential patterns, Proceedings of the International Conference on Data Engineering (ICDE'95), Taipei, Taiwan, pp 3-14, 1995. [3] Agrawal R, Shafer JC, Parallel mining of association rules, IEEE Transactions 108

on Knowledge and Data Engineering, 8: pp. 962-969, 1996. [4] Yang G, Computational aspects of mining maximal frequent patterns, Theory of Computer Science 362(1 3): pp.63-85, 2006. [5] Anastasiu DC, Iverson J, Smith S, Karypis G, Big data frequent pattern mining, Frequent Pattern Mining, Springer International, pp.225-259, 2014. [6] Cheng H, Yan X, Han J, Wei Hsu C, Discriminative frequent pattern analysis for effective classification, International Conference on Data Engineering, pp.716-725, 2007. [7] Hill S, Srichandan B, Sunderraman R, An iterative map reduce approach to frequent subgraph mining in biological datasets, Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, New York, ACM BCB 12, pp. 661-666, 2012. [8] DR.P.SHUNMUGAPRIYA, R.MARAGATHAM, Fidoop-Dp: Data Partitioning In Frequent Itemset Mining On Hadoop Clust, International Innovative Research Journal of Engineering and Technology, pp. 2017. 109

110