Association Rule Mining in Big Data using MapReduce Approach in Hadoop

Size: px

Start display at page:

Download "Association Rule Mining in Big Data using MapReduce Approach in Hadoop"

Todd Oliver
6 years ago
Views:

1 GRD Journals Global Research and Development Journal for Engineering International Conference on Innovations in Engineering and Technology (ICIET) July 2016 e-issn: Association Rule Mining in Big Data using MapReduce Approach in Hadoop 1 J. Jenifer Nancy 2 M. Jansi Rani 3 Dr. D. Devaraj 1 P. G Scholar 2 Assistant Professor 3 Senior Professor and H.O.D 1,2 Department of Computer Science & Engineering 3 Department of Electrical and Electronics Engineering 1,2,3 Kalasalingam University, Krishnankovil, India Abstract The concept of Association rule mining is an important task in data mining. In case of big data the large volume of data makes is impossible to generate rules at a faster pace. By making use of parallel execution in Hadoop using the MapReduce framework, the rules can be generated much faster and in an efficient way. The existing method transforms the input dataset into binomial representation before processing them using MapReduce. But binomial conversion is not user-friendly since it is complex in case of continuous values. In this paper, an improved and scalable algorithm is proposed for association rule mining that will convert the input dataset into key-value pairs instead of binomial. All the stages of proposed association rule mining algorithm are parallelized using MapReduce. The proposed algorithm works on high cardinality features and so no dimension detection is needed. Keyword- Hadoop; MapReduce; Association rule mining; Data mining; big data I. INTRODUCTION A. Big Data and Characteristics The data is collected and stored in every minute, every hour and every day in an organization or institute and is available in large quantity. But the amount of data is not of importance but what the organizations do with these data to identify information that can be useful for them. This can be done by analyzing the data to identify insights or critical information that can help the organization to make useful decisions for their growth. The term big data describes a large volume of data that is available in both structured and in unstructured formats. Even though the concept of big data is a new term, the process of collecting the data, storing them in large amounts and analyzing them to gather new information is something that has been done since long before big data has been used. The characteristics of big data can be explained using 3 V s such as (1) Volume, (2) Velocity and (3) Variety. The applications of big data include areas such as health care, telecom, finance, etc. In this paper the process of association rule generation in big data is discussed and an association rule mining technique is proposed to generate the rules from the KDD CUP 99 dataset. B. Data Mining in Big Data Big Data mining deals with a large amount of data that is stored in the data warehouses and databases. The concept of big data mining can be used to extract or identify the interesting patterns and information from these large data. Many data mining techniques are available that can be applied to the big data. They are classification, clustering, association rules, prediction, estimation, documentation and description. The researches around these techniques have been large since long ago. Many algorithms have been applied in each of the data mining techniques and this also applies to big data. One such well known technique that is applied is the association rule mining in big data. This is a most efficient data mining technique that is used to discover the various hidden patterns and information from large databases. Here the relationships between the various attributes of the data are identified using the association rule mining algorithm. Some basic types of association rule mining algorithms are the Apriori algorithm, Distributed algorithm and Parallel algorithm. C. Association Rule Mining The Association Rule Mining (ARM) [1] in data mining is a popular approach that is used to analyse the given dataset to discover interesting patterns or relationships between the various items in the dataset. The concept of strong association rules was first used by Agarwal et al. [2] to identify the various association rules between the items that are sold during a large scale transaction database collected from a supermarket using a point system. The relationship between the items is identified based on the purchase pattern. The ARM technique generates a set of association rules prevailing between the various items of the given dataset based on the number of occurrences of these items combination in the dataset. 179

2 An association rule is used to define the relationship between any two items in the given dataset. Consider three items A, B and C. The relation {A, B} C say that if a person buys two items A and B together, then he/she will most likely buy the item C also. That is, the relations between the items are generated by identifying the various patterns within the dataset. The Association Rule Mining (ARM) technique [3] consists of two stages as follows: 1) Identify the itemset that occur frequently in the dataset The frequent itemset are those that have a support value (sup(item)) equal to or greater than the minimum support value (min_sup) that is pre-defined. The support value of itemset is calculated as the number of transactions that contains that item. In the above example support of {A, B} is calculated as how many transactions have both A and B. 2) Association rule generation using frequent itemset: In this stage the interesting rules are generated by calculating the confidence factor (conf) for all the frequent itemset that are generated in previous stage. The confidence value for the above example rule of {A, B} C will be sup({a, B})/sup(C). D. MapReduce Approach for ARM The association rules and the generation of rules are widely used and they face many issues and the major one is the availability of large data and multidimensional datasets [4]. A single processor system and normal CPU speed and resources cannot handle such large data and this makes the algorithm inefficient to use. In recent developments, the growth of network technology and especially cloud platforms provided new ideas in terms of association rule generation by making use of parallel environment like Hadoop [5]. MapReduce has been a popular and more used for computing large amounts of data ever since it was launched by Google in its platform. The Google Distributed File System (GFS) and the Amazon Web Service (AWS) makes use of the Hadoop platform and MapReduce to provide their services. A MapReduce job usually splits the input data into various chunks and each of these are processed by the map tasks in parallel manner. The Mapper maps the small tasks by making use of the key and value pair concept and the outputs are sorted. Then the Reducer reduces the obtained outputs from the maps to obtain the final output. The MapReduce framework contains a single Job Tracker as the master and a single Task Tracker as the slave for each cluster node. All input and output in MapReduce are <key, value> pairs. The Hadoop is a Java based distributed programming environment sponsored by Apache that can be used to process and handle large amounts of data. Hadoop has been created using the concept of MapReduce for large processing by using a large number of nodes and clusters. In case of Association Rule Mining in MapReduce, the Mapper maps the task of obtaining the various combinations of items as the key and the value is used to keep track of the number of occurrences or the support count. Then finally the Reducer task will reduce the obtained set of Mappers for each key value and calculates the final support and confidence for all the candidate itemsets. This way the Association Rules can be generated with maximum support and confidence. This remainder of this paper is organized as follows: Section 2 explains about the various association rule mining algorithms using Hadoop and MapReduce; Section 3 describes the proposed method and its working; Section 4 shows the experimental results of the proposed method; and finally Section 5 provides the overall conclusion of the paper. II. LITERATURE SURVEY The MapReduce can be used to design the existing sequential algorithms into parallel algorithms that can be used to handle large amounts of data in a shorter time and so this is applied for association rule mining [6]. Some of the existing methods have been discussed as given below. A. State-of-art in Association Rule Mining Yang et al. proposed a MapReduce based programming model for generation of association rules in Hadoop framework to handle large volumes of data. The Apriori algorithm [7] is used as the underlying association rule generation technique. But the standard Apriori algorithm is time consuming and it takes a really consumes more time especially when dealing with many candidate sets. To overcome this issue, they implemented the improved Apriori algorithm that is parallelized using the Hadoop framework to save time. The use of Hadoop for association rule generation provided new research focus in upcoming years. The improved Apriori algorithm [8] is proposed by Yang et al. that mainly works using the MapReduce concept to handle large data by making use of the various nodes in Hadoop platform. Lin et al. [11] proposed a similar method for association rule generation by using the same Apriori approach for frequent itemset generation in Hadoop platform using the MapReduce approach. The mining process is executed in a fast manner by implementing the parallelized mining technique during frequent itemset generation. But parallelization cannot be handled effectively. For this purposed the MapReduce is used. They proposed a parallelization algorithm in MapReduce that performs better than the previously existing algorithms in terms of speed and efficiency in rule generation. That is, the comparison of results obtained here shows better performance in terms of both speed and the rule generation accuracy [9] with existing algorithms. Riondata et al. proposed a randomized algorithm for association rules mining that is implemented using a parallel approach [10] in MapReduce framework. The proposed approach generated the association rules appropriately based on the dataset content. At first the proposed PARMA (Parallel Association Rule Mining Algorithm) approach randomized the 180

3 MapReduce algorithm to identify the appropriate frequent itemsets and association rules by using a near-linear speed up process. A large number of random samples are mined by using the original dataset. Jongwook Woo et al. proposed a Market Based Analysis algorithm combined with MapReduce for association rule generation. This is one of the most used algorithms for association rules [12]. At first the algorithm sorts the give dataset in ascending order and then converts each instance of the dataset into a (key, value) pair and fit them into the MapReduce. Then the execution is done on the Amazon EC2 MapReduce platform. The obtained experimental results shows that the performance is increased by making use of the MapReduce parallel code but still there is a bottle neck at certain point when more nodes are used. B. Need for Proposed Method The use of binomial algorithm is not suitable in many datasets and a novel method should be available that can be applied to any format of datasets [13]. Also binomial transformation is complex and time consuming and is not necessary. It is difficult to handle and process large volumes of data in a single server and so there is a need to use parallel environment. In this paper an improved scalable and distributed key-value pair algorithm is proposed for the selection of frequent itemsets from the dataset and for association rules generation. The proposed algorithm is a bottom up approach since at first the candidate itemsets are generated and then the support values are calculated by getting the count from the dataset transactions. The minimum support value is then provided to converts the candidate itemsets to frequent itemsets. A very large dataset is used here and after selecting the frequent itemsets the association rules are generated. The implementation is done by making use of the MapReduce platform and the complete process is parallelized. III. PROPOSED METHOD The paper proposes and implements the association rule mining using a very large dataset in the Hadoop platform using MapReduce [14]. The proposed algorithm converts the input dataset into <key, value> pairs instead of binomial representation. This way, one level of transformation can be reduced at the end for converting binomial features to data features. The input dataset should be first preprocessed before going for the rule generation phase in MapReduce [15]. The various phases of the proposed algorithm are discussed below. 1) Phase 1: Generate frequent 1-itemsets The input dataset is stored in the HDFS of the Hadoop environment at first to make data access easy and fast for MapReduce operations [16]. The input data is then split into various chunks and provided to the Mapper that maps the data to the output. The output from the mapper is represented as <key, value> pair. The outputs obtained from all the maps are then combined together in the combiner and then sent to the reducer. Here the support values are calculated by combining the values corresponding to each of the key values. Then the support values are compared with the minimum support and the items that support these items are taken as the output and thisis the frequent 1-itemsets. 2) Phase 2: Generate candidate 2-itemsets and n-itemsets Next the candidate 2-itemsets are generated by the mapper using the frequent 1-itemsets. The count of each item in the candidate 2-itemsets is verified with the input data that is provided to the mapper. They are then combined using the combiner to calculate the count values of the 2 -itemsets and provided to the reducer. The reducer further reduces and counts the support values of 2-itemsets. This is repeated till all the possible candidate n-itemsets are generated. The same process is repeated until no possible frequent itemset is available in previous iteration. 3) Phase 3: Association rule generation Finally after generating all the frequent n-itemsets, the association rules are generated based on confidence values. The confidence values are calculated by using the support values of the frequent itemsets that form the rules. The output contains all the selected itemset value and its support count. The output is written in an output file. These support values are then used for confidence calculation and the rules that contain 100% confidence are generated as the output rules. The overall association rule generation as discussed above is implemented in the Hadoop framework by creating a sing node Hadoop environment [17]. The time in Hadoop is synchronized with the system time and the time values are calculated in milliseconds using the time function in Hadoop. The data flow for two iterations of MapReduce in Hadoop is shown below in Fig

Fig. 1: Data flow showing two iterations of proposed method First the dataset is read as input by the MapReduce code from the HDFS storage and it processes each item as a separate key to calculate

4 Fig. 1: Data flow showing two iterations of proposed method First the dataset is read as input by the MapReduce code from the HDFS storage and it processes each item as a separate key to calculate the frequent 1-itemset as in Fig. 1. Then using pair of items from the 1-itemset the frequent 2-itemsets are generated. This process is repeated till any number of iterations based on the number of itemsets needed. Fig. 1 shows till 3- itemset calculation using MapReduce. The key used in the Mapper represents the n-itemsets where n is the number items used to form the key. The MapReduce flow of the proposed MapReduce framework is shown below in Fig. 2. Fig. 2: Proposed MapReduce framework During the MapReduce operation the input dataset or file is split into many sections in the Mapper phase with each Mapper having a unique key. In ARM the key represents the items available within the dataset and the value is the number of occurrence of the item in the dataset. Initially the count is set to 1 in the Mapper and for each occurrence this count is increment. Finally in the Reducer the total occurrence is found using merge and the support and confidence are calculated. The output file consist of the list of rules generated based on the support and confidence. IV. EXPERIMENTATION AND RESULTS A. Dataset Description The proposed approach for association rule mining is applied to KDD CUP 99 data and the simulation details are presented here. The KDD CUP 99 input dataset consist of records from four categories of attacks such as Denial of Service, user-to-root, probing attack and remote-to-local. The instances of the dataset consists of both labeled and unlabeled records in which each labeled records consists of 41 attributes and one target attribute. The dataset consists of three groups of values such as basic, content based and time based values. And not all the values are binary. The training set consists of almost 5 million instances of input dataset. The description of test set and training set are given below: Training Set Contains 494,021 connections or records with a total of 22 attack types. 182

5 Test Set Contains 311,029 connections or records with 17 new attacks types not available in training data. No. Value No. Value 1 duration 22 is_guest_login 2 protocol_type 23 count 3 service 24 srv_count 4 flag 25 serror_rate 5 src_bytes 26 srv_serror_rate 6 dst_bytes 27 rerror_rate 7 land 28 srv_rerror_rate 8 wrong_fragment 29 same_srv_rate 9 urgent 30 diff_srv_rate 10 hot 31 srv_diff_host_rate 11 num_failed_logins 32 dst_host_count 12 logged_in 33 dst_host_srv_count 13 num_compromised 34 dst_host_same_srv_rate 14 root_shell 35 dst_host_diff_srv_rate 15 su_attempted 36 dst_host_same_src_port_rate 16 num_root 37 dst_host_srv_diff_host_rate 17 num_file_creation 38 dst_host_serror_rate 18 num_shells 39 dst_host_srv_serror_rate 19 num_access_files 40 dst_host_rerror_rate 20 num_outbound_cmds 41 dst_host_srv_rerror_rate 21 is_host_login Table 1: Features of the input dataset The 41 features of the KDD CUP 99 dataset is shown in Table 1 and Fig. 3 shows the sample values of the dataset. The values from 1 to 41 are represented by separating them using, (comma) in the dataset given below in Fig. 3. That is, each instance or row of the dataset consists of 42 attributes with 41 feature attributes and one class attribute all separated using a, (comma) as in the figure below. The row values are split to read each attributes separately. Fig. 3: KDD CUP 99 dataset sample values B. Results and Discussion The input dataset is split into many tasks by using the Map and Reduce in the Hadoop environment during the execution. The input data is sent to the mapper that will split the instances of the data into <key, value> pairs and then it is sent to the reducer. The data is sorted and then shuffled before it is sent to the reducer. The final result is obtained by reducing the <key, value> pairs 183

The obtained values of support and confidence during the 4 levels of MapReduce operations are shown in Fig.

6 by calculating support and confidence and then selecting the rules based on that. Based on this it is possible to identify if the user of a specific instance or attack is a guest login or host login. The obtained values of support and confidence during the 4 levels of MapReduce operations are shown in Fig. 4. Fig. 4: Support and Confidence The execution of the MapReduce phase [18] in Hadoop and the obtained final results of the reducer phase are shown in Fig. 5 and Fig. 6 respectively. Fig. 5 shows the execution of the Reducer phase and the output file is being generated. The final statistics of the MapReduce job is shown in Fig. 5. The generated output file is shown in Fig. 6. Fig. 5: Mapper and Reducer execution Fig. 6: Final output 184

7 The final output shown in Fig.6 shows the list of all frequent items sets that are generated along with the support and confidence values near them. The format represented in the output is <itemset, support, confidence> and this is generated for all possible combinations of itemsets for the given input attributes. In this case the 2-itemsets are generated. V. CONCLUSION AND FUTURE WORK The concept of association rule generation or mining can be done effectively in distributed systems that can use parallel executions as in Hadoop environment. This is because it can be scaled up to large volumes of data with less execution time and cost with good accuracy. The proposed algorithm in this paper also considers the type of input data and can be applied to any data formats. By dividing the input data into many splits and processing them using many nodes, the execution is made easy. The management issues such as data transfer between the nodes, storage of data, failure of any node and other issues within the cluster are all handled by Hadoop automatically. Thus the proposed system is more efficient in terms of scalability and robustness. The proposed association rule mining algorithm also has the same features and so it is efficient. Also by making use of the key-value pair approach, the processing is made much easier compared to that of the existing binomial approach. But still the proposed algorithm is not the best in performance when comes to really large datasets. So in the future the Fuzzy based association rule mining can be done in Hadoop to handle data larger than the one in this paper. Further the input data can be classified based on the calculated support and confidence values by using a suitable classification algorithm. In future this work can be extended to implement feature selection first using information gain or mutual information [19] before implanting ARM. REFERENCES [1] Ashrafi, M.Z.,Taniar,D., Smith,K., ODAM:An Optimized Distributed Association Rule Mining Algorithm, Distributed Systems Online, IEEE, Volume 5, Issue 3, [2] R.Agrawal, R.Srikant, Fast Algorithms for Mining Association Rules, In Proceedings of International Conference on Very Large DataBases,pp , Santiago,Chile,September1994. [3] JongSooPark, Ming-SyanChen, PhilipS. Yu, An Effective Hash-based Algorithm for Mining Association Rules, In Proceedings of the ACMSIGMOD International Conference on Management of Data, Michael Carey and Donovan Schneider, ACM, [4] Ozel,S.A., Guvenir,H.A., An Algorithm for Mining Association Rules using Perfect Hashing and Database Pruning,10th Turkish Symposiumon Artificial Intelligence and Neural Networks, Gazimagusa, Springer, pp , [5] KaramGouda, Mohammed JaveedZaki, Efficiently Mining Maximal Frequent Itemsets, In Proceedings of the IEEE International Conference on DataMining, pp , November29-December 02, [6] J.Han,J. Pei,Y. Yin, Mining Frequent Patterns without Candidate Generation, ACMSIGMOD International Conference,Dallas,2000. [7] D.W.Cheung, Jiawei Han, V.T. Ng, A.W. Fu, Yongjian Fu, "Afast Distributed Algorithm for Mining Association Rules, In Proceedings of International Conference on Parallel and Distributed Information Systems, IEEE CS Press, [8] AnsariE, DastghaibifardG, KeshtkaranM, KaabiH, Distributed Frequent Itemset Mining using Trie Data Structure,International Journal of Computer Science, Volume 35, Issue 3, pp , [9] Park,J.S.,Chen,M. S., Yu,P. S., Efficient Paralle l Data Mining for Association Rules, In Proceedings of the Fourth International Conference on Information and Knowledge Management,pp.31-33, [10] Woo, J., Xu, Y, Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing, In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, [11] Lin, Ming-Yen, Pei-Yu Lee, Sue-Chen Hsueh, "Apriori-based Frequent Itemset Mining Algorithms on MapReduce", In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ACM, [12] PeddiKishor, SammulalPorika, Literature Survey on Association Rule Discovery in Data Mining, International Journal of Computer Science and Management Research, Volume 2, Issue 1, January [13] Zhang C.S, Li Z.Y, Zheng D.S., An Improved Algorithm for Apriori, In Proceedings of the 1st International Workshop on Education Technology and Computer Science, Volume 1, pp , [14] C.Jin, C.Vecchiola, R.Buyya, MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms, Fourth IEEE International Conference on escience, pp , [15] T.Elsayed, J.Lin, Douglas W. Oard, Pairwise Document Similarity in Large Collections with MapReduce, In Proceedings of 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, [16] J.H.C. Yeung, C.C. Tsang, K.H. Tsoi, B.Kwan, C. Cheung, A.P.C. Chan P.H.W. Leong, Map-reduce as a Programming Model for Custom Computing Machines, In Proceedings of the 16th IEEE Symposium on Field-Programmable Custom Computing Machines, pp , [17] M.Zaharia, A.Konwinski, A. D. Joseph, R. Katz, I. Stoica, Improving MapReduce Performance in Heterogeneous Environments, EECS Department University of California, Berkeley Technical Report Number UCB/EECS August 19,

8 [18] MohammadhosseinBarkhordari, Mahdi Niamanesh, ScadiBino: An Effective MapReduce-based Association Rule Mining Method, ACM 16th International Conference on Electronic Commerce, August [19] P.Ganesh Kumar, D.Devaraj, Intrusion Detection using Artificial Neural Network with Reduced Input Features, International Journal on Soft Computing, ICTACT, Issue 1, pp , July

Network attack analysis via k-means clustering

Network attack analysis via k-means clustering - By Team Cinderella Chandni Pakalapati cp6023@rit.edu Priyanka Samanta ps7723@rit.edu Dept. of Computer Science CONTENTS Recap of project overview Analysis