AN EFFICIENT APPROACH FOR FREQUENT ITEMSET MINING IN BIG DATA

Size: px

Start display at page:

Download "AN EFFICIENT APPROACH FOR FREQUENT ITEMSET MINING IN BIG DATA"

Rosemary Martin
6 years ago
Views:

AN EFFICIENT APPROACH FOR FREQUENT ITEMSET MINING IN BIG DATA THESIS Submitted in partial fulfillment of the requirements for the award of the degree of DOCTOR OF PHILOSOPHY IN THE DEPARTMENT OF

1 AN EFFICIENT APPROACH FOR FREQUENT ITEMSET MINING IN BIG DATA THESIS Submitted in partial fulfillment of the requirements for the award of the degree of DOCTOR OF PHILOSOPHY IN THE DEPARTMENT OF INFORMATION TECHNOLOGY By P. SUBHASHINI (Regn. No. SP12ITD035) DEPARTMENT OF INFORMATION TECHNOLOGY St. PETER S INSTITUTE OF HIGHER EDUCATION AND RESEARCH St. PETER S UNIVERSITY CHENNAI SEPTEMBER 2017

2 ii CERTIFICATE I hereby certify that the thesis entitled, AN EFFICIENT APPROACH FOR FREQUENT ITEMSET MINING IN BIG DATA revised and resubmitted to the St.Peter s University, for the award of Degree of Doctor of Philosophy is the record of research work done by the candidate P.Subhashini under my guidance and that the thesis has not formed previously the basis for the award of any degree, diploma, associateship, fellowship or other similar titles. Dr.G.GUNASEKARAN SUPERVISOR Place : Date :

3 iii DECLARATION Certified that the thesis entitled AN EFFICIENT APPROACH FOR FREQUENT ITEMSET MINING IN BIG DATA is the bonafide record of independent work done by me under the supervision of Dr.G.Gunasekaran. Certified further that the work reported herein does not form part of any other thesis or dissertation on the basis of which a degree or award was conferred earlier. Dr.G.GUNASEKARAN SUPERVISOR P.SUBHASHINI Place: Date :

4 iv ACKNOWLEDGEMENTS The completion of this research work would not have been possible without the encouragement, help and support of many individuals. It is my privilege to thank the people who have supported and guided me throughout this research work. I am grateful to Dr.Francis C. Peter, Vice Chancellor, St. Peter s University, Chennai, for giving me an opportunity to carry out my research in the university. I also thank Dr.S.Gunasekaran, Dean, R&D, St. Peter s University for his constant help and support. I am highly indebted to my supervisor Dr.G.Gunasekaran, Professor and Principal, Meenakshi College of Engineering, Chennai for giving me an opportunity to work under his guidance. I would like to express my sincere gratitude to him for his valuable guidance, insightful suggestions and constructive inputs. I am grateful to Dr.C.Jayakumar, Professor, Department of Computer Science and Engineering, Sri Venkateswara College of Engineering, Chennai and Dr.K.Selvamani, Professor, Department of Computer Science and Engineering, College of Engineering, Chennai for guiding me as Doctoral Committee Members for my research. I also thank Dr.S.Pushpa, Professor and Head, CSE and the staff members of the Department of Computer Science and Engineering and Information Technology, St. Peter s University for their help during this research. Finally, I would like to thank all those who were directly or indirectly helpful in carrying out this research.

5 v ABSTRACT Mining Frequent Itemsets is one of the most important concepts of Data Mining. Over two decades many research works have been done on Frequent Itemset Mining. But it becomes a very difficult task when they are applied to Big Data. Many efficient pattern mining algorithms have been discovered in the last two decades, yet most do not hold good for Big Dataset. Recent improvements in the field of parallel programming already provide good tools to tackle this problem. Hadoop is one such tool which is used to process the big data in parallel using MapReduce. Recent studies reveal that MapReduce programming model ensures significant performance gains in context of data mining algorithms. In terms of memory as well as in execution speed, tree based Pattern growth algorithm is considered as most efficient than the other Frequent Itemset Mining (FIM) methods. Another important thing which is to be considered in Frequent Itemset Mining is that it often generates a very large number of itemsets, which reduces not only the efficiency but also the effectiveness of mining. So Constraint-based FIM has been proved to be effective in reducing the search space in the FIM task and thus improves the efficiency. In addition to this, in almost all Frequent Pattern Mining algorithms generates Frequent 1-itemsets inorder to find the support count(occurences) of each item in the entire transactions. This task is itself a tedious task in generating Frequent Patterns when considering the hugeness of modern databases available. No explicit strategy has been outlined in these algorithms to perform the aforesaid task.

6 vi To overcome the above drawbacks an efficient algorithm called Modified FP Growth has been proposed to mine Frequent Itemsets from big data set. In this algorithm, Map Reduce concept is used to find Frequent Itemsets from Big Data set. In each Data Node Frequent 1-itemsets are generated using a new tree structure called support count tree. This tree can be easily be embedded into any of the existing algorithms aimed at FIM. With the help of this tree Frequent 1-Itemsets are found out quickly and efficiently which in-turn speeds up the generation of Frequent Itemsets of the entire database. In addition to this, to still more increase the efficiency of MapReducetask a cache has been included in the Map phase to maintain support count tree for calculating the Frequent-1 itemsets of each mapper. This reduces the total time of calculating Frequent-1 itemsets since it bypasses the sort and the combine task of each Mapper in the originalmapreduce tasks. This inturn reduces the total execution time of generating Frequent Itemsets of the entire database. Keywords: Data Mining, Frequent Itemset, Constraints, Hadoop, Map Reduce, support count, Frequent 1-itemsets, patterns, cache

7 vii CONTENTS Page Certificate Declaration Acknowledgements Abstract List of Tables List of Figures List of Abbreviations ii iii iv v ix x xii CHAPTER 1 INTRODUCTION System Overview 1.2 Introduction to Data Mining Foundations of Data Mining Scope of Data Mining 1.3 Association Rule Mining FPM 1.4 Big Data 1.5 Need for the Study 1.6 Problem Statement 1.7 Objectives of the Study 1.8 Methodology of the Study 1.9 Organization of the Thesis CHAPTER 2 SURVEY OF LITERATURE Survey on the existing algorithms for FPM in Big Data 2.2 Survey on FPM using Constraints CHAPTER 3 MODIFIED FP GROWTH FOR FIM Basics of Frequent Itemsets 3.2 FPM Algorithms Apriori Algorithm Eclat Algorithm

8 viii FP Growth Algorithm Comparison 3.3 Modified FP Growth Algorithm 3.4 Experimental Results CHAPTER 4 CONSTRAINED FIM IN BIG DATA Need for Constrained Pattern Mining 4.2 Types of constraints Classification of constraints based on semantics Classification of constraints based on properties 4.3 Constrained FIM 4.4 FIM in Big Data Introduction to Hadoop and MapReduce 4.5 FIM in Big Data using MapReduce 4.6 Constrained FIM in Big Data CHAPTER 5 MODIFIED MAPREDUCE FOR FIM IN BIG DATA Introdution to cache Cache entries 5.2 Proposed system Modified MapReduce 5.3 Experimental Results CHAPTER 6 CONCLUSIONS AND SCOPE FOR FURTHER STUDY Conclusions 6.2 Scope for Further Study REFERENCES 99 PUBLICATIONS 114

9 ix LIST OF TABLES Table Page 3.1 Comparison of Apriori, Eclat and FP Growth Execution time of FP Growth and FP Growth with support count tree Auxillary information of each item Transactional Database Transactional Database Numbering each item Support Count Table FIM-Mapper outputs FIM-Reducer outputs Comparison of MapReduce and Modified MapReduce with respect to the execution time 94

10 x LIST OF FIGURES Figure Page 1.1 Types of FPM Big Data Characterization Frequent Itemset generation using Apriori Comparison of Apriori, ECLAT and FP Growth algorithms execution time Support Count Tree formation Performance of FP Growth and FP Growth with support count tree Components of Hadoop Ecosystem HDFS cluster setup Mapper and Reducer task Generation of Frequent 1-itemset using MapReduce 4.5 Flow Chart for FIM using Anti-Monotone Constraint Flow diagram for multiple constraints FIM Flow Chart for FIM using Modified MapReduce Proposed Architecture of MapReduce for generating Frequent 1-itemsets Flow diagram of MapReduce task Flow diagram of Modified MapReduce task Support count tree Cache size required for storing different number of items 94

11 xi 5.7 Performance Comparison of MapReduce and Modified MapReduce 5.8 Performance Comparison of MapReduce and Modified MapReduce for merged files 95 96

12 xii LIST OF ABBREVIATIONS AFPIM - Adjusting FP-Tree for Incremental Mining AWS - Amazon Web Services CATS - Compresses Arranged Transaction Sequences CFIM - Constrained Frequent Itemset Mining CFP - Compressed Frequent Pattern COFI - Co-Occurrence Frequent Item CPT - Compact Pattern tree DFEM - Dynamic FPTree and Eclat Method EC2 - Elastic Compute Cloud FIM - Frequent Itemset Mining FIU - Frequent Items Ultrametric FLR - Fast, Load balancing and Resource FPM - Frequent Pattern Mining HDFS - Hadoop Distributed File System HUI - High Utility Itemsets H-Mine - Hyperstructure Mining I/O - Input/Output JVM - Java Virtual Machine TLB - Translation Lookaside Buffer MCP - Maximum Cost Performance MREclat - MapReduce Eclat PHUI - Parallel mining High Utility Itemsets

13 xiii PIFP - Parallelized Incremental FP PPT - Prefix Path Tree RARM - Rapid Association Rule mining RDD - Resilient Distributed Datasets S3 - Simple Storage Service SCT - Support Count Table SSR - Search Space Reduced TID - Trasaction Identifier

14 1 CHAPTER 1 INTRODUCTION 1.1 System Overview Frequent Pattern Mining (FPM) is one of the most well-known techniques to extract frequent patterns from data. It plays an important role in association rule mining, finding correlations and trends etc. Finding Frequent Patterns becomes a very difficult task when they are applied to Big Data. Data storage has increased exponentially in the world over the past few years. Data coming from different sources such as web logs, machine logs, human-generated data, etc. are being stored by companies. This phenomenon is known as "Big Data" and nowadays it is trending everywhere. With the incredibly fast growth of data, comes the need to analyze the huge amount of data. Due to lack of adequate tools and programs, data remains unused and underutilized because many important knowledge which is useful to the mankind remains hidden. Recent improvements in the field of parallel programming have provided good tools to tackle this problem. However, these tools come with their own technical challenges, e.g. balanced data distribution and intercommunication costs. Yang Qiang (2015) has explained, how Hadoop using MapReduce programming paradigm can be used to mine frequent patterns from Big data. The data in Hadoop Distributed File System is scattered and need lots of time to retrieve. So MapReduce a programming model, can be used for processing and generating large data sets with a parallel and distributed algorithm on a cluster.

15 2 Some of the real-life applications to which different Data Mining techniques can be applied are (i) forming groups people based on their interests of the people or grouping similar constraints based on their properties (clustering), (ii) categorizing new insurers based on records of similar old claimants (classification), and (iii) detecting unusual credit transactions. Besides clustering, classification and anomaly detection, frequent pattern mining and association rule mining are also important because the latter two analyses valuable data (e.g., shopper market basket data) and help shop owners/managers by finding interesting or frequent itemsets that reveal customer purchase behavior. An algorithm for mining customer transaction database item sets has been proposed by Han, Jiawei, Micheline Kamber, and Pei Jian (2006). Most of the algorithms for FPM can be grouped into two categories: Apriori-like algorithm and FP growth algorithm. Apriori generates frequent patterns by repeatedly scanning the database to prune candidate sets. Whereas FP growth algorithm generates frequent patterns by first constructing FP-Tree. In FP-Tree, transactions are stored in a tree structure in a compressed format. Then, using FP growth algorithm frequent itemsets are extracted from the database. 1.2 Introduction to Data Mining Data Mining is a powerful new technology to extract hidden predictive information from large databases. It helps companies to focus on the most important information in their data warehouses. Data Mining tools predict future trends and behaviors with which business people can make proactive, knowledgedriven decisions. The automated, prospective analysis offered by Data Mining moved beyond the analysis of the past events provided by retrospective tools

16 3 similar to that of decision support systems. Data Mining tools can answer business questions that have traditionally been too time-consuming to resolve Foundations of Data Mining Foundations of Data Mining means a systematic study of various notions that form its inherent hierarchical structure, from the basic concepts like data, objects, attribute/features, knowledge, etc. to the theories, algorithms for deriving knowledge from Data Mining algorithms and Data Mining process and to the evaluate and interpret the results. Data Mining techniques are the result of a long process of research and product development. This evolution of Data Mining began when business people started to store business information on computers, continued with improvements in data access, and more recently, advanced technologies that allowed people to navigate through their data in real time. Data Mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: Huge data collection Powerful multiprocessor computers Algorithms for Mining Data The Scope of Data Mining Data Mining have gone through several research and developmental phases for many years. Statistics, Artificial Intelligence and Machine Learning are the three key areas with which Data Mining has attained its maximum growth. Data Mining is built on Statistics which is the foundation of most technologies, e.g. Standard Deviation, Standard Variance, Regression Analysis, Standard Distribution, Cluster Analysis, etc. Artificial Intelligence is also the base for Data

17 4 Mining, which tries to simulate human thought process or human intelligence in statistical problems. Another core area for Data Mining is Machine Learning and it is the combination of Statistics and Artificial Intelligence. Data Mining is the collection of historical and recent developments in Statistics, Machine Learning and Artificial Intelligence. These techniques are used to study and find hidden patterns or knowledge available in data. Also, Data Mining is being applied to areas such as information security and intrusion detection. The name Data Mining has been derived from the similarities between searching for valuable business information in a large database, for example, finding linked products in gigabytes of store scanner data and mining a mountain for finding valuable ore. For both processes, it requires intelligently probing to find exactly where the value resides or either shifting through an immense amount of material. Data Mining technology can generate new business opportunities for a small or a big database. It provides the following capabilities: Automated prediction of trends and behaviors. Data Mining is used to automate the process of finding predictive information from large databases. Traditionally to find an answer for a particular Question requires, extensive hands-on analysis. But now the answers can be found quickly and directly from the data. A typical example of a predictive problem is targeted marketing. Data Mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. Automated discovery of previously unknown patterns. Data Mining tools sweeps through the databases and identifies previously hidden

18 5 patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify the products which are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent the data entry keying errors. Some successful application areas include: A car manufacturing company can analyze its recent sales force activity and their results to improve targeting of upper class people and determine which marketing activities will have the greatest impact in the next few months. The data needed includes competitor market activity as well as information about the residence of high society people. The results can be distributed to the sales force via a wide-area network that enables the sales branches to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations. Pattern Discovery is another important application of Data Mining. In this, patterns that occurs frequently in a database is found. The most wellstudied type of patterns are sets of items that occur frequently together in transaction databases such as market basket logs of retail stores. A credit card company can use its customer transaction data to identify which customer can be inserted in the new credit card product. To get the attributes of the customers a small test mailing can be sent to them to identify the affinity with them towards the product. Recent projects have indicated that, more than a 20-fold decrease in costs for targeted mailing campaigns over conventional approaches. A diversified transportation company with a large direct sales force can apply Data Mining to identify the best prospects for its services. Data Mining can be used to analyze the customers' experience. A unique

19 6 segmentation can be build by identifying the attributes of high-value prospects. Applying this segmentation to a general business database can yield a prioritized list of prospects by region. Data Mining can be used in educational systems to bridge the knowledge gap between the students of different Universities. The hidden patterns and associations that are extracted from the mining process can improve the decision making processes in higher educational systems. This improvement can bring advantages like increasing the student s promotion rate, improving the efficiency of the educational system, reducing the cost of system processes, etc., By applying Data Mining, a large consumer package goods company can improve its sales process to retailers. With data collected from consumer panels, shipments, and competitor activity they can understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reach their target customer segments. Of all these applications, pattern discovery is very important for association rule mining. For example, association rules can be found for market basket or transaction data analysis, classification models can be mined for prediction, clusters can be identified for customer relation management, and outliers can be found for fraud detection. 1.3 Association rule mining One of the fundamental methods from the prospering field of Data Mining is the generation of association rules that describe relationships between items in data sets. Association rule mining is primarily focused on finding frequent co-occurring associations among a collection of items. It is sometimes

20 7 referred to as Market Basket Analysis, since that was the original application area of association mining. The goal is to find associations of items that occur together more often than you would expect from a random sampling of all possibilities. Generally speaking an Association Rule is an implication of the form: X Y Where X and Y are disjunct sets of items. The meaning of such rule is quite intuitive: Let DB be a transaction database, where each transaction T D is a set of items. An association rule X Y expresses that "Whenever a transaction T contains X then this transaction T also contains Y with probability conf". The probability conf is called the rule confidence and is supplemented by further quality measures like rule support and interest. The support is an indication of how the itemset appears frequently in the database. It is sometimes expressed as a percentage of the total number of records in the database. The confidence is an indication of how often the rule has been found to be true. An Example for Association Rule Mining is identifying the items that occur frequently from a large transactional database. For this, association rule mining can be used, even if the customers who bought the items are unknown. An Association Rule Mining searches for interesting relationship among those items and displays it in a rule form. An association rule "{bread, jam} (sup = 2%; conf =80%)" states that 2% of all the transactions under analysis show that bread and jam are purchased together and 80% of the customers who bought bread also bought jam. Such rules can be useful for decisions concerning product pricing, promotions, and many things. Association rules are also widely used in various areas such as telecommunication networks, market and risk management, inventory control, etc.

21 8 Phases of Association Rule Mining: It consists of two phases: Finding all frequent patterns. By definition, each of these patterns will occur at least as frequently as a pre-defined minimum support threshold. Minimum Support threshold is the minimum support for an itemset to be identified as frequent. Generating association rules from frequent patterns. Association rules can be formed only by satisfying the pre-defined minimum support threshold and minimum confidence threshold. The second phase is straightforward and less expensive. Therefore the first phase of FPM is a crucial step of the two and determines the overall performance of mining association rules. In addition to this, frequent pattern plays an essential role in many Data Mining tasks that try to find interesting patterns from databases, such as association rules FPM FPM means finding patterns (itemset, sequence, structure, etc.) that occurs frequently in a data set. FPM helps us to identify the relationships or correlations between items in the dataset. For example, a set of items, such as paint and brush, which appear frequently together in a transaction data set, is a Frequent Itemset. This information helps the shop keeper to arrange these frequent items together which will induce paint buyer to buy brush. Another example is Frequent Pattern discovery from Web Log data which helps to identify the navigational behaviors of the users. Consider the scenario, such as buying first a PC, then a Data Card, and then a Pen Drive, and if this pattern

9 occurs frequently in a shopping history database, then that pattern is a frequent sequential pattern. Types of FPM are shown in Figure 1.1. Frequent Pattern Mining Figure 1.

22 9 occurs frequently in a shopping history database, then that pattern is a frequent sequential pattern. Types of FPM are shown in Figure 1.1. Frequent Pattern Mining Figure 1.1 Types of FPM Sequential Pattern Mining: It is concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. The mining process finds a frequent subsequences from a set of sequential data set, where a sequence records an ordering of events. FIM: Extracting sets of products that are frequently bought together. It aims at finding regularities in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops, etc.

23 10 Structured Pattern Mining: The mining process searches for frequent substructures in a structured data set. A structure is defined as a general concept that covers many structural forms, such as graphs, lattices, trees, sequences, sets, single items, or combinations of such structures. The identification of sets of items, products, symptoms and characteristics, which often occur together in the given database, can be seen as one of the most basic tasks in Data Mining. So FIM is a very important mining of all the Pattern Mining types. That too, Frequent Itemsets from Big Data is a highly researched area. Kinds of FIM Constrained frequent itemset: An itemset X is a constrained frequent itemset in set S if X satisfy a set of user-defined constraints. A Naïve solution is to find all frequent sets and then test them for constraint satisfaction. However, the mining process can be done more efficiently by pushing the constraint as deeply as possible inside the frequent pattern computation. Closed Frequent Itemset: An itemset X is closed frequent itemset in set S if X is both closed and frequent in S. Maximal Frequent Itemset: An itemset is maximal frequent if none of its immediate supersets are frequent. An itemset X is a maximal frequent item in set S if X is frequent, and there exist no super-itemset Z, such that XZ and Z are frequent in S.

24 11 Top k frequent itemset: An itemset X is said to be top-k frequent itemset in set S if X is the k most frequent itemset for a userspecified value, k. Near-Match frequent itemset: An itemset X is a near-match frequent itemset if X equals the support count of the near or almost matching itemsets. 1.4 Big Data Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. The data can be a structured, semi-structured and unstructured data which can be mined for information. Here the definition uses size or volume of data as the only criterion. Another interesting definition of Big Data is that it is a technology designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis. There are three main characteristics of Big Data: the data itself, the analytics of the data, and the presentation of the results of the analytics. This definition is based on the 3Vs model coined by Doug Laney in He did not use the term Big Data but predicted that data management will get more and more important and difficult. He then identified the 3Vs which is shown in Figure 1.2 (data volume, data velocity and data variety) as the biggest challenges of data management. Data volume is the size of the data, data velocity is the speed at which data arrives and data variety is the data extracted from different sources which can be unstructured or semi-structured.

25 12 Figure 1.2 Big Data characterization Data Volume The size of the data is growing at a rapid speed nowadays. A text file is a few kilobytes, a sound file is a few megabytes while a full-length movie is a few gigabytes. These data comes from many sources. In the olden days all data are generated within the company itself by the employees. But the data are now generated by employees, partners and customers. In addition to this, machines also generate data for a company which has many branches. For example, hundreds of millions of smart phones send a variety of information to the network infrastructure. Handling such huge volume of data is obviously the most widely recognized challenge. Processing huge volume of data is not an issue if the data is loaded in bulk and there is enough processing time. But handling small amount of data gets problematic if the data is unstructured, arriving at high velocity and also need to be processed within seconds. From this it is very clear that it is not

26 13 the volume which decides whether the data is big but also other data characteristics like velocity and variety must also be considered. Data Velocity The speed at which data arrives is known as velocity. There are two folds of data velocity. First, the rate at which new data are flowing in and existing data is getting updated called the acquisition rate challenge. Second, the time acceptable to analyze the data and process while the data is on move, called timeliness challenge. The first problem is data acquisition rate. The challenges involved in this is of how to receive, filter, manage and store the continuously arriving data. Traditional relational database systems are not suitable for this task as they process many overhead in the form of locking, logging, buffer pool management and formulating threaded operations. One way to handle this problem is to analyze the flow of data for anomalies and discard unnecessary data by filtering and only store important data. This filtering of data stream without missing any important data not only requires an intelligent tool, but also consumes time and resources. Also it is not always possible to filter data. Another necessary task is to automatically extract and store metadata together with the streaming data. This is useful to track how data is stored and measured. The second problem is regarding reaction to incoming data streams. In many situations real time analysis becomes necessary, otherwise the information gets useless. As mentioned earlier, it is not only sufficient to analyze the data and extract information in real time but also necessary to react on it and apply. The speed of handling the whole case becomes the decisive issue. For mining of data streams not only speed is important but also the time interval of the actual processing of data. This is very important with respect to data streams with characteristic feature of evolving over time.

27 14 Data Variety Variety refers to the diversity of data sources. Data comes from different sources in various types such as web data, social media, machine generated logs, human-generated data, biometrics, transactional data, etc. This not only implies an increased amount of data sources but also structural differences among those data sources. Furthermore, the structure or schema of different data sources is not necessarily compatible and also the semantics of data can be inconsistent. Therefore, managing and integrating of multi-structured data from a wide variety of sources poses many challenges. First comes the storage and management of this data in a database like systems. Relational database management systems may not be suitable for all types and formats of data. The next challenge is related to the semi and fully unstructured character of data. In the context of integrating different data sources, different data, be it structured, unstructured or semi-structured needs to be transformed to some structure or schema that can be used to relate different data sources. 1.5 Need for the study FPM has proved to be one of the promising fields in carrying out the research work because of its wide use in all Data Mining tasks such as clustering, classification, prediction and association analysis. Mining frequent itemsets enables humans to take better decisions in a wide range of applications including market basket analysis, traffic signals analysis and in Bioinformatics identify frequently co-occurring protein domains in a set of proteins. Many researches have proposed many algorithms to generate FIM, but the execution time and storage space plays a key difference in all these algorithms. Pruning unimportant patterns becomes another important research area in FIM.

28 15 Now it is an era of Big Data. There are some applications where frequent patterns have to be extracted from Big Dataset. One such example is Web Log Mining, which helps us to identify frequent web pages visited by the user. By using this information one can improve their advertising process. To handle Big Dataset parallel mining becomes necessary for which MapReduce concept can be used. 1.6 Problem Statement Generating all frequent itemsets is typically very large which some applications do not require. The subset that is really needed by these applications usually contains only a small number of itemsets. Thus more time is spent in considering all unwanted frequent itemsets to extract frequent itemsets. In addition to this, memory is also wasted in storing all unimportant frequent itemsets. So constraints can be introduced to remove these unimportant itemsets. All the existing algorithms hold good only when the dataset is small. So there is a need to propose an efficient algorithm to find frequent itemsets from Big Dataset using constraints. In almost all FPM algorithms, Frequent 1-itemsets are generated to find the support count (occurrences) of each item in the entire database. This task is itself a tedious task in generating Frequent itemsets when considering the hugeness of modern databases available. No explicit strategy has been outlined in these algorithms to perform the aforesaid task. So an efficient data structure can be proposed to find the support count of each item. 1.7 Objectives of the Study The major objectives of this study are as follows: To extract Frequent Itemsets from Big Dataset To reduce the memory wastage using Constraints

29 16 To speed up the execution time using support count tree and cache To extract cumulative Frequent itemsets from multiple files 1.8 Methodology of the Study In this research work, the paradigm of Constraint-based Itemset Mining in Big Data is introduced. In order to extract Frequent Itemsets from Big data, MapReduce function in Hadoop is used. Constraints provide focus only on the interesting or required data, thus reducing the number of patterns extracted to those of potential interest. To extract Frequent-1 itemsets an efficient Support count tree algorithm has been embedded in the FP growth algorithm to mine Frequent Itemsets from big data set. To still more increase the efficiency of MapReduce a Modified MapReduce algorithm has been proposed. In this algorithm cache has been included in the Map phase to maintain support count tree for calculating the frequent-1 itemset of each mapper. This reduces the total time of calculating Frequent-1 itemsets since it bypasses the shuffle, sort and the combine task of each Mapper in the original MapReduce tasks. This in-turn reduces the execution time of generating Frequent Itemsets of the entire database. 1.9 Organization of the Thesis A concise outline of various chapters of the thesis is as follows: Chapter 1 deals with the introduction to Data Mining, Association rule mining and Big Data. An overview of the system, need, objective and methodology of the study are given. In this chapter organization of the thesis is also presented.

30 17 In Chapter 2, an in-depth analysis is made on most influential algorithms which have given significant contributions to several efficiency issues of FPM problems in Big Data. In addition to this literature survey on FPM using constraints is also dealt. Chapter 3 introduces the concept of FIM, its notations and methodologies. FIM Methods based on layout of data (horizontal as well as vertical) are described in this section. An algorithm of each basic category of FIM methods has been explained here such as Apriori, FP growth, Eclat etc. Description of the proposed improved FP growth algorithm. Experimental Result of the proposed work is shown. Chapter 4 describes the need for constrained pattern mining and also describes different types of constraints for FIM. Constrained FIM in Big Data has also been explained. Chapter 5 deals with description of uses of cache and how it can be used in Modified MapReduce algorithm for FIM in Big Data. Experimental Results of the proposed work is shown. In Chapter 6, conclusions of the research work are given and scope for future study is indicated.

31 18 CHAPTER 2 SURVEY OF LITERATURE A detailed survey of the literatures relevant to FPM in Big Data are presented. Literature survey on Constraint Pattern Mining is also presented. 2.1 Survey on the existing algorithms for FPM in Big Data Meera Narvekar, Shafaque Fatma Syed (2015) have designed a new technique which mines out all the frequent item sets without the generation of the conditional FP Trees. This algorithm increases the efficiency by scanning the database only once. There are three steps involved in the proposed technique. In the first step D-tree is generated by scanning the database once. After scanning the database and getting the D-tree out of it, they have used the D-tree for further processing. To count the number of occurrences of each item in the database D- tree is scanned and it is recorded as the frequency of each item. Since this algorithm uses D-tree to find the occurrences of each item reading a transaction from the memory resident tree structure is faster than scanning it from the disk. In the second step, a New Improved FP-Tree is constructed and as well as a node table which contains some nodes and their frequencies using the D-tree and support count as an input. With new improved FP-Tree, frequent patterns are generated. PP-Mine proposed by Xu. Y et al. (2002) is a novel coded Prefix Path Tree (PPT) which finds all the frequent itemsets through the constructed PPT. PPT is constructed as follows. First, PP-Mine scans the database to find all the frequent items. For each transaction, the infrequent items are removed. The

32 19 remaining frequent items are sorted in descending frequency order and are inserted into PPT. PPT is similar to FP-Tree and is the item link-free tree structure. Moreover, each node in the PPT is arranged by the frequency order and is assigned a calculated code. Although PPT does not need to build item-link in the tree initially, all the sibling nodes need to be sorted in a total order. Algorithm PPMine mines patterns in a subtree following a depth-first traversal order and all patterns in a subtree will be mined vertically. PP-Mine needs to recursively construct a large number of sub-header-tables when push-right and push-down operations occur. For any node in a PPT, PP-Mine checks if the itemset from the root to the node is frequent. Push-down is a depth-first traversal with building a sub-header-table and put the children of the node to the sub-header-table. The push-right strategy is to push the children of the node to their corresponding siblings which lie at the right side of the child nodes. For each push-right operation, PP-Mine needs to search for all the children for a set of nodes from a subheader-table. Therefore, it faces a large search space if the set of frequent 1- itemsets is large. Mamunur Rashid Md et al. (2013) have addressed the problem of sensor association rule mining which often produces a huge number of rules. Most of already existing algorithms for this problem are either redundant or failed to reflect the true correlation relationship among data objects. So they have proposed mining of a new type of sensor behavioral pattern called associatedcorrelated sensor patterns. This new behavioral pattern captures not only association-like co-occurrences but also the substantial temporal correlations implied by such co-occurrences in the sensor data. They have used a prefix treebased structure called Associated-Correlated Sensor Pattern-tree, which facilitates Frequent pattern (FP) growth-based mining technique to generate all associatedcorrelated patterns from wireless sensor network data with only one scan over the sensor database. By extensive performance, they have shown that their approach

33 20 is time and memory efficient in finding associated-correlated patterns than the existing most efficient algorithms. Kumari Priyanka Sinha, Rajeshwar Puran (2014) have proposed a simple and easy-to-implement algorithm to find frequent 1-itemsets from the transactional database. In this algorithm they have used the concept of red-black trees. This data structure is readily available in the form of map in C++ and Tree Map in Java, thus requires no extra effort in implementation. This algorithm can be easily embedded into any of the existing algorithms aimed at FPM. Steps to generate FPM in red-black tree are given below: Step1: A red-black tree with no element is instantiated. Step 2: Whole database is scanned. For each element of the database, loop through step 3. Step 3: Insert the element in the instantiated red-black tree. If the element is not there in the tree, it will be inserted in the tree as a key with the corresponding value automatically initialized with 1. If the element already exists in the tree, its corresponding value will be incremented by 1. Step 4: Each key and their corresponding values of the red-black tree are extracted. This gives the count value of frequent 1-itemsets. Looping through the whole database takes O(n) time, where n is the total number of elements in the database. Also, for each element of the database, it takes O(log n) time to insert the element into the red-black tree. Thus the overall time complexity of the proposed algorithm comes out to be O(n log n). In CATSIM Tree, Ketan Modi et al. (2012) have done modifications in Compresses Arranged Transaction Sequences (CATS) Tree to support incremental mining. CATS Tree extends the idea of FP-Tree to improve storage compression and allow FPM without generation of candidate itemsets. It allows to mine only through a single pass over the database. The algorithm which they

34 21 have proposed should satisfy the following properties: 1) During incremental updates, the frequency changes. Items ordering should not be affected by these changes. 2) The frequency of each node in the CATSIM tree must be high as the sum of frequencies of its children. Web log files are generates as a result of an interaction between the client and the service provider of web. Web log file contains many hidden valuable information. The navigation behavior of the user can be predicted if these information are mined. However, the task of discovering frequent sequence patterns from the web log is challenging. Sequential pattern mining provides a significant role in serving a promising approach of the access behavior of the user. Bina Kotiyalt et al. (2012) has focused on adopting an intelligent technique to identify which web pages a person has the willingness to access. They have created an efficient and an effective personalized web service for accessing web pages. They have used two intelligent algorithms for predicting the user behavior's namely Apriori and Eclat and also they have given the performance comparison of the two algorithms in terms of time and space complexity for the filtered data. In BitTableFI, Dong Jie, Han Min (2006) a special data structure BitTable is used horizontally and vertically to compress database for quick candidate itemsets generation and support count, respectively. The algorithm can also be used in many Apriori-like algorithms to improve the performance. The algorithm has significant differences from the Apriori and all other algorithms extended from Apriori. This algorithm first generates frequent 1-itemsets. All non-frequent 1-itemsets is neglected to reduce the size of the BitTable. Then, using quick candidate itemsets generation and quick candidate itemsets support count algorithm they have generated frequent itemsets. They have proved that their algorithm outperforms algorithms based on Apriori, in terms of execution

35 22 time because the Bitwise AND/OR operation is greatly faster than the traditional item comparing method used in many Apriori-like algorithm. Researchers have developed a lot of algorithms and techniques for determining association rules. Generation of candidate set is the main problem. Among the existing techniques, the Frequent Pattern(FP) growth method is the most efficient and scalable approach. Rezbaul Islam A.B.M et al. (2011) has proposed a new and improved FP-Tree with a table and a new algorithm for mining association rules. This algorithm mines all possible frequent item set without generating the conditional FP-Tree. It also provides the frequency of frequent items, which is used to estimate the desired association rules. In this algorithm the generated FP-Tree consists of mainly two elements- the tree and a table. The tree represents the correlation among the items more specifically and table called sparse table is used to store the spare items. The table consist of two columns. One is the item name and the other one is the frequency. The item name is the name of the items and frequency means how many times it occurs in the table. The main reason to introduce the spare table is, in a traditional FP-Tree a lot of branches are created and the same item appears in more than one node. But in this improved FP-Tree, every distinct item has only one node. So it is simpler and efficient for further processing. To reduce the complexity for generating maximal frequent itemsets, Lin.D and Kedem Z.M (2002) presented a new approach by combining both topdown and bottom-up approaches. In the bottom-up approach, they start from 1- itemset and then they move one-level up in each iteration and proceeded upto n- itemsets like Apriori algorithm. Whereas in the top-down approach they have started from n itemsets, then moved many levels down in each iteration and proceeded up to 1-itemset. Both bottoms-up and top-down approach individually identifies the maximal frequent itemsets by examining its candidates.

36 23 Pei Jian et al. (2007) have proposed H-Mine (Mem) which is a memory based hyperstructure mining of frequent patterns. In this they use the concept of projected database to mine frequent patterns. H-mine (Mem) is efficient when the frequent-item projections of a transaction database plus a set of header tables can fit into the main memory. But when the projected database becomes big it becomes inefficient. To overcome this, they have used a database partitioning technique which can scale up to a very large database. No matter how large the database is. It can be mined by at most three scans of the database. In the first scan, the algorithm finds globally frequent items. In the second scan the algorithm mines the partitioned database using H-mine(Mem) and in the third scan verifies globally frequent patterns. Since every partition is mined efficiently using H-mine(Mem), the mining of the whole database is highly scalable. In MRPrePost, Jinggui Liao, Yuelong Zhao, Saiqin Long (2014) have proposed a parallel algorithm based on Hadoop platform. MRPrePost algorithm can be adapted to mine frequent patterns from large databases. In this algorithm first they divide the database into blocks using a default file block policy of Hadoop. Each block is allocated to worker nodes. First in the MapReduce stage the number of items in each shard set is counted. This gives frequent- itemsets. In the next step, frequent items are sorted based on sequence of F-list and output is produced. With this output an FP-Tree is formed using MapReduce again. Next frequent itemsets are generated from FP-Tree. Yen S.J et al. (2012) investigated how to improve the efficiency for mining frequent itemsets. Since the database scans can be significantly reduced by constructing an FP-Tree and it is fast to search for a small set of candidates, they have proposed an algorithm Search Space Reduced algorithm (SSR) for generating frequent patterns, which combines the advantages of FP-Tree and candidate generation. SSR algorithm first constructs an FP-Tree to store all the

37 24 information in the transaction database. After building a compact sub-tree for each frequent item from the constructed FP-Tree, SSR generates a small set of candidates in batch from the sub-tree, such that the search time and storage space can be reduced. They have compared their algorithm with Co-Occurrence Frequent Item (COFI) algorithm. The sub-tree built by the SSR algorithm is smaller than the sub-tree built by COFI and the search space for the algorithm is also much less than that of COFI. Therefore, the algorithm is more efficient than COFI in terms of execution times and memory storage. Cheung William et al. (2003) suggested the concept of CATS which works on the principle of Interactive mining, Build once, mine many. The problem with CATS Tree is swapping, merging, deleting the node. It takes too much time. Also storage is the constraint for this type of tree structure. Researchers assumed unlimited amount of memory, but in practical applications, this is not possible. Due to unexpected database growth, the mechanism which supports incremental mining will be very much essential, otherwise the complete mining procedure needs to be started from the scratch. With this idea in mind, Khan Q.I et al. (2005) proposed a new tree structure called Canonical Tree. In comparison to CATS Tree, here all the items are ordered according to some specific ordering, for example Lexicographical or Alphabetical. Available data can be in any order, to arrange the data in some specific sequence is also a typical task. This is the additional overhead of the mechanism. The tree size is also dependent on the items appearing in the transactions. Juan Li, Ming De-ting (2010) proposed QFP algorithm also known as Rapid Association Rule Mining. By scanning the database only once, the QFP algorithm can convert a transaction database into a QFP-Tree after data preprocessing, and then the association rule mining is performed on the tree. The QFP algorithm is more efficient than the FP growth algorithm because it retains

38 25 the complete information for mining frequent patterns. It doesn t destroy the long patterns of any transaction. The input of the QFP algorithm is the same as that of the FP growth algorithm or the Apriori algorithm, therefore, the QFP algorithm may apply to any situation which is suitable for the FP growth algorithm or the Apriori algorithm. Shamila Nasreen, et al. (2014) analyzed a range of widely used algorithms for finding frequent patterns with the purpose of discovering how these algorithms can be used to obtain frequent patterns over large transactional databases. They have given a comparative study of algorithms like ECLAT algorithm, Apriori algorithm, FP growth algorithm, Rapid Association Rule Mining (RARM) and Associated Sensor Pattern Mining of Data Stream algorithms. They have studied each of the algorithm s strengths and weaknesses for finding patterns among large item sets in database systems. They have found that all these algorithms can be used to mine only static database and they cannot be used to mine dynamic database. The sampling algorithm, proposed by Toivonen. H (1996), scans the database by picking a random sample from the database and then finds all relatively frequent patterns in that sample, and then verifying the results with the rest of the database. In case if this sampling method did not produce all frequent patterns, then during the second pass the missing patterns can be found by generating all remaining potentially frequent patterns by varying supports. By decreasing the support threshold, the probability of such a failure can be avoided. However, for a reasonably small probability of failure, if the threshold is drastically decreased, it may cause a combinatorial explosion of the number of candidate patterns.

39 26 Obulesu. O et al. (2014) have given a new framework to find spatiotemporal patterns from Big Data. Existing algorithms are well in computation of necessary patterns, but more problematic when they are applied to Big Data. Big Data challenge is becoming one of the most exciting opportunities for the next years. Thus, in this paper they have focused on a broad overview of pattern mining algorithms and significance in Spatiotemporal Databases, its current status, trade-offs, and forecast to the big data pattern mining future. In Fast, Load balancing and Resource (FLR) mining, Lin, K.W, Chung Sheng-Hao (2015) a fast, load balancing and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments has been proposed. They have discussed how to efficiently determine the appropriate number of computing nodes, considering execution efficiency and load balancing in a distributed environment. FLR-Mining automatically determines the appropriate number of computing nodes automatically and achieves better load balancing when compared with existing methods. Schlegel Benjamin et al. (2011) proposed two novel data structures, namely the CFP-Tree (Compressed Frequent Pattern Tree) and the CFP-array, which reduce memory consumption by about an order of magnitude. This allows to process significantly larger data sets in main memory. These data structures are based on structural modifications of the prefix tree that increases compressibility, node ordering and indexing. The key to memory efficiency are two data structures, one is the CFP-Tree and the other is the use of CFP array. The CFP- Tree is optimized for the build phase and is based on structural changes to the FP- Tree, a highly tuned physical representation by means of a ternary tree, and various lightweight compression techniques. The CFP-Tree provides a high compression ratio. After the initial build phase, the CFP-Tree is transformed into

40 27 a different data structure called the CFP- array. The cost of this transformation constitutes only a negligible fraction of the overall FP growth runtime. Since different access paths are required in the mining phase, the CFP-array uses an array-based physical representation of the FP-Tree, and employs an intelligent node ordering, indexing, and compression. Experimental results have shown that memory consumption is reduced. It has lead to multiple order-of-magnitude performance improvements when compared to plain FP growth. Han et al. (2000) have used the divide-and-conquer approach to decompose the search space based on length-1 suffixes. Additionally, they have reduced database scans during the search by leveraging a compressed representation of the transaction database, via a data structure called an FP-Tree. The FP-Tree is a specialization of a prefix-tree in which item is stored at each node along with the support count value and path from the root to that node. Each database transaction is mapped onto a path in the tree. The FP-Tree also keeps pointers between nodes containing the same item, which helps to identify all itemsets ending in a given item. From the FP-Tree conditional patterns are generated and from this frequent itemsets are found out. Experimentally he has shown that this algorithm is better than Apriori algorithm. Koh, J.L. and Shieh, S.F. (2004) have proposed AFPIM(Adjusting FP-Tree for Incremental Mining) algorithm for incremental mining. Similar to FP-Tree, it only keeps frequent items. In this algorithm, a threshold called PreMinsup is considered whose values are set less than the Minimum support. The insertion, deletion or modification of transactions may affect the frequency and order of the items since the items are ordered based on the number of events. More specifically, items in the tree are adjusted when the order of the items changes. The AFPIM algorithm swaps such items by applying the bubble sort algorithm that involves huge calculation.

41 28 Most of the previous studies on mining frequent patterns based on an Apriori approach, requires more number of database scans and operations for counting pattern supports in the database. Since the size of each set of transaction may be massive that it makes difficult to perform traditional Data Mining tasks. Anurag Choubey et al. (2012) proposed a graph structure that captures only those itemsets that needs to define a sufficiently immense dataset into a sub matrix representing important weights and does not give any chance to outliers. They have devised a strategy that covers significant facts of data by drilling down the large data into a succinct form of an Adjacency Matrix at different stages of mining process. The graph structure is so designed that it can be easily maintained and the trade off in compressing the large data values is reduced. They have shown that graph-based approach is faster than the partition algorithm. In Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed and Jeong Byeong-Soo (2009) research, they have proposed a parallel and a distributed algorithm. It uses Parallel and Distributed Algorithm using parallel pattern tree. It requires only one database scan to construct parallel pattern tree and in a way reduces I/O cost. It considers a distributed memory architecture where each node contains all resources locally. The algorithm is divided into three phases Phase I: It first accept database contents in horizontal partitions in any canonical order, construct the tree in one scan and then restructure it in global frequency descending sequence of items. Phase II: Local Mining Individually mine local parallel pattern tree in parallel for discovering global frequent patterns. Phase III: Final sequential step which collects frequent patterns from all local parallel pattern tree and generates global frequent patterns. The drawback of this algorithm is that the original database is partitioned but not pruned with infrequent 1-itemsets and original database is only used throughout the algorithm so this may utilize more memory for huge databases.

42 29 Dharmesh Bhalodiya et al. (2013) the dynamic programming approach to facilitate the fast candidate itemset generation and searching. In Data Mining association rule mining and FPM, both are a key feature of market-basket analysis. One of the basic market basket analysis algorithm is an Apriori algorithm. It generates all candidates item-set frequent pattern. They have proposed a new improved method to generate candidate 1-itemsets and candidate 2-itemsets. This approach requires only database scan for both frequent candidate 1-itemset and 2-itemset.Since Dynamic programming is one of the techniques to design an efficient algorithm, they were able to store the previous solutions. So whenever it reappears, it can be directly access from those pre- calculated values without creating more overhead. Xun Yaling et al. (2015) designed a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the Frequent Items Ultrametric (FIU) tree, rather than conventional FP-Trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. The three MapReduce jobs in FiDoop are described in detail. The first MapReduce job discovers all frequent items or frequent one-itemsets. In this phase, the input of Map tasks is a database and the output of Reduce tasks is frequent one-itemsets. The second MapReduce job scans the database to generate k-itemsets by removing infrequent items in each transaction. The last MapReduce job constructs k-fiu-tree and mines all requent k-itemsets. This step is responsible for 1) decomposing itemsets; 2) constructing k-fiu trees, and 3) mining frequent itemsets. In this step the reducers perform combination operations by constructing small ultrametric trees and the actual mining of these trees separately.

43 30 To improve FiDoop s performance, they have developed a workload balance metric to measure load balance across the cluster s computing nodes. They have also proposed FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. By Extensive experiments using real-world celestial spectral data they have demonstrated that their proposed solution is efficient and scalable. Ho et al. (2011) concentrated their efforts on improving performance by changing the data flow in the transition between Mappers and Reducers. Originally, Hadoop employs an all-to-all communication model between Mappers and Reducers. This may lead to saturation of network bandwidth during the shuffling phase. This problem is known as the Reducers Placement Problem(RPP). They have modeled the traffic in a multiple-rack environment. As a solution to this problem two algorithms and analytical method has been proposed. This approach uses optimization techniques to formulate the problem. They developed a greedy algorithm to find the optimal solution for the problem. Sabita Barik et al. (2010) have proposed hybridized Fuzzy FP growth approach for an efficient, frequent pattern based clustering to find the gene which forms frequent patterns showing similar phenotypes leading to specific symptoms for specific disease. In the past, most of the approaches for finding frequent patterns were based on Apriori algorithm, which generates and tests candidate itemsets (gene sets) level by level. Apriori leads to iterative database (dataset) scans and high computational costs. Apriori algorithm also suffers from mapping the support and confidence framework to a crisp boundary. To find the frequent patterns from gene expression data they have used FP growth algorithm which is the enhanced version of Apriori. The FP growth algorithm constructs the conditional frequent pattern tree and performs the mining on this tree. FP-Tree is extended prefix tree structure, storing crucial and quantitative information about

44 31 frequent sets. FP growth method transforms the problem of finding long frequent patterns to search for shorter once recursively and then concentrating the suffix. They have validated their model with existing Apriori models by considering various parameters. They have proved that their model outperforms the Apriori model on the basis of run time for finding number of patterns and also the scalability issues have been found to be improved significantly considering both the attributes and objects as they increases. This approach not only outperforms the Apriori with respect to run time, but also it builds a tight tree structure to keep the membership values of the fuzzy region to overcome the sharp boundary problem and it also takes care of scalability issues as the number of genes and condition increases. Sagiroglu.S, Sinanc.D (2013) have given a review of Big Data. They have described the big data content, its scope, methods, samples, advantages and challenges of Data. The critical issue about the Big data is the privacy and security. Big data samples describe the review about the atmosphere, biological science and research, life sciences, etc. In this paper, they have concluded that any organization in any industry having big data can take the benefit by its careful analysis. Using Knowledge Discovery it is easy to get the information from the complicated and Big Data sets. The overall evaluation describes that the data is increasing and becoming complex. The challenge is not only to collect and manage the data, but also in the method of extracting the useful information from that collected data. Tanbeer, S. K et al. (2008) have proposed a Compact Pattern tree (CPT). It is a compact prefix-tree structure which is constructed with one database. To increase the efficiency, they have introduced tree restructuring process. The CPT is constructed in two phases: 1. Insertion phase: inserts transactions in CPT according to item appearance order and updates frequency

45 32 count of respective items in a list called I-list. 2. Restructuring phase: In this phase the I-list is rearranged according to frequency-descending order of items and restructures the tree nodes according to the new I-list. Branch Restructuring Method is used in restructuring phase. In Branch restructure method, it restructures by sorting unsorted path one after another and the I-list in frequency descending order. CPT achieves a remarkable performance gain in overall runtime than the other existing algorithms. In EMRSA-I and EMRSA-II, Mashayekhy Lena, Grosu Daniel (2014) proposed a framework for improving the energy efficiency of MapReduce applications, while satisfying the service level agreement. They first modeled the problem of energy-aware scheduling of a single MapReduce job as an Integer Program. They proposed two heuristic algorithms, called Energy-Aware MapReduce Scheduling Algorithms (EMRSA-I and EMRSA-II), that finds the assignments of map and reduce tasks to the machine slots in order to minimize the energy consumed when executing the application. Iona Sudheendran et al. (2015) in Dynamic FP Tree and Eclat Method (DFEM) have proposed a method to apply the threshold dynamically at runtime to efficiently fit the characteristics of the database during the mining process. DFEM combines FP growth and Eclat algorithm strategies for mining. FP tree is used to store the database in the memory in a compact manner. During the mining process this tree is being used recursively to find the frequent patterns. The switching between the FP growth and Eclat algorithm happens based on the threshold being defined. The algorithm consists of three major parts: 1) Construction of FP tree: To find all the frequent items the Database is scanned and the header table is created. The database is scanned once more to get the frequent items such that the FP tree can be constructed.

46 33 2. Mining FP-Tree: They have used FP growth algorithm to find all the frequent patterns from the conditional tree constructed recursively. Before the construction of the conditional tree the size has to be verified. If the size is small, then Bit Vector will be generated. Otherwise, the FP tree will be created. 3. Mining Bit Vector: It will collect all the Tansaction Identifier (TID) bit vector from the database and searches for frequent pattern by logically ANDing these bit vectors recursively. The new patterns are created by concatenating the suffix pattern from the previous steps. Zaharia. M et al. (2011) have given the importance of Spark. Spark is a distributed computing framework developed at the Berkeley AMP Lab. It offers a number of features to make big data processing faster. The fundamental feature of Spark is that it uses in-memory parallel execution model. This feature is very useful for applications having iterative computations. The second key feature is that, differing from the fixed two-stage data flow model in MapReduce, Spark can provide very flexible Directed Acyclic Graph based data flow. These two features can significantly speed up the computation for those iterative algorithms such as the Apriori algorithm and some of other machine learning algorithms. The programming model of Spark is upon a new distributed memory abstraction called Resilient Distributed Datasets (RDDs) which is proposed for in-memory computations on large cluster. Spark will cache the contents of the RDDs in memory of the worker nodes, making data reuse substantially faster. Spark tracks enough information to reconstruct RDDs when a node fails. In Map Reduce Eclat (MREclat), Zhang Zhigang et al.(2013) have proposed a parallel algorithm based on Map/Reduce framework. In the vertical layout algorithm the frequent patterns are mined using the algorithm Eclat. The algorithms for mining frequent patterns in horizontal layout databases are different from the algorithms for mining vertical databases like the Eclat.

47 34 Algorithm MREclat consists of three steps. In the initial step, all frequent 2- itemsets and their TID-lists are obtained from transaction database. The second step is the balanced group step, where frequent 1-itemsets are partitioned into groups. The third step is the parallel mining step, where the data got in the first step is redistributed to different computing nodes. Each node runs an improved Eclat to mine frequent itemsets. Finally, MREclat collects all the output from each computing node and formats the final result. MREclat uses the improved Eclat to process data with the same prefix. It has been proved that MREclat has high scalability and good speedup ratio. Wei, X., Ma, Y. et. al. (2014) have presented a Parallelized Incremental FP growth (PIFP growth) mining strategy. The proposed algorithm successfully solves the incremental issues brought by the dynamic threshold value and database at the same time, which avoids repeated computation. This parallel mining strategy based on MapReduce framework is implemented on Apache Hadoop. The experimental results have proved the effectiveness and advantages of PIFP growth. The existing Data Mining methods and algorithms have limitations when dealing with big data. For example, Apriori algorithm needs to scan the data from external storage repeatedly so as to obtain the frequent itemsets, which increases input/output traffic which lowers the performance of the algorithm. Moreover, existing incremental algorithms cannot be applied in situations when both threshold value and database changes. PIFP algorithm which uses the MapReduce framework solves this problem. Experimental results have shown that this improved algorithm is effective in reducing the time of duplicated work. Chen Hui, Young Lin Tsau, Zhan Zhibing and Zhong Jie (2013) have proposed a parallel algorithm for mining frequent pattern in large transactional data. It uses an extended MapReduce Framework. A number of subfiles are

48 35 obtained by splitting the mass data file. The bitmap computation is performed on each subfile to obtain the frequent patterns. By integrating the results of all subfiles the frequent pattern of the overall mass data file is obtained. A statistic analysis method is used to prune the insignificant patterns when processing each subfile. They have been proved that the method is scalable and efficient in mining frequent patterns in big data. Sanket Thakare et al. (2016) have improved the pre post algorithm. The PrePost algorithm is one of well-known algorithms of FPM. It is based on N- list data structure to mine frequent item-sets. But the performance of PrePost algorithm degrades when it comes to processing of a large amount of data. To handle big data they have used Hadoop. Improved PrePost algorithm combines the features of Hadoop in order to process large data efficiently. Efficiency of PrePost algorithm is enhanced by implementing compact PrePost Cloud tree with the general tree method and finding frequent itemsets without generating candidate itemsets. They have proposed an architecture of the Improved PrePost algorithm implemented on cloud using Amazon Web Services (AWS). The results show that as dataset size is increased, the Improved PrePost algorithm gives 60% better performance. The Improved PrePost Algorithm on cloud consists of following steps. Initially, input dataset and Improve PrePost jar files located in Local System. Input bucket is created and these files are uploaded to the AWS Simple Storage Service (S3) bucket using simple AWS API with the provided access and secret keys. The next Hadoop cluster is configured with different type of Elastic Compute Cloud (EC2) instances. The instance type is based on the requirement of computation. In Hadoop cluster, one instance is designated as Master node and others are designated as Slave. After configuring Hadoop, Master EC2 instance dataset and jar files are fetched S3 storage. AWS secret key and access key are

49 36 added to HDFS-site file of Hadoop configuration for the communication between EC2 instance and S3. Master EC2 instance node divides data sets into fixed size block and map to the different slave EC2 instance nodes in the Hadoop cluster. The Map function on slave EC2 instance nodes map input dataset to sets of key value pairs called intermediate results and applies the improved PrePost algorithms to produces the output on allocated dataset. The Master EC2 instance store the final result in HDFS file system in output bucket on S3 storage. The final result can be downloaded from S3 bucket to the local system with simple AWS API with the provided access and secret keys. The above architecture makes the system capable of handling increased computation and storage requirement. Ebin Deni Raj and Dhinesh Babu L.D (2014) have discussed various scheduling policies in MapReduce. They have proposed a new scheduling technique called Two Phase Scheduling Policy for resource allocation in MapReduce. They have used a simulated workload of textual data which consists of numbers, alphabets and special characters. They have used six Matlab workers in the parallel computing toolbox of Matlab and a generic cluster was created which will mimic the MapReduce framework. The workload for this algorithm ranges from one GB to five GB of data. The data consist of only text files. Hongyan Liu et al. (2006) proposed a new search strategy by integrating top-down mining with a novel row enumeration tree, which makes full use of the pruning power of the minimum support threshold to cut down search space dramatically. With this search strategy they have designed an algorithm called TD-Close to find a complete set of frequent closed patterns from very high dimensional data. In addition to this a new method, called closenesschecking was developed to check efficiently and effectively whether a pattern is closed. Unlike other existing closeness-checking methods, it does not need to

50 37 scan the mining data set, nor the result set, and is easy to integrate with the topdown search process. Using the above two algorithms they have designed and implemented a method to discover a complete set of frequent closed patterns. Experimentally they have shown that this algorithm is more efficient and uses less memory than bottom-up search styled algorithms, Carpenter and FPclose. Yang X.Y. et. al. (2010) have used the MapReduce model to handle huge datasets on Hadoop distributed computing environment for FPM. MapReduce is used first to find frequent 1-itemsets by scanning the database once. Then again MapReduce model is used where each mapper generates subset of candidate itemsets. These candidate itemsets are given to the reducer to prune candidates based on minimum support threshold and to generate a subset of frequent itemsets. Multiple iterations to MapReduce are necessary to produce frequent itemsets. At the final stage output of all reducers is combined to generate final frequent itemsets. In this paper, though they have implemented Parallel Apriori algorithm it also suffered from the drawbacks of multiple database scans and huge candidate set generation. Chen Qi, Liu Cheng, and Xiao Zhen (2013), pointed out some pitfalls in the previous works on Multiple speculative execution strategies. Some of the pitfalls are i) Use average progress rate to identify slow tasks while in reality the progress rate can be unstable and misleading, ii) Cannot appropriately handle the situation when there exists data skew among the tasks, iii) Do not consider whether backup tasks can finish earlier when choosing backup worker nodes. To overcome this they developed a new strategy, Maximum Cost Performance (MCP), which improves the effectiveness of speculative execution significantly. To accurately and promptly identify stragglers, they provided the following methods in MCP: i) Use both the progress rate and the process bandwidth within a phase to select slow tasks, ii) Use Exponentially Weighted Moving Average to

51 38 predict process speed and calculate a task s remaining time, iii) Determine which task to backup based on the load of a cluster using a cost-benefit model. To choose proper worker nodes for backup tasks, they considered both data locality and data skew. MCP was evaluated in a cluster of 101 virtual machines running a variety of applications on 30 physical servers. Experiment results show that, MCP can run jobs up to 39% faster and has improved the cluster throughput up to 44% compared to Hadoop Shi Xuanhua, Chen Ming, He Ligang (2014) have developed Mammoth, a new MapReduce system, which aims to improve MapReduce performance using global memory management. In Mammoth, they have designed a novel rule-based heuristic to prioritize memory allocation and revocation among execution units (mapper, shuffler, reducer, etc.), to maximize the holistic benefits of the Map/Reduce job when scheduling each memory unit. A multi-threaded execution engine, which is based on Hadoop, but runs in a single Java Virtual Machine (JVM) on a node has been developed. In the execution engine, they have implemented the algorithm of memory scheduling to realize global memory management, based on which they further developed the techniques such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the JVM. Extensive experiments were conducted to compare Mammoth against the native Hadoop platform. The result shows that the Mammoth system can reduce the job execution time by more than 40% in typical cases, without requiring any modifications to the Hadoop programs. When a system is short of memory, Mammoth can improve the performance up to 5.19 times, as observed for I/O intensive applications, such as Page Rank. They have also compared Mammoth with Spark. Although Spark can achieve better performance than Mammoth for interactive and iterative applications when the memory is sufficient, experimental results show that for batch processing applications, Mammoth can adapt better to

52 39 various memory environments and outperform Spark when the memory is insufficient, and can obtain similar performance as Spark when the memory is sufficient. Fumarola. F et al. (2014) proposed a novel parallel, distributed Data Mining algorithm to find the frequent patterns in Big Data. The key principle underlying the design of this algorithm is that one can make reasonable decisions in the absence of perfect answers. This algorithm exploits the Chernoff bound to mine approximate frequent itemsets with statistical error guarantees on their actual supports if the classical threshold for minimum support and a user specified error bound is given. These itemsets are generated in parallel and independently from subsets of the input dataset, by exploiting the MapReduce parallel computation framework. The set of frequent itemsets from each subset are aggregated and filtered by using a novel technique to provide a single collection in output. This algorithm can scale well on gigabytes of data and tens of machines, as experimentally proven on real datasets. In the experiments, it also showed that the proposed algorithm returns a good statistically bounded approximation of the exact results. Devender Banga and Sunitha Cheepurisetti (2014) proposed a framework for prefetching in the proxy server. For the reduction in the web latency and improving user satisfaction, prefetching is important. The proposed framework uses FP growth algorithm for determining the user s frequent sets of patterns. User data will be collected from the web log history. Now, depending upon the threshold and patterns generated by the FP growth, list of patterns to be prefetched will be generated and passed to the predictor module. Predictor module will prefetch the web objects by creating session with the main server. Using FP growth algorithm of association rule mining frequent patterns can be determined without candidate key generation. Thus they have shown that the proposed framework has improved the efficiency of the existing network.

53 40 Ying, C. L. et al. (2015) proposed a new framework for mining high utility itemsets in big data. Existing algorithms cannot be applied to the big data environments, where data are often distributed and too large to be dealt with by a single machine. They proposed a novel algorithm named Parallel Mining High Utility Itemsets(PHUI) by pattern-growth for parallel mining of High Utility itemsets(hui) on Hadoop platform, which has several features similar to that of Hadoop like easy deployment, low communication overheads, fault recovery and high scalability. In this algorithm, MapReduce architecture is used to partition the whole mining tasks into smaller independent subtasks and also have used the Hadoop distributed file system to manage distributed data so that it allows parallel discovery of HUIs from distributed data across multiple commodity computers in a reliable, fault tolerance manner. PHUI-Growth adopts a novel strategy called Discarding local unpromising items in MapReduce framework to effectively prune the unnecessary intermediate itemsets produced during the mining process, which reduces the search space. They have proved that the performance of PHUI-Growth is higher than non-parallel type of HUI mining algorithms. 2.2 Survey on FPM using Constraints Constraint-based mining, as the name suggests, involves the extraction of only a subset of patterns which satisfies the user-specified constraints. Guns. T, Nijssen. S, and De Raedt. L (2010) has explained about k-pattern mining under constraints. On the basis of the interaction between the constraint and the mining process, the constraints can be classified into the following four categories: antimonotonic, monotonic, succinct and convertible. These classifications are neatly explained by Han, Jiawei, Pei Jian, and Yin Yiwen (2000). A constraint C is monotone if and only if whenever an itemset S satisfies C, so does any superset of S. Grahne.G, et.al.(2000) and Pei.J and J. Han. (2000) has given explained

54 41 Monotonic constraint with an example. Anti-monotonic and succinct constraints were defined and introduced by Ng R.T, Lakshmanan L. V, Pang. A, and Hah. J (1998). Succinct constraints can be pushed into the initial data selection process at the start of mining. Jia. L, Pei. R and Pei. D (2003) have pushed hard constraints into the frequent closed itemsets mining process. The output of the algorithm is the same of a post-processing one, i.e. first closed itemsets are discovered and then they are tested against a given set of constraints. In A-Close algorithm, Pasquier. N et. al. (1999) they have addressed the problem of finding frequent itemsets using the closed itemset lattice framework. This helps us to limit the search space to closed itemset lattice rather than the subset lattice. This in turn determines the reduced set of association rules. For this they have proposed A-Close algorithm which uses closure property to closed frequent itemsets and have pushed the constraints into the computation. A constraint C is anti-monotone if and only if whenever an itemset S violates C, so does any superset of S. Anti-monotonic constraints are used to restrain the pattern growth during mining. The anti-monotone property is the same as the Apriori property used in the Apriori algorithm for the mining of frequent patterns. Convertible constraints essentially are the constraints that can be converted somehow into anti-monotonic and monotonic constraints. Pei et al., has introduced convertible monotonic and convertible anti-monotonic properties to reduce the search space. Provided that there is a fixed order of the items, a constraint C is defined as convertible anti-monotone, whenever an itemset S satisfies C, so does any prefix of S. Similarly, the constraint C is defined as convertible monotone, whenever an itemset S violates C, so does any prefix of S. The constraints mentioned above help in reducing the search space. However, there are some constraints which don t fall into any of the above mentioned

55 42 categories. These constraints are known as tough constraints and they do not reduce the size of the search space. The interestingness measure known as periodicity is used to extract periodic-frequent patterns. This is an example of such constraint and this is clearly explained by Tanbeer, S.K., Ahmed, C.F., Jeong, B.S. and Lee, Y.K. (2009). RARM presented by Das Amithaba et al., (2001) is another method that uses a tree structure to represent original database and to avoid candidate generation process. Constraints were applied during the mining process to generate only those association rules that are interesting to the users, which guarantees the improvement of the efficiency of the existing mining algorithm. Tien Dung Do et al., (2003) presented a category based algorithm as well as the associated algorithm for constraint rule mining based on Apriori. This approach reduces the computational complexity of mining process by passing most of the subsets of final itemsets. Boulicaut. J. and Jeudy. B. (2001) proposed an approach where instead of mining closed itemsets, it is proposed to mine free itemsets i.e. the minimal elements of each equivalence class of frequency (closed itemsets are the maximal elements of such classes). The output of the algorithm is made with all the free itemsets satisfying a given set of monotone and anti-monotone constraints. Pei Jein, Han Jiawei (2002) has developed efficient pattern growth methods for FPM. Pattern Growth methods are not only efficient, but also effective in mining Frequent Patterns with various constraints. Many tough constraints which cannot be handled by previous methods can be pushed deep into the pattern growth mining process. In this paper, they have overviewed the principles of pattern growth methods for constrained FPM and sequential pattern mining. Moreover, the power of pattern growth methods towards mining with tough constraints has also been explored.

56 43 Agrawal. R. and Srikanth. R. (1994) have adopted Apriori like approach, which is based on an Anti-Monotone Apriori property: if any, length k pattern is not frequent in the database, its length(k+1) can never be frequent. The essential idea is to iteratively generate the set of candidate patterns of length (for k 1), and check their corresponding occurrence frequencies in the database. Therefore, an intuitive methodology to push constraints into Apriori-like approaches is to use anti-monotonic constraints to prune candidates. However, many commonly used constraints are not anti-monotonic, like Avg(X) v, which requires that the average value (price) in pattern X be greater than or equal to v. Thus, the Apriori-like methods meet challenges when mining with such constraints.

57 44 CHAPTER 3 MODIFIED FP GROWTH FOR FIM 3.1 Basics of Frequent Itemsets The formal definition of frequent pattern and association rule mining problems is stated as follows: Let I = {i 1, i 2, i 3,, i n } is a set of items, such as products like (computer, CD, printer, papers, and so on). Let DB be a set of database transactions where each transaction T is a set of items such that T I. Each transaction contains a unique identifier called TID. An association rule has the form X Y, where X Y = Φ. X is called the antecedent and Y is called the consequent of the rule where X, Y is a set of items or an itemset or a pattern. The number of rows (transactions) containing X itemset in the given database is given as freq(x). The support of an itemset X is defined as the fraction of all rows containing the itemset, i.e. freq(x) / D. The support of an association rule is the support of union of X and Y, i.e. Support(X Y) = (X Y) / D The confidence of an association rule is defined as the percentage of rows in D containing itemset X that also contain itemset Y, i.e. Confidence(X Y) = P(X/Y) = support(x Y)/ support(x) An itemset (or a pattern) is frequent if its support is equal to or more than a user specified minimum support threshold. Association rule mining can be refined further by using constraints such as minimum support and minimum confidence.

58 45 However, a large number of these rules will be pruned after applying the support and confidence thresholds. Therefore the previous computations will be wasted. To avoid this problem and to improve the performance of the rule discovery algorithm, mining association rules may be decomposed into two phases: 1. Discovering the large itemsets, i.e., the sets of items whose support is above a predetermined minimum threshold and this is known as Frequent Itemsets. 2. Using this large itemsets, association rules for the database is generated that have confidence above a predetermined minimum threshold. The overall performance of mining association rules is determined primarily by the first step. The second step is easy. After the large itemsets are identified, the corresponding association rules can be derived in a straightforward manner. The main consideration of the thesis is the first step, i.e. to extract frequent itemsets. 3.2 FPM Algorithms are: There are various algorithms for FIM. Some of the efficient algorithms Apriori Algorithm Eclat Algorithm FP growth Algorithm Apriori Algorithm Apriori is the very first algorithm for mining frequent items from transactional database. It was given by Agrawal.R and Srikant.R in It works on the horizontal layout based database. It is based on Boolean association rules which uses generate and test approach. It uses BFS (breadth first search). Using frequent k itemsets, Apriori finds a bigger itemset of k+1 itemset. Apriori

59 46 property: All subsets of a frequent itemsets which are non empty are also frequent. Apriori algorithm states that if a set doesn t satisfy the minimum support threshold, then all its supersets will fail to meet the minimum support threshold. In the first pass of the algorithm, candidate 1-itemsets is constructed by counting the occurrences of each item in the database. The resulting set is denoted as C1. Set L1 is generated by pruning the items whose support values are lower than the minimum support value. The resulting set is denoted as L1. This property is also known as Apriori Pruning Principle. After the algorithm finds out all the frequent 1-itemsets, it joins the frequent 1-itemsets to construct the candidate 2-itemsets and prunes some candidate 2-itemsets whose support count are below the minimum support count to generate the frequent 2-itemsets. This process is repeated until no further candidate itemsets can be created. Figure 3.1gives an example of generation of candidate itemsets and frequent itemsets when the minimum support count is 2. This algorithm Is easy to implement Can be easily parallelized Uses large itemset property

47 Figure 3.1 Frequent itemset generation using Apriori Limitations There are major computational challenges faced by the Apriori algorithm.

60 47 Figure 3.1 Frequent itemset generation using Apriori Limitations There are major computational challenges faced by the Apriori algorithm. As the dimensionality of the database increases, it needs to scan the database multiple times iteratively and generate huge number of candidate itemsets. It checks a large set of candidates with pattern matching. As more search space is needed and I/O cost increases the computational cost becomes quite expensive. It is also a tedious workload to go over each transaction to determine support count of the candidate itemsets. Unfortunately, when the dataset becomes large, this algorithm leads to huge loss of time and more occupancy of memory space.

61 Eclat Algorithm Eclat algorithm is basically a depth-first search algorithm using set intersection. It uses a vertical database layout i.e. instead of explicitly listing all transactions; each item is stored together with its cover (also called TID list) and uses the intersection based approach to compute the support of an itemset. It states that, when the database is stored in the vertical layout, the support of a set can be counted much easier by simply intersecting the covers of two of its subsets that together give the set itself. In this algorithm, each frequent item is added to the output set. After that, for every such frequent item i, the i-projected database D i is created. This is done by first finding every item j that frequently occurs together with i. The support of this set {i, j} is computed by intersecting the covers of both items. If {i, j} is frequent, then j is inserted into D i together with its cover. Reordering is performed in every recursion steps of the algorithm. Then the algorithm is called recursively to find all frequent itemsets in the new database D i. The algorithm has good scalability due to the compact representation. Drawbacks When the database is very large and the itemsets in the database is also very large, then it is feasible to handle the Transaction-id list. Thus, it produces good results. But for small databases its performance does not scale well FP growth Algorithm FP growth uses a combination of the vertical and horizontal database layout to store the database in main memory. It stores the actual transactions from the database in a tree structure and every item has a linked list going through all

62 49 transactions that contains that item. This new data structure is called FP-Tree. Steps involved in FPM: First it calculates the support count of each item in the database. Next, the items are ordered in the order the items in the database in support ascending order for the same reasons as before. Next, FP-Tree is formed. Create the root node of the tree, labeled as null. For each transaction in the database, the items are processed and a branch is created for each transaction. Every node in the FP-Tree additionally stores a counter which keeps track of the number of transactions that share that node. When considering the branch to be added to a transaction, the count of each node along the common prefix is incremented by 1, and nodes for the items in the transaction following the prefix are created and linked accordingly. Additionally, an item header table is built so that each item points to its occurrences in the tree via a chain of nodelinks. Each item in this header table also stores its support. The reason to store transactions in the FP-Tree in support descending order is that in this way, it is hoped that the FP-Tree representation of the database is kept as small as possible since the more frequently occurring items are arranged closer to the root of the FP-Tree and thus are more likely to be shared. From the FP-Tree conditional pattern base is generated. Conditional FP-Tree is formed from conditional pattern base and finally, frequent itemsets are generated. Advantages - Mines frequent patterns without candidate generation - Highly compressed structure because it compresses a large database into a compact FP-Tree structure. - Avoid costly database scans - Develop an efficient, FP-Tree based FPM method. It follows a divide and conquer methodology. - Avoid candidate generation: sub-database test only

63 Comparison Comparison of the above three algorithms are shown in Table 3.1. Table 3.1 Comparison of Apriori, Eclat and FP growth Parameter Apriori Eclat FP growth Technique Breadth First Depth First Search Divide and Search & Apriori and intersection of conquer property for transaction id to pruning generate candidate itemset Approach Horizontal layout Vertical Layout based Projected layout based algorithm algorithm based algorithm Database Suitable for sparse Suitable for medium Suitable for large datasets as well as and dense datasets but and medium dense datasets not for small datasets datasets Database Database is scanned Database is scanned Database is scan for each time a candidate item is few times scanned two times only generated Data storage format structure Horizontal array Vertical array FP- Tree(Horizontal tree) Memory Due to large Requires less amount Due to compact utilization amount of of memory compared structure and no candidates produced it requires to Apriori if items are less candidate generation, it large memory space requires less space

64 51 Advantage Easy to implement As support count information will be obtained from the previous itemset, there is no need to scan the database each time a candidate itemset is generated Drawbacks Too many Requires virtual candidate itemset memory to perform generations. So the transformation more memory is required The Database is scanned only two times. No candidate generation Expensive Thus from the table 3.1 it is found that Apriori uses join and prune method, Ecalt works on vertical datasets and FP growth constructs the conditional frequent pattern tree which satisfies the minimum support. The major drawback of the Apriori algorithm is that it produces too many candidate itemset generations. So more memory is required. It is very much expensive to scan large database. A true reason of Apriori failure is that it lacks an efficient processing method on database of small in size. Eclat is more efficient than Apriori algorithm in terms of running time but it requires virtual memory. Whereas FP growth is better than Apriori and Eclat in terms of execution time, which is shown in Figure 3.2 and it is more scalable. The comparison of the three algorithms are made with the database consisting of 5,000 transactions.

65 Time(ms) 52 Support(%) Figure 3.2 Comparison of Apriori, Eclat and FP growth algorithm s Execution time 3.3 Modified FP growth Algorithm The first step in FP growth is generating Frequent itemsets. A formalized and an efficient algorithm named support count tree has been proposed to find out Frequent-1 itemsets. This algorithm has been embedded in FP growth algorithm. So far a well defined algorithm has not been proposed to calculate Frequent 1-itemsets. In the proposed work an efficient tree structure has been proposed which finds the Frequent 1-Itemsets quickly and efficiently, which in-turn speeds up the generation of Frequent Itemsets of the entire database.

66 53 Assume all items in the database are numbered. If it is not numbered, then number each item in the database. Suppose the transactional database which is being considered consist of 45 items, then start the numbering from 1 to 45. Next step is to form a support count tree. Steps for generating the support count tree: Step 1: Find the mid value of the entire item set and make it as the root node. Step 2: Items on the left side of the mid value forms the left sub tree of the root node and items on the right side of the mid value forms the right sub tree of the root node. Step 3: With respect to the left sublist find the mid value and repeat step 1 and 2 until all items are included in the tree. Step 4: With respect to the right sublist find the mid value and repeat step 1 and 2 until all items are included in the tree. Step 5: Now scan each and every transaction from the transactional database. Step 6: When each transaction is scanned search each item in the transaction from the support count tree. If an item is found, increment the respective count variable of that item. Step 7: Repeat step 6 until all the transactions have been scanned. known. Thus, from the count variable of each item, their support counts are An example has been illustrated below on how to form an initial support count tree with 15 items using the above algorithm.

67 54 Step 1: Consider 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 item no for 15 items. Of these 15 items 8 is the mid value and so it is made as the root node. Step 2: Left side of item 8 forms the left sub-list and right side of item forms the right sub-list. Step 3: From the left sublist find the mid value. It is item 4 and link it as the left subtree of node 8. From the right sub-list find the mid value. It is item 12 and link it as the right subtree of node 8. The above steps are repeated recursively until all the nodes are connected to the tree which is shown in Figure a) Initial Step b) Second Step

55 8 4 12 2 6 10 14 c) Third step d) Final Step Figure 3.

68 c) Third step d) Final Step Figure 3.3 Support count tree Formation For even number of items, mid value is calculated by adding the position of the first and the last element and then dividing the result by 2, which is shown in algorithm 3.1. For example, consider there are 200 items in the dataset. Then

69 56 the middle value will be Mid value is rounded-off and taken as 101. Left subtree will have 100 values and the right subtree will have 99 values. Thus, this algorithm invariably holds good for both even and odd number of items. Variables used: Min-points to the first item in the list Max-points to the last item in the list N-total number of items in the database Let k be any item in the transaction. Val node value of the support count tree. Y is the pointer pointing to the structure node. Each node consists of four fields. Value(name of the node), count, left link and right link. Right() points to the right subtree Left()-points to the left subtree For each item in the transaction performs the following operation: Pseudocode for support count tree formation: SCTree(min, max) 1. t = (min+max) / 2 Val ->y = t 2. For the left sublist Recrusively call SCTree(min, (t-1)) 3. For the right sublist Recrusively call SCTree((t+1), max) Algorithm 3.1: Support Count Tree formation

70 57 Steps involved in finding the support count from the support count tree: Step 1: If the item to be searched is equal to the root node, then increment its count variable and search for the next item in the transaction else go to step 2. Step 2: Check whether the value of the item to be searched is lesser than the root node. If yes, go to step 3 else go step 4. Step 3: Check whether the root node value of the left sub tree is equal to the value of the item to be searched. If yes, repeat step 1 and 2. Step 4: Check whether the root node value in the right sub tree is equal to the value of the item to be searched. If yes, repeat step 1 and 2. Repeat the above steps until all items are searched. Pseudocode to find the support count from the support count tree is given in algorithm 3.2. Initially assign the root node address to y. SupportCount(y, k) 1. If val > y= =k Then increment the count value of y 2. else if val->y < k then y = right(y) call recursively SupportCount (y, k) 3. else y = left(y) Call recursively SupportCount (y, k) Count value of each node gives the frequent-1 itemsets. Algorithm 3.2: Calculation of Support Count

71 58 Steps for FIM using Modified FP growth: 1. After finding the frequent-1 itemsets remove the item which does not satisfy the minimum support. Sort the frequent items in each transaction in the descending order. 2. Create an FP-Tree with T as the root and label it as null. It also consists of a set of item prefix subtrees as the children of the root. Each node in the subtree consists of three fields -item-name, count, and node-link. A frequent-item header table is maintained for efficient access of FP-Tree. It consists of three fields namely -item-name, support count and head of node-link Algorithm for construction of FP-Tree: Input: Transaction DB, minimum support threshold. Output: FP-Tree 1. Let F be the set of items in the transaction. Initially, the support value is calculated using the support tree algorithm. Sort F in support order as prefix in each transaction. 2. Create the root T of an FP-Tree, and label it as "null". 3. Let the item list be [p P], p is the first item and P is the remainder element in the liat. For each item list call inserttree(items, T); 4. call function inserttree([p P], T). if T has child N and N.itemName = p.itemname then N.count++; else create node N = p, N.count=1, be linked to T, node-link to the nodes with the same itemname; if P is nonempty then call inserttree(p, N); 5. Starting at the bottom of frequent-item header table in the FP-Tree, by following the link of each frequent item traverse the FP-Tree. 6.Accumulate all of the transformed prefix paths of that item to form a conditional pattern base.

72 59 7. For each pattern base - Accumulate the count for each item in the base - Construct the conditional FP-Tree for the frequent items of the pattern base. 8.Repeat step 6 and 7 for each Frequent Item. 9. Repeat the above process on each newly created conditional FP-Tree until the resulting FP-Tree is empty, or it contains only one path Algorithm for FP growth: Input: FP-Tree, minimum support threshold, without DB. Output: The complete set of Frequent Itemsets. Method: Call FP growth (FP-Tree, null) Procedure FP growth (Tree, α) { 1. if P contains single path in tree then 2. for each combination (denoted as β) of the nodes in P do 3. generate pattern β α with support as minimum support in β 4. else for each a i in the Header Table of Tree do { 5. generate pattern β = α i α with support = α i.support 6. constructβ's conditional pattern base andβ's conditional FP-Tree Treeβ; 7. if Treeβ null then 8. call FP growth (Treeβ, β); }} The above algorithm is used to mine frequent itemsets from FP-Tree. 3.4 Experimental Results This support count tree has been used in the FP growth algorithm for the generation of Frequent Itemsets. The data set was taken from T20I7D500K. Number of transactions in this dataset is 5,00,000. The following experiment is carried out by varying the number of transactions taken from the

73 Run Time(secs) 60 above dataset. Size of 15K dataset is 861KB, 25K dataset is 1.39MB, 50K dataset is 2.80MB and 75K dataset is 4.20MB. Experimental result shows that the run time of FP growth algorithm is more when compared to FP growth algorithm with support count tree which is shown in Figure 3.4 and Table 3.2. Number of Transactions S.No Figure 3.4 Performance of FP growth and FP growth with support count tree Table 3.2 Execution time of FP growth and FP growth with support count tree Number of transactions(k) Execution Time (sec) Of FP growth Execution Time (sec) Of FP growth with support count tree

74 61 In Support count tree the items are kept in the sorted order. When an item value has to be incremented it is enough to search the item by traversing from root to leaf by making comparisons to the items stored in the tree. If the value to be searched is smaller than the root node then searching is continued in the left subtree otherwise searching is continued in the right subtree. Each comparison prunes half of the tree, so that the time taken to increment the count value of an item is proportional to the logarithm of the number of items stored in the tree. Thus the time complexity of support count tree to search an element and increment the count value of an item is O(log(n)), where n is the number of items. Whereas it takes O(n) times in FP growth. Thus, to calculate the support count of each item, Support count tree is better than FP growth.

75 62 CHAPTER 4 CONSTRAINED FIM IN BIG DATA 4.1 Need for Constrained Pattern Mining FPM usually produces too many solution patterns. This situation is harmful for two reasons: 1. Performance: mining is usually inefficient or often simply unfeasible, 2. Identification of fragments of interesting knowledge: which is blurred within a huge quantity of small, mostly useless patterns. So users need an effective way to control the large number of discovered patterns, and to be able to choose what patterns to consider at each time. The most accepted and common approach to minimize these drawbacks is to capture and represent the semantics of the domain through constraints, and use them not only to reduce the number of results, but also to focus the algorithms in areas where it is more likely to gain information and return more interesting results. So Constraints can be used which gives the following benefits: 1. They can be pushed in the frequent pattern computation exploiting them in pruning the search space, thus reducing time and resource requirements. 2. They provide guidance to the user over the mining process and a way of focusing on interesting knowledge. 4.2 Types of Constraints Classification of constraints based on semantics Item Constraint: An item constraint specifies what are the particular individual or group

76 63 of items that should not be present in pattern. For example a soap company may be interested in patterns containing only soap products, when it mines transactions in a grocery store. Length Constraint: A length constraint specifies the requirement on the length of patterns, i.e, the number of items in the patterns. For example, when mining classification rules for the document, a user may be interested in only frequent patterns with at least five keywords. Model-based constraint: A Model-based constraint looks for patterns which are sub or super patterns of some given patterns (models). For example, a car dealer may be interested in knowing what are all the other accessory items a purchaser would buy when he buys a car. Aggregate Constraint: An Aggregate constraint is on an aggregate of items in a pattern, where the aggregate function can be SUM, AVG, MAX, MIN, etc. For example, a marketing analyst may like to find pattern where the average price is over $150. User Constraint: User constraints are those in which user can use a rich set of SQL-style Constraints, to guide the mining process to find only those Frequent Patterns containing market basket items that satisfy the user constraints. Examples of these constraints include the following:

77 64 C1 min(s.price) $10 and C2 S.Type= snack. Here, constraint C1 says that the minimum price of all items in a pattern/set S is at least $10; constraint C2 says that all items in a pattern S are snack. It is important to note that, besides these market basket items, the set of constraints can also be imposed on individuals, events, or objects in other domains. The following are some examples: C3 max(s.temperature) 38 C. This constraint says that the maximum (body) temperature of all individuals in a pattern/set S must be at most 38 C. C4 min(s.price) $1000, and C5 avg(s.price) $1000. Constraints C4 and C5, respectively, say that the minimum and the average price of all items in S is at most $ Classification of constraints based on properties Monotonicity: When an itemset S satisfies the constraint, so does any of its superset. Sum(S.Price) v is monotone Min(S.Price) v is monotone Anti-monotonicity: When an itemset S satisfies the constraint, so does any of its subset - Frequency is an anti-monotone constraint. Succinctness: Given A 1, the set of items satisfying a succinct constraint C, then any set S satisfying C is based on A 1, i.e., S contains a subset belonging to A 1. Idea: whether an itemset S satisfies constraint C can be determined based on the singleton items which are in S. min(s.price) v is succinct

78 65 sum(s.price) v is not succinct Convertible anti-monotone: If an itemset S violates a constraint C, so does every itemset having S as a prefix w.r.t. R. Let R be the order of items. Ex. avg(s) v, w.r.t. item value descending order Convertible monotone: If an itemset S satisfies constraint C, so does every itemset having S as a prefix w.r.t. R. Let R be the order of items. Ex. avg(s) v, w.r.t. item value descending order 4.3 Constrained FIM In constrained FIM, first constraints are applied to the database, thereby the search space can be reduced. For Example, consider the Table 4.1 Table 4.1 Auxillary information of each item Item H I J K L M Qty Price Let S be the set of itemsets which satisfy the constraints Q1 : min(s.qty) 400 (Anti-monotone, succinct) Q2: max(s.price) 40 (Monotone and succinct) Q3: min(s.qty) 400 Ʌ max(s.price) 40(Anti-Monotone, Succinct, Monotone) Q3 is to find all frequent itemsets satisfying Q1(X) Ʌ Q2(X).

79 66 Let us consider the above constraints to mine Frequent Itemsets from the database which is shown in table 4.2. Table 4.2 Transactional database 1 Transactions (T) T1 T2 T3 T4 T5 Itemsets {H,I,J} {K,J} {H,I,K,L} {H,I,K,M} {K,M} Q1 : min(s.qty) 400 To find frequent itemsets which satisfies the above constraint, first the algorithm finds the frequent 1-itemsets. Let the minimum support threshold be 2. In the above transactional database table, only L has support count as 1. So L is removed from the database. In each transactions, the items are arranged in the descending order of the Qty value. Next step is to identify the invalid items. M is invalid since its qty value is 200. So M is also excluded. Now FP-Tree is built using only valid items. Eg. H, I, J, K. So a projected database is formed with respect to H, I, J, K and frequent itemsets are generated by applying FP growth. Q2: max(s.price) 40 For this type of constraint, first the algorithm identifies the items whose price is greater than or equal to 40. In the above database I, J, K, M has price greater than or equal to 40. H and L s price is less than 40. In this, L is infrequent and it is removed. I, J, K, M forms a primary group and H forms the secondary group. The FP-Tree is formed. Here a dashed line has to be used to form a boundary between primary and secondary items in the initial FP-Tree. From the

80 67 initial FP-Tree, projected database is formed for each valid singleton items. Here the boundary is not required. This means, once a primary item x is found for a valid itemset υ, any other item in υ can be chosen from the primary or secondary group. From the projected FP-Tree frequent itemsets are generated by applying FP growth. Q3: min(s.qty) 400 Ʌ max(s.price) 40 When using multiple constraints, the algorithm first picks the most selective constraint by ignoring another constraint during the mining process. The mined itemsets can be checked against the previously ignored constraint when generating the conditional database and then frequent itemsets are generated. In the above example Q1 Ʌ Q2, the primary group can be items whose Qty values are 400 and whose price is 35. Secondary group contains items whose Qty values are FIM in Big Data FP growth for FIM do not hold good to handle large volume of data. Since the FP-Tree is considered to be the best compact data structure to hold the data patterns in memory there has been efforts to make it parallel and distributed to handle large databases. However, it incurs a lot of communication overhead during the mining. Recent improvements in parallel programming have provided good tools to handle this problem. So a parallel and distributed FIM algorithm using the Hadoop Map Reduce framework has been proposed, which shows best performance results for large databases.

68 4.4.1 Introduction to Hadoop and MapReduce Hadoop : Hadoop is an open-source software framework for processing of very large data sets in a distributed environment.

81 Introduction to Hadoop and MapReduce Hadoop : Hadoop is an open-source software framework for processing of very large data sets in a distributed environment. This software package provides tools for data discovery, answering questions and solving analytic problems in Big Data. It consist of a collection of servers called clusters, which runs the Hadoop software and individual servers within a cluster called nodes. Components of the Hadoop ecosystem are shown in Figure 4.1. Figure 4.1 Components of Hadoop Ecosystem

82 69 Hadoop Ecosystem comprises of the following key components: (i) HBase: It is an open, distributed and Non-relational database system implemented in Java. It runs above the layer of HDFS. It can serve the input and output for the MapReduce in well-mannered structure. (ii) Map Reduce: Map-Reduce was introduced by Google in order to process and store process and store large data sets on commodity hardware. Map Reduce is a model for processing large-scale data records in clusters. (iii) Oozie: Oozie is a web-application that runs in a Java servlet. Oozie uses the database to gather the information of workflow, which is a collections of actions. It manages the Hadoop jobs in a mannered way. (iv) Sqoop: Sqoop is a command-line interface application that provides platform which is used for converting data from relational databases and Hadoop or vice versa. (v) Avro: It is a system that provides the functionality of data serialization and service of data exchange. It is basically used in Apache Hadoop. These services can be used together as well as independently according to the data records. (vi) Chukwa: Chukwa is a framework that is used for data collection and analysis to process and analyze the massive amount of logs. It is built on the upper layer of the HDFS and Map Reduce framework. (vii) Pig: Pig is a high-level platform where the MapReduce framework is created which is used with Hadoop platform. It is a high-level data processing system where the data records are analyzed that occurs in high level language.

83 70 (viii) Zookeeper: It is a centralized base service that provides distributed synchronization and provides group services along with maintenance of the configuration information and records. (ix) Hive: It is an application developed for a data warehouse that provides the SQL interface as well as the relational model. Hive infrastructure is built on the top layer of Hadoop that help in providing a conclusion, and analysis for respective queries. The main building blocks of Hive are: - Metastore which is used to store catalogue and metadata. - Query Compiler compiles HiveQL for MapReduce tasks. - Execution Engine executes the tasks produced by the compiler in proper dependency order. -HiveServer and a JDBC/ODBC server. (x) HDFS: Hadoop provides a distributed file system (HDFS) and a framework for analysis and transformation of very large datasets using the MapReduce paradigm. Thus the main objective of Hadoop is not to speed up the processing of data but to make it possible to process really huge amount of data by splitting these data into smaller subsets of data. Hadoop is flexible to any data formats and it can run on low cost commodity hardware. It protects the data being lost because of the hardware failures by taking multiple copies automatically. HDFS: Hadoop employs a master/slave architecture for both distributed storage and computation. A multi-node cluster running Hadoop means running a set of daemons or resident programs, on the different servers in the network which is shown in Figure 4.2.

71 Figure 4.2 HDFS Cluster Setup The daemons include: NameNode : is the master of HDFS that directs the slave DataNodes daemons. It keeps track of the overall health of the distributed file system.

84 71 Figure 4.2 HDFS Cluster Setup The daemons include: NameNode : is the master of HDFS that directs the slave DataNodes daemons. It keeps track of the overall health of the distributed file system. The function of NameNode is memory and I/O intensive as a result of which it does not store any user data or perform any computations in a MapReduce program to reduce the workload on the machine. There is a negative aspect of the importance of the NameNode - it is a single point of failure. DataNode : is responsible for reading and writing HDFS blocks to actual files on the local file system. It constantly informs the NameNode about the blocks it

85 72 is currently storing and also provides information about local changes as well as receives instructions to create, edit or delete blocks from local disk. Secondary NameNode : gives an impression that it is a substitute to NameNode but it is not as it is unable to process the metadata onto the disk. It is just a checkpoint node which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system. JobTracker : works as a binding between the application and Hadoop. JobTracker manages MapReduce job execution by determining which files to process, assigning nodes to perform different tasks and monitors all tasks as they are running. First, the JobTracker receives the request from the client. It then communicates with the name node to determine the location of data. It then identifies the best TaskTracker to execute the task. JobTracker monitors the TaskTracker periodically and submits the status to the client. If a task fails JobTracker automatically relaunches the task on a different node upto a predefined limit of retries. There is only one JobTracker per Hadoop cluster which runs on a server as a master node of the cluster. When a JobTracker fails, The existing MapReduce job will be stopped. Task Tracker : is responsible for managing the execution of individual tasks assigned by the JobTracker. There is one TaskTracker per slave node. Constantly communicates with JobTracker in order to obtain task requests and to provide the task to the nodes. When a TaskTracker fails the JobTracker will assign the task executed by that TaskTracker to another node. MapReduce Mechanisms: In a MapReduce cluster, after a job is submitted, a master divides

86 73 the input files into multiple map tasks, and then schedules both the map tasks and the reduce tasks to worker nodes which is shown in Figure 4.3. A worker node runs tasks on its task slots and keeps updating the task s progress to the master by periodic heartbeat. Map task extracts key-value pairs from the input, transfers them to some user defined map function and combine function, and finally generates the intermediate map output. After that, the reduce task copies the input pieces from each map task, merges these pieces into a single ordered (key, value list) pair stream by a merge sort, transfers the stream to some user-defined reduce function, and finally generates the result of the job. In general, a map task is divided into map and combine phases, while a reduce task is divided into copy, sort and reduce phases. One of the Hadoop basic principle is, moving computation is cheaper than moving data. Therefore, for a map task, it takes into account the TaskTracker s network location and picks a task whose input data is as close as possible to the TaskTracker. The scheduling policy preferentially selects the tasks with data locality. In the optimal case, map task is data-local, that is, running on the same node that input data resides on. Alternatively, the next best case is when the data are in any other node within the same rack, called rack locality. Some map tasks retrieve their data from a different rack, rack-off locality. In Hadoop-0.20, reduce tasks can start when only some map tasks complete, which allows reduce tasks to copy map outputs earlier as they become available and hence mitigates network congestion. However, no reduce task can step into the sort phase until all map tasks complete. This is because each reduce task must finish copying outputs from all the map tasks to prepare the input for the sort phase.

87 74 Figure 4.3 Mapper and Reducer Task Advantages of MapReduce: 1. A large number of distributed computing problems could be split and solved with the basic operations map and reduce. 2. A large number of the said computing problems or algorithms could be rewritten/re-architectured to be embarrassingly parallel. Therefore, running the

88 75 computing problem on N cores could theoretically (ideally) give a N-time speedup. 3. By developing a system which creates an abstraction to solve a generic mapreduce parallel computation problem, it was possible to expose certain high-level API's to developers who did not have to worry about the internal details that involved data-shuffling, partitioning, network topology, etc. This was the basis of MapReduce. Thus MapReduce reduces complexity of parallel computing, and enables easier pain-free, reliable execution on commodity hardware ensuring high levels of fault-tolerance. Applications of MapReduce: Searching: If an input of line number and line is given to MapReduce function, it identifies the line matching the pattern. Sorting: If an input of key and value is given to the MapReduce, then the same records are sorted by key. Inverted Indexing: If an input of filename and text is given to MapReduce, then an output of list of files containing a particular word will be given. It is also used for Text tokenization, Creation of other kinds of data structures (e.g., graphs) and Machine Learning.

89 FIM in Big Data using MapReduce In generating frequent itemsets, MapReduce is used twice. For the first time MapReduce is used to find the Frequent 1-itemsets which is clearly shown in Figure 4.4. Next MapReduce is used for the second time to generate frequent itemsets using any FPM algorithm. Steps involved in generating Frequent 1-itemsets using MapReduce: Step1: Input transactional database is split equally, based on the number of data nodes available. Step 2: In each data node based on the number of Mappers available the input database is split equally. Step 3: Mapper for each item in the database generates (key, value) pairs. The output is a set of key-value pairs (F, 1), where F is a frequent itemset from the sample. Step 4: In the shuffling phase the same items with its values are put together. Step 5: In the combine phase the count value of each item is calculated. Each combiner finds the count value of certain items which is destined to it. Step 6: In the reduce phase frequent 1-itemset of all the items in each Data Node are combined in the Master node and finally Frequent 1-itemsets are generated for the entire database.

90 INPUT SPLITTING MAPPING SHUFFLING COMBINE REDUCE Figure 4.4 Generation of Frequent 1-itemsets using MapReduce 77

91 78 Next, MapReduce is used again to generate Frequent Itemsets. Conditional patterns of each item are created as output by the Mapper function locally on each data node. Each Reducer then combines conditional patterns based on element key. After combining all the patterns for each key, respective reducers do pruning of infrequent Itemsets based on the Minimum support count. Each reducer after pruning generates frequent sets for respective keys locally on each processor. This step reduces cost of interprocess communication as for each key frequent set are generated locally. The results of the above step of reducer are aggregated as the final outcome of this algorithm to generate Frequent Itemsets for the complete database. 4.6 CONSTRAINED FIM IN BIG DATA To handle constrained FPM in Big Data Constrained Frequent Itemset Mining (CFIM) algorithm has been proposed. For constraints like min(s.qty) 400, MapReduce is used first to generate frequent 1-itemsets which is clearly explained in section 4.5. Again MapReduce is used to perform FPM using constraints. In each mapper, items which are less than the minimum threshold are removed. Next, items in the database are arranged in the descending order of the field which is being considered. Items which do not satisfy the constraints are removed from the database. Next, an FP-Tree is formed only for valid items and the function to generate conditional pattern is called. Conditional patterns of each item are created as output of mapper function locally on each data node. Each Reducer then combines conditional patterns based on each element key and Frequent Itemsets are generated. The results of the reducer are sent to the name node. Results of each data node are aggregated as final outcome to generate Frequent Itemsets available for the complete database which is shown in Figure 4.5.

92 79 Split of each Mapper Remove the infrequent items Apply constraint Anti-Monotone MAPPER FP-Tree generation for valid items Generation of conditional pattern of each item Combining conditional pattern based on each key FIs are generated and sent to the Master node Reducer Results of each data node are aggregated and FIs are generated for the entire database Stop Figure 4.5 Flow Chart for FIM using Anti-Monotone Constraint

93 80 For constraints like Q2: max(s.price) 40, the algorithm generates Frequent Itemsets by following the steps given below: 1. MapReduce is used first to generate frequent 1-itemsets which is clearly explained in section 4.5. Remove the infrequent items from the database. 2. Next the second phase of MapReduce starts to generate frequent patterns. In the Map Phase first identify the items which satisfies the constraints and which do not. 3.With items satisfying the constraints form the primary group and form the secondary group with items which do not satisfy the constraints. 4.Form a projected FP-Tree for valid singleton items. 5.Form condition pattern of each item and it is sent to the reducer. 6.Each Reducer then combines conditional patterns based on each element key and Frequent Itemsets are generated. The results of the reducer are sent to the name node. 7.Results of each data node are aggregated as a final outcome to generate Frequent Itemsets available for the complete database. Multiple constraints: MapReduce is used first to generate frequent 1-itemsets which is clearly explained section 4.5. Remove the infrequent items from the database. Next in the second phase of map task, for multiple constraints the algorithm first picks the most selective constraint and filter the items from the database. Next with valid items, projected database is formed by checking the second constraint in parallel. From the projected database condition pattern of each item is formed and it is sent to the reducer. Each Reducer then combines conditional patterns based on each element key and Frequent Itemsets are generated. The results of the reducer are sent to the Name Node/Master Node. Results of each data node

94 Conditional Pattern 81 are aggregated as final outcome to generate frequent Itemsets available for the complete database. These steps are clearly shown in Figure 4.6. Split Map Constraint 1 Map Constraint 2 Name Node Frequent Itemsets Reduce Figure 4.6 Flow diagram for multiple constraints FIM Thus many algorithms like FP growth +, ExAminer, FP-Bonsai, ExAnti, FIC(A), FIC(M) has been proposed to mine frequent patterns using constraints form only small amount of data. Whereas CFIM can be used to mine Constrained Frequent Itemsets from Big Data.

95 82 CHAPTER 5 MODIFIED MAPREDUCE FOR FIM IN BIG DATA 5.1 Introduction to Cache Today s computers depend on large and fast storage systems. Large storage capabilities are needed for many database applications, scientific computations with large data sets, video and music, and so forth. For some applications speed becomes much more important. For such type of applications the cache memory is used. Cache memories are small, fast static RAM memories that improves the program performance by keeping a copy of the most frequently used data from the main memory. Parameters of cache are capacity, block (cache line) size and associativity, where capacity is the size of the cache, block size is the basic transferring unit between cache and main memory, associativity determines how many slots in the cache are potential destinations for a given address reference. When the cache size is more in a system, then the hit ratio is also more. If the data which is being searched is present in the cache, then it is known as a cache hit. A cache hit is good because the data which is required is fetched faster than fetching from the main memory. A cache miss occurs if the cache does not contain the requested data. This is bad because the CPU has to wait until the data is fetched from the main memory. There are many areas in the computer world where Pareto's Law applies, and cache size is definitely one of them. If you have a 256 KB cache on a system using 32 MB, increasing the cache by 100% to 512 KB will probably result in an increase in the hit ratio of less than 10%. Doubling it again will likely result in an increase of less than 5%. In the real

96 83 world, this differential is not noticeable to most people. However, if the system memory is increased greatly then the cache size should also be increased to prevent a degradation in performance. Modern architectures typically have two levels of cache (L1 and L2) between the CPU and main memory. While the L1 cache can perform at CPU speed, the L2 cache and main memory accesses normally introduce latencies in the order of 10 and 100 cycles respectively. Most modern desktop and server CPUs have at least three independent caches: an instruction cache- to speed up executable instruction fetch, a data cache- to speed up data fetch and store and a Translation Lookaside Buffer (TLB) to speed up virtual-to-physical address translation for both executable instructions and data. TLB cache is part of the memory management unit and it is not directly related to the CPU caches Cache entries Data is transferred between memory and cache in blocks of fixed size, called cache lines. A cache line is created when a cache line is copied from memory into the cache. The cache entry will include the copied data as well as the requested memory location (now called a tag). When the processor wants to fetch a data it first checks for a corresponding entry in the cache. The cache checks for the contents of the requested memory location in any cache lines that might contain that address. In the case of a cache hit, the processor immediately reads or writes the data from or to the cache. For a cache miss, the cache allocates a new entry and copies in data from main memory, then the request is fulfilled from the contents of the cache. Memory latency will be decreased only when the requested data is available in the cache. Thus Cache memories can reduce the memory latencies only when the requested data is found in the cache. Thus the hit ratio can be increased when the cache is large.

97 Proposed System To further increase the efficiency of generating FIM, cache is introduced so that the support count can be calculated in the cache itself. For this a Modified Map Reduce algorithm has been proposed. The flow chart of this algorithm is given in Figure 5.1. Get the Input from multiple files. Each file can be a big transactional data set. Split the dataset based on the number of data nodes available. Then apply Map ( ) to form a support count tree and it is stored in cache. The output will be frequent 1-itemset. Then apply Reduce( ) to find out Frequent 1-Itemset of all mappers in a data node. The output of each reducer is aggregated in the name node. A

98 85 A Again perform parallel computing and remove the transactions which contains itemsets less than the min support count. Again apply Map( ) to find Conditional Pattern base in each Data Node. Then by using the Reduce( ) in each Data Node, Frequent Itemsets are generated. Finally the results of the reducers are aggregated to generate Frequent Itemset of the entire database. STOP Figure 5.1 Flow Chart for FIM using Modified MapReduce The initial step of Frequent itemset generation is to generate Frequent 1-itemsets for the given database. For this support count tree algorithm has been proposed which is explained in detail in section 3.3. In section 4.5, it has been shown how MapReduce is used to find frequent 1-itemsets and to generate frequent itemsets using constraints. To increase the efficiency of map reduce task a cache has been included in the map phase to maintain support count tree for calculating the frequent-1 itemset of each mapper which is shown in Figure 5.2. As the data in cache can be quickly fetched it reduces the total time of calculating

99 Slave Master 86 Frequent-1 itemsets, since it bypasses the shuffle, sort and the combine task of each Mapper in the original MapReduce tasks. Job Tracker Name Node Task Tracker Data Node Task Tracker Data Node Map Cache Reduce Map Cache Reduce Map Cache Map Cache Figure 5.2 Proposed Architecture of MapReduce for generating Frequent 1-itemsets Modified MapReduce In each map function for finding the support count of each item the support count tree code has been embedded. The tree is stored in cache. As the items are read from the transaction database, it becomes easier to fetch the respective items data, as it is stored in the cache. Thus at the end of map phase, the support count of each item is calculated by bypassing the sort and combine phase of the original MapReduce tasks which is shown in Figure 5.3 and Figure 5.4. The output of each Mapper is then given to the Reducer which finds the

100 87 cumulative Frequent-1 itemsets of all mappers belonging to the same DataNode. The output is then stored in HDFS. In HDFS the outputs of the all the reducers are aggregated which gives the Frequent -1 Itemsets of the entire database. MAP SORT COMBINE REDUCE Figure 5.3 Flow Diagram of Map Reduce Task MAP (CACHE) REDUCE Figure 5.4 Flow Diagram of Modified Map Reduce Task Thus using cache and Support count tree the support count of each item is calculated quickly without undergoing sorting and combining steps. Hadoop combiners require all map outputs to be serialized, sorted, and possibly written to disk. To overcome this, a cache has been introduced to store the frequent 1-itemset values.

101 88 Table 5.1 Transaction Database 2 Transaction ID Transactions 1 Z, Y, C, D, G, I, X, P 2 Y, B, C, Z, L, X, O 3 B, Z, H, J, O 4 B, C, Q, K, S, P 5 Y, Z, C, E, L, P, X, N If the transaction database is given the item number, then the support count tree can be formed immediately. If not each item has to be numbered and then the support count tree has to be formed. After finding the support count each item name has to be mapped to the item name and an example is shown below: Table 5.2 Numbering each item Item Name Item Number B 1 C 2 D 3 E 4 G 5 H 6 I 7 J 8 K 9 L 10 N 11 O 12 P 13 Q 14 S 15 X 16 Y 17 Z 18

102 89 Next step is to form a support count tree. A support count tree for the above Table 5.2 is shown in Figure 5.5. Figure 5.5 Support Count Tree Each node in the support count tree has a count value associated with it. This gives the frequent 1-itemset of each item. Items whose support count is less than the minimum support threshold are removed and an Support Count Table (SCT) is formed which is shown in table 5.3. Table 5.3 Support Count Table Item Identifier Support Count C Z Y B X P

103 90 After construction of SCT, MapReduce function is used for the second time to generate Frequent Itemsets. Now the support count of each item and the minimum threshold value (ξ= 3) is passed to each Mappers. Transactions of each Mapper are pruned by removing the infrequent items and sorting each transactions in the descending order. The output of each Mapper will be (Key element: Conditional Pattern) pair. Function FIM_Mapper( ) input :SC Table, Transaction Ti from DB output: Conditional Frequent Patterns Steps: 1.For each transactionin DB do 2.Select and sort the frequent items in the order of SC Table. 3.Let the sorted frequent item list in the transaction be patternelement[p P], where p is the first element and P is the remaining list. 4.Create the root of an FP-Tree t, and label it as Null. 5. if T has a child n such that n.item-name = p.item-name then 6: Increment n's count with 1 7: else 8: Create a new node n and let its count be 1, its parent link be linked to t, and its node-link be linked to the nodes with the same item-name via the node-link structure. 9: end if 10: if P is non empty then repeat the steps from 5 to 10 until it becomes empty 11. end if 12.call generateconditionalpattern(patternelement) 13.Endfor

104 91 Generate conditionalpatterns( ) input :patternelements output : conditional prefix-tree for each element Steps: for each element from patternelements construct conditional prefix-tree HadoopContext.write(element, conditionalprefix-tree); Endfor Mapper i/p Table 5.4 FIM-Mapper outputs Sorted and pruned Transaction patterns Z, Y, C, D, G, I, X, P C, Z, Y, X, P P: C Z Y X X: C Z Y Y: C Z Z: C C: - Y, B, C, Z, L, X, O C, Z, Y, B, X X: C Z Y B B: C Z Y Y: C Z Z: C C: - B, Z, H, J, O Z, B B: Z Z: - B, C, Q, K, S, P C, B, P P: C B B: C C: - Y, Z, C, E, L, P, X, N C, Z, Y, X, P P: C Z Y X X: C Z Y Y: C Z Z: C C: - Mapper o/p- Conditional Frequent patterns (key element : Conditional Pattern)

105 92 Each Reducer then combines conditional patterns based on element key. After combining all the patterns for each key, respective reducers do pruning of infrequent Itemsets based on the minimum support threshold ξ = 3. Each reducer after pruning generates frequent sets for respective keys locally on each processor which is shown in Table 5.5. This step reduces the cost of interprocess communication as for each key frequent set are generated locally. The results of the reducer are sent to the master node. Function FIM_Reducer( ) input : Conditional frequent patterns grouped by element Name output: Frequent Itemsets Steps: 1.For each conditionalpattern from group of ConditionalPatterns condfphead = new ConditionalFP(); 2.if (first= ConditionalPattern) assign conditionalpattern to condfphead; else merge the already available pattern elements with incoming matched element patterns. 3. Increment the frequency counts based on the similar elements and update condfphead node 4.call condfphead.intersect(conditionalpattern); 5.endif 6.HadoopContext.write(element Name, condfphead); 7.Endfor

106 93 Table 5.5 FIM-Reducer Output Reducer combines Pruned infrequent Final Frequent itemsets conditional pattern base on item based on key element min support threshold Y: {C, Z} / {C, Z} / {C, Z} Y: {C, Z} {Y}, {Y, Z}, {Y, Z, C}, {Y,C} X: {C, Z,Y} / {C, Z, Y, B} / {C, Z, Y} X: {Y, Z, C} {X}, {X, Y}, {X, Y, Z}, {X, Y, Z, C}, {X, Z}, {X, Y, C}, {X, Z, C}, {X, Z, C}, {X, C} P: {C} {P}, {P,C} P: {C, Z, Y, X} / {C,B} / {C, Z, Y, X} B:{Z} / {C} / {C, Z, Y} B: { } {B} C: { } C: { } {C} Z: {C} / {C} / {C} Z: {C} {Z}, {Z, C} Function aggregator in the master node takes the output of all reducers and generates final Frequent Itemsets of the entire database. 5.3 Experimental Results The dataset which is being considered is T10I4D100K[36]. It contains 100,000 transactions of 3.93 MB with 999 different items. Each unique item in the dataset is considered as a node in the support count tree which has four attributes namely the name, count value, left link and the right link. The cache size for storing various number of items are given in Figure 5.6.

107 Cache Size (KB) 94 Number of Unique items Figure 5.6 Cache size required for storing different number of items From Figure 5.7 and Table 5.6, it is clearly shown that the execution time to generate Frequent Itemsets using modified MapReduce is less when compared to the original MapReduce method. The graph clearly shows that as the number of cores increases the execution time decreases considerably because the database is split evenly among the cores. So each core has fewer amounts of transactions to find frequent itemsets as the number of cores are increased. This in turn reduces the total execution time. Table 5.6 Comparison of MapReduce and Modified MapReduce with respect to the execution time S.No Number of cores Execution Time (sec) for MapReduce Execution Time (sec) for Modified MapReduce

108 Execution Time(sec) 95 MapReduce Modified MapReduce Number of cores Figure 5.7 Performance Comparison of MapReduce and Modified MapReduce The database can be a single file or it can be from multiple files. If cumulative frequent itemsets are to be generated from multiple files that can also be done by appending the second file with the first file and the rest of the procedure is the same which is explained in section Two files namely T10I4D100K and T10I4D1000K have been considered. These two files are merged into a single file. Cumulative frequent itemsets for transactions are generated which is shown in Figure 5.8.

Efficient Algorithm for Frequent Itemset Generation in Big Data

Efficient Algorithm for Frequent Itemset Generation in Big Data Anbumalar Smilin V, Siddique Ibrahim S.P, Dr.M.Sivabalakrishnan P.G. Student, Department of Computer Science and Engineering, Kumaraguru