Frequent Itemsets Melange

Similar documents
Data Mining Part 3. Associations Rules

gspan: Graph-Based Substructure Pattern Mining

Improved Frequent Pattern Mining Algorithm with Indexing

Data Mining for Knowledge Management. Association Rules

FP-Growth algorithm in Data Compression frequent patterns

Mining Frequent Patterns without Candidate Generation

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Association Rule Mining

Maintenance of the Prelarge Trees for Record Deletion

Association Rule Mining. Introduction 46. Study core 46

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Parallelizing Frequent Itemset Mining with FP-Trees

Appropriate Item Partition for Improving the Mining Performance

An Improved Apriori Algorithm for Association Rules

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

CS570 Introduction to Data Mining

An Efficient Algorithm for finding high utility itemsets from online sell

Association Rules Extraction with MINE RULE Operator

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

Frequent Pattern Mining with Uncertain Data

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Tutorial on Association Rule Mining

Mining of Web Server Logs using Extended Apriori Algorithm

Research and Improvement of Apriori Algorithm Based on Hadoop

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

Chapter 4: Association analysis:

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

This paper proposes: Mining Frequent Patterns without Candidate Generation

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Comparing the Performance of Frequent Itemsets Mining Algorithms

CSCI6405 Project - Association rules mining

TKS: Efficient Mining of Top-K Sequential Patterns

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Association Rule Mining: FP-Growth

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Memory issues in frequent itemset mining

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Adaption of Fast Modified Frequent Pattern Growth approach for frequent item sets mining in Telecommunication Industry

Efficient Tree Based Structure for Mining Frequent Pattern from Transactional Databases

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

An Algorithm for Mining Frequent Itemsets from Library Big Data

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

MARGIN: Maximal Frequent Subgraph Mining Λ

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

ANALYSIS OF DENSE AND SPARSE PATTERNS TO IMPROVE MINING EFFICIENCY

Study on Mining Weighted Infrequent Itemsets Using FP Growth

Privacy Preserving Frequent Itemset Mining Using SRD Technique in Retail Analysis

A New Fast Vertical Method for Mining Frequent Patterns

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Association Rule Mining

Fast Algorithm for Mining Association Rules

Efficient Algorithm for Frequent Itemset Generation in Big Data

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran

Decision Support Systems

Mining Frequent Itemsets Along with Rare Itemsets Based on Categorical Multiple Minimum Support

A Taxonomy of Classical Frequent Item set Mining Algorithms

Comparison of FP tree and Apriori Algorithm

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar


Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

arxiv: v1 [cs.db] 11 Jul 2012

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

DATA MINING II - 1DL460

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

SQL Based Frequent Pattern Mining with FP-growth

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

A Mining Algorithm to Generate the Candidate Pattern for Authorship Attribution for Filtering Spam Mail

A Comparative Study of Association Rules Mining Algorithms

A New Technique to Optimize User s Browsing Session using Data Mining

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

Infrequent Weighted Itemset Mining Using Frequent Pattern Growth

An Automated Support Threshold Based on Apriori Algorithm for Frequent Itemsets

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Closed Non-Derivable Itemsets

Mining Temporal Association Rules in Network Traffic Data

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

PC Tree: Prime-Based and Compressed Tree for Maximal Frequent Patterns Mining

Iliya Mitov 1, Krassimira Ivanova 1, Benoit Depaire 2, Koen Vanhoof 2

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

Generation of Potential High Utility Itemsets from Transactional Databases

Discovery of Frequent Itemsets: Frequent Item Tree-Based Approach

Association Rules. A. Bellaachia Page: 1

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

Transcription:

Frequent Itemsets Melange Sebastien Siva Data Mining

Motivation and objectives Finding all frequent itemsets in a dataset using the traditional Apriori approach is too computationally expensive for datasets that contain many large frequent itemsets. The two main reasons are huge candidate generation requirements, and the large number of database scans. This project attempts to address this limitation with two different approaches. The first approach, dubbed Hash FP (frequent patterns), grows frequent itemsets directly out of the transactions where frequent item subsets were found. This requires storing references to all transactions where a frequent itemset was found, and hashing frequent itemsets into a table, so that itemsets found in separate transactions are quickly identified and counted. This solution avoids Apriori style candidate generation and reduces database scanning by visiting only relevant transactions. It has some similarities to FP-Growth, but does not build a tree structure. The second approach is a simple implementation of the Pincer algorithm. It is based on the Apriori algorithm, using Apriori to generate and test candidate itemsets. It also, simultaneously, searches for maximum frequent itemsets. Each cycle uses infrequent itemsets, to prune maximum itemset candidates. Furthermore, frequent maximum itemsets are used to avoid database scanning and counting of candidates which are maximum subsets. Though this algorithm was originally designed to only find maximum frequent itemsets, it is trivially modified to output all frequent itemsets. Unfortunately, it cannot specify a support count for frequent itemsets; just guarantee a minimum support threshold.

Related work There has been a lot of recent research in frequent itemset finding. One of the fastest techniques is FP-Growth, which Hash FP is somewhat modeled after. The FP- Growth technique builds a compact tree representing the entire dataset that can then be quickly scanned to build the frequent itemsets. There are many variations to this algorithm, and at least one solution that adapts dynamically to the data to optimize which variation should be used when. The AFOPT algorithm is one such solution and gives very quick run times for the example datasets used in this project. The experimental results are shown in Figure 1. AFOPT Results All Frequent Itemsets Maximum frequent itemsets T10I4D100K 2.853 Sec 2.930 Sec (Threshold 500) Chess (Threshold 2750) 0.171 Sec 0.625 Sec Figure 1. The AFOPT algorithm on the two test datasets. This implementation is in C++ and highly optimized. Furthermore, it uses the latest FP-Growth techniques. As a result, it is much faster than the sample (Java) implementations done in this project. On the Apriori side of the research comes the Pincer algorithm. It uses Apriori s main weakness, candidate generation, as a pruning device for maximum candidates. As a result, it is probably the best use of the Apriori candidate generation technique possible. Problem Statement The goal of this project was originally to quickly find all frequent item sets in a dataset with many large frequent itemsets. The original idea behind the project was to

use a least common substring algorithm to find some large frequent itemsets. These large itemsets could then be used to significantly reduce the candidate itemsets searched for in the dataset. Unfortunately, the least common subsequence algorithm was found to be not scalable enough for the sample datasets used here. The second attempt at solving this goal was to combine the Hash FP algorithm with the Pincer search. The Hash FP algorithm would use the infrequent itemsets it found to prune the maximum candidates. Furthermore, the maximum frequent itemsets found, would be used to stop the Hash FP algorithm from searching subsets that were already known to be frequent. Two problems arose in this idea. First, the Hash FP algorithm does not consider nearly as many candidates as the Apriori approach. This is because the Hash FP algorithm grows candidates from transactions as opposed to Apriori, which considers all combinations of smaller frequent itemsets as candidates. Since less infrequent candidates are found, the pruning of maximum candidate itemsets is less effective. Second, Hash FP, does not generate candidates before doing the database search, as a result, the top-down pruning is ineffective. In order to achieve the original goal, and evaluate the potential of the Hash FP method, the projects scope was switched to include an implementation of the Pincer algorithm. Thus, in the end, the project has two implementations; Pincer and Hash FP. The Hash FP algorithm is potentially novel, but does not directly address the original goal.

Solutions Hash FP The Hash FP algorithm works by storing a list of transactions where each itemset was found. This list is then traversed in the next iteration and searched for frequent items that are greater than the largest item in the current itemset. For each frequent item found, a new itemset object consisting of the previous itemset and the new item is created, or updated, if already created from a previous transaction. The corresponding transaction is then added to the itemset object to record where this itemset was found. These itemsets are indexed in an array by the item value, so when future transactions find the same itemset the same itemset, they can instantly update the object. Pseudo code for the algorithm is given below in Figure 2. As you can see from above, the algorithm runs through the entire dataset twice, the first time counting all frequent items, the second time hashing all pairs of frequent items. From then on, it only goes to transactions where k-1 frequent itemsets were found. The last item in the frequent itemset indexes the transactionlistarray. This array is reused for each search, so space utilization is minimized.

Hash FP Algorithm Define structure: k-itemset{ int[] set, List transactionlist} 1. Scan the entire database counting the frequency of each item and storing it by item value as index in the array called itemcount. 2. Print Frequent items 3. Scan the entire database. a. For each transaction i. For each pair (2-itemset) of frequent 1 items: 1. Search for 2-itemset in hash table called 2-itemsetTable. 2. Add transaction reference to the transactionlist stored in the 2-itemsetTable. b. For each 2-itemset if the corresponding transactionlist has size >= min_sup. i. Add 2-itemset to frequentitemlist ii. Print 2-itemset 4. whilefrequentitemlist not empty a. For each k-itemset in the frequentitemlist i. For each transaction in this k-itemset 1. For each frequent 1 item >= last item in k-itemset a. Add transaction reference to the transactionlist stored in transactionlistarray[itemvalue] 2. For each transactionlist in transactionlistarray with size >= min_sup a. Create new itemset b. itemset.set t = k-itemset Union transactionlistindex c. itemset.transactionlist = transactionlist d. Add itemset to newfrequentitemlist e. Print itemset ii. Clear out transactionlistarray b. frequentitemlist = newfrequentitemlist c. newfrequentitemlist = null; Figure 2. Hash FP pseudo code Pincer The Pincer algorithm uses a standard Apriori bottom up search to find frequent itemsets. Each infrequent itemset found is used to prune maximum frequent itemset candidates. These candidates all come from the original candidate, which is simply the

set of all frequent items. Below is an example of the maximum candidate generation and checking. Assume (1,3,5,7,10,12,14) are the frequent items. The original maximum candidate is {1,3,5,7,10,12,14} The database is scanned to see if it is frequent. Assume it is not frequent, and furthermore the Apriori bottom-up process reveals {3,10} is not frequent. The original maximum candidate is split into two candidates o {1, 3, 5, 7, 12,14} o {1, 5, 7,10,12,14} These two candidates are then checked for frequency in the dataset. If they are frequent, they are removed from the candidate list. The process repeats. This process is repeated until there are no more maximum frequent candidates. As the bottom-up Apriori search proceeds, candidates that are subsets of the maximum frequent itemsets are skipped. Thus, the Apriori search benefits from the top-down approach. Pseudo code for the algorithm is given in Figure 3.

Pincer Algorithm Define: List MaximumCandidates, MaximumFrequents 1. Scan the entire database counting the frequency of each item and storing it by item value as index in the array called itemcount. 2. Print frequent items. 3. Initialize MaximumCandidates to the single candidate set consisting of all frequent items. 4. Generate 2-itemset candidates from all pairs of frequent items and store them in list called candidates. 5. While candidates not empty and MaximumCandidates not empty a. For all MaximumCandidates i. Count frequency of MaximumCandidate in dataset ii. If MaximumCandidate is frequent (count >= min_sup) remove from MaximumCandidates and add to MaximumFrequents b. For all candidates i. If candidate is a subset of a MaximumFrequent then add it to frequents and skip to next candidate. ii. Count frequency of candidate. iii. If frequent (count >= min_sup), then print and add to list called frequents iv. Else use them to split MaximumCandidates. c. Clear candidates. d. Generate new candidates Apriori stile from all frequents. Figure 3. Pincer algorithm pseudo code There are some other complexities skipped over in this short pseudo code, such as Apriori candidate generation, but in general the algorithm is very straightforward. It should be noted that this algorithm is slightly modified from the original Pincer implementation. In the original implementation if a candidate is skipped because it is a subset of a frequent maximum, then, it is not used for future candidate generation. To fill in the possible, un-generated candidates, a complicated scheme that involves rescanning the maximum frequents is used. This was found to be overly complicated for the minimal benefit provided. Furthermore, in the implementations included with this project the size 2-itemsets are always directly built in a hash table as opposed to Apriori

or FP-Growth generation. This was found to be faster, but too costly in terms of memory for itemsets greater than size 2. Implementation Details Both algorithms presented above are implemented in Java. They are designed to work with the datasets available from the FIMI Repository. Below is a description of the Classes used in each implementation. Hash FP DataSet Parses the input file into a list of transactions (int arrays). Also counts up all frequent items, and provides an interface to check if an arbitrary item is frequent. ItemSet Holds a potentially frequent item set (int array) and a list of transactions where the item set was found. Searcher implements the methods for doing the 2-itemset hash and the general searching algorithm described in the previous section. BufUtil - implements a system which allows the transactionlistarray described in the previous section to be quickly erased and reused. Pincer DataSet Parses the input file into a list of transactions (int arrays). Also counts up all frequent items, and provides an interface to check if an arbitrary item is frequent. Furthermore, it provides a simple interface for searching all the transactions for a given collection of item sets, and call backs for unique processing of the frequent and infrequent itemsets as they are found.

DataSetListener the interface required for the call back system used by the DataSet class. ItemSet Holds a potentially frequent item set (int array) and a frequency counter. MFCS Master Frequent Candidate Searcher has all the algorithms required for pruning maximum candidates with infrequent sets, and searching for maximum candidates in the dataset. Apriori A standard Apriori system that uses candidate generation, pruning based on the CandidateFilter interface, and candidate searching / skipping according to the algorithm described in the previous section. Furthermore, it uses infrequent candidates to prune the MFCS system. CandidateFilter Interface for pruning infrequent candidates before the dataset search. TwoItemHash CandidateFilter used to efficiently prune the 2-itemset candidates. InFrequentSubset CandidateFilter used to prune candidates that have infrequent subsets (standard Apriori). Test Results The implementations were tested on two datasets from the FIMI repository; T10I4D100K, Chess. The T10I4D100K dataset contains few large frequent itemsets while the Chess dataset contains many large frequent itemsets. Thus it was expected that the Hash FP algorithm would perform better on T10I4D100K and worse on the Chess dataset, while the Pincer algorithm would do just the opposite. Below are the test results.

DataSet (Threshold) Hash FP Pincer T10I4D100K (500) 20.308 s 51.233 s Chess (2750) 10.7 s 3.9 s These results show the expected outcome. The Pincer algorithm, which is designed for datasets with many large frequent itemsets, performs very well on the Chess dataset, but poorly on the T10I4D100K. The performance of the Hash FP is fairly promising for a new approach, but clearly not tailored for datasets with many large frequent itemsets. Conclusion There are many algorithms available for frequent itemset finding. The generic algorithms that grow larger itemsets, from smaller ones, suffer when encountering datasets with many large frequent itemsets. The Pincer algorithm is a good strategy for these datasets. The Hash FP algorithm shows a simple approach to standard bottom-up frequent itemset mining. It is simple, and efficient in time, but can be expensive in memory for certain situations. Like other bottom-up algorithms, it suffers from datasets with many large frequent itemsets. Neither of these two algorithms compare in performance with some of the newer adaptive algorithms such as AFOPT, which dynamically select data structures and search strategies as they discover characteristics of the data. Finally, these implementations are far from fully optimized, as they are written in Java. Future Work The Hash FP algorithm shows enough promise for some future work. To start with, it should be renamed, as it no longer requires much hashing. Furthermore, an

efficient implementation in C should be written to truly test performance potential. Finally, it should be noted that, unlike FP-Growth, the Hash FP algorithm does no need to hold the entire dataset in memory. On the other hand, it maintains the advantage of only revisiting transactions that contained smaller frequent subsets. This may make the algorithm very suitable for frequent item mining on very large datasets that cannot fit in memory, and have an expensive transaction retrieval cost. These issues justify separate research into the Hash FP method.

References Kedem Z., Lin D. 1997. Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Han J., Pei J., Yin Y. 2000. Mining Frequent Patterns without Candidate Generation Lu G., Lui H., Yu J., Wang W., Xiao X. 2003. AFOPT: An Efficient Implementation of Pattern Growth Approach