Frequent Itemsets Melange Sebastien Siva Data Mining
Motivation and objectives Finding all frequent itemsets in a dataset using the traditional Apriori approach is too computationally expensive for datasets that contain many large frequent itemsets. The two main reasons are huge candidate generation requirements, and the large number of database scans. This project attempts to address this limitation with two different approaches. The first approach, dubbed Hash FP (frequent patterns), grows frequent itemsets directly out of the transactions where frequent item subsets were found. This requires storing references to all transactions where a frequent itemset was found, and hashing frequent itemsets into a table, so that itemsets found in separate transactions are quickly identified and counted. This solution avoids Apriori style candidate generation and reduces database scanning by visiting only relevant transactions. It has some similarities to FP-Growth, but does not build a tree structure. The second approach is a simple implementation of the Pincer algorithm. It is based on the Apriori algorithm, using Apriori to generate and test candidate itemsets. It also, simultaneously, searches for maximum frequent itemsets. Each cycle uses infrequent itemsets, to prune maximum itemset candidates. Furthermore, frequent maximum itemsets are used to avoid database scanning and counting of candidates which are maximum subsets. Though this algorithm was originally designed to only find maximum frequent itemsets, it is trivially modified to output all frequent itemsets. Unfortunately, it cannot specify a support count for frequent itemsets; just guarantee a minimum support threshold.
Related work There has been a lot of recent research in frequent itemset finding. One of the fastest techniques is FP-Growth, which Hash FP is somewhat modeled after. The FP- Growth technique builds a compact tree representing the entire dataset that can then be quickly scanned to build the frequent itemsets. There are many variations to this algorithm, and at least one solution that adapts dynamically to the data to optimize which variation should be used when. The AFOPT algorithm is one such solution and gives very quick run times for the example datasets used in this project. The experimental results are shown in Figure 1. AFOPT Results All Frequent Itemsets Maximum frequent itemsets T10I4D100K 2.853 Sec 2.930 Sec (Threshold 500) Chess (Threshold 2750) 0.171 Sec 0.625 Sec Figure 1. The AFOPT algorithm on the two test datasets. This implementation is in C++ and highly optimized. Furthermore, it uses the latest FP-Growth techniques. As a result, it is much faster than the sample (Java) implementations done in this project. On the Apriori side of the research comes the Pincer algorithm. It uses Apriori s main weakness, candidate generation, as a pruning device for maximum candidates. As a result, it is probably the best use of the Apriori candidate generation technique possible. Problem Statement The goal of this project was originally to quickly find all frequent item sets in a dataset with many large frequent itemsets. The original idea behind the project was to
use a least common substring algorithm to find some large frequent itemsets. These large itemsets could then be used to significantly reduce the candidate itemsets searched for in the dataset. Unfortunately, the least common subsequence algorithm was found to be not scalable enough for the sample datasets used here. The second attempt at solving this goal was to combine the Hash FP algorithm with the Pincer search. The Hash FP algorithm would use the infrequent itemsets it found to prune the maximum candidates. Furthermore, the maximum frequent itemsets found, would be used to stop the Hash FP algorithm from searching subsets that were already known to be frequent. Two problems arose in this idea. First, the Hash FP algorithm does not consider nearly as many candidates as the Apriori approach. This is because the Hash FP algorithm grows candidates from transactions as opposed to Apriori, which considers all combinations of smaller frequent itemsets as candidates. Since less infrequent candidates are found, the pruning of maximum candidate itemsets is less effective. Second, Hash FP, does not generate candidates before doing the database search, as a result, the top-down pruning is ineffective. In order to achieve the original goal, and evaluate the potential of the Hash FP method, the projects scope was switched to include an implementation of the Pincer algorithm. Thus, in the end, the project has two implementations; Pincer and Hash FP. The Hash FP algorithm is potentially novel, but does not directly address the original goal.
Solutions Hash FP The Hash FP algorithm works by storing a list of transactions where each itemset was found. This list is then traversed in the next iteration and searched for frequent items that are greater than the largest item in the current itemset. For each frequent item found, a new itemset object consisting of the previous itemset and the new item is created, or updated, if already created from a previous transaction. The corresponding transaction is then added to the itemset object to record where this itemset was found. These itemsets are indexed in an array by the item value, so when future transactions find the same itemset the same itemset, they can instantly update the object. Pseudo code for the algorithm is given below in Figure 2. As you can see from above, the algorithm runs through the entire dataset twice, the first time counting all frequent items, the second time hashing all pairs of frequent items. From then on, it only goes to transactions where k-1 frequent itemsets were found. The last item in the frequent itemset indexes the transactionlistarray. This array is reused for each search, so space utilization is minimized.
Hash FP Algorithm Define structure: k-itemset{ int[] set, List transactionlist} 1. Scan the entire database counting the frequency of each item and storing it by item value as index in the array called itemcount. 2. Print Frequent items 3. Scan the entire database. a. For each transaction i. For each pair (2-itemset) of frequent 1 items: 1. Search for 2-itemset in hash table called 2-itemsetTable. 2. Add transaction reference to the transactionlist stored in the 2-itemsetTable. b. For each 2-itemset if the corresponding transactionlist has size >= min_sup. i. Add 2-itemset to frequentitemlist ii. Print 2-itemset 4. whilefrequentitemlist not empty a. For each k-itemset in the frequentitemlist i. For each transaction in this k-itemset 1. For each frequent 1 item >= last item in k-itemset a. Add transaction reference to the transactionlist stored in transactionlistarray[itemvalue] 2. For each transactionlist in transactionlistarray with size >= min_sup a. Create new itemset b. itemset.set t = k-itemset Union transactionlistindex c. itemset.transactionlist = transactionlist d. Add itemset to newfrequentitemlist e. Print itemset ii. Clear out transactionlistarray b. frequentitemlist = newfrequentitemlist c. newfrequentitemlist = null; Figure 2. Hash FP pseudo code Pincer The Pincer algorithm uses a standard Apriori bottom up search to find frequent itemsets. Each infrequent itemset found is used to prune maximum frequent itemset candidates. These candidates all come from the original candidate, which is simply the
set of all frequent items. Below is an example of the maximum candidate generation and checking. Assume (1,3,5,7,10,12,14) are the frequent items. The original maximum candidate is {1,3,5,7,10,12,14} The database is scanned to see if it is frequent. Assume it is not frequent, and furthermore the Apriori bottom-up process reveals {3,10} is not frequent. The original maximum candidate is split into two candidates o {1, 3, 5, 7, 12,14} o {1, 5, 7,10,12,14} These two candidates are then checked for frequency in the dataset. If they are frequent, they are removed from the candidate list. The process repeats. This process is repeated until there are no more maximum frequent candidates. As the bottom-up Apriori search proceeds, candidates that are subsets of the maximum frequent itemsets are skipped. Thus, the Apriori search benefits from the top-down approach. Pseudo code for the algorithm is given in Figure 3.
Pincer Algorithm Define: List MaximumCandidates, MaximumFrequents 1. Scan the entire database counting the frequency of each item and storing it by item value as index in the array called itemcount. 2. Print frequent items. 3. Initialize MaximumCandidates to the single candidate set consisting of all frequent items. 4. Generate 2-itemset candidates from all pairs of frequent items and store them in list called candidates. 5. While candidates not empty and MaximumCandidates not empty a. For all MaximumCandidates i. Count frequency of MaximumCandidate in dataset ii. If MaximumCandidate is frequent (count >= min_sup) remove from MaximumCandidates and add to MaximumFrequents b. For all candidates i. If candidate is a subset of a MaximumFrequent then add it to frequents and skip to next candidate. ii. Count frequency of candidate. iii. If frequent (count >= min_sup), then print and add to list called frequents iv. Else use them to split MaximumCandidates. c. Clear candidates. d. Generate new candidates Apriori stile from all frequents. Figure 3. Pincer algorithm pseudo code There are some other complexities skipped over in this short pseudo code, such as Apriori candidate generation, but in general the algorithm is very straightforward. It should be noted that this algorithm is slightly modified from the original Pincer implementation. In the original implementation if a candidate is skipped because it is a subset of a frequent maximum, then, it is not used for future candidate generation. To fill in the possible, un-generated candidates, a complicated scheme that involves rescanning the maximum frequents is used. This was found to be overly complicated for the minimal benefit provided. Furthermore, in the implementations included with this project the size 2-itemsets are always directly built in a hash table as opposed to Apriori
or FP-Growth generation. This was found to be faster, but too costly in terms of memory for itemsets greater than size 2. Implementation Details Both algorithms presented above are implemented in Java. They are designed to work with the datasets available from the FIMI Repository. Below is a description of the Classes used in each implementation. Hash FP DataSet Parses the input file into a list of transactions (int arrays). Also counts up all frequent items, and provides an interface to check if an arbitrary item is frequent. ItemSet Holds a potentially frequent item set (int array) and a list of transactions where the item set was found. Searcher implements the methods for doing the 2-itemset hash and the general searching algorithm described in the previous section. BufUtil - implements a system which allows the transactionlistarray described in the previous section to be quickly erased and reused. Pincer DataSet Parses the input file into a list of transactions (int arrays). Also counts up all frequent items, and provides an interface to check if an arbitrary item is frequent. Furthermore, it provides a simple interface for searching all the transactions for a given collection of item sets, and call backs for unique processing of the frequent and infrequent itemsets as they are found.
DataSetListener the interface required for the call back system used by the DataSet class. ItemSet Holds a potentially frequent item set (int array) and a frequency counter. MFCS Master Frequent Candidate Searcher has all the algorithms required for pruning maximum candidates with infrequent sets, and searching for maximum candidates in the dataset. Apriori A standard Apriori system that uses candidate generation, pruning based on the CandidateFilter interface, and candidate searching / skipping according to the algorithm described in the previous section. Furthermore, it uses infrequent candidates to prune the MFCS system. CandidateFilter Interface for pruning infrequent candidates before the dataset search. TwoItemHash CandidateFilter used to efficiently prune the 2-itemset candidates. InFrequentSubset CandidateFilter used to prune candidates that have infrequent subsets (standard Apriori). Test Results The implementations were tested on two datasets from the FIMI repository; T10I4D100K, Chess. The T10I4D100K dataset contains few large frequent itemsets while the Chess dataset contains many large frequent itemsets. Thus it was expected that the Hash FP algorithm would perform better on T10I4D100K and worse on the Chess dataset, while the Pincer algorithm would do just the opposite. Below are the test results.
DataSet (Threshold) Hash FP Pincer T10I4D100K (500) 20.308 s 51.233 s Chess (2750) 10.7 s 3.9 s These results show the expected outcome. The Pincer algorithm, which is designed for datasets with many large frequent itemsets, performs very well on the Chess dataset, but poorly on the T10I4D100K. The performance of the Hash FP is fairly promising for a new approach, but clearly not tailored for datasets with many large frequent itemsets. Conclusion There are many algorithms available for frequent itemset finding. The generic algorithms that grow larger itemsets, from smaller ones, suffer when encountering datasets with many large frequent itemsets. The Pincer algorithm is a good strategy for these datasets. The Hash FP algorithm shows a simple approach to standard bottom-up frequent itemset mining. It is simple, and efficient in time, but can be expensive in memory for certain situations. Like other bottom-up algorithms, it suffers from datasets with many large frequent itemsets. Neither of these two algorithms compare in performance with some of the newer adaptive algorithms such as AFOPT, which dynamically select data structures and search strategies as they discover characteristics of the data. Finally, these implementations are far from fully optimized, as they are written in Java. Future Work The Hash FP algorithm shows enough promise for some future work. To start with, it should be renamed, as it no longer requires much hashing. Furthermore, an
efficient implementation in C should be written to truly test performance potential. Finally, it should be noted that, unlike FP-Growth, the Hash FP algorithm does no need to hold the entire dataset in memory. On the other hand, it maintains the advantage of only revisiting transactions that contained smaller frequent subsets. This may make the algorithm very suitable for frequent item mining on very large datasets that cannot fit in memory, and have an expensive transaction retrieval cost. These issues justify separate research into the Hash FP method.
References Kedem Z., Lin D. 1997. Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set Han J., Pei J., Yin Y. 2000. Mining Frequent Patterns without Candidate Generation Lu G., Lui H., Yu J., Wang W., Xiao X. 2003. AFOPT: An Efficient Implementation of Pattern Growth Approach