FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Jun Luo Sanguthevar Rajasekaran Dept. of Computer Science Ohio Northern University Ada, OH 4581 Email: j-luo@onu.edu Dept. of Computer Science & Engineering University of Connecticut 191 Auditorium Road, U-155, Storrs, CT 6269-3155 Email: rajasek@engr.uconn.edu Abstract Association rule mining is an important data mining problem that has been studied extensively. In this paper, a simple but Fast algorithm for Intersecting attribute lists using a hash Table (FIT) is presented. FIT is designed for efficiently computing all the frequent itemsets in large databases. It deploys the similar idea as Eclat but has a much better computation performance than Eclat due to two aspects: 1) FIT has fewer total number of comparisons for each intersection operation between two attribute lists, 2) FIT significantly reduces the total number of intersection operations. The experimental results demonstrate that the performances of FIT are much better than those of Eclat and Apriori algorithms. Keywords: association rule, frequent itemset, FIT. 1. Introduction Association rule mining originated from the necessity of analyzing large amounts of supermarket basket data [2][3][5][6][9][1][11][12]. It is a wellstudied problem in data mining. The problem of mining association rules can be formally stated as follows: Let I=i 1, i 2,..., i n be a set of attributes, called items. An itemset is a subset of I. D represents a database that consists of a set of transactions. Each transaction in D contains two parts: a unique transaction identification number (tid) and an itemset. The size of an itemset is defined as the number of items in it. An itemset with size d is denoted as d-itemset. A transaction (t) is said to contain an item (i) if i appears in t. A t is said to contain an itemset X if all the items in X are contained by t. That the support of X is s means there are s transactions containing X in D. An X is said to be a frequent itemset if its support is greater than or equal to the user-specified minimum support (). An association rule can be expressed as X=>Y, in which X, Y are itemsets and X Y=. That the rule X=>Y is said to have a support of s means s transactions in D contain the itemset of X Y. Also, that the rule X=>Y is said to have a confidence c means c percent of the transactions that contain X also contain Y. The symbol minconf is used to represent the user-specified minimum confidence. Given a D,, and minconf, the problem of mining association rules is to generate all the association rules whose supports and confidences are greater than or equal to and minconf, respectively. For convenience of discussion, some conventions are adopted in this paper: If X and Y are frequent itemsets, and the union of X and Y is also a frequent itemset, then X and Y are said to have a strong association relation. Otherwise, X and Y are said to have a weak association relation. If a symbol A represents a set or a list, then the notation A stands for the number of elements in A. Some other notations used in the rest of this paper are shown in Table 1. Generally speaking, the task of mining association rules consists of two steps: 1) Calculate all the frequent itemsets, 2) Calculate the association rules from the frequent itemsets that have been discovered in 1). Between the two steps, calculations of frequent Notations n Table 1 Notations Remarks The collection of frequent k-itemsets with their attribute lists. l An attribute list in L, here 1 n L. i n k l The i-th attribute of l, here 1 n L and 1 i l. n i CF All the attribute lists that follow l i in, here 1 i<. i j i j SG, The union of the attribute lists from l i to l j in, here 1 i<j. k k F, i j All the attribute lists that follow l j in and have strong association relations with SG,. n 1

itemsets play an essential role in association rule mining. In this paper, an algorithm FIT is presented. FIT is a simple but fast algorithm for computing all the frequent itemsets in large databases. The basic idea of FIT is similar to that of Eclat [7], but FIT has much better computation performances than Eclat. The remainder of this paper is organized as follows: Section 2 describes a simple method. Section 3 puts forward the FIT algorithm. Section 4 discusses experimental results. Finally, Section 5 presents conclusions. 2. A Simple Method If a transaction t contains an itemset X, then t is treated as an attribute of X. The attribute is represented by the tid value of t. All the attributes of X form an attribute list. Attribute lists for all the items in I are generated by scanning D once. Therefore, the original database is transformed into the attribute list format. Note that, all the attribute lists whose support values are no less than constitute L 1. With the attribute list format, calculations of frequent itemsets become straightforward: The support value of X is determined by the number of attributes in its attribute list. The support value of the itemset generated from the union of itemsets X and Y consists of two steps: Step 1) Intersect attribute lists of X and Y. Step 2) Calculate the number of attributes in the intersection results. Intersections between any two attribute lists, l 1 and l 2, can be calculated using a hash table. The length of the hash table depends on the largest attribute value in. The initial value for each hash table entry is set to -1. The calculation begins with scanning l 1 first. During the scan, attribute values are used as indices to access hash table entries, and values of entries being accessed are set to 1. Then, l 2 is scanned. During the scan, attribute values are also used as indices to access hash table entries. If the entry being accessed contains 1, the corresponding attribute is kept in the intersection results. Otherwise, the attribute is discarded. The total number of comparisons for computing l 1 l 2 is min l 1, l 2. For n attribute lists (l 1, l 2,..., l n ), intersections between an attribute list (l p, 1 p<n) and each of the rest attribute lists (l q, p<q n) are computed as follows: Scan l p once and initialize the hash table as discussed above. Then, successively scan each l q and calculate intersections. If all the attribute lists are arranged in such an order that l 1 l 2... l n, the total number of comparisons for calculating l p l p+1, l p l p+2,, and l p l n is equal to l p+1 + l p+2 +... + l n. Starting from L 1, all the frequent itemsets of any size could be calculated in two ways: breadth-first calculation or depth-first calculation. The idea of the breadth-first calculation is that all the frequent k- itemsets, k>1, are calculated before any of the frequent (k+1)-itemsets is calculated. The idea of the depth-first calculation is that given an -1, k>1, if intersection results between an attribute list (l p, 1 p<n) and the attribute lists that follow l p in -1 generate a non-empty, then, is calculated before any of the intersections between an attribute list (l q, p<q n) and any of attribute lists that follow l q in -1 is computed. It was shown in experiments that the depth-first strategy had better performance than the breadth-first strategy. It is believed that the depthfirst strategy can result in better cache hit rates. Given a database D, and, a formal description of the simple method is shown in Figure 1 below: Step 1) Transform D into the attribute list format and calculate L 1. Sort items and corresponding attribute lists in L 1 into the non-increasing order according to the number of attributes in the lists. Mark all the itemsets in L 1 as unvisited. Step 2) Establish a hash table (hb) with D entries. Set each entry in hb to -1. Set k to 1. Step 3) If all the itemsets in have been visited, and k equals 1, the calculation terminates. If all the itemsets in have been visited, and k does not equal 1, decrease k by 1. Step 4) Scan the attribute list of the first unvisited itemset (X) in,. For each attribute (vx) set hb[vx] to 1. Mark X as visited. Step 5) Scan the attribute list of any of the other itemsets (Y) that follow X in. For each attribute (ty), if hb[vy] equals, discard ty. If hb[vy] equals 1, put vy into the resulting attribute list. If the number of attributes in the resulting attribute list is no less than, put the itemset (X Y) and the resulting attribute list into +1. Mark the itemset X Y as unvisited in +1. Step 6) Reset entries in hb to -1. If +1 is not empty, increase k by 1 and go to Step 4). Otherwise, go to Step 3). Figure 1 A Simple Method 3. Algorithm FIT Given n attribute lists, the simple method in Figure 1 needs to perform n (n-1)/2 intersection calculations. If two itemsets, X and Y, have a weak relation, the attribute list calculation for X Y is 2

unnecessary. The overall computation performance can be improved if such unnecessary intersection calculations are avoided. The idea for cutting down on the unnecessary intersection operations is based on Lemma 1: Lemma 1: Let l be the union of n attribute list (l 1, l 2,, l n ). If l has a weak association relation with another attribute list (l n+1 ), any attribute list l i, in which 1 i n, will also have a weak association relation with l n+1. Proof: Assume l has a weak association relation with l n+1 and, without loss of generality, l 1 has a strong association relation with l n+1. Let a = l1 ln+ 1, and b = l ln+ 1. As attributes of l 1 is a subset of l, b is greater than or equal to a. Thus, b is no less than. Therefore, l has a strong association relation with l n+1, which contradicts to the assumption. So l 1 cannot have a strong association relation with l n+1. The correctness of the observation has been proved. Given an, the attribute lists are logically divided into /d subgroups. Each subgroup except the last one has d attribute lists, in which 1<d<. The last subgroup has -( /d-1)d attribute lists. For the convenience of discussion, in the rest of this paper, is assumed to be a multiple integral of d., d 1 d, 2d 1 The subgroups are denoted as,, k k, SG L. Starting with (the first subgroup) until subgroup L d, L 1 k Lk d, Lk 1 SG, d 1 (the last one), for each, do the following: 1) Calculate the set ; 2) For attribute lists in, the simple method introduced in Figure 1 is adopted here with a small change: For each id,( i+ 1) d 1 attribute list, for example lg, in, the simple method needs to calculate the intersections between lg and any of other attribute lists that either id,( i+ 1) d 1 are in F or follow lg in. The method of calculating is described as follows: At the beginning, set to C ( i+ 1) d 1 id,( i+ 1) d 1 id,( i+ 1) d 1. Then, the union of all the attribute lists in is calculated, and the result is u. The intersections between u and each attribute list, for example lq, in are calculated one at a is time. If u and lq has a weak association relation, l q is removed from. The algorithm FIT is simply a recursive version of the above discussion. After the first logical division of, if the size of each subgroup is still large, then, id,( i+ 1) d 1 after calculating, each subgroup is treated as a new set of frequent k-itemsets, and the method introduced in the above discussion is applied on. This procedure repeats until the size of each subgroup is small enough. Note that, when is divided into smaller subgroups whose sizes are denoted as d, for each jd,( j+ 1) d 1 L subgroup SG, in which j k 1, the d jd,( j+ 1) d 1 ( j+ 1) d 1 initial set F is the union of CF and. Pseudocode descriptions of the algorithm FIT are shown in Figure 2. Given an, the recursion level of subgroups that are generated by logically dividing is 1. The recursion level of subgroups is q+1, if they are generated by logically dividing a subgroup whose recursion level is q. In Figure 2, the maximum recursion level and sizes of subgroups at different recursion levels for L 1 are recorded in a variable max_depth and an array depth[1..max_depth], respectively. fit() Scan D and calculate L 1 =frequent 1-itemsets and corresponding attribute lists; Sort itemsets and attribute lists in L 1 into the nonincreasing order according to the number of attributes in the lists. Create hb with D entries; Determine the value of max_depth and values of depth[1..max_depth]; k:=1, p:=1; for(i:=1; i< L 1 -depth[1]; i:=i+depth[1]) for(j:=i; j< i+depth[1]; j:=j+1) initialize_hb(l 1.l j ); F:=Ø; for(j:=i+depth[1]; j L 1 ; j:=j+1) l:=intersection(l 1.l j ); F:=F j; if(p<max_depth) 3

calculate_subgroup(p+1, i, i+depth[1], F); else +1 :=Ø; for(j:=i; j<i+depth[1]; j:=j+1) initialize_hb(l 1.l j ); for(x:=j+1; x<i+depth[1]; x:=x+1) l:=intersection(l 1.l x ); +1 :=+1 l; for(x:=1; x< F ; x++) l:=intersection(l 1.l x ); +1 :=+1 l; if( +1 >) depth_first(+1, k+1); for(i:= L 1 -depth[1]; i L 1 ; i:=i+1) +1 :=Ø; initialize_hb(l 1.l i ); for(x:=i+1; x L 1 ; x:=x+1) l:=intersection(l 1.l x ); +1 :=+1 l; if( +1 >) depth_first(+1, k+1); //end of void fit() initialize_hb(l) Set all the entries in hb to -1; for(h:=1; h l ; h++) v:=l[h]; if(hb[v]!=1) hb[v]:=1; //end of initialize_hb(l) intersection(l x ) l:=ø; for(h:=1; h l x ; h++) v:=l x [h]; if(hb[v]!=-1) l:=l v; if( l ) return l; else return NULL; //end of intersection() calculate_subgroup(p, be, en, F) for(i:=be; i<en; i:=i+depth[p]) for(j:=i; j<i+depth[p]; j:=j+1) initialize_hb(l 1.l j ); C:=Ø; for(j:=i+depth[p]; j<en; j:=j+1) l:=intersection(l 1.l j ); C:=C j; for(j:=1; j F ; j:=j+1) v:=f j ; l:=intersection(l 1.l v ); C:=C v; if(p<max_depth) calculate_subgroup(p+1, i, i+depth[p], C); else +1 :=Ø; for(j:=i; j<i+depth[p]; j:=j+1) initialize_hb(l 1.l j ); for(x:=1; x C ; x++) v:=f j ; l:=intersection(l 1.l v ); +1 :=+1 l; if( +1 >) depth_first(+1, k+1); //end of calculate_subgroup() depth_first(, k) for(i:=1; i< ; i:=i+1) +1 :=Ø; initialize_hb( l i ); for(j:=2 j ; j:=j+1) 4

l:=intersection(.l j ); +1 :=+1 l; if( +1 >) depth_first(+1, k+1); //end of depth_first( ) Figure 2 The Algorithm FIT A simplified example is shown in Figure 3. The value of is set to 3. In Figure 3, (a) shows a database consisting of 8 transactions, and (b) displays the corresponding attribute list format. The length of the hash table is set to 8, and the size d of the subgroup is set to 3. The procedure of calculating intersections on (b) is as follows: The attribute lists of itemsets 1, 2, and 3 are scanned successively. The snapshot of the hash table after the scan is shown in (c). Then, the intersections between the union of the first three attribute lists and the attribute lists of the itemsets from 4 to 8 are calculated separately. Results are shown in (d). Because only the attribute list of the itemset 5 has a strong association relation with the union of the first three attribute lists, only the support values of the itemsets 1, 2, 1, 3, 1, 5, 2, 3, 2, 5, and 3, 5 are further calculated. The final results are shown in (e). There are no frequent 3-itemsets. So the calculation stops. In Figure 3, in order to calculate the frequent 2- itemsets that contain at least one of the three items: 1, 2 or 3, a total of 5+3+2+1=11 intersection operations are performed. If the simple method in Figure 1 is used, a total of 7+6+5=18 intersection operations will be needed. 4 Experimental Results We implemented the algorithms of Apriori and Eclat to our best knowledge. Besides FIT, the simple method in Figure 1 is also implemented as a separate algorithm. We hoped to see how effectively the simple method reduced the total number of comparisons performed by Eclat. All the programs were written in C++. For the same reason mentioned in [2], we did not implement FP-Growth in [4]. Instead of trying to implement as many other current algorithms as possible, we spent most of the time implementing Apriori efficiently. Many papers compared their algorithms with Apriori. By showing the comparisons between FIT and Apriori, it was hoped that readers could compare the performance of FIT and other algorithms indirectly. When Eclat, the simple method, and FIT were implemented, following techniques were used to determine whether or not intersection operations could be stopped earlier even though only all the attributes in both attribute lists have not been Tid Items in transactions Item-set Attribute Lists Support 1 1, 2, 3, 4, 5 2 5, 6, 7 3 1, 3, 5 4 3, 5, 6, 7 5 1, 2, 3, 7, 9 6 4, 6, 8, 9 7 4, 6, 7, 8 8 2, 3, 5, 8 (a) 1 2 3 4 5 6 7 8 1-1 1 1 1-1 -1 1 (c) 1 1, 3, 5 3 2 1, 5, 8 3 3 1, 3, 4, 5, 8 5 4 1, 6, 7 3 5 1, 2, 4, 8 4 6 2, 4, 6, 7 4 7 2, 4, 5, 7 4 8 6, 7, 8 3 (b) Item-set Attribute Lists Support,2 SG 2, 5 1, 4, 8 3 (d) Item-set Attribute Lists Support 3, 5 1, 4, 8 3 (e) Figure 3 A Simplified Example 5

examined: Suppose at some point in the procedure of the intersection operation between two attribute lists l 1 and l 2, there are still a attributes remaining in l 1 and b attributes remaining in l 2, in which a< l 1 and b< l 2. If the number of attributes already put into the resulting attribute list is c, and the sum of c and min(a, b) is less than, then the current intersection operations could be stopped immediately. Also, in our implementations, Eclat was extended to calculate the intersections between the attribute lists of frequent single items. All the experiments were performed on a SUN UltraTM 8 workstation which consisted of four 45-MHz UltraSPARC II processors with 4-MB L2 cache. The total main memory was 4GB. The operating system was Solaris 8. Synthetic datasets were created using the data generator in [8]. The synthetic datasets used in the first three experiments were D1=T26I4N1kD1k, D2=T1I4N1kD1k, D3=T1P4N1kD1k. The dataset T26I4N1kD1k meant an average transaction size of 26, an average size of the maximum potentially frequent itemsets of 4, 1 distinct items, and 1 generated transactions. The number of patterns in all the three synthetic datasets was set to 1,. The first set of experimental results shown in Figure 4 and Figure 5 were carried out on D1. Figure 4 shows the run time comparisons. Figure 5 shows the corresponding speedups of Eclat, the simple method, and FIT over Apriori. The run times of FIT were measured when the set L 1 were divided into 2 levels of subgroups. The sizes of the subgroups were 15 and 3. For any other set, k>1, the level of Apriori Eclat Simple Method FIT subgroups was restricted to 1, and the size was set to 3. Apriori Eclat Simple Method FIT 1 9 8 7 6 5 4 3 2 1.5% 1% 1.5% 2% The second set of experiments were performed on D2, and the run time results are shown in Figure 6. The third set of experiments were performed on D3, and the run time results are shown in Figure 7. Figure 8 shows the corresponding speedups of FIT 12 1 Figure 5 Apriori Eclat Simple Method FIT onds) run time (sec 1 9 8 7 6 5 4 3 2 1.5% 1% 1.5% 2% Figure 4 ) run time (seconds 8 6 4 2.5%.1%.15%.2% Figure 6 over Apriori. Similarly, Figure 9 illustrates the speedups of FIT over Eclat. In both experiments, the 6

speedup set L 1 was divided into 3 levels of subgroups. The sizes were 12, 15, and 3. For any other set, k>1, the subgroup level was restricted to 1, and the size Apriori Eclat Simple Method FIT 8 7 6 5 4 3 2 1 11 1 9 8 7 6 5 4 3 2 1 was set to 3..5%.1%.15%.2% D2 Figure 7 D3.5%.1%.15%.2% Figure 8 The results of the above three experiments and our other experiments show that FIT is consistently faster than the other three algorithms. As is decreased, the run times of FIT are increased at a slower pace than Apriori and Eclat. When equals 15 (.15%) or 2 (.2%) in Figure 6 and Figure 7, there are few frequent itemsets existing in the datasets D2 and D3. As a result, the speedups of FIT over the other algorithms are not as significant as those in other situations. The experimental results also show that the simple method is always faster than Eclat. In Figure 4, the speedup of the simple method over Eclat is as high as 3.51 when is set to 1 (1%). However, both the simple method and Eclat might be slower than Apriori in experiments. Examples are illustrated in Figure 6 when is set to 15 (.15%) or 2 (.2%). To see how effectively the simple method reduced the total number of comparisons performed by Eclat, speedup 5 45 4 35 3 25 2 15 1 5 total comparison times in millions 8 7 6 5 4 3 2 1 Eclat D2 Basic Method 74.5 D3.5%.1%.15%.2% Figure 9 15 17.11 Figure 1 7

21,97 5 Figure 11 two sample results are shown in Figure 1 and Figure 11. Both results came from the experiments on D3. Figure 1 shows the total comparison times when is set to 15 (.15%). Figure 11 illustrates the total number of comparisons when is set to 5 (.5%). In both Figures, the vertical axes represent the total number of comparisons in millions. In Figure 1, the total number of comparisons performed by the simple method is about 23 percent of that performed by Eclat. In Figure 11, the total number of comparisons performed by the simple method is about 31 percent ) total intersection operations (in millions total comparison times in millions 3 25 2 15 1 5 Eclat 25, 2, 15, 1, 5, 26.7 of that performed by Eclat. Basic Method 6736 Simple Method 3.6.46.49.92.25.5%.1%.15% Figure 12 FIT Figure 12 shows the comparisons of the total number of intersection operations performed by FIT and the simple method. Note that, for a set of frequent itemsets, the number of intersection operations performed by the simple method and Eclat should be the same. The results in Figure 12 came from the experiments on D3. That FIT significantly cut down on the intersection operations performed by the simple method or Eclat explains the results in Figure 9 where FIT is much faster than the simple method and Eclat. 28 24 2 16 12 8 7 6 5 4 3 2 1 8 4 Eclat FIT.25%.5% 1% 1.5% 2% Apriori Figure 13 T2I6D1K Eclat FIT Apriori.25%.5% 1% 1.5% 2% Figure 14 T1I2D1K 8

1 Eclat FIT Apriori 28 24 Eclat FIT Apriori 8 2 6 4 16 12 8 2 4.25%.5% 1% 1.5% 2%.25%.5% 1% 1.5% 2% Figure 15 T1I4D1K Figure 17 T2I4D1K Several other experiments in [8] were also performed. The results are shown in Figures from Figure 13 to Figure 17. The number of distinct items demonstrated that FIT is consistently faster than Apriori and Eclat. 28 24 2 16 12 8 4 Eclat FIT Apriori.25%.5% 1% 1.5% 2% Figure 16 T2I2D1K was set to 1,. The number of patterns was set to 2,. As experiments in [2] gave the performance comparisons between Eclat Apriori, not all the Figures show the run time results of Apriori. Readers can refer to [2] for the performance comparisons between Eclat and Apriori. The results further 5. Conclusions In this paper, a simple but fast algorithm, FIT, was presented. FIT efficiently addressed the problem of computing all the frequent itemsets in large databases. The simple method and FIT were designed and implemented before we noticed Eclat. Although Eclat, the simple method, and FIT all adopted the so-called tid-list idea, the simple method and FIT had much better computation performances that had been proved experimentally. Theoretical analyses of the simple method and FIT could be found in [13], which also proved the efficiency of the simple method and FIT. The simple method calculated the frequent itemsets by the aide of a hash table. The hash table was the key data structure that made it possible for the design of FIT. FIT used the idea of the divide-and-conquer strategy. In all experiments, FIT was consistently the fastest among all the algorithms that were tested. Reference [1] H. M. Mahmoud, Sorting A distribution theory, John Wiley & Sons, Inc. 2. [2] J. Hipp, U. Guntezr, and G. Nakhaeizadeh, "Algorithms for Association Rule Mining A General Survey and Comparison", Proc. of the ACM SIGKDD, July 2. 9

[3] J. Han, and Y. Fu, "Discovery of Multiple-Level Association Rules from Large Databases", IEEE Transactions on Knowledge and Data Engineering, 11(5), 1999. [4] J. Han, and J. Pei, and Y. Yin, "Mining Frequent Patterns without Candidate Generation", ACM SIGMOD Intl. Conference on Management of Data, 2 [5] Jong Soo Park, Ming-Syan Chen and Philip S. Yu, "An Effective Hash-Based Algorithm for Mining Association Rules", Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. [6] K. Wang, Y. He and J. Han, "Mining Frequent Itemsets Using Support Constraints (PDF)", 2 Int. Conf. on on Very Large Data Bases (VLDB), Cairo, Egypt, Sept. 2. [7] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New Algorithms for fast discovery of association rules. In Proc. of the 3 rd Int l Conf. On Kdd and Data Mining (KDD 97), Newport Beach, California, August 1997. [8] R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 2th Intl Conference on Very Large Databases, Santiago, Chile, Sept. 1994. [9] R. Agrawal, T. Imielinski, A. Swami: "Mining Associations between Sets of Items in Large Databases", Proc. of the ACM-SIGMOD 1993 Intl Conference on Management of Data, Washington D.C., May 1993, 27-216. [1] R. Srikant, R. Agrawal: "Mining Quantitative Association Rules in Large Relational Tables", Proc. of the ACM-SIGMOD 1996 Conference on Management of Data, Montreal, Canada, June 1996. [11] R. Srikant, R. Agrawal: "Mining Generalized Association Rules", Proc. of the 21st Intl Conference on Very Large Databases, Zurich, Switzerland, Sep. 1995. [12] R. Srikant, Q. Vu, R. Agrawal: "Mining Association Rules with Item Constraints", Proc. of the 3rd Intl Conference on Knowledge Discovery in Databases and Data Mining, Newport Beach, California, August 1997. [13] J. Luo, S. Rajaskaran: A Framework for Finding Frequent Itemsets in Large Databases (Submitted for publication). 1