620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

Size: px

Start display at page:

Download "620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others"

Thomas Wilkerson
6 years ago
Views:

1 Vol.15 No.6 J. Comput. Sci. & Technol. Nov A Fast Algorithm for Mining Association Rules HUANG Liusheng (ΛΠ ), CHEN Huaping ( ±), WANG Xun (Φ Ψ) and CHEN Guoliang ( Ξ) National High Performance Computing Center, Department of Computer Science University of Science & Technology of China, Hefei , P.R. China flshuang,hpchen,xwang,glcheng@ustc.edu.cn Received March 7, 2000; revised May 22, Abstract In this paper, the problem of discovering association rules between items in a large database of sales transactions is discussed, and a novel algorithm, BitMatrix, is proposed. The proposed algorithm is fundamentally different from the known algorithms Apriori and AprioriTid. Empirical evaluation shows that the algorithm outperforms the known ones for large databases. Scale-up experiments show that the algorithm scales linearly with the number of transactions. Keywords database, data mining, large itemset, association rule, minimum support, minimum confidence 1 Introduction Data mining is motivated by the decision support problem faced by most large retail organizations. Progress in bar-code technology has made it possible for retail organizations to collect and store massive sales data, referred to as the basket data. A record in such data typically consists of the transaction time and the items bought in the transaction [1;2]. The problem of mining association rules over basket data was introduced in [2]. An example of such a rule might be that 80% of customers who purchase tires and auto accessories also get automobile services done. Finding all such rules is valuable for cross-marketing and attached mailing applications. Other applications include catalog design, add-on sales, store layout, and customer segmentation based on buying patterns. The databases involved in these applications are very large. It is imperative, therefore, to develop fast algorithms for this task. The following is a formal statement of the problem [2]. Let I = fi 1 ;i 2 ;:::;img be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T I. Associated with each transaction is a unique identifier, called its TID. We say that a transaction T contains X, a set of some items in I, if X T. An association rule is an implication of the form X ) Y, where X ρ I, Y ρ I, and X Y = ;. The rule X ) Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. The rule X ) Y has support s in the transaction set D if s% of transactions in D contain X [ Y. Given a set of transactions D, the problem of mining association rules is to generate all association rules that have support and confidence greater than or equal to the user-specified minimum support (called minsup) and minimum confidence (called minconf) respectively. This problem can be decomposed into two subproblems [3;4]. 1) Find all sets of items (itemsets) that have transaction supports above minimum support. The support for an itemset is defined as the fraction of total transactions that contains This work was supported in part by the National `863' High-Tech Programme of China (No ZD06-2).

2 620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others small itemsets. 2) Use the large itemsets to generate the desired rules. Here is a straightforward algorithm. For every large itemset l, find all non-empty subsets of l. For every found subset a, output a rule of the form a ) (l a) if support(l)/support(a) min conf. We need to consider all subsets of l to generate rules with multiple consequents. It is easy to see that the first problem is the key to discover all association rules. In this paper, therefore, we do not discuss the second problem further, but readers may refer to [4] for a rule generation algorithm. Agrawal and Srikant presented two famous algorithms, Apriori and AprioriTid, for finding all large itemsets [3;4]. They also showed that these two algorithms are much better than earlier algorithms, such as AIS [2] and SETM [5] algorithms. In this paper, we present a new algorithm BitMatrix that is fundamentally different from previous algorithms. We also prove through experiments that the proposed algorithm always outperforms the Apriori and AprioriTid and it has excellent scale-up properties, increasing the practicability and the speed of mining association rules over very large databases. The rest of this paper is organized as follows. In Section 2, we give the new algorithm, BitMatrix, for finding all large itemsets. In Section 3, we show the performance of the proposed BitMatrix algorithm against that of the Apriori and AprioriTid algorithms. We also demonstrate the scale-up properties of our algorithm. Finally, we come to a conclusion by pointing out some related open problems in Section 4. 2 Discovering Large Itemsets Most of the existing algorithms for discovering all large itemsets make multiple passes over the data. In the first pass, they count the support of individual items and determine which of them are large. In each subsequent pass, candidate itemsets, also called potentially large itemsets, are generated from previously generated large itemsets or the database. Then the support counts for these candidate itemsets are found during the pass over the data. At the end of the pass, large itemsets are determined by their counts. This process continues until no new large itemsets are found [2;4;5]. In both Apriori and AprioriTid algorithms all the candidate itemsets with the same length must be stored in the memory, which results in a waste of space. To generate large itemsets, the database is passed as many times as the length of the longest large itemsets in Apriori algorithm. Namely, the database is scanned and the support of each candidate itemset is counted after the new candidate itemsets are generated, which results in a waste of time for large database. Although AprioriTid does not use the database for counting support after the first pass, its performance is worse than Apriori in almost all cases because each entry that is used for counting support may be larger than the corresponding transaction in the database in the initial pass. In our BitMatrix algorithm, which is fundamentally different from the Apriori and AprioriTid, we need not store all the candidate itemsets in the memory and pass over the database only once. We start with the previously generated large itemsets and once a candidate itemset is generated, the information stored in the bitmatrix will be used to determine whether it is a large itemset. After all candidates are generated, the large itemsets in that pass are discovered and the pass is over. Therefore, the new algorithm has the excellent property that the database is not used at all after the initialization of bitmatrix. Rather, the encoding of the database is employed for judging whether a candidate is a large itemset. In the later passes, the size of this encoding can become much smaller than the database, thus saving much reading effort.

3 No.6 A Fast Algorithm for Mining Association Rules Algorithm BitMatrix In Apriori and AprioriTid algorithms, it is assumed that items in each transaction are kept sorted in their lexicographic order [4]. However, this is not needed in BitMatrix. By careful programming, we can keep the items in the large itemsets and the large itemsets of the same size are kept sorted in their lexicographic order even if the items in the transactions are not kept sorted. We call the number of items in an itemset its size, and call an itemset of size k a k-itemset. The set of all large k-itemsets is defined as Lk. Each k-itemset c in Lk consists of items c[1];c[2];:::;c[k], where c[1] < c[2] < < c[k]. Associated with each itemset are two fields: count field to store the support for this itemset, and index field (henceforth referred to as support index) to indicate the transactions that contain the itemset. The BitMatrix algorithm is described as: (1) Initialize the bitmatrix; (2) L 1 = flarge 1-itemsetg; (3) for (k = 2; Lk 6= ;; k ++)do (4) Lk =GenLargeItemsets(Lk 1); (5) Answer= [ k Lk. In Step (1) of this algorithm, we initialize the bitmatrix as follows. First we build a matrix whose row number and column number are the item number and the transaction number, respectively. Note that the matrix is a bit-matrix and every position of the matrix only has one bit in the memory. Then we go through the database. If there are items i 1 ;i 2 ;:::;ik in the j-th transaction, bits ai 1j;ai 2j;:::;ai k j (aij represents the bit of i-th row and j-th column) and the other bits in the j-th column of the matrix are initialized as 1 and 0 respectively. As an example, Fig.1 shows a database and the corresponding bitmatrix after initialization. Database TID Items Bitmatrix Items Transactions Fig.1. An example. In Step (2), we simply count the number of 1 in each row to get the support count of every item and the large 1-itemsets are determined. In Step (4), the previously generated large (k 1)-itemsets are used to generate the large k-itemsets. This step repeats until no new large itemsets are generated. The GenLargeItemsets function is used here, which takes as argument Lk 1 and returns Lk. The function works as follows. (1) for (8p; q 2 Lk 1) do (2) if (p[1] = q[1])^;:::;^(p[k 2] = q[k 2]) ^ (p[k 1] < q[k 1]) then f (3) c = p [ q; //c consists of p[1];p[2];:::;p[k 2];p[k 1];q[k 1] (4) for all (k 1)-subsets s of c do (5) if (s 62 Lk) then fdelete c; c = ;; break;g (6) if (c 6= ;) then f (7) c.index = p.index & q.index; //support index (8) compute c.count from c.index; //support count (9) if (c.count minsup) then Lk = Lk [fcg; (10) g//endif (11) g//endif From Steps (1) to (5), the function simply helps generate the Ck that is a set of candidate k-itemsets (potentially large itemsets, see also [4]). In Step (2), the condition p[k 1] <

4 622 HUANG Liusheng, CHEN Huaping et al. Vol.15 q[k 1] ensures that no duplicates are generated. However, this algorithm differs from Apriori in that it need not store all the candidates in the memory. Once a candidate itemset is generated, it will be determined in Steps (7) to (9) whether it is a large one. To decide whether a candidate itemset is a large one, we associate each large itemset with a support index, which is a bit index and each bit of which indicates whether the itemset is contained by a transaction in the database. As to the 1-itemsets, their support index is some row in the bitmatrix. For example, in Fig.1, the support index of item 3 (i.e., the 1-itemset f3g) is the 3rd row in the bitmatrix, or It indicates that the 2nd, 3rd, 4th transactions contain this itemset and the first transaction does not contain it. Since c is the union of p and q, we simply generate c's support index by bit operator AND ( &") that is applied to each bit of p's and q's in Step (7). 2.2 The Correctness We need to show what is generated in the GenLargeItemsets function. Agrawal has proved Ck Lk [4] and Ck will be obtained by collecting all the itemsets c in Step (5). Therefore, we only need to demonstrate that no large itemsets are omitted in Steps (7) (9). In fact we only need to prove that the equation in Step (7) is correct. Note that if the k-th bit in the support index associated with an itemset is 1, then the k-th transaction contains the itemset, and vice versa. If both the k-th bit of p's index and that of q's index are 1, then the k-th transaction contains both p and q. Therefore, it certainly also contains p [ q (i.e., c). By similar reasoning, if either the k-th bit of p's index or that of q's index is 0, the transaction does not contain p or q. So it does not contain c and the k-th bit of c's index is 0. Thus, we can conclude that the equation in Step (7) is correct. 2.3 An Example We still use the database in Fig.1 and assume that the minsup is 2 transactions. From the bitmatrix in Fig.1, we can see that f1g, f2g, f3g, f5g are large 1-itemsets. Using the GenLargeItemsets function, we can get 6 candidate 2-itemsets: f1 2g,f1 3g,f1 5g,f2 3g,f2 5g,f3 5g, and the support indexes associated with them: 1110, 0110, 0010, 0111, 0011, All except f1 5g are large itemsets. Using GenLargeItemsets function again, we can get 2 candidate 3-itemsets f1 2 3g,f2 3 5g, and the associated support indexes: 0110, They are both large 3-itemsets because 1's number (i.e., support count) in their support indexes is not less than minsup. There are no longer large itemsets because L 4 turns out to be empty when we generate L 4 using L Buffer Management The bitmatrix and the support indexes of large itemsets both take much space in the memory. However, we need not store all of them in the memory. When Lk is generated, the bitmatrix only needs the indexes of large (k 1)-itemsets. We can release the space for storing the indexes of large itemsets shorter than k after we generate all the large k-itemsets. Furthermore, when we use p and q to generate c in the GenLargeItemsets function, the indexes of those large (k 1)-itemsets before p are not used and we can release them, too. 3 Performance To evaluate the relative performance of the algorithms for discovering large itemsets, we carried out several experiments on a GODEYE workstation with a CPU clock rate of 133MHz, 64MB of main memory, and running AIX 4.1.

5 No.6 A Fast Algorithm for Mining Association Rules Generation of Synthetic Data To evaluate the performance of the algorithms over a large range of data characteristics, we generated synthetic transaction data using the method proposed in [4]. These transactions simulate the transactions in the retailing environment. Our synthetic data generation program takes the parameters shown in Table 1. We generated dataset by setting N = 1000, jlj = 2000 and jdj = 100; 000. We chose 3 values for jt j: 5, 10, and 20, 3 values for jij: 2, 4, and 6. Table 2 summarizes the dataset parameter settings. For the same jt j and jdj values, the sizes of datasets in megabytes were roughly equal for different values of jij. jdj jt j ji j jlj N Table 1. Parameters Number of transactions Average size of transactions Average size of potentially large itemsets Number of potentially large itemsets Number of items Table 2. Parameter Settings Database jt j ji j jdj Size (MB) T5I2D100k k 3.2 T10I2D100k k 5.2 T10I4D100k k 5.2 T20I2D100k k 9.2 T20I4D100k k 9.2 T20I6D100k k Relative Performance Fig.2 shows the execution time of the six synthetic datasets given in Table 2 for decreasing values of minimum supports. As the minimum support decreases, the execution times of all the algorithms increase because of the increase in the total number of candidates and large itemsets. The figure shows that BitMatrix outperforms Apriori and AprioriTid for all problem sizes. Table 3 gives the execution time of the algorithms for the minimum support of 0.75%. Fig.2. Execution times for synthetic data.

6 624 HUANG Liusheng, CHEN Huaping et al. Vol.15 Table 3. Execution Times for minsup=0.75 (s) Database AprioriTid Apriori BitMatrix T5I2D100k T10I2D100k T10I4D100k T20I2D100k T20I4D100k T20I6D100k Fig.3. Scale-up property of BitMatrix. 3.3 Scale-up Experiment Fig.3 shows how BitMatrix algorithm scales up as the number of transactions increases from 10,000 to 100,000. The combinations for average sizes of transactions and itemsets are T10I4 and T20I6 respectively, and all other parameters are the same as those in Table 2. The minimum support level is set to 0.75%. The execution time is normalized with respect to the time for the 10,000 transaction datasets in this figure. As shown, the execution time scales quite linearly. 4 Conclusions and Future Work In this paper a new algorithm BitMatrix is presented for discovering all significant association rules between items in a large database of transactions. And the algorithm is compared with the previously known algorithms, the Apriori and AprioriTid algorithms. The BitMatrix algorithm has the nice feature that it need not pass over the original dataset in every pass. Furthermore, it need not keep a large set of candidate itemsets in the memory. Instead, once a candidate is generated, it is evaluated whether it is a large one, and it will be discarded if not. Therefore, our algorithm saves much time and space compared with Apriori and AprioriTid over very large databases. This paper has also presented experimental results, showing that the proposed algorithm always outperforms Apriori and AprioriTid. The scale-up experiment shows that BitMatrix scales quite linearly. In the future, we plan to extend this work along the following directions: ffl A more compact data structure may be found to take the place of BitMatrix. This structure may reduce the memory requirement and improve the performance. ffl The problem of very large C 2 will be considered. A specific algorithm has to be worked out for generating C 2. Our idea is that we need not find all the large 2-itemsets, but most of them. If the performance improves greatly, it will be successful. ffl Multiple taxonomies (is-a hierarchies) over items are often available. An example of such a hierarchy is that a dishwasher is a kitchen utensil as well as an electric appliance. It will be valuable to find association rules where such hierarchies are used. ffl The quantities of the items bought in a transaction have not been considered, but they are useful for some applications. Finding such rules needs further research. References [1] Agrawal R, Srikant R. Mining sequential patterns. IBM Research Report, [2] Agrawal R, Imielinski T, Swami A. Mining association rules between sets in large databases. In Proc. the ACM SIGMOD Conf. Management of Data, May 1993, pp [3] Agrawal R, Srikant R. Fast algorithm for mining association rules. IBM Research Report, [4] Agrawal R, Mannila H, Toivonen H et al. Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996, pp [5] Houtsma M, Swami A. Set-oriented mining of association rules. IBM Research Report, Oct

Improved Frequent Pattern Mining Algorithm with Indexing

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.