Maintenance of Generalized Association Rules with. Multiple Minimum Supports

Size: px

Start display at page:

Download "Maintenance of Generalized Association Rules with. Multiple Minimum Supports"

Cleopatra Osborne
6 years ago
Views:

1 Maintenance of Generalized Association Rules with Multiple Minimum Supports Ming-Cheng Tseng and Wen-Yang Lin (corresponding author) Institute of Information Engineering, I-Shou University, Kaohsiung 84, Taiwan Tel: Dept. of Information Management, I-Shou University, Kaohsiung 84, Taiwan Abstract Mining generalized association rules among items in the presence of taxonomy has been recognized as an important model in data mining. Earlier wor on generalized association rules confined the minimum supports to be uniformly specified for all items or items within the same taxonomy level. This constraint would restrain an expert from discovering more interesting but much less supported association rules. In our previous wor, we have addressed this problem and proposed two algorithms, MMS_Cumulate and MMS_Stratify. In this paper, we examined the problem of maintaining the discovered multi-supported, generalized association rules when new transactions are added into the original database. We proposed two algorithms, _Cumulate and _Stratify, which can incrementally update the discovered generalized association rules with non-uniform support specification and are capable of

2 effectively reducing the number of candidate sets and database re-scanning. Empirical evaluation showed that _Cumulate and _Stratify are -6 times faster than running MMS_Cumulate or MMS_Stratify on the updated database afresh. Keywords: Generalized association rules, maintenance, multiple minimum supports, sorted closure, taxonomy.. Introduction Mining association rules from a large database of business data, such as transaction records, has been a popular topic within the area of data mining [,,, 3]. This problem is originally motivated by applications nown as maret baset analysis to find relationships among items purchased by customers, that is, the inds of products tend to be purchased together. For example, an association rule, Destop In-jet (Support=3%, Confidence=6%), says that 3% (support) of customers purchase both Destop PC and In-jet printer together, and 6% (confidence) of customers who purchase Destop PC also purchase In-jet printer. Such information is useful in many aspects of maret management, such as store layout planning, target mareting, and understanding the behavior customers. In many applications, there are taxonomies (hierarchies), explicitly or implicitly, over the items. In some applications, it may be more useful to find associations at different levels of the taxonomy than only at the primitive concept level [7, 4]. For example, consider Figure, the taxonomy of items from which the previous association rule is derived can be represented as

3 Printer PC Scanner Non-impact Dot-matrix Destop Noteboo Laser In-jet Figure. An example of taxonomy T. It is liely to happen that the association rule Destop In-jet (Support=3%, Confidence=6%) does not hold when the minimum support is set to 4%, but the following association rule may be valid PC Printer. Besides, note that in reality the frequencies of items are not uniform. Some items occur very frequently in the transactions while others rarely appear. In this case, a uniform minimum support assumption would hinder the discovery of some deviations or exceptions that are more interesting but much less supported than general trends. To meet such situations, we have investigated the problem of mining generalized association rules across different levels of taxonomy with non-uniform minimum supports [6]. We proposed two efficient algorithms, MMS_Cumulate and MMS_Stratify, which not only can discover associations that span different hierarchy levels but also have high potential to produce rare but informative item rules. The proposed approaches, however, are not effective to the situation for frequently updating the large source database. In this case, adopting the mining approach tends to re-applying the whole process on the updated database to reflect correctly the most recent associations among items. This is not cost-effective and is unacceptable in general. To be more realistic and cost-effective, it is better to perform the association mining algorithms to generate the initial association rules, and then upon updating the source database, apply an incremental maintenance method to re-build the discovered rules. The major challenge here is to deploy an efficient maintenance algorithm to fa- 3

4 cilitate the whole mining process. This problem is nontrivial because updates may invalidate some of the discovered association rules, turn previous wea rules into strong ones and import new, undiscovered rules. In this paper, we addressed the issues for developing efficient maintenance methods and proposed two algorithms, _Cumulate and _Stratify. Our algorithms can incrementally update the generalized association rules with non-uniform support specification and is capable of effectively reducing the number of candidate sets and database re-scanning. The performance study showed that _Cumulate and _Stratify are -6 times faster than running MMS_Cumulate or MMS_Stratify on the updated database afresh. The remainder of the paper is organized as follows. The problem of maintaining generalized association rules with multiple minimum supports is formalized in Section. In Section 3, we explain the proposed algorithms for updating frequent itemsets with multiple minimum supports. The evaluation of the proposed algorithms on IBM synthetic data is described in Section 4. A review of related wor is given in Section 5. Finally, our conclusions are stated in Section 6.. Problem statement. Mining multi-supported, generalized association rules Let I = {i, i,, i m } be a set of items and = {t, t,, t n } be a set of transactions, where each transaction t i = tid, A has a unique identifier tid and a set of items A (A I). We assume that the taxonomy of items, T, is available and is denoted as a directed acyclic graph on I J, where J = {j, j,, j p } represents the set of generalized items derived from I. An edge in T denotes an is-a relationship. That is, if there is an edge from j to i, we call j a parent (generalization) of i and i a child of j. For example, in Figure I = {Laser printer, In-jet printer, Dot matrix printer, Destop PC, Noteboo, Scanner} and J = {Non-impact printer, Printer, Personal computer}. 4

5 Definition. Given a transaction t = tid, A, we say an itemset B is in t if every item in B is in A or is an ancestor of some item in A. An itemset B has support s, denoted as s = sup(b), in the transaction set if s% of transactions in contain B. Definition. Given a set of transactions and a taxonomy T, a generalized association rule is an implication of the form A B, where A, B I J, A B =, and no item in B is an ancestor of any item in A. The support of this rule, sup(a B), is equal to the support of A B. The confidence of the rule, conf(a B), is the ratio of sup(a B) and sup(a). The condition in Definition that no item in A is an ancestor of any item in B is essential; otherwise, a rule of the form, a ancestor(a), always has % confidence and is trivial. Definition 3. Let ms(a) denote the minimum support of an item a in I J. An itemset A = {a, a,, a }, where a i I J, is frequent if the support of A is larger than the lowest value of minimum support of items in A, i.e., sup(a) min a i A ms(a i ). Definition 4. A multi-supported, generalized association rule A Bis strong if sup(a B) min a i A B ms(a i ) and conf(a B) minconf. Definition 5. Given a set of transactions, a taxonomy T, the user-specified minimum supports for all items in T, {ms(a ), ms(a ),, ms(a n )}, and the minconf, the problem of mining multi-supported, generalized association rules is to find all association rules that are strong. Example. Suppose that a shopping transaction database in Table consists of items I {Laser printer, In-jet printer, Dot matrix printer, Destop PC, Noteboo, Scanner} and taxonomy T as shown in Figure. Let the minimum confidence (minconf) be 6% and the minimum support (ms) associated with each item in the taxonomy be as follows: 5

6 ms(printer) 8% ms(non-impact) 65% ms(dot matrix) 7% ms(laser) 5% ms(in-jet) 6% ms(scanner) 5% ms(pc) 35% ms(destop) 5% ms(noteboo) 5% The following generalized association rule, PC, Laser Dot matrix (sup 6.7%, conf 5%), fails because its support is less than min{ms(pc), ms(laser), ms(dot matrix)} 5%. But another rule, PC Laser (sup 33.3%, conf 66.7%), holds because both its support and confidence are larger than min{ms(pc), ms(laser)} 5% and minconf, respectively. Table lists the frequent itemsets and the resulting strong rules for this example. Table. A transaction database (). TID Items Purchased Noteboo, Laser printer Scanner, Dot matrix printer 3 Dot matrix printer, In-jet printer 4 Noteboo, Dot matrix printer, Laser printer 5 Scanner 6 Destop computer Table. Frequent itemsets and association rules generated for Example. Itemsets min ms (%) Support (%) {Scanner} {PC} {Noteboo} {Laser} {Scanner, Printer} {Scanner, Dot matrix} {Laser, PC} {Noteboo, Printer} {Noteboo, Non-impact} {Noteboo, Laser} Rules 6

7 PC Laser (sup 33.3%, conf 66.7%) Laser PC (sup 33.3%, conf %) Noteboo Printer (sup 33.3%, conf %) Noteboo Non-impact (sup 33.3%, conf %) Noteboo Laser (sup 33.3%, conf %) As shown in [], the tas of mining association rules is usually decomposed into two steps:. Itemset generation: find all frequent itemsets those have support exceeding a threshold minimum support.. Rule construction: from the set of frequent itemsets, construct all association rules that have a confidence exceeding a threshold minimum confidence. Since the solution to the second subproblem is straightforward, the problem can be reduced to finding the set of frequent itemsets that satisfy the specified minimum support.. Maintaining multi-supported, generalized association rules In real business applications, the database grows over time. This implies that if the updated database is processed afresh, the previously discovered associations might be invalid and some undiscovered associations should be generated. That is, the discovered association rules must be updated to reflect the new circumstance. Analogous to mining associations, this problem can be reduced to updating the frequent itemsets. Definition 6. Let denote the original database, db the incremental database, the updated database containing db and, i.e., = db +, T the taxonomy of items, and L the set of frequent itemsets in. The problem of updating the frequent itemsets with taxonomy and multiple supports is to find L = {A sup (A) min a i A ms(a i )}, 7

8 given the nowledge of, T, db, L, and sup (A) A L. Example. Consider Example again, and suppose that the incremental transaction database db is shown in Table 3. Table 4 lists the frequent itemsets and the resulting strong rules. Comparing Table 4 to Table, we observe that two old frequent - itemsets in Table, {Scanner, Printer} and {Scanner, Dot matrix}, are discarded, while three new frequent itemsets, {Destop}, {PC, Printer}, and {PC, Nonimpact}, are added into Table 4, and two new rules PC Printer and PC Nonimpact are found. Table 3. An incremental database (db). TID Items Purchased 7 Destop computer, Laser printer 8 Laser printer 3. Methods for updating frequent itemsets for multi-supported, generalized association rules 3. Preliminary As stated in the pioneering wor [4], the primary challenge of devising effective association rules maintenance algorithm is how to reuse the original frequent itemsets and avoid the possibility of re-scanning the original database. Table 4. Frequent itemsets and association rules generated for Example. Itemsets min ms (%) Support (%) {Scanner} 5 5. {Destop} 5 5. {Noteboo} 5 5. {PC} {Laser} 5 5. {PC, Printer} {PC, Non-impact}

9 {Laser, PC} {Noteboo, Printer} 5 5. {Noteboo, Non-impact} 5 5. {Noteboo, Laser} 5 5. Rules PC Printer (sup 37.5%, conf 75%) PC Non-impact (sup 37.5%, conf 75%) PC Laser (sup 37.5%, conf 75%) Laser PC (sup 37.5%, conf 75%) Noteboo Printer (sup 5%, conf %) Noteboo Non-impact (sup 5%, conf %) Noteboo Laser (sup 5%, conf %) Let denote the number of transaction records in the original database, db be the number of transaction records in the incremental database db, and be the number of transaction records in the whole updated database containing db and. For a sorted -itemset A = a, a,, a with ms(a ) ms(a ) ms(a ), we define its support counts in db as A.count db, in as A.count and in as A.count. Note that = db + and A.count = A.count db + A.count. After scanning the incremental database db, we have A.count db (A) and can proceed further according to the following conditions. () If A is a frequent itemset both in db and, i.e., count db (A) ms(a ) db and count (A) ms(a ), then A is a frequent itemset in. There is no need to compute because count (A) ms(a ). () If A is not a frequent itemset in db but is frequent in, then a simple calculation can determine whether A is frequent or not in. (3) If A is a frequent itemset in db but not frequent in, then A is an undetermined itemset in. Since count (A) is not available, we must re-scan to compute count (A) to decide whether A is frequent or not in. 9

10 (4) If A is neither a frequent itemset in db nor in, then A is not frequent in. There is no need for further computation. These four cases are depicted in Figure. Note that only case 3 yields the essence of re-scanning the original database. Infrequent Itemset Min. Support Frequent Itemset Original Database Case 4 Case 3 Case Case Incremental Database Infrequent Itemset Min. Support Frequent Itemset Figure. Four cases in the incremental frequent itemset maintenance [8]. Let -itemset denote an itemset with items. The basic process of updating generalized association rules with multiple minimum supports is similar to previous wor [4, 6] on updating association rules with uniform support and follows the level-wise approach widely used in most efficient algorithms to generate all frequent -itemsets. First, count all -itemsets in db and then determine whether to re-scan the original database from these itemsets and finally create frequent -itemsets L. From frequent -itemsets, generate candidate -itemsets C and repeat the above procedure until no frequent -itemsets L are created. The above paradigm, however, has to be modified to incorporate taxonomy information and multiple minimum supports. First, note that in the presence of taxonomy an itemset can be composed of items, primitive or generalized, in the taxonomy. To calculate the occurrence of each itemset, the current scanned transaction t is extended to include the generalized items of all its component items.

11 Secondly, note that the well-nown apriori pruning technique based on the concept of downward closure does not wor for multiple support specification. For example, consider four items a, b, c, and d that have minimum supports specified as ms(a) = 5%, ms(b) = %, ms(c) = 4%, and ms(d) = 6%. Clearly, a -itemset {a, b} with % of support is discarded for % < min{ms(a), ms(b)}. According to the downward closure, the 3-itemsets {a, b, c} and {a, b, d} will be pruned even though their supports may be larger than ms(c) and ms(d), respectively. To solve this problem, we have adopted the sorted closure property [9] in our previous wor for mining generalized association rules with multiple minimum supports. Hereafter, to distinguish from the traditional itemset, a sorted -itemset denoted as a, a,, a is used. Lemma. If a sorted -itemset a, a,, a, for and ms(a ) ms(a ) ms(a ), is frequent, then all of its sorted subsets with items are frequent, except the subset a, a 3,, a. Again, let L and C represent the set of frequent -itemsets and candidate - itemsets, respectively. We assume that any itemset in L or C is sorted in increasing order of the minimum item supports. The result in Lemma reveals the obstacle in using the apriori-gen in generating frequent itemsets. Lemma. For =, the procedure apriori-gen(l ) fails to generate all candidate -itemsets in C. For example, consider a sorted candidate -itemset a, b. It is easy to find if we want to generate this itemset from L, both items a and b should be included in L ; that is, each one should be occurring more frequently than the corresponding minimum support ms(a) and ms(b). Clearly, the case ms(a) sup(b) < ms(b) fails to generate a, b in C even sup( a, b ) ms(a). For the above reason, all items within an itemset are sorted in the increasing order of their minimum supports, and a sorted itemset, called frontier set, F = a j, a j, a j,, a j l, is facilitated to generate the set of candidate -itemsets, where a j = min a i I J {a i sup(a i ) ms(a i )}, ms(a j ) ms(a j ) ms(a j ) ms(a jl ), sup(a ji ) ms(a j), i l.

12 Example 3. Continuing with Example, we change ms(scanner) from 5% to %. The resulting F is shown in Table 5. From Table 5, we observe that ms(scanner) is the smallest of all items, and Scanner could join with any item whose support is greater than or equal to ms(scanner) % to become a C candidate without losing any -itemsets. The -itemsets Scanner, Destop and Scanner, In-jet could not become candidates because sup(destop) or sup(in-jet) is less than ms(scanner), and the supports of Scanner, Destop and Scanner, In-jet could not be greater than sup(destop) and sup(in-jet), respectively, according to the downward closure property. Therefore, we eep items whose support is greater than or equal to ms(scanner) in F, and discard Destop and In-jet. For C, 3, the candidate generation is not changed. Please refer to [6] for details on the mining of multi-support, generalized association rules. 3. Algorithm _Cumulate The basic process of _Cumulate algorithm is proceeded as follows. First, count all -itemsets in db including generalized items. Second, combine the itemset counts in db and, and create the frequent -itemsets L according to the four cases. Next, create the frontier set F and use it to generate candidate -itemsets C. Then, generate the frequent -itemsets L following the same procedure for L. Finally, for 3, repeat the above procedure until no frequent -itemsets L are created, except that the candidate -itemsets C are generated from L. In each iteration for C generation, an additional pruning technique as described below is performed. Table 5. Generating frontier set F. Item Sorted ms % Support % F Scanner 33.3 Scanner

13 Laser Destop Noteboo PC In-jet Non-impact Dot-matrix Printer Laser Noteboo PC Non-impact Dot-matrix Printer Lemma 3. For any sorted -itemset A a, a,, a, if, A can be pruned if A is not a frequent itemset in and there exists a subset of A, say a i, a i,, a i, such that sup db( a i, a i,, a i ) ms(a ) in db, and a i a or ms(a ) ms(a ). Proof. Note that if A is not frequent in and db, then A will not be frequent in. Since we now that A is not frequent in, all we have to do is to mae sure A is not frequent in db. If we can find A is not frequent in db, then we sure A is not frequent in. Example 4. Assume that ms(scanner) 3%, ms(noteboo) 3%, ms(in-jet) 5%,, db, Scanner, Noteboo, In-jet is not frequent in. Let us consider the three subsets of Scanner, Noteboo, In-jet, and assume that sup db ( Scanner, Noteboo ), sup db ( Scanner, In-jet ) or sup db ( Noteboo, Injet ) is not frequent in db. Since Scanner, Noteboo, In-jet is infrequent in, count( Scanner, Noteboo, In-jet ) 3% 3. Now, if Scanner, Noteboo or Scanner, In-jet is not frequent in db, then sup db ( Scanner, Noteboo, In-jet ) sup db ( Scanner, Noteboo ) ms(scanner), or sup db ( Scanner, Noteboo, In-jet ) sup db ( Scanner, In-jet ) ms(scanner) in db according to the downward closure property, and so count db ( Scanner, Noteboo, In-jet ) 3

14 3% 3. Hence, count ( Scanner, Noteboo, In-jet ) count db ( Scanner, Noteboo, In-jet ) + count ( Scanner, Noteboo, In-jet ) It follows that sup ( Scanner, Noteboo, In-jet ) count ( Scanner, Noteboo, In-jet )/ 33/(+) 3% ms(scanner). Therefore, Scanner, Noteboo, In-jet is not a frequent 3-itemset in. On the other hand, if Noteboo, In-jet is not frequent in db, then it is true that sup db ( Scanner, Noteboo, In-jet ) sup db ( Noteboo, In-jet ) ms(noteboo) 3% ms(scanner). Hence, Scanner, Noteboo, In-jet is not frequent in db. Finally, we can derive that Scanner, Noteboo, In-jet is not a frequent 3-itemset in. The procedure for generating F is described in Figure 3. Example 5. Let us continue with Example. Tables 6 and 7 show the progressing results for generating F. First, scan db to find the counts of all -itemsets in db, then scan to find the counts of all -itemsets that are not in L. The result is shown in Table 6. Next, calculate and combine the counts of items calculated in the first two steps and finally generate F, and the result is shown in Table 7.. for each transaction t db do /* Scan db and calculate count db (a) */. for each item a t do 3. Add all ancestors of a in IA into t; 4. Remove any duplicates from t; 5. for each item a t do 6. count db (a)++; 7. end for 8. for each item a L do /* Cases & */ 9. count (a) count (a) + count db (a);. if sup (a) ms(a) then. L = L {a};. end for 3. for each transaction t do /* Scan and calculate count (a) 4. for a L */ 5. for each item a t do 4

15 6. Add all ancestors of a in IA into t; 7. Remove any duplicates from t; 8. for each item a t and a L do 9. count (a)++;. end for. for each item a L do /* Cases 3 & 4 */. count (a) count (a) + count db (a); 3. if sup (a) ms(a) then 4. L = L {a}; 5. end for 6. for each item a in SMS in the same order do 7. if a L then 8. Insert a into F ; 9. brea; 3. end if 3. for each item b in SMS that is after a in the same order do 3. if sup (b) ms(a) then 33. Insert b into F ; Figure 3. Procedure F -gen(sms,, db, L, IA). Table 6. The result for scanning db and. L Scan db Scan not in L -itemset Count Sup % -itemset Count Sup % -itemset Count Sup % Scanner 33.3 Laser Destop 6.7 Laser 33.3 Destop 5. Dot-matrix 3 5. Noteboo 33.3 PC 5. Non-impact 3 5. PC 3 5. Non-impact In-jet 6.7 Printer Printer Table 7. Generating frontier set F. C Count Sorted ms% Sup % F Scanner 5 5. Scanner Laser Laser Destop 5 5. Destop Noteboo 5 5. Noteboo PC PC In-jet 6.5 Non-impact Non-impact Dot-matrix Dot-matrix Printer Printer

16 Figure 4 shows an overview of the _Cumulate algorithm. As an illustration, let us consider the example shown in Figure 5. For convenience, all the itemsets and frequent itemsets in, db and are first shown in Tables 8 to 3. A simple picture of running the _Cumulate algorithm on the example in Figure 5 is shown in Figure 6. First, scan db and sort all the items in db. Second, load 6 L, and chec whether those items in db are in L, if so, we can calculate their supports straightforward. Otherwise, we must re-scan the original database to determine whether candidate -itemsets are frequent or not. At the same time, we generate the frontier set F and use it to generate candidate -itemsets C. After that, the process is almost the same as that for generating L, except that, if candidate -itemsets are not frequent in, we can use count db (C ) to prune C, and determine whether candidate -itemsets C are frequent or not in db.. Create IMS; /* The table of user-defined minimum support */. Create IA; / * The table of each item and its ancestors from taxonomy T */ 3. SMS sort(ims); /* Ascending sort according to ms(a) stored in IMS */ 4. F F -gen(sms,, db, L, IA); 5. L {a F sup(a) ms(a)}; 6. for ( ; L ; ) do 7. if then C C -gen(f ); 8. else C apriori-gen(l ); 9. Delete any candidate in C that consists of an item and its ancestor;. Delete any candidate in C that satisfies Lemma 3;. Delete any ancestor in IA that is not present in any of the candidates in. C ; 3. Delete any item in F that is not present in any of the candidates in C ; 4. Cal-count(db, F, C ); /* Scan db and calculate counts of C */ 5. for each candidate A L do /* Cases & */ 6. count (A) count (A) + count db (A); 7. if count (A) ms(a[]) then /* A[] denotes the first item with the smallest minimum support in sorted A */ 8. L 9. end for = L {A};. for each candidate A C L do /* Cases 3 & 4 */

17 . if count db (A) ms(a[]) db then. Delete any candidate in C L ; 3. Cal-count(, F, C L ); /* Scan and calculate counts of C L */ 4. for each candidate A C L do 5. count (A) count (A) + count db (A); 6. if count (A) ms(a[]) then 7. L = L {A}; 8. end for 9. Result L ; Figure 4. Algorithm _Cumulate. A Taxonomy F I B D G H C E Original Database () Hierarchy Table (HI) TID Items Purchased Item Level_No. Group SubLevel H, C A 3 I, D B 3 3 D, E C H, D, C D 3 5 I E G F G Incremental Database (db) H TID Items Purchased I 3 7 G, C 8 C Minimum Item Support Table (MIS) 7 Item Ancestor Table (IA) Item minsup % Item Ancestor_ Ancestor_ A 8 A

18 B 65 B A C 5 C A B D 7 D A E 6 E A B F 35 F G 5 G F H 5 H F I 5 I Figure 5. An example of updating generalized association rules. 8

19 count db ( C ) C I C A, B, C, D, E, F, G, H, I G H F Scan db & sort E C in L Load Cal. support B D L A C not in L Scan & cal. support I 5. % H C % % F 5. % G 5. % E.5 % B 6.5 % D 37.5 % A 75. % F Generate L Generate F I, C, G, H, F, B, D, A L I, C, G, H, F Generate C C IC,IG,IH,IF,IB,ID,IA,CG,CH,CF,CD,GH, GB,GD,GA,HB,HD,HA,FB,FD,FA C in Load L & L use count db ( C ) to prune C C not in L ID,IA,CH,CF,HB,HA CG,GB,GA,FB,FA Scan db count ( C ) db ID IA CH CF HB HA CG GB GA FB FA count ( C ) db ms(a ) db Cal. support CG,GB,GA,FB,FA Scan & cal. support ID.5 % IA.5 % CH 5. % CF 37.5 % HB 5. % HA 5. % L Generate CH,CF,HB,HA,FB,FA Generate C3 C 3 L CG.5 % GB.5 % GA.5 % FB 37.5 % FA 37.5 % Figure 6. An example of _Cumulate. 9

20 Table 8. Summary for candidates, counts and supports of in Figure 5. C Count Sup % C Count Sup % C 3 Count Sup % I 33.3 I, D 6.7 C, H, D 6.7 C 33.3 I, A 6.7 C, F, D 6.7 G 6.7 C, H 33.3 H, B, D 6.7 H 33.3 C, F 33.3 F, B, D 6.7 F 3 5. C, D 6.7 E 6.7 H, B 33.3 B 3 5. H, D 6.7 D 3 5. H, A 33.3 A F, B 33.3 F, D 6.7 F, A 33.3 E, D 6.7 B, D 33.3 Table 9. Summary for candidates, counts and supports of db in Figure 5. C Count Sup % C Count Sup % C C, G 5. G 5. C, F 5. F 5. G, B 5. B G, A 5. A F, B 5. F, A 5. Table. Frontier set F in Figure 5. C Count Sorted ms% Sup % F I 5 5. I C C G 5 5. G H 5 5. H F F E 6.5 B B D D A A

21 Table. Summary for candidates, counts and supports of in Figure 5. C Count Sup % C Count Sup % C 3 Count Sup % I 5. I, D.5 C, H, D.5 C 4 5. I, A.5 C, F, D.5 G 5. C, G.5 H, B, D.5 H 5. C, H 5. F, B, D.5 F 4 5. C, F E.5 C, D.5 B G, B.5 D G, A.5 A H, B 5. H, D.5 H, A 5. F, B F, D.5 F, A E, D.5 B, D 5. Table. Frequent itemsets summary of in Figure 5. L -itemset Count Sup % -itemset Count Sup % 3-itemset I C H F I, D I, A C, H C, F H, B H, A L L 3 Table 3. Frequent itemsets summary of in Figure 5. L -itemset Count Sup % -itemset Count Sup % 3-itemset I C G H F C, H C, F H, B H, A F, B F, A L L 3

22 3.3 Algorithm _Stratify The _Stratify algorithm is based on the concept of stratification introduced in [4], which refers to a level-wise counting strategy from the top level of the taxonomy down to the lowest level, hoping that those candidates containing items at higher levels will not have minimum support, yielding no need to count candidates which include items at lower levels. However, this counting strategy may fail in the case of non-uniform minimum supports. Example 6. Let {Printer, PC}, {Printer, Destop}, and {Printer, Noteboo} are the candidate itemsets to be counted. The taxonomy and minimum supports are defined as Example. Using the level-wise strategy, we first count {Printer, PC} and assume that it is not frequent, i.e., sup({printer, PC}) <.35. Since the minimum supports of {Printer, Destop},.5, and {Printer, Noteboo}, also.5, are less than {Printer, PC}, we cannot assure that the occurrences of {Printer, Destop} and {Printer, Noteboo}, though less than {Printer, PC}, are also less than their minimum supports. In this case, we still have to count {Printer, Destop} and {Printer, Noteboo} even though {Printer, PC} does not have minimum support. The following observation inspires us to the deployment of the _Stratify algorithm. Lemma 4. Consider the two -itemsets a, a,, a and a, a,, â, where â is an ancestor of a. If a, a,, â is not frequent, then neither is a, a,, a. Lemma 4 implies that if a sorted candidate itemset in higher level of the taxonomy is not frequent, then neither are all of its descendants that differ from the itemset only in the last item. We thus first divide C, according to the ancestor-descendant re-

23 lationship claimed in Lemma 4, into two disjoint subsets, called top candidate set TC and residual candidate set RC, as defined below. Definition 7. Consider a set, S, of candidates in C induced by the schema a, a,, a,, where ' ' denotes don't care. A candidate -itemset A a, a,, a, a is a top candidate if none of the candidates in S is an ancestor of A. That is, A[])}, and TC {A A C, ( A C and A[i] A [ i], i, A[ ] is an ancestor of RC = C TC, if every itemset in TC is a frequent itemset. Example 7. Assume that the candidate -itemset C in Example consists of Scanner, PC, Scanner, Destop, Scanner, Noteboo, Noteboo, Laser, Noteboo, Non-impact, and Noteboo, Dot matrix, and the supports of the items in higher levels are larger than those in lower levels. Then TC = { Scanner, PC, Noteboo, Non-impact, Noteboo, Dot matrix } and RC = { Scanner, Destop, Scanner, Noteboo, Noteboo, Laser }. If the support of the top level candidate does not pass its minimum support ms(scanner), we do not need to count the remaining descendant candidates Scanner, Destop, Scanner, Noteboo. On the contrary, if Scanner, Destop is a frequent itemset, we should perform another pass over to count Scanner, Destop and Scanner, Noteboo to determine whether they are frequent or not. Our approach is that, for, rather than counting all candidates in C as in _Cumulate algorithm, we count the supports of candidates in TC, deleting as well as their descendants in RC the candidates that do not have minimum support. If RC is not empty, we then perform an extra scanning over the transaction database to 3

24 count the remaining candidates in RC. Again, the less frequent candidates are eliminated. The resulting TC and RC, called TL and RL, respectively, form the set of frequent -itemsets L. An overview of the _Stratify algorithm is described in Figure 7. The procedures for generating TC and RC are given in Figure 8 and Figure 9, respectively. Figure shows a picture of running the _Stratify algorithm on the example in Figure 5. The procedure for generating frontier set F and L is the same as that in the _Cumulate algorithm. After that, first, we sort C to find the top candidate set TC, lie _Cumulate algorithm doing, and generate TL. We then use TL to prune C and generate RC. Since RC is not empty, we must do the same procedure as generating TL, and generate RL and L lastly. 4

25 . Create IMS; /* The table of user-defined minimum support */. Create HI; /* The table of each item with Hierarchy Level, Sublevel, and Group */ 3. Create IA; /* The table of each item and its ancestors from taxonomy T */ 4. SMS sort(ims); /* Ascending sort according to ms(a) stored in IMS */ 5. F F -gen(sms,, db, L, IA); 6. L {a F sup(a) ms(a)}; 7. for ( ; L ; ++) do 8. if then C C -gen(f) 9. else C C -gen( L );. Delete any candidate in C that consists of an item and its ancestor;. Delete any candidate in C that satisfies Lemma 3;. TC TC -gen(c, HI); /* Using C, HI to find top C */ 3. Delete any ancestor in IA that is not present in any of the candidates in TC ; 4. Delete any item in F that is not present in any of the candidates in C ; 5. Cal-count(db, F, TC ); /* Scan db and calculate counts of TC */ 6. for each candidate A L do /* Cases & */ 7. count (A) count (A) + count db (A); 8. if count (A) ms(a[]) then /* A[] denotes the first item with the smallest minimum support in sorted A */ 9. TL = TL {A};. end for. for each candidate A TC. if count db (A) ms(a[]) db then 3. Delete any candidate in TC L do /* Cases 3 & 4 */ 5 L ; 4. Cal-count(, F, TC L ); /* Scan and calculate counts of TC L */ 5. for each candidate A TC L do 6. count (A) count (A) + count db (A); 7. if count (A) ms(a[]) then 8. TL = TL {A}; 9. end for 3. RC RC -gen(c, TC, TL ); /* Use C, TL to find residual C */ 3. If RC then 3. Delete any ancestor in IA that is not present in any of the candidates in RC ; 33. Delete any item in F that is not present in any of the candidates in RC ; 34. Cal-count(db, F, RC ); /* Scan db and calculate counts of RC */ 35. for each candidate A L do /* Cases & */ 36. count (A) count (A) + count db (A); 37. if count (A) ms(a[]) then 38. RL = RL {A}; 39. end for

26 4. for each candidate A RC L do /* Cases 3 & 4 */ 4. if count db (A) ms(a[]) db then 4. Delete any candidate in RC L ; 43. Cal-count(, F, RC L ); /* Scan and calculate counts of 44. RC L */ 45. for each candidate A RC L do /* Cases 3 & 4 */ 46. count (A) count (A) + count db (A); 47. if count (A) ms(a[]) then 48. RL = RL {A}; 49. end for 5. end if 5. L TL RL ; 5. end for 53. Result L ; Figure 7. Algorithm _Stratify for each itemset A C do Sort C according to HI of A[] preserving the ordering of A[]A[] A[-]; for itemset A C in the same order do S = { A A C and A[i] A [ i], i }; if A is not mared and none of the candidates in S is an ancestor of A[] then insert A into TC ; mar A and all of its descendants in S ; end if end for Figure 8. Procedure TC -gen(c, HI) for each itemset A TC do if A TL then S = { A A C and A[i] A [ i], i }; Insert all of its descendants in S into RC ; end if Figure 9. Procedure RC -gen(c, TC, TL ). 6

27 count ( C ) db C I C in Cal. support C L A, B, C, D, E, F, G, H, I G H F Load Scan db & sort E B D L A C not in L Scan & cal. support I 5. % C 5. % H 5. % F 5. % G 5. % E.5 % B 6.5 % D 37.5 % A 75. % Generate L F L Generate I, C, G, H, F, B, D, A I, C, G, H, F F Generate C C IC,IG,IH,IF,IB,ID,IA,CG,CH,CF,CD,GH, GB,GD,GA,HB,HD,HA,FB,FD,FA Load L & use count db ( C ) to prune C Sorted C IA,ID,CF,CG,CH,GA,GB,HA,HB,FA,FB TC IA,CF,GA,HA,FA RC C in Generate TC L C not in IA,CF,HA Scan db GA,FA L Generate C in RC CH,HB CG,CH,HB,FB L C not in Scan db CG,FB L C.count db IA CF Cal. support IA.5 % CF HA 37.5 % 5. % TL HA GA FA CH HB CG FB count ( C ) ms(a ) db count C ) ms(a ) db db CF,HA,FA GA,FA Scan & cal. support GA.5 % FA 37.5 % Generate TL Cal. support CH 5.% HB 5.% RL db ( CG,FB Scan & cal. support Generate RL CH,HB,FB CG.5% FB 37.5% L CH,CF,HB,HA,FB,FA Generate C3 C 3 Generate L Use TL to prune C Figure. An example of _Stratify. 7

28 4. Experiments In this section, we evaluate the performance of the two algorithms, _Cumulate and _Stratify, using the synthetic dataset generated by the IBM data generator []. We performed four experiments, changing a different parameter in each experiment. All the parameters except the one being varied were set to their default values, as shown in Table 4. All experiments were performed on an Intel Pentium-II 35 with 64MB RAM, running on Windows Table 4. Default parameter settings for synthetic data generation. Parameter Default value Number of original transactions, db Number of incremental transactions, t Average size of transactions 5 N Number of items R Number of groups 3 L Number of levels 3 F Fanout 5 Minimum Support: We use the following formula [9] to generate multiple minimum supports for each item a: ms(a) = sup ( a), sup ( a).., otherwise where.6. and sup (a) denotes the support of item a in the original database. The result is depicted in Figure. As shown in the figure, _Cumulate and _Stratify perform 8

29 significantly better than MMS_Stratify and MMS_Cumulate; the improvement ranges from 3 to 6 times and increases as decreases. Besides, algorithm MMS_Stratify performs better than MMS_Cumulate for., and algorithm _Cumulate performs slightly better than _Stratify. 4 Time (sec) MMS_Cumulate MMS_Stratify _Cumulate _Stratify Figure. Performance comparison of MMS_Cumulate, MMS_Stratify, _Cumulate, and _Stratify for different values. Number of Incremental Transactions: We then compared the efficiency of these four algorithms under various sizes of incremental database. The number of transactions was varied from, to 85,, and the minimum supports were specified with =.6. As shown in Figure, _Cumulate and _Stratify outperform MMS_Stratify and MMS_Cumulate, and algorithm _Cumulate performs better than _Stratify. The performance gap between _Stratify and MMS_Cumulate decreases as the incremental size increases. The reason is that at high values, i.e., high minimum supports, the percentage of top candidates with generalized items becomes very small, and so _Stratify cannot prune significantly larger number of descendant itemsets to compensate for the cost of another database scan. 9

30 Time (sec) MMS_Cumulate MMS_Stratify _Cumulate _Stratify Number of Incremental Transactions (x, ) Figure. Performance comparison of MMS_Cumulate, MMS_Stratify, _Cumulate, and _Stratify for various sizes of incremental transactions. Fanout: We changed the fanout from 3 to 9. Note that this corresponded to decreasing the number of levels. The results are shown in Figure 3. All algorithms performed faster as the fanout increased. This is because the cardinality as well as the number of generalized itemsets decreased upon increasing the number of levels. Also note that the performance gap between _Cumulate, _Stratify and MMS_Cumulate, MMS_Stratify decreased at high fanout since there were fewer candidate itemsets. Number of Groups: We varied the number of groups from 5 to. As shown in Figure 4, the effect of increasing the number of groups is similar to that of increasing the fanout. The reason is that the number of items within a specific group decreases as the number of groups increases, so the probability of a generalized item decreases. 3

31 Time (sec) MMS_Cumulate MMS_Stratify _Cumulate _Stratify Fanout Figure 3. Performance comparison of MMS_Cumulate, MMS_Stratify, _Cumulate, and _Stratify for varying fanout. Time (sec) MMS_Cumulate MMS_Stratify _Cumulate _Stratify 5 5 Number of Groups Figure 4. Performance comparison of MMS_Cumulate, MMS_Stratify, _Cumulate, and _Stratify for varying number of groups. 3

32 5. Related Wor The problem of incremental updating association rules was first addressed by Cheung et al. [4]. They coined the essence of updating the discovered association rules when new transaction records are added into the incremental database over time and proposed an algorithm called FUP (Fast UPdate). By maing use of the discovered frequent itemsets, the proposed algorithm can dramatically reduce the efforts for computing the frequent itemsets in the updated database. They further examined the maintenance of multi-level association rules [5], and extended the model to incorporate the situations of deletion and modification [6]. All their approaches, however, did not recognize the varied support requirement inherent in items at different hierarchy levels. Since then, a number of techniques have been proposed to improve the efficiency of incremental mining algorithm [8,,, 5]. In [, 5], the authors proposed an incremental updating technique mainly based on the concept of negative borders. Empirical results showed that the approach could significantly save I/O access time for the maintenance of association rules. In [], Ng and Lam proposed an alternative incremental updating algorithm that incorporated the dynamic counting technique. Similar to its batch counterpart IDC [3], this algorithm can significantly reduce the number of database scans. In [8], Hong et al. developed an incremental mining algorithm that was based on a novel concept of pre-large itemsets. According to two user-specified upper and lower support thresholds, their approach facilitated the pre-large itemsets (those itemsets having support larger than the lower support threshold) to refrain from rescanning the original database until the accumulated amount of inserted transactions exceeds a safety bound derived by pre-large concept. Their wor, however, did not exploit association rules with generalized items, and did not consider multiple minimum supports. To sum up, all previous wors for incremental maintenance of association rules addressed in part the aspects discussed in this paper; no wor, to our nowledge, has considered simultane- 3

33 ously both issues of taxonomy and non-uniform support specification. A summary of these wors is described in Table 5. Table 5. A summary of related wor for incremental maintenance of association rules Contributors Model of incremental maintenance of association rules Support specification Exploiting taxonomy Type of update Cheung et al. [4] uniform no insertion Cheung et al. [5] uniform yes insertion Cheung et al. [6] uniform no insertion, deletion and modification Hong et al. [8] uniform no insertion Ng and Lam [] uniform no insertion Sarda and Srinivas [] uniform no insertion Thomas et al. [5] uniform no insertion 6. Conclusions In this paper, we have investigated the problem of maintaining association rules in the presence of taxonomy and multiple minimum supports. We presented two novel algorithms, _Cumulate and _Stratify, for maintaining multi-supported, generalized frequent itemsets. The proposed algorithms can incrementally update the discovered generalized association rules with non-uniform support specification and effectively reduce the number of candidate itemsets and database re-scanning. Empirical evaluation showed that these two algorithms not only were -6 times faster than running MMS_Cumulate or MMS_Stratify on the updated database afresh but also had good linear scale-up characteristic. In the future, we will extend the maintenance of generalized association rules to a more general model that incorporates the situations of deletion and modification, and fuzzy taxonomic structures. We will also investigate the problem of on-line discovery and maintenance of multidimensional association rules from data warehouse data. 33

34 References [] R. Agrawal, T. Imielinsi, and A. Swami, Mining association rules between sets of items in large databases, in: Proc. of 993 ACM-SIGMOD International Conference on Management of Data, 993, pp [] R. Agrawal and R. Sriant, Fast algorithms for mining association rules, in: Proc. of the th International Conference on Very Large Data Bases, 994, pp [3] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, Dynamic Itemset Counting and Implication Rules for Maret-Baset Data, in: Proc. of 997 ACM-SIGMOD International Conference on Management of Data, pp. 7-6, 997. [4] D.W. Cheung, J. Han, V.T. Ng, and C.Y. Wong. Maintenance of discovered association rules in large databases: An incremental update technique, in: Proc. of the th 996 International Conference on Data Engineering, 996, pp.6-4. [5] D.W. Cheung, V.T. Ng, B.W. Tam, Maintenance of discovered nowledge: a case in multilevel association rules, in: Proc. of the nd International Conference on Knowledge Discovery and Data Mining, 996, pp [6] D.W. Cheung, S.D. Lee, and B. Kao, A general incremental technique for maintaining discovered association rules, in: Proc. of the 5th International Conference on Database Systems for Advanced Applications (DASFAA'97), 997, pp [7] J. Han and Y. Fu, Discovery of multiple-level association rules from large databases, in: Proc. of the st International Conference on Very Large Data Bases, 995, pp [8] T.P. Hong, C.Y. Wang, Y.H. Tao, Incremental data mining based on two support thresholds, in: Proc. of the 4th International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies,, pp

35 [9] B. Liu, W. Hsu, and Y. Ma, Mining association rules with multiple minimum supports, in: Proc. of the 5th International Conference Knowledge on Discovery and Data Mining, 999, pp [] K.K. Ng and W. Lam, Updating of association rules dynamically, in: Proc. of 999 International Symposium on Database Applications in Non-Traditional Environments (DANTE'99),, pp [] J.S. Par, M.S. Chen, and P.S. Yu, An effective hash-based algorithm for mining association rules, in: Proc. of 995 ACM-SIGMOD International Conference on Management of Data, 995, pp [] N.L. Sarda and N.V. Srinivas, An adaptive algorithm for incremental mining of association rules, in: Proc. of the 9th International Worshop on Database and Expert Systems Applications (DEXA'98), 998, pp [3] A. Savasere, E. Omiecinsi, and S. Navathe, An efficient algorithm for mining association rules in large databases, in: Proc. of the st International Conference on Very Large Data Bases, 995, pp [4] R. Sriant and R. Agrawal, Mining generalized association rules, in: Proc. of the st International Conference on Very Large Data Bases, 995, pp [5] S. Thomas, S. Bodagala, K. Alsabti, and S. Rana, An efficient algorithm for the incremental updation of association rules in large databases, in: Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, 997, pp [6] M.C. Tseng and W.Y. Lin, Mining generalized association rules with multiple minimum supports, in: Proc. of the 3rd International Conference on Data Warehousing and Knowledge Discovery,, pp

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou