Mining Association rules with Dynamic and Collective Support Thresholds

Size: px

Start display at page:

Download "Mining Association rules with Dynamic and Collective Support Thresholds"

Coleen Warner
5 years ago
Views:

1 Mining Association rules with Dynamic and Collective Suort Thresholds C S Kanimozhi Selvi and A Tamilarasi Abstract Mining association rules is an imortant task in data mining. It discovers the hidden, interesting relationshis (associations) between items in the database based on the user-secified suort and confidence thresholds. In order to find relevant associations one has to secify an aroriate suort threshold. The suort threshold lays an imortant role in deciding the number of aroriate rules found. The rare associations will not aear if a high threshold is set. Some uninteresting associations may aear if a low threshold is set. This aer rooses an aroach to obtain the aroriate suort thresholds at each level of the level-wise mining aroach. It sets the suort threshold by analyzing the of items and their associations in the database at each level. Exerimental results show that this aroach roduces the interesting rules without secifying the user secified suort threshold. Index Terms Association rules, Collective Suort, Dynamic Suort, Frequent itemset I. INTRODUCTION Data mining is widely used in various alication areas such as banking, marketing and retail industry and so on. Association rule mining is one technique used in data mining to discover hidden associations that occur between various data items [1]. An association rule [1] is an exression of the form X Y, where X, Y are itemsets. It reveals the relationshi between the itemsets X and Y. The ortion of transactions containing X also containing Y, i.e., P(Y X) = P( X Y ) / P( X ) is called the confidence (conf) of the rule. The suort (su) of the rule is the ortion of the transactions that contain all items both in X and Y, i.e., su(x Y) = P (XUY). To generate an interesting association rule, the suort and confidence of the rule should satisfy a user-secified minimum suort called MINSUP and minimum confidence called MINCON, resectively. Mining association rules consists of two stes [1]: 1) Finding all the frequent itemsets having adequate suort. 2) Producing association rules from these frequent itemsets. The outcome des uon the results of ste1. Manuscrit received May 4, C S Kanimozhiselvi, Assistant Professor, Deartment of Comuter Science and Engineering, Kongu Engineering College, Perundurai Erode, Tamilnadu, India Phone: ; A Tamilarasi, Professor, Deartment of Comuter Science and Engineering, Kongu Engineering College, Perundurai Erode, Tamilnadu, India. According to ste 1, the itemsets whose suort value is greater than the MINSUP are labeled as frequent itemsets. The result of this ste lays an imortant role in roducing the association rules. Consequently, the concentration has been given only on the first ste by many researchers. However, without secific knowledge, users will have difficulties in setting the MINSUP to obtain their required results. If MINSUP secified by the user is not aroriate, the user may find many meaningless rules or may miss some interesting rules. Hence, the user has to try many ossibilities for secifying the MINSUP in order to find the aroriate one. To overcome these roblems, a technique is required to generate the rules of high confidence without having user secified suort constraint. To meet out this task, the aer rooses an aroach to secify suort for frequent itemset generation without consulting the users. This aroach finds an initial MINSUP value by analyzing the itemsets and their. It also rooses a collective suort threshold on the subsequent levels based on the revious level suort and the items considered in the current level. The rest of the aer is organized as follows: Section II revisits the roblem of association rule mining and exlores the need for dynamic and collective suort thresholds for the generation of association rules. Section III rovides a detailed insight into the modified association rule framework and exlains the suort secification and mining rocess for this model. Section IV reorts the exerimental results on the IBM synthetic data uon rule set size. Finally, the conclusions are ointed out in Section V. II. RELATED WORK Many researchers considered association rule mining as an interesting research area and studied widely [1-11]. Most of these studies address the issue of finding the association rules that satisfy user-secified minimum suort and minimum confidence constraints. Aroaches like Ariori [1,12] and FP-Growth [13] emloy the uniform minimum suort at all levels. These aroaches assume that all items in the data are of the same kind and have similar frequencies in the database but this assumtion is not alicable for real-life alications. In many alications, some items aear very frequently in the database, while others hardly ever aear. One cannot claim that the frequent itemsets are alone interesting, but the rare items would also matter. To identify the frequent and rare items, an aroriate minimum suort has to be secified. Otherwise the user would face two roblems [4]

2 9) Generation of fewer rules uon secifying the high minimum suort. 10) Generation of too many rules sometimes uninteresting uon secifying low minimum suort. This may lead to the develoment of stronger rule runing techniques. The aer [4] argues that the single MINSUP for the whole database is inadequate, because it cannot cature the inherent nature and / or differences of the items in the database. Therefore it suggests a model with multile minimum suorts for items at each level. Even if the user secifies multile minimum suorts, the rules generated may not be more aroriate and interesting. A better solution lies in develoing suort constraints, which secify minimum suort required for the itemsets, so that only the necessary itemsets are generated. If more than one suort constraint is satisfied by an itemset, then the one with minimum suort value should be adoted. This has been studied in the aer [6]. This aroach also has some roblems: 1) This aroach deals with only the secific roblems but not in general. Moreover it assumes that the user has adequate domain knowledge. 2) Using the aroach, one can analyze the nature of the items but can t able to know of the items until all the items in the database are scanned and candidate items are generated. Correlation based framework [10] without the use of suort thresholds has also been studied in the ast. It uses contingency table to find ositively correlated items. Although the correlation framework discovers strongly correlated items, the suort threshold is still imortant. Without the suort threshold, the comutation cost will be too high and many ineffectual itemsets would be generated [10]. As such, users still face the roblem of aroriate suort secification. In [11], the authors avoided the use of suort measure to find the interesting associations. This aroach is not suitable for alications that adhere to the traditional asymmetric confidence measure as ointed out by [11]. Although substantial effort has been made to lighten these roblems, such as adding the lift or conviction measure, which derives the minimum suort dynamically from the item suort, it also leaves out some interesting rules comosed of more than two items because the secification is derived from frequent 2-itemsets [16]. Another effort is made by [14, 15] for finding the N-most frequent itemsets. It allows the user to control the result by secifying the value N, thus leading to user intervention. This aer aims at making the user free from secifying any constraints including suort constraints. It rooses an aroach to calculate the minimum suort threshold dynamically by scanning the records in the database so that all frequent, interesting and meaningful rules would be generated. This minimizes the generation of large number of rules and reduces the need for rule runing techniques. Exeriment results on synthetic data show that the roosed technique is effective. III. ASSOCIATION RULE MINING FRAMEWORK Consider a given transaction database T = {r 1, r 2,, rn}, where each record ri, 1 i n is a set of items from a set I of items, i.e., ri I. The basic association rule model as follows: Let I = {i 1, i 2, i m } be a set of items. Let R be a set of records in the database, where each record r is a set of items such that r I. An association rule is an exression of the form, X Y, where X I, Y I, and X Y = φ. The rule X Y holds in the set of records R with confidence c if c% of records in R that suorts X also suorts Y. The rule has suort s in R if s% of the records in R contains X U Y. Given a set of records R, the roblem of mining association rules is to discover all association rules that have suort and confidence greater than the user-secified minimum suort (MINSUP) and minimum confidence (MINCON). An association rule mining algorithm works in two stes [3, 4]: 1. Generate frequent itemsets that satisfy MINSUP. 2. Generate interesting association rules that satisfy MINCON using the frequent itemsets. A. Proosed Model In this model, we roose two minimum suort counts namely Dynamic Minimum Suort Count (DMS) and Collective Minimum Suort Count (CMS) for the itemset generation at each level. Initially, DMS is calculated while scanning the items in the database. CMS is calculated during the itemset generation. DMS reflects the of items in the database. CMS reflects the intrinsic nature of items in the database by carrying over the existing suort to the next level. This model is based on multile minimum suorts model. In each ass, a different minimum suort value is used (i.e) the DMS and CMS values are calculated in each ass. Initially, the DMS is used for itemset generation and in the subsequent asses CMS values are used to find the frequent itemsets. Let there are n items in the database and su be the suort of each item in the database and reresents the current ass. MAXS and MINS denote the Maximum Suort and Minimum Suort resectively. The total suort of items considered in each ass is TOTOCC. TOTOCC DMS = n i= 1 su TOTOCC MINS + MAXS = + 2 n 2 1 [ DMS + DMS ] 1 CMS = 1 4 The calculation for the DMS values has been the same at all level. Here DMS reresents the value at the current level whereas the DMS -1 reresents the value at the revious level. B. Frequent Itemset Generation The roosed algorithm exts the Ariori algorithm for finding large itemsets. We call the new algorithm, DCS (Dynamic Collective Suort) Ariori. The new algorithm is also based on level wise search. It generates all large itemsets by making multile asses over the data. In each ass, it

3 counts the suorts of itemsets and finds the MINS and MAXS values, TOTOCC and thus the DMS value. Initially, the itemsets that satisfy the DMS value are retrieved. The DMS value is calculated based on the candidates generated in the revious ass. Let L k denote the list of k-itemsets. Each itemset c is of the following form, {c[1], c[2],, c[k]}, which consists of items, c[1], c[2],, c[k]. The algorithm is given below: Algorithm DCSAriori Inut : Set of records R in Transaction database T Outut : Frequent itemsets L 1 = find frequent 1 itemsets(r) for (k = 2; L k-1 Ø, k++) do C k =candidate-gen (L k-1 ) for each Record r in Є R do C t = subset (C k, r); for each candidate c Є Ct do c.count++; CMS = su_calc(c k ) L k = {c Є C k c.count ( CMS )} return U k L k ; Procedure Candidate-gen(L k-1) for each itemset l 1 Є L k-1 for each itemset l 2 Є L k-1 erform join oeration l1 l 2 if has_infrequent_subset(c, L k-1) rune c; else add c to C k; if return C k; Procedure has_infrequent_subset(c, L k-1) for each (k-1) subset s of c if s is in L k-1 return false; else return true; if Procedure su_calc(c k ) MINS = c.count c.count is minimum for all C k and c.count > 1 MAXS = c.count c.count is maximum for all C k TOTOCC = sum(c.count) for all C k DMS = Average( (Average (MAXS, MINS )+Average(TOTOCC )) If (= 1) then CMS = DMS Else CMS = (DMS -1 + DMS ) / 4 End if return CMS Examle: Consider the following dataset. TABLE I: TRANSACTION DATABASE R1 : I 1,I 2,I 5 R2 : I 1,I 3 R3 : I 3,I 4 R4 : I 1, I 2, I 3 R5: I 1,I 2,I 3,I 5, I 6 R6: I 2,I 5,I 6 R7 : I 2,I 5,I 6,I 7 Initially the database is scanned and the suort counts of itemsets are found. If there are n items and the minimum suort and maximum suort are MINS and MAXS resectively. The sum of suort of all items is known as TOTOCC. In the first ass, the algorithm retrieves the itemsets whose suort count is greater than the DMS value as frequent itemsets. In the subsequent asses, the algorithm uses the CMS value to rune the uninteresting items. Using the formula 1 and 2, the roosed algorithm finds out the frequent itemsets in the first ass as: I 1, I 2, I 3, I 5. Similarly during the second ass the DMS and the CMS values are calculated and the itemsets {I 1, I 2 },{I 1, I 3 }, {I 1, I 5 },{{I 2, I 3 }, {I 2,I 6 },{I 3, I 6 } are generated. During the third ass the itemsets {I 2, I 3, I 5 }, {I 1, I 2, I 3 }, {I 1, I 2, I 5 }, {I 1, I 3, I 5 } are generated as frequent itemsets. C. Rule Generation The roosed algorithm adots the confidence based rule generation model of ariori [1] for rule generation. The confidence threshold can be used to find out the interesting rule set. The confidence of a rule is its suort divided by the suort of its antecedent. For examle, the following rule { I 2, I 3 } { I 5 } has confidence equivalent to suort for { I 2, I 3, I 5 } / suort for { I 2, I 3 }. Association rules are generated as follows. For each frequent itemset fl, all non emty subsets of fl are generated. For every non emty subset s of fl, the rule s (fl-s) is formed, if suort-count (fl) / suort-count of (s) Minimum Confidence. After the generation of frequent items, the algorithm checks if it satisfies the MINCON threshold. If their confidence is larger than MINCON then they will be generated as interesting association rules. Otherwise, the rules will be discarded. The algorithm for rule generation is given below: Algorithm: Rule_gen(L k, MINCON) for each frequent itemset fl L k do for each nonemty subset s of fl do if c.count(fl)/c.count(s) MINCON then outut the rule s (fl - s); IV. EXPERIMENTAL RESULTS In this section, the roosed DCSAriori algorithm obtains the frequent itemsets by finding various Collective Minimum Suort (CMS) at each level. Then a minimum confidence is set by the user to find the interesting association rules. We use different minimum confidence thresholds at each run. The Ariori algorithm with the standard uniform suort (US, set to 0.5 %,1%,2%,3% and 4)

is also evaluated. The evaluation is examined based on the number of rules found. All exeriments are erformed on an Intel Pentium-IV with 256 MB RAM, running Windows 98.

8% 28.738% 0.11% 0.10% As shown in Figs 1,2,3,4 and 5 DCSAriori generates not less rules or not more rules while varying the confidence levels.

4 is also evaluated. The evaluation is examined based on the number of rules found. All exeriments are erformed on an Intel Pentium-IV with 256 MB RAM, running Windows 98. We use two synthetic data sets generated from IBM synthetic data generator [1]: T10.I4.D100K and T40.I10.D100K. Characteristics of these two data sets are shown in Table II. Fig 2 Ariori Vs DCSAriori with Suort=1.0% for dataset T10I4D100K TABLE II PROPERTIES OF DATASETS T10I4D100K T40I10D100K Number of 100, ,000 Transactions Number of items Minimum item Maximum item Average item 0.001% 0.005% 7.8% % 0.11% 0.10% As shown in Figs 1,2,3,4 and 5 DCSAriori generates not less rules or not more rules while varying the confidence levels. It retains the medium level where as ariori roduces number of less confidence rules and at high confidence it roduces fewer rules. This means that the ratio of surious frequent itemsets increases when the MINCON becomes higher, and more uninteresting rules will be generated. The advantage of DCSAriori is that it sets the aroriate minsu at each level based on the and association of items. It exlores more hidden but effective frequent itemsets and avoids the generation of surious frequent itemsets and thus the low confidence rules. Fig 3 Ariori Vs DCSAriori with Suort=1.0% for dataset T40I10D100K Fig 4 Ariori Vs DCSAriori with Suort=3.0% for dataset T40I10D100K Fig 5 Ariori Vs DCSAriori with Suort=4.0% for dataset T40I10D100K Fig 1. Ariori Vs DCSAriori with Suort=0.5% for dataset T10I4D100K V. CONCLUSION In the roosed method the minimum suort is calculated dynamically. Instead of using the user secified minimum suort we use the calculated minimum suort for each itemset generation and for rule generation. Thus it generates more relevant and meaningful rules. In fact the running time of the roosed method is not taken into account, it still leaves the user free from secifying minimum suort and guarantees the generation of interesting association rules

5 REFERENCES [1] R. Agrawal, T. Imielinski, and A. Swami, Mining association rules between sets of items in large databases, Proceedings of 1993 ACM-SIGMOD International Conference on Management of Data, 1993, [2] S. Brin, R. Motwani, J.D. Ullman, and S. Tsur, Dynamic itemset counting and imlication rules for marketbasket data, Proceedings of 1997 ACM-SIGMOD International Conference on Management of Data, 1997, [3] J. Han and Y. Fu, Discovery of multile-level association rules from large databases, Proceedings of the21st International Conference on Very Large Data Base 1995, [4] B. Liu, W. Hsu, and Y. Ma, Mining association rules with multile minimum suorts, Proceedings of 1999 ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining 1999, [5] M.C. Tseng and W.Y. Lin, Mining generalized association rules with multile minimum suorts, Proceedings of International Conference on Data Warehousing and Knowledge Discovery 2001, [6] K. Wang, Y. He, and J. Han, Mining frequent itemsets using suort constraints, Proceedings of the 26th International Conference on Very Large Data Bases 2000, [7] M. Seno and G. Karyis, LPMiner: An algorithm for finding frequent itemsets using length-decreasing suort constraint, Proceedings of the 1st IEEE International Conference on Data Mining [8] J. Li and X. Zhang, Efficient mining of high confidence association rules without suort thresholds, Proceedings of the 3rd Euroean Conference on Princiles and Practice of Knowledge Discovery in Databases [9] C.C. Aggarwal and P.S. Yu, A new framework for itemset generation, Proceedings of the 17th ACM Symosium on Princiles of Database Systems 1998, [10] S. Brin, R. Motwani, and C. Silverstein, Beyond market baskets: generalizing association rules to correlations, Proceedings of 1997 ACM-SIGMOD International Conference on Management of Data 1997, [11] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang, Finding interesting associations without suort runing, Proceedings of IEEE International Conference on Data Engineering 2000, [12] R. Agrawal and R. Srikant, Fast algorithms for mining association rules, Proceedings of the 20th International Conference on Very Large Data Bases 1994, [13] J. Han, J. Pei, and Y. Yin., Mining Frequent Patterns without Candidate Generation, Proc ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00),Dallas, TX, May [14] Ngan, S-C., Lam, T.,Wong, R.C-W. and Fu, A.W-C. (2005), Mining N-most interesting itemsets without suort threshold by the COFI-tree, Vol. 1, No. 1, Int. J. Business Intelligence and Data Mining, [15] Y. Cheung and A. Fu. Mining frequent itemsets without suort threshold: with and without item constraints, IEEE Trans. on Knowledge and Data Engineering, 16(9): , [16] Wen-Yang Lin and Ming-Cheng Tseng,, Automated suort secification for efficient mining of interesting association rules, Journal of Information Science,Volume 32, Issue 3 (June 2006) : , 2006,ISSN:

Mining Frequent Itemsets Along with Rare Itemsets Based on Categorical Multiple Minimum Support

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 6, Ver. IV (Nov.-Dec. 2016), PP 109-114 www.iosrjournals.org Mining Frequent Itemsets Along with Rare