CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

Size: px

Start display at page:

Download "CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark."

Ann Booker
5 years ago
Views:

1 119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM

2 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched techniques of data mining [87]. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. Association rules are used in various areas such as telecommunication networks, market and risk management and inventory control. A new rule is adapted for association rule mining for Market Basket Analysis called the Adaptive Association Rule Mining ASSOCIATION RULE MINING AND ITS CHALLENGES Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem is usually decomposed into two sub-problems. One is to find those itemsets whose occurrences exceed a predefined threshold in the database; those itemsets are called frequent or large itemsets. The second problem is to generate association rules from those large itemsets with the constraints of minimal confidence. Suppose one of the large itemsets is L k, L k ={I 1, I 2,, I k }, association rules with this itemsets are generated in the following way: the first rule is {I 1, I 2,, I k-1 } {I k }, by checking the confidence this rule can be determined as interesting or not. The other rules are then generated by deleting the last item in the antecedent and inserting it to the consequent, further the confidences of the new rules are checked to determine their interestingness. Those processes are iterated until the antecedent becomes empty. Since the second sub-problem is quite straight forward, most of the researches focus on the first sub-problem. The first sub-problem can be further divided into two namely, candidate large itemsets generation process and frequent itemsets generation process. Those itemsets whose support exceed the support threshold are large or frequent item sets, those itemsets that are expected or have the hope to be large or frequent are called candidate itemsets. 89

3 121 In many cases, the algorithms generate an extremely large number of association rules, often in thousands or even millions. Further, the association rules are sometimes very large. It is nearly impossible for the end users to comprehend or validate such large number of complex association rules, thereby limiting the usefulness of the data mining results. Several strategies have been proposed to reduce the number of association rules, such as generating only interesting rules, generating only non-redundant rules, or generating only those rules satisfying certain other criteria such as coverage, leverage, lift or strength Association Rules The basic task in mining for association rules is to determine the correlation between items belonging to a transactional database. Let I = i 1, i 2,, i m be a set of literals, called items. Let D be a set of transactions where each transaction T is a set of items such that T D, T I. An association rule is an implication of the form X=>Y, where X I, Y I and X Y = Ø. Each transaction is identified by a label called a transaction identifier, called TID. In general, every association rule must satisfy two user specified constraints, one is support and the other is confidence. The support of a rule X=>Y is defined as the fraction of transactions that contain X Y, while the confidence is defined as the ratio support (X Y)/support(X). Hence, the target is to find all the association rules that satisfy user specified minimum support and confidence values. It is also called as strong rules to distinguish from the weak ones. It has been shown that the problem of discovering association rules can be reduced to two sub-steps: Find all frequent itemsets for a predetermined support Generate the association rules from the frequent itemsets The association rule X=>Y holds with support s, where s is the percentage of transaction in D that contains X Y, i.e. the union of sets X and Y. This is taken to be the probability, P(X Y). 90

4 122 The association rule X=>Y has confidence c in the transaction D, where c is the percentage of transactions in D containing X that also contain C. This is taken to be the condition probability, P (Y X). That is s(x => Y) = P(X Y) s X => Y = P C A = s X Y s(x) Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called strong. A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. An itemset whose support is larger than a prescribed minimum support threshold, min sup, is referred to as a frequent itemset. The set of all the frequent k-itemsets in D is commonly denoted by L k. There are two phases in the problem of data mining association rules: 1. Find all the frequent itemsets: i.e. all itemsets that have support s above a predetermined minimum threshold. 2. Generate strong association rules from the frequent itemsets: these association rules must have confidence c above a predetermined minimum threshold. After the large itemsets are identified, the corresponding association rules can be derived in a relatively straightforward manner. Thus, the overall performance of mining association rules is determined primarily by the first step. Efficient counting of large itemsets is thus the focus of most association rules mining algorithms. Finding the Frequent Itemsets: An Example Here, a very simplified version of such a transaction database, D is considered. There are a total of 9 transactions involving a total of 6 items. Each item in D is labeled by a positive number. Transaction 001 is a point-of-sale purchase of items 1, 2 and 5. 91

5 123 Transaction 002 is a joint purchase of items 2, 4 and so on. Note that the items within each transaction are sorted lexicographically. Table 5.1: Sample Transaction TID Items 001 1,2, , ,3, ,2, , , , ,2,3, ,2,3 The task here is to derive association rules with minimum confidence threshold min conf of 70% between itemsets in D that have a support count of at least 2. This means that the minimum support threshold, min sup, is given by 2/9 = 22%. Since this database has very few transactions and each transaction contains only a small number of items, it is easy to calculate manually. The first job is to find all the itemsets that have sc 2. It is easy to do that simply by enumeration: {1} with support count of 6; {2} with support count of 7; {3} with support count of 6; {4} with support count of 2; {5} with support count of 2; {1, 2} with support count of 4; {1, 3} with support count of 4; {1, 5} with support count of 2; 92

6 124 {2, 3} with support count of 4; {2, 4} with support count of 2; {2, 5} with support count of 2; {1, 2, 3} with support count of 2; {1, 2, 5} with support count of 2; Out of a possible number of = 63 potential sets, there are only 13 such frequent itemsets Challenges In real-world applications, finding all frequent itemsets in a database is a nontrivial problem because [12]: The number of transactions in the database can be very large and may not fit in the memory of the computer. Recall that Walmart has 20 million transactions/day and a 10 terabyte database. The potential number of frequent itemsets is exponential to the number of different items, although the actual number of frequent itemsets can be much smaller. Often a huge number of frequent itemsets are generated, especially if min sup is low. This is because if an itemset is frequent, each of its subsets is frequent as well. A long itemset will contain a combinatorial number of shorter, frequent sub-itemsets. For example, a frequent itemset of length 100, such as {a 1, a 2,... a 100 }, contains {a 2 },..., {a 100 }, = 100 frequent 1-itemsets: {a 1 }, = 4950 frequent 2-itemsets: {a 1, a 2 },. {a 1, a 3 }... {{a 99,a 100 }, and so on. The total number of frequent itemsets that it contains is thus, It is necessary to develop algorithms that are scalable (their complexity should increase linearly, not exponentially, with the number of transactions) APRIORI RULE MINING AND ITS PROPERTIES Efficient mechanism for finding association rules by efficiently finding sets of items (itemsets) that meet a minimal support criterion 93

7 125 Builds itemsets of size K that meet the support criterion by using the items of size K-1 Makes use of certain principles regarding itemsets to limit the work that needs to be done Each size of itemsets constructed in one pass through the database Apriori Properties For some support threshold S, If any itemset I meets the threshold S, then any non-empty subset of I also meets the threshold. If any itemset I does not meet the threshold S, any superset (including 0 or more other items) will also not meet the threshold. These properties can be used to prune the search space and reduce the amount of work that needs to be done. One major shortcoming of association rules data-mining is that the supportconfidence framework often generates too many rules. Although conventional Apriori Algorithm can identify meaningful itemsets and construct association rules, it suffers from the disadvantage of generating numerous candidate itemsets that must be repeatedly contrasted with the entire database. However, there have been a number of modifications and extensions to improve the Apriori Algorithm ADAPTIVE ASSOCIATION RULE MINING This section illustrates how the algorithm to mine association rules may be used for market basket analysis. The most important differentiation between the implementation of these association types is that the handling of different transaction data to mine the association rules, and different recommendation strategies Mapping Ratings to Transactions The conversion from item ratings available for recommendation tasks to transactions as required for association rule mining is determined by kind and the levels of associations that are to be discovered. The numeric rating for an item is then 94

8 126 mapped into two categories: like and dislike according to whether the rating for the item is greater than or less than some chosen threshold value. The chosen like and dislike ratings are then converted into transactions: i) With the intention of obtaining like associations among users, assume each user correspond to a user items rated by users correspond to a transaction. If a user likes an item, the transaction equivalent to the item contains the item related to the user liking the item. If the user dislikes or did not rate the item, the equivalent transaction does not include the corresponding item. The mined rules will then be of the following form: 90% of items liked by user A and user B are also liked by user C, 30% of all articles are liked by all of them, or, in simpler notation, [user a : like] AND [user b : like] [user c : like] with confidence 90% and support 30%. ii) In order to mine like associations among items, assume each item corresponds to an item and each training user who rated the target item correspond to a transaction. If a training user likes an item, then the transaction related to the user contains the item equivalent to the item. If the user dislikes or did not rate the item, the equivalent transaction then does not include the related item. From here, like associations among articles can be mined as: [item 1 : like] AND [item 4 : like] [target-item: like] with confidence 100% and support 40% Recommendation Strategy A) User Associations For user associations, the rules mined are akin to [training-user 1 : like] AND [training-user 2 : like] [target-user: like]. If training-user 1 likes a test item and training user 2 also likes this item, it is then said that this rule fires for this item. Associate each rule with a score, which are the products of the support and the confidence of the rule. Then assign a score to each item, which is the sum of the scores of all the rules that fire for that item. If the score for an item i is greater than the threshold, recommend the item i to the target user. 95

9 127 B) Item Associations For item associations, the rules are of the form: [item 1 : like] AND [item 2 : like] [target_ item: like]. For a test item of the target user, if the user likes item 1 and item 2 (which could be known from the training items of the user), it is then said that this rule fires for this item. The recommendation strategy for item associations is dissimilar than for user associations. Here, items whose rules' supports are above a cutoff are taken into consideration. The support cutoff is adjusted during system tuning. The mining process is then restricted to rules whose support is above the cutoff. This mining process has the following advantages over algorithms such as Apriori [8] or CBA: i. By mining item associations for one item at a time, only ratings related to the target item are used for mining, which is only a small subset of the whole rating data. The support of a rule is calculated over the small subset of the whole rating data, which facilitates to obtain rules for items that have received only a limited number of ratings, for example a new product. ii. A considerable amount of runtime is saved by mining only rules over the subset of the rating data that is associated to the target item rather than over the whole data items. Systems that mine rules with unrestricted heads for instance IBM's Intelligent Miner can effortlessly take numerous days to mine item associations for all articles at once. C) Combined Associations Mode The following strategy is used to combine user and item associations: If a user's minimum support is greater than a threshold, user associations are then used for recommendation, otherwise item association is used for the effective association A NEW ADAPTIVE ALGORITHM ASSOCIATION RULE MINING ALGORITHM (AARMA) This approach relies on information about relationships between different users' preferences in order to suggest items of potential interest to the target user. This algorithm 96

10 128 adjusts the minimum support of the rules during mining in order to obtain an appropriate number of significant rules for the target predicate. The new AARMA consists of two parts: AARMA-1 and AARMA-2. AARMA-1 With the intention of mining only a specified number of most capable rules for each target item, AARMA-1 is used to control the minimum support count and discover the rules with the highest supports. The minimum support count is the smallest amount of transactions that convince a rule with the aim of making that rule frequent, specifically; it is the multiplication of the minimum support and the whole number of transactions. The overall process of AARMA-1 algorithm is shown in fig 5.1. Input: Transactions, targetitem, minconfidence, minrulenum, maxrulenum Output: minedrulenum 1) set initial minsupportcount based on targetitem s like ratio; 2) r= AARMA-2(); 3) while (R,rulenum=maxRulenum) do 4) minusupportcount++; 5) R 1 = AARMA-2(); 6) if R 1.rulenum > minrlenum then R=R 1 ; 7) else return R; 8) end 9) while(r.rulenum<minrulenum) do 10) minsupportcount--; 11) R=ARMA-2(); 12) end 13) return R; Fig 5.1: The AARMA-1 Algorithm 97

11 129 The working of the above algorithm is described below: 1. AARMA-1 starts the minimum support count based on the frequency of the target predicate and calls AARMA-2 to mine rules. When AARMA-2's output is returned, AARMA-1 will initially verify if the number of rules returned is equivalent to the maxrulenum (as described below, AARMA-2 terminates the mining process when the number of rules generated is equal to maxrulenum). If it is, that means the minimum support count is small which causes above maxrulenum rules, as a result the AARMA-1 will keep raising the minimum support count and calling AARMA-2 until the number of rules is less than maxrulenum. 2. Lastly, AARMA-1 will verify if the number of rules is fewer than minrulenum; if it is, it will keep diminishing the minimum support count until the rule number is better than or equal to minrulenum. Within a specified support, rules with smaller bodies are mined initially. Therefore, if with minimum support count, say, 15 there is no rule available, but with minimum support count 16 there are at least maxrulenum rules, AARMA-1 will then return the shortest maxrulenum rules with support count of at least 16. AARMA-2 is an alternative of CBA-RG and as a result of the Apriori Algorithm also. AARMA-2 is an alternative of CBA-RG in the sense that rather than mining rules for all target classes, it only mines rules for one target item. It varies from CBA-RG in that it will simply mine a number of rules within a particular range. When it attempts to produce a new rule after having acquired maxrulenum rules previously, it then terminates its execution and returns the rules it has mined until now. AARMA-2 algorithm is presented in fig

12 130 Input: Transaction, targetitem, minconfidence, maxrulenum, minsupportcount Output: mined association rules 1) F 1 ={frequent1-condsets}; 2) R=genRules(F 1 ); 3) if R.rulenum=maxRulenum then return R; 4) for (k=2;f k-1 ;k++) do Begin 5) C k =candidategen(f k-1 ); 6) for each transaction tc t contained in t; 7) C t =all candidate condsets of C k contained in t; 8) for each candidate cc t do Begin 9) C.condsupCount++; 10) If t contains targetitem then c.rulesupcount++; 11) end 12) end 13) F k ={c=c k c.rulesupcount minisupportcount}; 14) R=R genrules(f k ); 15) if R.rulenum=maxRulenum then return R; 16) end 17) return R; Fig 5.2: AARMA-2 Algorithm Here k-condset is used to indicate a set of items (or itemset) of size k which possibly will form a rule: k-condset target-item. The support count of the k-condset called condsupcount is the amount of transactions that include the k-condset. The support count of the equivalent rule (also called rulesupcount of this k-condset) is the number of transactions that include the condset in addition to the target item. As stated above, AARMA-2 is extremely like CBA-RG. Association rules are produced by making multiple passes over the transaction data. The initial pass calculates the rulesupcounts and the condsupcounts of all the particular items and discovers the 99

13 131 frequent 1-condsets. For pass k > 1, it produces the candidate frequent k-condsets by making use of the frequent (k 1)-condsets. After that, it scans all the transactions to count the rulesupcounts and the condsupcounts of all the candidate k-condsets. It will then go over the entire candidate k-condsets, choosing those whose rulesup is above the minimum support as frequent k-condsets and simultaneously generating rules k-condset target-item, if the confidence of the rule is above the minimum confidence DISCUSSION Existing the association rule mining algorithms suffer from the problems of too much execution time and generation of too many association rules and it is difficult to choose a proper minimum confidence and support for each item before the mining process because of users' interests. If the minimum confidence and support for mining are set too high, enough rules for accurate combination will not be obtained. A new Adaptive Association Rule Mining Algorithm guides the users towards the selection of the best combination of items in a super market. By incorporating the similarities between the rules and the active user and confidence of the weighted rules, it is possible to choose only the most appropriate combination of items for displaying to the user, the minimum confidence and support for mining are low, it decreases the execution time and increases the accuracy of the results SUMMARY In this chapter, a novel effective approach for Market Basket Analysis based on the Adaptive Association Rule Mining Algorithm has been introduced. There are various techniques available for selecting the best combination of itemsets in the super market for better sales. Most of the association rule mining algorithms suffer from the problems of too much execution time and generating too many association rules. Moreover, it is difficult to choose a proper minimum confidence and support for each item before the mining process because of users' interests. If the minimum confidence and support for mining are set too high, enough rules for accurate combination will not be obtained. 100

14 132 This chapter introduces a new Adaptive Association Rule Mining Algorithm to guide the users towards the selection of the best combination of items in a super market. By incorporating the similarities between rules and active user and confidence of the weighted rules, it is possible to choose only the most appropriate combination of items for displaying to the user. This algorithm adjusts the minimum support of the rules during mining in order to obtain an appropriate number of significant rules for the target item. Hence, it increases the easiness of the customers which in turn increases the sales rate of the super market. This approach mainly analyzes the process of discovering association rules in this kind of big repositories. Thus, this approach is very significant for effective Market Basket Analysis and it helps the customers in purchasing their items with more comfort, which in turn increases the sales rate of the markets. 101

Lecture notes for April 6, 2005

Lecture notes for April 6, 2005 Mining Association Rules The goal of association rule finding is to extract correlation relationships in the large datasets of items. Many businesses are interested in extracting