Data Structures Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali 2009-2010 Association Rules: Basic Concepts and Application 1. Association rules: Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction 2. Definition: Frequent Itemset Itemset - A collection of one or more items, Example: {Milk, Bread, Diaper}. - k-itemset, Example: An itemset that contains k items Support count (σ) - Frequency of occurrence of an itemset: E.g. σ({milk, Bread,Diaper}) = 2 Support - Fraction of transactions that contain an itemset: E.g. s({milk, Bread, Diaper}) = 2/5 Frequent Itemset - An itemset whose support is greater than or equal to a minsup threshold 3. Definition: Association Rule Association Rules are on of the promising aspects of data mining as knowledge discovery tool, and have been widely explored to date. They allow to capture all possible rules that explain the presence of 1
some attributes according to the presence of the other attributes. In association rule,, is a statement of the form for a specified fraction of transactions, a particular value of attribute set X determines the value of attribute set Y as another particular value under a certain confidence. Thus, association rules aim at discovering the patterns of co-occurrences of attributes in a database. The association rules may be useful in many application such as [supermarket transaction analysis, store layout and promotions on the items, telecommunications alarm correlation, university course enrollment analysis, customer behavior analysis in retailing, catalog design, word occurrence in text documents, user s visits to WWW pages, military mobilization, stock transactions,..,etc. 4. Mining Association Rules: Two-step approach A. Frequent Itemset Generation B. Rule Generation.Generate all itemsets whose support minsup Generate high confidence where each rule is a binary partitioning of a frequent itemset rules from each frequent itemset. As a Results Association Rule An implication expression of the form X Y, where X and Y are itemsets Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X 2
5. The Apriori Algorithm Is an efficient association rule mining algorithm. Apriori employs breadth-first-search and uses a hash-tree structure to count candidate item sets efficiently. The algorithm generates candidate item sets (patterns) of length k from k-1 length item sets. Then, the patterns which have an infrequent sub pattern are pruned. According to the min support. The whole transaction database is scanned to determine frequent item sets among the candidates. For determining frequent items in a fast manner, the algorithm uses a hash tree to store candidate item sets. Note:- A hash tree has item sets at the leaves and hash tables at internal nodes. The following figure explains the hash tree structure Figure 1: Illustrating Hash Tree Structure Figure 2: Illustrating Apriori Principle 3
4
5.1. Example Let s look at a scientific example, based on the science transaction database, D, of Table 1. There are nine transactions in this database, that is D = 9. We use Figure 3 to illustrate the Apriori algorithm for finding frequent itemsets in D. Table.1 Transactional Data for an Computer Branch A. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of occurrences of each item. B. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of frequent 1- itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying minimum support. In our example, all of the candidates in C1 satisfy minimum support. C. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 to generate a candidate set of 2-itemsets, C2.8 C2 consists of 2-itemsets. Note that no candidates are removed from C2 during the prune step because each subset of the candidates is also frequent. D. Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is accumulated, as shown in the middle table of the second row in Figure 3. E. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support. F. The generation of the set of candidate 3-itemsets,C3, is detailed in Figure 7 From the join step, we first get C3 =L2 L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}}. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C3, thereby saving the effort of unnecessarily obtaining their counts during the subsequent scan of D to determine L3. Note that when given a candidate k-itemset, we only need to check if its (k-1)-subsets are frequent since the Apriori algorithm uses a level-wise search strategy. The resulting pruned version of C3 is shown in the first table of the bottom row of Figure 3. 5
G. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support (Figure 3). H. The algorithm uses L3 L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned because its subset {{I2, I3, I5}} is not frequent. Thus, C4 =, and the algorithm terminates, having found all of the frequent itemsets. Figure 3: Generation of candidate itemsets and frequent itemsets, where the minimum support count is 2 6
Figure4: Generation and pruning of candidate 3-itemsets, C3, from L2 using the Apriori property Generating association rules: Suppose the data contain the frequent itemset l = {I1, I2, I5}. What are the association rules that can be generated from l? The nonempty subsets of l are {I1, I2g, fi1, I5}, {I2, I5},{I1}, {I2}, and {I5}. The resulting association rules are as shown below, each listed with its confidence: I1^I2 I5, confidence = 2/4 = 50% I1^I5 I2, confidence = 2/2 = 100% I2^I5 I1, confidence = 2/2 = 100% I1 I2^I5, confidence = 2/6 = 33% I2 I1^I5, confidence = 2/7 = 29% I5 I1^I2, confidence = 2/2 = 100% If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules above are output, because these are the only ones generated that are strong. Note that, unlike conventional classification rules, association rules can contain more than one conjunct in the right-hand side of the rule. 7