Association Rules Apriori Algorithm

Association Rules Apriori Algorithm Market basket analysis n Market basket analysis might tell a retailer that customers often purchase shampoo and conditioner n Putting both items on promotion at the same time would not create a significant increase in revenue n While a promotion involving just one of the items would likely drive sales 1

Association Rules n Discovers co-occurrence relationships n Besides market basket data, association analysis is also applicable to other application domains n bioinformatics, n medical diagnosis n Web mining n scientific data analysis. n A widely used example of cross selling on the web n Market basket analysis is Amazon.com's use of "customers who bought book A also bought book B 2

Sales Transaction Table n We would like to perform a basket analysis of the set of products in a single transaction n Discovering for example, that a customer who buys shoes is likely to buy socks Shoes Socks 3

Transactional Database n The set of all sales transactions is called the population n We represent the transactions in one record per transaction n The transaction are represented by a data tuple TX1 TX2 TX3 TX4 Shoes,Socks,Tie Shoes,Socks,Tie,Belt,Shirt Shoes,Tie Shoes,Socks,Belt Socks Tie n Sock is the rule antecedent n Tie is the rule consequent 4

Support and Confidence n Any given association rule has a support level and a confidence level n Support it the percentage of the population which satisfies the rule n If the percentage of the population in which the antecedent is satisfied, then the confidence is that percentage in which the consequent is also satisfied Transactional Database Socks Tie n Support is 50% (2/4) n Confidence is 66.67% (2/3) TX1 TX2 TX3 TX4 Shoes,Socks,Tie Shoes,Socks,Tie,Belt,Shirt Shoes,Tie Shoes,Socks,Belt 5

Apriori Algorithm n Mining for associations among items in a large database of sales transaction is an important database mining function n For example, the information that a customer who purchases a keyboard also tends to buy a mouse at the same time is represented in association rule below: n Keyboard Mouse n [support = 6%, confidence = 70%] Association Rules n Based on the types of values, the association rules can be classified into two categories: Boolean Association Rules and Quantitative Association Rules n Boolean Association Rule: Keyboard Mouse [support = 6%, confidence = 70%] n Quantitative Association Rule: (Age = 26...30) (Cars =1, 2) [support 3%, confidence = 36%] 6

Minimum Support threshold n The support of an association pattern is the percentage of task-relevant data transactions for which the pattern is true A B support(a B) = P(A B) support(a B) = # _ tuples_containing_both _ A _ and _ B total _# _ of _ tuples Minimum Confidence Threshold n Confidence is defined as the measure of certainty or trustworthiness associated with each discovered pattern A B confidence(a B) = P(B A) The probability of B given that all we know is A confidence(a B) = # _ tuples_containing_both _ A _ and _ B # _ tuples_containing_ A 7

Itemset n A set of items is referred to as itemset n An itemset containing k items is called k-itemset n An itemset can be seen as a conjunction of items Frequent Itemset n Suppose min_sup is the minimum support threshold n An itemset satisfies minimum support if the occurrence frequency of the itemset is greater or equal to min_sup n If an itemset satisfies minimum support, then it is a frequent itemset 8

Strong Rules n Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong Association Rule Mining n Find all frequent itemsets n Generate strong association rules from the frequent itemsets n Apriori algorithm is mining frequent itemsets for Boolean associations rules 9

Apriori Algorithm n Level-wise search n k-itemsets (itensets with k items) are used to explore (k+1)- itemsets from transactional databases for Boolean association rules First, the set of frequent 1-itemsets is found (denoted L 1 ) L 1 is used to find L 2, the set of frequent 2-itemsets L 2 is used to find L 3, and so on, until no frequent k-itemsets can be found n Generate strong association rules from the frequent itemsets n If an itemset is frequent, then all of its subsets must also be frequent. 10

Example Sup Database TDB min = 2 Itemset sup {A} 2 L Tid Items C 1 1 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan {B} 3 {C} 3 {D} 1 {E} 3 C 2 C 2 L {A, B} 1 2 Itemset sup 2 nd scan {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 C 3 Itemset 3 rd scan L 3 {B, C, E} Itemset sup {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {B, C, E} 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} n The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent items n Employs an iterative approach known as levelwise search, where k-items are used to explore k+1 items 11

Apriori Property n Apriori property is used to reduce the search space n Apriori property: All nonempty subset of frequent items must be also frequent n Anti-monotone in the sense that if a set cannot pass a test, all its supper sets will fail the same test as well Apriori Property n Reducing the search space to avoid finding of each L k requires one full scan of the database (L k set of frequent k-itemsets) n If an itemset I does not satisfy the minimum support threshold, min_sup, the I is not frequent, P(I) < min_sup n If an item A is added to the itemset I, then the resulting itemset cannot occur more frequent than I, therfor I A is not frequent, P(I A) < min_sup 12

Scalable Methods for Mining Frequent Patterns n The downward closure property of frequent patterns n Any subset of a frequent itemset must be frequent n If {beer, diaper, nuts} is frequent, so is {beer, diaper} n i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} n Scalable mining methods: Three major approaches n Apriori (Agrawal & Srikant@VLDB 94) n Freq. pattern growth (FPgrowth Han, Pei & Yin @SIGMOD 00) n Vertical data format approach (Charm Zaki & Hsiao @SDM 02) Algorithm 1. Scan the (entire) transaction database to get the support S of each 1-itemset, compare S with min_sup, and get a set of frequent 1-itemsets, L 1 2. Use L k-1 join L k-1 to generate a set of candidate k- itemsets. Use Apriori property to prune the un frequent k-itemset 3. Scan the transaction database to get the support S of each candidate k-itemset in the final set, compare S with min_sup, and get a set of frequent k-itemsets, L k 4. Is the candidate set empty, if not goto 2 13

5 For each frequent itemset l, generate all nonempty subsets of l 6 For every nonempty subset s of l, output the rule s (I s) if its confidence C > min_conf I={A1,A2,A5} A1 A2 A5 A1 A5 A2 A2 A5 A1 A1 A2 A5 A2 A1 A5 A5 A1 A2 Example n Five transactions from a supermarket TID List of Items 1 Beer,Diaper,Baby Powder,Bread,Umbrella 2 Diaper,Baby Powder 3 Beer,Diaper,Milk 4 Diaper,Beer,Detergent 5 Beer,Milk,Coca-Cola (diaper=fralda) 14

Item Step 1 n Min_sup 40% (2/5) C1 è L1 Support Beer "4/5" Diaper "4/5" Baby Powder "2/5" Bread "1/5" Umbrella "1/5" Milk "2/5" Detergent "1/5" Coca-Cola "1/5" Item Support Beer "4/5" Diaper "4/5" Baby Powder "2/5" Milk "2/5" Step 2 and Step 3 n C2 è L2 Item Support Beer, Diaper "3/5" Beer, Baby Powder "1/5" Beer, Milk "2/5" Diaper,Baby Powder "2/5" Diaper,Milk "1/5" Baby Powder,Milk "0" Item Support Beer, Diaper "3/5" Beer, Milk "2/5" Diaper,Baby Powder "2/5" 15

Step 4 n C3 è empty Item Support Beer, Diaper,Baby Powder "1/5" Beer, Diaper,Milk "1/5" Beer, Milk,Baby Powder "0" Diaper,Baby Powder,Milk "0" Min_sup 40% (2/5) Step 5 n min_sup=40% min_conf=70% Item Support(A,B) Suport A Confidence Beer, Diaper 60% 80% 75% Beer, Milk 40% 80% 50% Diaper,Baby Powder 40% 80% 50% Diaper,Beer 60% 80% 75% Milk,Beer 40% 40% 100% Baby Powder, Diaper 40% 40% 100% 16

Results Beer Diaper n support 60%, confidence 75% Diaper Beer n support 60%, confidence 75% Milk Beer n support 40%, confidence 100% Baby _ Powder Diaper n support 40%, confidence 100% Interpretation n Some results are belivable, like Baby Powder è Diaper n Some rules need aditional analysis, like Milk è Beer n Some rules are unbelivable, like Diaper è Beer n This example could contain unreal results because of the small data 17

n Maximal frequent itemset n Closed itemsets n Closed frequent itemset n Maximal frequent itemset n A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets are frequent. 18

n Maximal frequent itemsets effectively provide a compact representation of frequent itemsets. n They form the smallest set of itemsets from which all frequent itemsets can be derived. n Maximal frequent itemsets do not contain the support information of their subsets Closed itemsets n An itemset X is closed if none of its immediate supersets has exactly the same support count as X n X is not closed if at least one of its immediate supersets has the same support count as X n Closed itemsets provide a minimal representation of itemsets without losing their support information 19

n {b,c} is a closed itemset because it does not have the same support count as any of its supersets n An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets are frequent 20

n An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup n Closed frequent itemsets are useful for removing some of the redundant association rules An association rule X Y is redundant if there exists another rule X Y, where X is a subset of X and Y is a subset of Y,such that the support and confidence for both rules are identical n A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets are frequent. 21

n n n The association rule {b} {d,e} is therefore redundant because it has the same support and confidence as {b,c} {d,e}. Such redundant rule not generated n Maximal frequent itemsets are closed because none of the maximal frequent itemsets can have the same support count as their immediate supersets 22

Simpson s Paradox n In some cases, the hidden variables may cause the observed relationship between a pair of variables n Disappear or reverse its direction, a phenomenon that is known as Simpson s paradox n Consider the relationship between the sale of high-definition television (HDTV) and exercise machine {HDTV=Yes} {Exercise machine=yes} has a confidence of 99/180 = 55% {HDTV=No} {Exercise machine=yes} has a confidence of 54/120 = 45%. 23

n Customers who buy high- definition televisions are more likely to buy exercise machines n However, a deeper analysis reveals that the sales of these items depend on whether the customer is a college student or a working adult n n For college students: {HDTV=Yes} {Exercise machine=yes} = 1/10 = 10% {HDTV=No} {Exercise machine=yes} = 4/34 = 11.8% For working adults: {HDTV=Yes} {Exercise machine=yes} = 98/170 = 57.7% {HDTV=No} {Exercise machine=yes} = 50/86 = 58.1% n The rules suggest that, for each group, customers who do not buy high- definition televisions are more likely to buy exercise machines, which contradict the previous conclusion 24

The paradox explained n Most customers who buy HDTVs are working adults n Working adults are also the largest group of customers who buy exercise machines Because nearly 85% of the customers are working adults, the observed relationship between HDTV and exercise machine turns out to be stronger in the combined data Than what it would have been if the data is stratified. n Suppose a/b < c/d and p/q < r/s n a/b and p/q may represent the confidence of the rule A B in two different strata n c/d and r/s may represent the confidence of the rule NOT A B in the two strata. 25

n When the data is pooled together, the confidence values of the rules in the combined data are (a + p)/(b + q) and (c + r)/(d + s), n Simpson s paradox occurs when 26