Advance Association Analysis 1
Minimum Support Threshold 3
Effect of Support Distribution Many real data sets have skewed support distribution Support distribution of a retail data set 4
Effect of Support Distribution How to set the appropriate minsup threshold? If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) If minsup is set too low, it is computationally expensive and the number of itemsets is very large Using a single minimum support threshold may not be effective 5
Multiple Minimum Support How to apply multiple minimum supports? MS(i): minimum support for item i e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1% Challenge: Support is no longer anti-monotone Suppose: Support(Milk, Coke) = 1.5% and Support(Milk, Coke, Broccoli) = 0.5% {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent 6
Multiple Minimum Support Item MS(I) Sup(I) AB ABC A 0.10% 0.25% A AC AD ABD ABE B 0.20% 0.26% B AE ACD C 0.30% 0.29% C BC BD ACE ADE D 0.50% 0.05% D BE BCD E 3% 4.20% E CD CE BCE BDE DE CDE 7
Multiple Minimum Support Item MS(I) Sup(I) A 0.10% 0.25% A AB AC AD ABC ABD ABE B 0.20% 0.26% C 0.30% 0.29% B C AE BC BD ACD ACE ADE D 0.50% 0.05% E 3% 4.20% D E BE CD CE BCD BCE BDE DE CDE 8
Multiple Minimum Support (Liu 1999) Order the items according to their minimum support (in ascending order) e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% Ordering: Broccoli, Salmon, Coke, Milk Need to modify Apriori such that: L 1 : set of frequent items F 1 : set of items whose support is MS(1) where MS(1) is min i ( MS(i) ) C 2 : candidate itemsets of size 2 is generated from F 1 instead of L 1 9
Multiple Minimum Support (Liu 1999) Modifications to Apriori: In traditional Apriori, A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k The candidate is pruned if it contains any infrequent subsets of size k Pruning step has to be modified: Prune only if subset contains the first item e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to minimum support) {Broccoli, Coke} and {Broccoli, Milk} are frequent but {Coke, Milk} is infrequent Candidate is not pruned because {Coke,Milk} does not contain the first item, i.e., Broccoli. 10
Mining Rare Association Rules 11
Rare Association Rule Mining: Motivation Rare events are events that occur infrequently Perhaps in the frequency range (0.1% to 10%) If they occur the consequences can be quite dramatic or negative. Applications: Hardware Fault Detection Faults that are rare but costly Medical Diagnosis Diseases that are typically rare but deadly 12
Detecting Rare Itemsets Apriori-Inverse To discover all rules that satisfy the maximum support (below maximum support) and above a minimum absolute support value. -- UCI Repository: Zoo Maximum support: 0.20 Itemsets Support Used? Venomous = 0 0.92 No Itemsets Analyzed Tail = 1 0.74 No... Fins = 1 0.17 Yes Venomous = 1 0.08 Yes 13
Coincidence vs Interesting 10000 transactions A appears 9500 times AB appears 9000 times AB B appears 9500 times A B (confidence = 0.95) Would we consider this an interesting? What if AB appears 9010 times? Under the normal assumption AB is expected to appear together at least 9025 times. 14
Probability of Collision 15 The probability that A and B will occur together exactly c times is under an assumption of independence: Given N = 1000, A= B = 500, and AB = 250, we are able to determine the probability of A and B occurring exactly 250 times is 0.05. = b N c b a N c a b a N c ),, Pcc( A A B c B N
Minimum Absolute Support To find the number of collisions for which Pcc is smaller than some value p (e.g. 0.0001) minabssup( N, a, b, p) i = = min m i= m 0 Pcc( i N, a, b) 1.0 p Given N = 1000, A = B = 500, and p = 0.0001, minabssup value is 274. Candidate itemsets that appear above the minabssup requirement are retained. 16
Rare pattern Given a user-specified minimum support threshold minsup ϵ [0,1], X is called a rare itemset or rare pattern in D if sup(x,d) minsup. 17
Roadmap for rare pattern mining 18
Mining Negative Rules 19
Negative vs Rare Patterns Rare patterns: Very low support but interesting E.g., buying Rolex watches Mining: Setting individual-based or special group-based support threshold for valuable items Negative patterns Since it is unlikely that one buys Ford Expedition (an SUV car) and Toyota Prius (a hybrid car) together, Ford Expedition and Toyota Prius are likely negatively correlated patterns Negatively correlated patterns that are infrequent tend to be more interesting than those that are frequent 20
Negative Correlated Patterns Definition 1 (support-based) If itemsets X and Y are both frequent but rarely occur together, i.e., sup(x U Y) < sup (X) * sup(y) Then X and Y are negatively correlated Problem: A store sold two needle 100 packages A and B, only one transaction containing both A and B. When there are in total 200 transactions, we have s(a U B) = 0.005, s(a) * s(b) = 0.25, s(a U B) < s(a) * s(b) When there are 10 5 transactions, we have s(a U B) = 1/10 5, s(a) * s(b) = 1/10 3 * 1/10 3, s(a U B) > s(a) * s(b) Where is the problem? Null transactions, i.e., the support-based definition is not null-invariant! 21
Negative Correlated Patterns Definition 2 (negative itemset-based) X is a negative itemset if (1) X = Ā U B, where B is a set of positive items, and Ā is a set of negative items, Ā 1, and (2) s(x) μ Itemsets X is negatively correlated, if This definition suffers a similar null-invariant problem. Definition 3 (Kulzynski measure-based) If itemsets X and Y are frequent, but (P(X Y) + P(Y X))/2 < є, where є is a negative pattern threshold, then X and Y are negatively correlated. 22
Mining Sequential Patterns 23
Sequential Patterns Transaction databases, time-series databases vs. sequence databases Frequent patterns vs. (frequent) sequential patterns Applications of sequential pattern mining Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams Program execution sequence data sets DNA sequences and gene structures 24
What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database A sequence : < (ef) (ab) (df) c b > SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> An element may contain a set of items Items within an element are unordered and we list them alphabetically <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup = 2, <(ab)c> is a sequential pattern Sequential pattern mining: find the complete set of patterns, satisfying the minimum support (frequency) threshold 25
The Apriori Property of Sequential Patterns A basic property: Apriori If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, <hb> is infrequent so is <hab> and <(ah)b> Seq. ID 10 20 30 40 50 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Given support threshold min_sup =2 27
Readings Data Mining Ian Witten - Section 6.3 Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Chapter 6 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT 96. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD 00. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97. E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE 03. Charu C. Aggarwal and Jiawei Han. 2014. Frequent Pattern Mining. Springer Publishing Company, Incorporated. 28