Association mining rules Given a data set, find the items in data that are associated with each other. Association is measured as frequency of occurrence in the same context. Purchasing one product when another product is purchased represents an association rule. Association rule detects common usage of items.
Market basket analysis An example of frequent itemset mining. This process analyzes customer buying habits by finding associations between the different items. Discovery helps the retails to develop marketing strategy. So, in short it provides an insight into the combination of the products within the customer basket.
Association Given set of items I = {I1,I2,..Im} and database of transactions D = {t1,t,2..tn}, where ti= { Ii1,Ii2..Iim} where I ik is element of, an association is an implication of X->Y where X,Y subset of I are set of items and X intersection Y isφ. So, association in short express an implication from X-> Y, where X and Y are item sets.
Market basket analysis Transaction Items 1 Milk, curd 2 Bread, butter, cold drink, eggs 3 Bread, butter, cold drink, jam 4 Bread, milk, butter, cold drink 5 Bread, milk, butter, jam
Terminologies Item set: Collection of one or more items. Eg: { butter, milk } Support count: Frequency of occurrence of an item set. Eg: Support count or σ { butter, bread, milk} = 2 Support: A fraction of transactions that contain an itemset. Eg: s = {butter, bread, milk} = 2/5. Frequent item set: An item set whose support is greater that minimum threshold.
Rule evaluation metrics Support: Fraction of contexts that contain both X and Y. s= Support_count(X U Y) / N So s for {milk, butter} -> {bread} will be s = σ {milk, butter, bread} / N = 2/5 = 0.4 Let us take one more example of association between bread->butter s = σ {butter, bread} / N = 4/5 = 0.8 Confidence: Measures how often items in Y occur in contexts containing X. c = Support_count(XUY) / Support_count(X) {bread} -> {butter} c or α = σ {butter, bread} / σ {bread} = 4/4 = 1
Rule evaluation metrics Confidence measures strength of the rule where as support measures how often it should occur in database. Eg: curd -> bread s = 0 (As no transaction with the itemsets)/ 5 =0 c = 0 / 4 = 0 Generally large confidence and small support are used. So a marketing company won t spend time in advertising a bread with the fact that when curd is bought, no bread is bought.
Apriori algorithm Is an influential algorithm for mining frequent item sets for boolean association rules. It uses the prior knowledge of the frequent itemsets properties. Iterative approach for level-wise search.
Transactional Data of some branch Tid List of items T1 I1,I2,I5 T2 I2,I4 T3 I2,I3 T4 I1,I2,I4 T5 I1,I3 T6 I2,I3 T7 I1,I3 T8 I1,I2,I3,I5 T9 I1,I2,I3 Consider the database with 9 transactions. Assume minimum support count required is 2. So, min support = 2/9 = 22% Let minimum confidence be 70%. We have to find frequent itemsets using Apriori algorithm. Then association rules will be generated using minimum support and minimum confidence. A frequent itemset: An itemset whose support is greater than minimum threshold.
Generating 1-item set frequent pattern Scan D for count of each candidate Item set I1 6 I2 7 I3 6 I4 2 I5 2 Support count Compare candidate support count with min support count Item set I1 6 I2 7 I3 6 I4 2 I5 2 Support count C1( Candidate 1 itemsets ) L1: Set of frequent 1 item set In the first iteration each item is a member of the set of candidate C1. L1 has candidate1 item sets that satisfy minimum support. ( As all satisfy, all are included).
Generating 2-item set frequent pattern Generate C2 candidates from L1 Item set I1,I2 I1,I3 I1,I4 I1,I5 I2,I3 I2,I4 I2,I5 I3,I4 I3,I5 I4,I5 C2( Candidate 2 itemsets ) Scan D for count of each candidate Item set I1,I2 4 I1,I3 4 I1,I4 1 I1,I5 2 I2,I3 4 I2,I4 2 I2,I5 2 I3,I4 0 I3,I5 1 I4,I5 0 S. count C2( Candidate 2 itemsets ) Scan D for count of each candidate Item set I1,I2 4 I1,I3 4 I1,I5 2 I2,I3 4 I2,I4 2 I2,I5 2 L2: set of frequent 2 itemsets S. count
Generating 2-item set frequent pattern To discover set of frequent 2-itemsets, L2 the algorithm uses L1 join L1, to generate candidate set C2. Then the transactions are scanned and the support count is accumulated in C2. The set of frequent -2 itemsets L2 is determined consisting of those candidate 2-itemsets in C2 having minimum support.
Generating 3-item set frequent pattern Item set I1,I2,I3 I1,I2,I5 HOW?? C3( Candidate 3 itemsets ) In order to find C3, we compute L2 join L2. Here the Apriori property comes into picture. Apriori property: All subsets of frequent item sets must also be frequent. So, initially C3 = L2 join L2 = {{I1,I2,I3}, { I1,I2,I5}, { I1,I3,I5}, { I2,I3,I4}, { I2,I3,I5}, {I2,I4,I5}} Remember that the join L K-1 join L k-1 is performed, where the members of L k-1 are joinable if their first k-2 items are in common.
Generating 3-item set frequent pattern Based on Apriori property, we can determine that four latter candidates cannot possibly be frequent. For {I1,I2,I3}, the 2-item subsets are {I1,I2}, {I1,I3}, and {I2,I3}. Since all of them are subsets of L2, we will keep it in C3. Lets take another example of {I2, I3, I5}. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. But {I3, I5} is not a member of L2 and hence it is not frequent, violating Apriori Property. Thus we will have to remove {I2, I3, I5} from C3. So finally, we have {I1,I2,I3} and {I1,I2,I5} in C3. This method is called Pruning.
Generating 3-item set frequent pattern Item set I1,I2,I3 2 I1,I2,I5 2 S. Count C3( Candidate 3 itemsets ) Compare candidate support count with min. support count Item set I1,I2,I3 2 I1,I2,I5 2 L3 S. Count
Generating 4-item set frequent pattern Now algorithm uses L3joinL3 to generate candidate sets of 4-item sets C4. The resultant set is { I1,I2,I3,I5 }, the itemset is pruned as its subset {I2,I3,I5} is not frequent. So, C4 = NULL. The algorithm terminates as we have found the frequent itemsets. What next? These frequent itemsets will be used to generate strong association rules( where strong association rules satisfy both minimum support & minimum confidence).
Generating association rules from frequent itemsets Procedure: For each frequent itemset l,generate all nonempty subsets of l. For every nonempty subset x of l, output the rule x l-x if support_count(l) / support_count(x) >= min_conf. where min_conf. is minimum confidence threshold.
From the example, we have l = { I1,I2,I5}. Its non empty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}. With the rule: For every nonempty subset x of l, output the rule x l-x if support_count(l) / support_count(x) >=min_conf. We get {I1,I2} I5 conf = 2/4 = 50% {I1,I5} I conf = / = % {I2,I5} I conf = / = % I1 {I2,I5} conf = 2/6 = 33% I2 {I1,I5 } conf = 2/7 = 29% I5 {I,I } conf = / = % Now what? Strong association rules are used for deciding business policies
References Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques: concepts and techniques. Elsevier, 2011.