Tutorial on Association Rule Mining Yang Yang yang.yang@itee.uq.edu.au DKE Group, 78-625 August 13, 2010
Outline 1 Quick Review 2 Apriori Algorithm 3 FP-Growth Algorithm 4 Mining Flickr and Tag Recommendation
Quick Review Quick Review What are Association Rules? - Frequent patterns/behaviors, correlations, among items or objects. What are Association Rules used for? - Predict what may happen in future. Where are Association Rules mined? - Transactional databases, relational databases, etc.
Applications Application Scenarios Market Basket Analysis E.g., Bread Milk Course Management Data Mining Machine Learning Machine Learning Convex Optimization Recommendation Online book store, e.g. Amozon Tag Recommendation, e.g. Youtube, Flickr...
Notation Highlights Notation Highlights Items I = {i 1,i 2,,i m } Transactions D = {t 1,t 2,,t n } Itemset X, a set of items Support Number of transactions containing X Total Number of Transactions P(X) = How often X appears in transactions of D. Frequent (Large) Itemset - Itemsets whose supports surpass a certain threshold. Association Rules X Y - X implies Y Confidence - P(Y X) = P(X,Y) Support(X Y) P(X) = Support(X) - How likely Y happens when X happens.
Apriori Algorithm Apriori Property: All nonempty subsets of a large (frequent) itemset must also be large (frequent). An iterative approach where k-itemsets are used to explore (k + 1)-itemsets. Candidate Generation - Join and Prune Test - Compare candidate s support with threshold
Pseudo Code Input : D, min sup; Output: L, frequent itemsets in D L 1 =find frequent 1-itemsets (D); for k = 2;L k 1 ;k + + do C k = apriori gen(l k 1 ); foreach transaction t in D do C t = subset(c k,t); foreach candidate c C t do c.count + +; end end L k = {c C k c.count min sup}; end return L = k L k ;
A toy example Use the Apriori algorithm to find the rules with support 0.5 and confidence 0.75 in the following database. TID Transaction 1 {a, b, c, d} 2 {a, c, d} 3 {a, b, c} 4 {b, c, d} 5 {a, b, c} 6 {a, b, c} 7 {c, d, e} 8 {a, c}
Issues with Apriori Pros - Basic idea is straightforward and easy to understand. - Efficient in dealing with small-scale dataset. Cons - But we cannot avoid Candidate generation... - Also we have to scan database again and again for test! Can we design a method that mines the complete set of frequent itemsets without candidate generation and repeatedly database scan?
Frequent-Pattern Growth Divide-and-Conquer Strategy Frequent-Pattern Tree - Compressed database, statistical information. - itemset association information retained. Recursive idea - Frequent items and their corresponding conditional databases. - Mine each sub-fp Tree and concatenate the result with its frequent item.
FP-Tree Construction Scan database once and find out frequent items F. Sort F in support count descending order L. Create the root of FP-Tree and label it as null. Scan database again. For each transaction, select and sort frequent items in L-order. Create a branch in the tree if there is no common prefix in the path of the tree. The counting is performed for the items in the transaction along the path of the tree. An item header table is used to record the occurrences of items via a chain.
Mining FP-Tree For each item (suffix pattern) in header table, find paths (conditional pattern bases) starting from it. All items along the path have the same counting with this item. Based on the conditional pattern base, construct the itemś conditional FP-Tree, and performing mining algorithm recursively on such a tree. Concatenate of the suffix pattern with the frequent patterns generated from a conditional FP-Tree.
The toy example again... Use the FP-Growth algorithm to find the rules with support 0.5 and confidence 0.75 in the following database. TID Transaction 1 {a, b, c, d} 2 {a, c, d} 3 {a, b, c} 4 {b, c, d} 5 {a, b, c} 6 {a, b, c} 7 {c, d, e} 8 {a, c}
The toy example again... L-order TID Transaction 1 {c, a, b, d} 2 {c, a, d} 3 {c, a, b} 4 {c, b, d} 5 {c, a, b} 6 {c, a, b} 7 {c, d} 8 {c, a}
Mining Flickr and Tag Recommendation Use Flickr API (http://www.flickr.com/services/api/) to collect image tag dataset. Use Association Rule Mining to discover user tagging behaviours/patterns. Weka (http://www.cs.waikato.ac.nz/ml/weka/) Frequent Itemset Mining Implementations Repository http://fimi.cs.helsinki.fi/ Recommend tags Present your work