Frequent Pattern Mining...3 Frequent Pattern Mining Frequent Patterns The Apriori Algorithm The FP-growth Algorithm Sequential Pattern Mining Summary 44 / 193
Netflix Prize Frequent Pattern Mining Frequent Patterns Users evaluate movies from time to time (http://www.netflixprize.com/ Can we predict how much a user likes a movie?!"#$% &% '% (% )% *% +%!"# $%%&# '(&# $%%&# '(&##!)# $%%&# $%%&# '(&# $%%&#!*# $%%&# '(&# '(&# $%%&# $%%&#!+# $%%&# '(&# $%%&#,# If we find pattern (A=Good) AND (C=Bad) (E=Good) holds for many users, we can recommend movie E to user U4! 45 / 193
Transaction Data Analysis Frequent Patterns Transactions: customers purchases of commodities bread, milk, cheese if they are bought together Frequent patterns are product combinations that are frequently purchased together by customers Generally, frequent patterns are patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93] 46 / 193
Frequent Itemsets Frequent Pattern Mining Frequent Patterns Transaction database TDB TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Itemset: a set of items, e.g., acm = {a, c, m} Support of itemsets, e.g., Sup(acm) =3 Given min sup = 3, acm is a frequent pattern Frequent pattern mining: findingall frequent patterns in a given database with respect to a give support threshold 47 / 193
A Naïve Attempt Frequent Pattern Mining The Apriori Algorithm Generate all possible itemsets, test their supports against the database How to hold a large number of itemsets into main memory? If there are 100 items, there are 2 100 1 possible itemsets How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? For a transaction of length 20, we need to update the support of 2 20 1=1, 048, 575 itemsets 48 / 193
Transactions in Real Applications The Apriori Algorithm A large department store often carries more than 100 thousand different kinds of items Amazon.com carries more than 17,000 books relevant to data mining Walmart has more than 20 million transactions per day AT&T produces more than 275 million calls per day Mining large transaction databases of many items is a real demand 49 / 193
The Apriori Algorithm How to Obtain an Efficient Method? Reducing the number of itemsets that need to be checked Checking the supports of selected itemsets efficiently 50 / 193
An Anti-Monotonicity Frequent Pattern Mining The Apriori Algorithm Any subset of a frequent itemset must be also frequent an anti-monotonic property A transaction containing {beer, diaper, nuts} also contains {beer, diaper} If {beer, diaper, nuts} is frequent, {beer, diaper} must also be frequent In other words, any superset of an infrequent itemset must also be infrequent No superset of any infrequent itemset should be generated or tested Many item combinations can be pruned! 51 / 193
The Apriori Algorithm Candidate Generation & Test (the Apriori Principle) Find frequent items Generate length (k + 1) candidate itemsets from length k frequent itemsets Test the candidates against DB 52 / 193
The Apriori Algorithm Example The Apriori Algorithm Data base D TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2 Scan D Scan D 3-candidates Itemset bce Freq 3-itemsets Itemset Sup bce 2 1-candidates Itemset Sup a 2 b 3 c 3 d 1 e 3 Freq 2-itemsets Itemset Sup ac 2 bc 2 be 3 ce 2 Freq 1-itemsets Itemset Sup a 2 b 3 c 3 e 3 Counting Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2 2-candidates Itemset ab ac ae bc be ce Scan D 53 / 193
The Apriori Algorithm [AgSr94] The Apriori Algorithm Require: transaction database TDB, minimum support threshold min sup {C k : the set of length-k candidate itemsets} {L k : the set of length-k frequent itemsets} L 1 {frequent items} k 1 while L k do {candidate generation} C k+1 candidates generated from L k ; sup(x ) 0forX C k+1 for all transaction t TDB, itemsetx C k+1 do if X t then sup(x )++ end if end for L k+1 = {X X C k+1, sup(x ) min sup}; k ++ end while return k i=1 L k 54 / 193
How to Find Frequent Items? The Apriori Algorithm Finding frequent items using a one dimensional array for all item x do c[x] 0 end for for all transaction t do if x t then c[x]++ end if end for return {x c[x] min sup} 55 / 193
The Apriori Algorithm How to Find Length-2 Frequent Itemsets? Using a 2-dimensional triangle matrix for items i, j (i < j), c[i, j] is the count for itemset ij for all items i and j such that i < j do c[i, j] 0 end for for all transaction t do sort items in t in lexicographic order for i =0tolen(t) 1 do if i is a frequent item then for j=i+1tolen(t) do if j is a frequent item then c[i, j]++ end if end for end if end for end for return {ij i < j c[i, j] min sup} 56 / 193
Implementation Frequent Pattern Mining The Apriori Algorithm A 2-dimensional triangle matrix can be implemented using a 1-dimensional array 1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5 There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)* (2*5-3)/2+5-3]=c[9] 1 2 3 4 5 6 7 8 9 10 57 / 193
Candidate Generation Example The Apriori Algorithm Suppose L 3 = {abc, abd, acd, ace, bcd}. How can we generate C 4? Self-joining: L 3 L 3 : abcd abc abd and acde acd ace Pruning: acde is removed because ade L 3 C 4 = {abcd} 58 / 193
Candidate Generation Algorithm The Apriori Algorithm Require: the items in every itemset in L k are listed in an order R {self-join L k } INSERT INTO C k+1 SELECT p.item 1, p.item 2,...,p.item k, q.item k FROM L k p, L k q WHERE p.item 1 = q.item 1,...,p.item k 1 = q.item k 1, p.item k < R q.item k {pruning} for itemset X C k+1 do for each k-subset X of X do if X L k then C k+1 = C k+1 {X } end if end for end for return C k+1 59 / 193
How to Count Supports? The Apriori Algorithm Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method Candidate itemsets are stored in a hash-tree A leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 60 / 193
Example Frequent Pattern Mining The Apriori Algorithm Subset function 3,6,9 1,4,7 2,5,8 1 + 2 3 5 6 1 3 + 5 6 1 2 + 3 5 6 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 Transaction: 1 2 3 5 6 2 3 4 5 6 7 1 3 6 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 61 / 193
Bottleneck of Freq Pattern Mining The FP-growth Algorithm Multiple database scans are costly Mining long patterns needs many scans and generates many candidates To find frequent itemset i 1 i 2 i 100, 100 scans are needed and the total number of candidates is ( 100 ) ( 1 + 100 ) ( 2 + + 100 ) 100 =2 100 1 1.27 10 30 Bottleneck: candidate-generation-and-test 62 / 193
The FP-growth Algorithm Search Space of Frequent Pattern Mining ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D {} Itemset lattice 63 / 193
Set Enumeration Tree Frequent Pattern Mining The FP-growth Algorithm Use an order on items, enumerate itemsets in lexicographic order a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d Reduce a lattice to a tree! a b c d ab ac ad bc bd cd abc abd acd bcd abcd 64 / 193
Borders of Frequent Itemsets The FP-growth Algorithm Frequent itemsets are connected is trivially frequent X on the border every subset of X is frequent! a b c d ab ac ad bc bd cd abc abd acd bcd abcd 65 / 193
Projected Databases Frequent Pattern Mining The FP-growth Algorithm X -projected database the set of transactions containing X TDB X = {t TDB X t} To test whether itemset Xy is frequent, we can use the X -projected database and check whether item y is frequent in the X -projected database! a b c d ab ac ad bc bd cd abc abd acd bcd abcd 66 / 193
The FP-growth Algorithm Compressing a Transaction Database by FP-tree The 1st scan: find frequent items Only record frequent items in the FP-tree F-list: f -c-a-b-m-p Header table item f c a b m p f:4 c:3 a:3 m:2 p:2 root b:1 b:1 m:1 c:1 b:1 p:1 The 2nd scan: construct tree Order frequent items in each transaction w.r.t. the f-list Explore sharing among transactions TID Items bought (ordered) freq items 100 f, a, c, d, g, I, m, p f, c, a, m, p 200 a, b, c, f, l,m, o f, c, a, b, m 300 b, f, h, j, o f, b 400 b, c, k, s, p c, b, p 500 a, f, c, e, l, p, m, n f, c, a, m, p 67 / 193
Why FP-tree? Frequent Pattern Mining The FP-growth Algorithm Completeness Never break a long pattern in any transaction Preserve complete information for frequent pattern mining no need to scan the database anymore Compactness Reduce irrelevant information infrequent items are removed Items in frequency descending order (f-list): the more frequently occurring, the more likely to be shared Never be larger than the original database (not counting node-links and the count fields) 68 / 193
Partitioning Frequent Patterns The FP-growth Algorithm Frequent patterns can be partitioned into subsets according to the f-list: f -c-a-b-m-p Patterns containing p Patterns having m but no p... Patterns having c but no a nor b, m, orp Pattern f Depth-first search of a set enumeration tree The partitioning is complete and does not have any overlap 69 / 193
Find Patterns Having Item p The FP-growth Algorithm Only transactions containing p are needed Form p-projected database Starting at entry p of the header table Follow the side-link of frequent item p Accumulate all transformed prefix paths of p p-projected database TDB p fcam: 2 cb: 1 Local frequent item: c:3 Frequent patterns containing p p: 3, pc: 3 Header table item f c a b m p f:4 c:3 a:3 m:2 p:2 root b:1 b:1 m:1 c:1 b:1 p:1 70 / 193
The FP-growth Algorithm Find Patterns Having Item m But No p Form m-projected database TDB m Item p is excluded (why?) TDB m = {fca :2, fcab :1} Local frequent items: f, c, a Build FP-tree for TDB m Header table item f c a root f:3 c:3 a:3 m-projected FP-tree Header table item f c a b m p f:4 c:3 a:3 m:2 p:2 root b:1 b:1 m:1 c:1 b:1 p:1 71 / 193
Recursive Mining Frequent Pattern Mining The FP-growth Algorithm Patterns having m but no p can be mined recursively Optimization: enumerate patterns from a single-branch FP-tree Enumerate all combination Support = that of the last item Example: m, fm, cm, am, fcm, fam, cam, fcam Header table item f c a root f:3 c:3 a:3 m-projected FP-tree 72 / 193
Patterns from a Single Prefix The FP-growth Algorithm When a (projected) FP-tree has a single prefix, we can reduce the single prefix into one virtual node, and join the mining results of the two parts root a 1 :n 1 root r 1 a 1 :n 1 a 2 :n 2 a 3 :n 3! r = a 2 :n 2 + b 1 :m 1 c 1 :k 1 b 1 :m 1 c 1 :k 1 a 3 :n 3 c 2 :k 2 c 3 :k 3 c 2 :k 2 c 3 :k 3 73 / 193
The FP-growth Algorithm The FP-growth Algorithm Pattern-growth: recursively grow frequent patterns by pattern and database partitioning for each frequent item x do construct the x-projected database, and then the x-projected FP-tree Recursively mine the x-projected FP-tree, until the resulted FP-tree either is empty, or contains only one path single path generates all the combinations, each of which is a frequent pattern end for 74 / 193
From Itemsets to Sequences Sequential Pattern Mining Itemsets: combinations of items, no temporal order Temporal order is important in many situations, such as time-series databases and sequence databases Frequent patterns (frequent) sequential patterns Application example of sequential pattern mining mobile user trajectories using pattern Park a car buy parking ticket visit a coffee shop, all in 15 minutes, we can recommend a coffee shop in a cell phone More applications: medical treatment, natural disasters, science and engineering processes, stocks and markets, telephone calling patterns, Web log clickthrough streams, DNA sequences and gene structures 75 / 193
What Is Sequential Pattern Mining? Sequential Pattern Mining Given a set of sequences, find the complete set of frequent subsequences SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> A sequence : < (ef) (ab) (df) c b > Given a minimum support threshold min sup = 2, (ab)c is a sequential pattern 76 / 193
Sequential Pattern Mining An (Anti)-Monotonic Property of Sequential Patterns If a sequence s is infrequent, then none of the super-sequences of s is frequent Example: let min sup = 2. hb is infrequent hab and (ah)b are infrequent Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> 77 / 193
Sequential Pattern Mining Sequential Pattern Mining Algorithm GSP 5 th scan: 1 cand. 1 length-5 seq. pat. 4 th scan: 8 cand. 6 length-4 seq. pat. 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 1 st scan: 8 cand. 6 length-1 seq. pat. <(bd)cba> <abba> <(bd)bc> Cand. cannot pass sup. threshold <abb> <aab> <aba> <baa> <bab> <aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)> <a> <b> <c> <d> <e> <f> <g> <h> Cand. not in DB at all Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> 78 / 193
Sequential Pattern Mining Sequential Pattern Mining Algorithm PrefixSpan Having prefix <a> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> SID SDB sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Having prefix <b> Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Having prefix <c>,, <f> <b>-projected database Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Having prefix <aa> Having prefix <af> <aa>-proj. db <af>-proj. db 79 / 193
Summary Frequent Pattern Mining Summary Frequent patterns: frequent combinations in large transaction databases Mining frequent patterns An anti-monotonic property The Apriori algorithm The FP-growth algorithm Sequential patterns and mining Sequential patterns GSP PrefixSpan 80 / 193
To-Do List Frequent Pattern Mining Summary Read the following paper to understand how PrefixSpan mines sequential patterns: J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.C. Hsu. Mining Sequential Patterns by Pattern-growth: The PrefixSpan Approach. IEEE Transactions on Knowledge and Data Engineering, Volume 16, Number 11, pages 1424-1440, November 2004, IEEE Computer Society. There is often redundancy among frequent patterns. Read the following paper to understand how FP-growth can be extended to mine frequent closed itemsets, a type of non-redundant frequent patterns: J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proceedings of the 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dallas,TX, May, 2000. 81 / 193