What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed Frequent pattern: a pattern (a set o items, subsequences, substructures, etc) that occurs requently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context o requent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were oten purchased together? Beer and diapers?! What are the subsequent purchases ater buying a PC? What kinds o DNA are sensitive to this new drug? Can we automatically classiy web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis 2/2/2015 COMP 465: Data Mining Spring 2015 1 2 Basic Concepts: Frequent Patterns Basic Concepts: Association Rules Tid buys beer Items bought 10 Beer, Nuts, Diaper 20 Beer, Coee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coee, Diaper, Eggs, Milk buys both buys diaper itemset: A set o one or more items k-itemset X = {x 1,, x k } (absolute) support, or, support count o X: Frequency or occurrence o an itemset X (relative) support, s, is the raction o transactions that contains X (ie, the probability that a transaction contains X) An itemset X is requent i X s support is no less than a minsup threshold 3 Tid 10 20 30 40 50 Nuts, Eggs, Milk Nuts, Coee, Diaper, Eggs, Milk buys beer Items bought Beer, Nuts, Diaper Beer, Coee, Diaper Beer, Diaper, Eggs buys both buys diaper Find all the rules X Y with minimum support and conidence support, s, probability that a transaction contains X Y conidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, mincon = 50% Freq Pat: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer Diaper (60%, 100%) Diaper Beer (60%, 75%) 4 1
Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a combinatorial number o subpatterns, eg, {a 1,, a 100 } contains ( 100 1) + ( 100 2) + + ( 1 1 0 0 00 ) = 2 100 1 = 127*10 30 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed i X is requent and there exists no superpattern Y כ X, with the same support as X (proposed by Pasquier, et al @ ICDT 99) An itemset X is a max-pattern i X is requent and there exists no requent super-pattern Y כ X (proposed by Bayardo @ SIGMOD 98) Closed pattern is a lossless compression o req patterns Reducing the # o patterns and rules 5 Exercise: Suppose a DB contains only two transactions <a 1,, a 100 >, <a 1,, a 50 > Let min_sup = 1 What is the set o closed itemset? {a 1,, a 100 }: 1 {a 1,, a 50 }: 2 What is the set o max-pattern? {a 1,, a 100 }: 1 What is the set o all patterns? {a 1 }: 2,, {a 1, a 2 }: 2,, {a 1, a 51 }: 1,, {a 1, a 2,, a 100 }: 1 A big number: 2 100-1 6 The Downward Closure Property and Scalable Mining Methods The downward closure property o requent patterns Any subset o a requent itemset must be requent I {beer, diaper, nuts} is requent, so is {beer, diaper} ie, every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB 94) Freq pattern growth (FPgrowth Han, Pei & Yin @SIGMOD 00) Vertical data ormat approach (Charm Zaki & Hsiao @SDM 02) 7 Apriori: A Candidate Generation & Test Approach Apriori pruning principle: I there is any itemset which is inrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB 94, Mannila, et al @ KDD 94) Method: Initially, scan DB once to get requent 1-itemset Generate length (k+1) candidate itemsets rom length k requent itemsets Test the candidates against DB Terminate when no requent or candidate set can be generated 8 2
The Apriori Algorithm An Example The Apriori Algorithm (Pseudo-Code) Itemset sup Database TDB Sup min = 2 {A} 2 L Tid Items C 1 1 {B} 3 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan {C} 3 {D} 1 {E} 3 C 2 C 2 {A, B} 1 L 2 Itemset sup 2 nd scan {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 C Itemset 3 3 rd scan L 3 {B, C, E} Itemset sup {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {B, C, E} 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 9 C k : Candidate itemset o size k L k : requent itemset o size k L 1 = {requent items}; or (k = 1; L k!= ; k++) do begin C k+1 = candidates generated rom L k ; or each transaction t in database do increment the count o all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ; 10 Implementation o Apriori Further Improvement o the Apriori Method How to generate candidates? Major computational challenges Step 1: sel-joining L k Step 2: pruning Example o Candidate-generation L 3 ={abc, abd, acd, ace, bcd} Sel-joining: L 3 *L 3 abcd rom abc and abd acde rom acd and ace Pruning: acde is removed because ade is not in L 3 C 4 = {abcd} 11 Multiple scans o transaction database Huge number o candidates Tedious workload o support counting or candidates Improving Apriori: general ideas Reduce passes o transaction database scans Shrink number o candidates Facilitate support counting o candidates 15 3
Partition: Scan Database Only Twice Any itemset that is potentially requent in DB must be requent in at least one o the partitions o DB Scan 1: partition database and ind local requent patterns Scan 2: consolidate global requent patterns A Savasere, E Omiecinski and S Navathe, VLDB 95 DB 1 + DB 2 + + DB k = DB sup 1 (i) < σdb 1 sup 2 (i) < σdb 2 sup k (i) < σdb k sup(i) < σdb DHP: Reduce the Number o Candidates A k-itemset whose corresponding hashing bucket count is below the threshold cannot be requent Candidates: a, b, c, d, e Hash entries count 35 88 itemsets {ab, ad, ae} {bd, be, de} {ab, ad, ae} {bd, be, de} 102 {yz, qs, wt} Frequent 1-itemset: a, b, d, e Hash Table ab is not a candidate 2-itemset i the sum o count o {ab, ad, ae} is below support threshold J Park, M Chen, and P Yu An eective hash-based algorithm or mining association rules SIGMOD 95 DHP Direct Hashing and Pruning 17 Sampling or Frequent Patterns DIC: Reduce Number o Scans Select a sample o original database, mine requent patterns within sample using Apriori Scan database once to veriy requent itemsets ound in sample, only borders o closure o requent patterns are checked Example: check abcd instead o ab, ac,, etc Scan database again to ind missed requent patterns H Toivonen Sampling large databases or association rules In VLDB 96 18 ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D Apriori Itemset lattice S Brin R Motwani, J Ullman, and S Tsur Dynamic itemset DIC counting and implication rules or market basket data In SIGMOD 97 Once both A and D are determined requent, the counting o AD begins Once all length-2 subsets o BCD are determined requent, the counting o BCD begins Transactions 1-itemsets 2-itemsets 1-itemsets 2-items 3-items 19 4
Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation Construct FP-tree rom a Transaction Database Bottlenecks o the Apriori approach Breadth-irst (ie, level-wise) search Candidate generation and test Oten generates a huge number o candidates The FPGrowth Approach (J Han, J Pei, and Y Yin, SIGMOD 00) Depth-irst search Avoid explicit candidate generation Major philosophy: Grow long patterns rom short ones using local requent items only abc is a requent pattern Get all transactions having abc, ie, project DB on abc: DB abc d is a local requent item in DB abc abcd is a requent pattern 20 TID Items bought (ordered) requent items 100 {, a, c, d, g, i, m, p} {, c, a, m, p} 200 {a, b, c,, l, m, o} {, c, a, b, m} 300 {b,, h, j, o, w} {, b} 400 {b, c, k, s, p} {c, b, p} 500 {a,, c, e, l, p, m, n} {, c, a, m, p} Header Table 1 Scan DB once, ind requent 1-itemset (single item pattern) 2 Sort requent items in requency descending order, -list Item requency head 4 c 4 a 3 b 3 m 3 p 3 3 Scan DB again, construct FP-tree F-list = -c-a-b-m-p min_support = 3 :4 c:1 m:2 p:2 m:1 p:1 21 Partition Patterns and Databases Frequent patterns can be partitioned into subsets according to -list F-list = -c-a-b-m-p Patterns containing p Patterns having m but no p Patterns having c but no a nor b, m, p Pattern Completeness and non-redundency 22 Find Patterns Having P From P-conditional Database Starting at the requent item header table in the FP-tree Traverse the FP-tree by ollowing the link o each requent item p Accumulate all o transormed preix paths o item p to orm p s conditional pattern base Header Table Item requency head 4 c 4 a 3 b 3 m 3 p 3 m:2 :4 c:1 p:2 m:1 p:1 Conditional pattern bases item c :3 a cond pattern base b ca:1, :1, c:1 m p ca:2, ca cam:2, c 23 5
From Conditional Pattern-bases to Conditional FP-trees Recursion: Mining Each Conditional FP-tree For each pattern-base Accumulate the count or each item in the base Construct the FP-tree or the requent items o the pattern base Header Table Item requency head 4 c 4 a 3 b 3 m 3 p 3 m:2 :4 c:1 p:2 m:1 p:1 m-conditional pattern base: ca:2, ca All requent patterns relate to m m, :3 m, cm, am, cm, am, cam, cam m-conditional FP-tree 24 :3 m-conditional FP-tree Cond pattern base o am : () :3 am-conditional FP-tree Cond pattern base o cm : (:3) :3 Cond pattern base o cam : (:3) :3 cam-conditional FP-tree cm-conditional FP-tree 25 A Special Case: Single Preix Path in FP-tree Beneits o the FP-tree Structure Suppose a (conditional) FP-tree T has a shared single preix-path P Mining can be decomposed into two parts Reduction o the single preix path into one node a 1 :n 1 a 2 :n 2 a 3 :n 3 b 1 :m 1 C 1 :k 1 Concatenation o the mining results o the two parts r 1 = a 1 :n 1 a 2 :n 2 + r 1 b 1 :m 1 C 1 :k 1 Completeness Preserve complete inormation or requent pattern mining Never break a long pattern o any transaction Compactness Reduce irrelevant ino inrequent items are gone Items in requency descending order: the more requently occurring, the more likely to be shared Never be larger than the original database (not count nodelinks and the count ield) C 2 :k 2 C 3 :k 3 a 3 :n 3 C 2 :k 2 C 3 :k 3 26 27 6
Run time(sec) 2/3/2015 The Frequent Pattern Growth Mining Method Scaling FP-growth by Database Projection Idea: Frequent pattern growth Recursively grow requent patterns by pattern and database partition Method For each requent item, construct its conditional patternbase, and then its conditional FP-tree Repeat the process on each newly created conditional FPtree Until the resulting FP-tree is empty, or it contains only one path single path will generate all the combinations o its sub-paths, each o which is a requent pattern What about i FP-tree cannot it in memory? DB projection First partition a database into a set o projected DBs Then construct and mine FP-tree or each projected DB Parallel projection vs partition projection techniques Parallel projection Project the DB in parallel or each requent item Parallel projection is space costly All the partitions can be processed in parallel Partition projection Partition the DB based on the ordered requent items Passing the unprocessed parts to the subsequent partitions 28 29 Partition-Based Projection FP-Growth vs Apriori: Scalability With the Support Threshold Parallel projection needs a lot o disk space Partition projection saves it p-proj DB cam cb cam m-proj DB cab ca ca am-proj DB c c c b-proj DB cb Tran DB camp cabm b cbp camp cm-proj DB a-proj DB c c-proj DB -proj DB 30 100 90 80 70 60 50 40 30 20 10 0 Data set T25I20D10K D1 FP-grow th runtime D1 Apriori runtime 0 05 1 15 2 25 3 Support threshold(%) 31 7
Runtime (sec) 2/3/2015 FP-Growth vs Tree-Projection: Scalability with the Support Threshold Advantages o the Pattern Growth Approach 140 120 Data set T25I20D100K D2 FP-growth D2 TreeProjection 100 80 60 40 20 0 0 05 1 15 2 Data Mining: Support Concepts threshold and Techniques (%) 32 Divide-and-conquer: Decompose both the mining task and DB according to the requent patterns obtained so ar Lead to ocused search o smaller databases Other actors No candidate generation, no candidate test Compressed database: FP-tree structure No repeated scan o entire database Basic ops: counting local req items and building sub FP-tree, no pattern search and matching A good open-source implementation and reinement o FPGrowth FPGrowth+ (Grahne and J Zhu, FIMI'03) 33 Interestingness Measure: Correlations (Lit) Are lit and 2 Good Measures o Correlation? play basketball eat cereal [40%, 667%] is misleading The overall % o students eating cereal is 75% > 667% play basketball not eat cereal [20%, 333%] is more accurate, although with lower support and conidence Measure o dependent/correlated events: lit P( A B) lit P( A) P( B) 2000 / 5000 lit ( B, C) 089 3000 / 5000*3750 / 5000 1000 / 5000 lit ( B, C) 133 3000 / 5000*1250 / 5000 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col) 3000 2000 5000 42 Buy walnuts buy milk [1%, 80%] is misleading i 85% o customers buy milk Support and conidence are not good to indicate correlations Over 20 interestingness measures have been proposed (see Tan, Kumar, Sritastava @KDD 02) Which are good ones? 43 8
Summary Basic concepts: association rules, supportconident ramework, closed and max-patterns Scalable requent pattern mining methods Apriori (Candidate generation & test) Projection-based (Fpgrowth) Which patterns are interesting? Pattern evaluation methods 48 9