Data Mining for Knowledge Management. Association Rules

Similar documents
Mining Frequent Patterns without Candidate Generation

Association Rule Mining

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Data Mining Part 3. Associations Rules

Association Rule Mining

Association Rules. A. Bellaachia Page: 1

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Frequent Pattern Mining

Nesnelerin İnternetinde Veri Analizi

Scalable Frequent Itemset Mining Methods

DATA MINING II - 1DL460

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Chapter 4: Mining Frequent Patterns, Associations and Correlations

CSE 5243 INTRO. TO DATA MINING

Fundamental Data Mining Algorithms

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

This paper proposes: Mining Frequent Patterns without Candidate Generation

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

Association Rules. Berlin Chen References:

Chapter 4: Association analysis:

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Decision Support Systems

Information Management course

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Information Management course

CS570 Introduction to Data Mining

CSE 5243 INTRO. TO DATA MINING

Association Rule Mining (ARM) Komate AMPHAWAN

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets

Effectiveness of Freq Pat Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

D Data Mining: Concepts and and Tech Techniques

FP-Growth algorithm in Data Compression frequent patterns

CSE 5243 INTRO. TO DATA MINING

Data Mining Techniques

Tutorial on Association Rule Mining

Appropriate Item Partition for Improving the Mining Performance

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Association Rule Mining: FP-Growth

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

BCB 713 Module Spring 2011

1. Interpret single-dimensional Boolean association rules from transactional databases

Chapter 6: Association Rules

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Frequent Pattern Mining

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

APPLICATION OF FP TREE GROWTH ALGORITHM IN TEXT MINING

Frequent Itemsets Melange

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

Pattern Lattice Traversal by Selective Jumps

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Product presentations can be more intelligently planned

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

CS145: INTRODUCTION TO DATA MINING

BAMBOO: Accelerating Closed Itemset Mining by Deeply Pushing the Length-Decreasing Support Constraint

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Lecture 10 Sequential Pattern Mining

SQL Based Frequent Pattern Mining with FP-growth

Association Pattern Mining. Lijun Zhang

Chapter 13, Sequence Data Mining

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Memory issues in frequent itemset mining

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Roadmap. PCY Algorithm

Chapter 7: Frequent Itemsets and Association Rules

Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms

A New Fast Vertical Method for Mining Frequent Patterns

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

PATTERN-GROWTH METHODS FOR FREQUENT PATTERN MINING

CHAPTER 8. ITEMSET MINING 226

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Knowledge Discovery in Databases

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Information Sciences

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Advance Association Analysis

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

A Comprehensive Survey on Sequential Pattern Mining

Krishnamorthy -International Journal of Computer Science information and Engg., Technologies ISSN

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

BAMBOO:Accelerating Closed Itemset Mining by Deeply Pushing the Length-Decreasing Support Constraint. Technical Report

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Transcription:

1 Data Mining for Knowledge Management Association Rules Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Thanks for slides to: Jiawei Han George Kollios Zhenyu Lu Osmar R. Zaïane Mohammad El-Hajj Yu-ting Kung 2

2 Frequent Pattern Mining Input: Given a transaction database DB and a minimum support threshold ξ, find all frequent patterns (item sets) with support no less than ξ. Output: DB: TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} Minimum support: ξ =3 all frequent patterns, i.e., f, a,, fa, fac, fam, Problem: How to efficiently find all frequent patterns? 3 Apriori The core of the Apriori algorithm: Use frequent (k 1)-itemsets (L k-1 ) to generate candidates of frequent k-itemsets C k Scan database and count each pattern in C k, get frequent k- itemsets ( L k ). E.g., TID Items bought 100 {f, a, c, d, g, i, m, p} 200 {a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} Apriori iteration C1 f,a,c,d,g,i,m,p,l,o,h,j,k,s,b,e,n L1 f, a, c, m, b, p C2 fa, fc, fm, fp, ac, am, bp L2 fa, fc, fm, 4

3 Performance Bottlenecks of Apriori The bottleneck of Apriori: candidate generation Huge candidate sets: 10 4 frequent 1-itemset will generate 10 7 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a 1, a 2,, a 100 }, one needs to generate 2 100 10 30 candidates. Multiple scans of database: each candidate 5 Ideas Compress a large database into a compact, Frequent- Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method (FP-growth) A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only. 6

4 Mining Frequent Patterns Without Candidate Generation Grow long patterns from short ones using local frequent items abc is a frequent pattern Get all transactions having abc : DB abc d is a local frequent item in DB abc abcd is a frequent pattern 7 Mining Frequent Patterns Without Candidate Generation 8

5 FP-tree Construction from a Transactional DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 Steps: 10 FP-tree Construction from a Transactional DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 Steps: 1. Scan DB once, find frequent 1-itemsets (single item patterns) 11

FP-tree Construction from a Transactional DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 Steps: 1. Scan DB once, find frequent 1-itemsets (single item patterns) 12 FP-tree Construction from a Transactional DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 Steps: 1. Scan DB once, find frequent 1-itemsets (single item patterns) 2. Order frequent items in descending order of their frequency 13 6

FP-tree Construction from a Transactional DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 Steps: 1. Scan DB once, find frequent 1-itemsets (single item patterns) 2. Order frequent items in descending order of their frequency 14 FP-tree Construction from a Transactional DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 Steps: 1. Scan DB once, find frequent 1-itemsets (single item patterns) 2. Order frequent items in descending order of their frequency 3. Scan DB again, construct FP-tree 15 7

FP-tree Construction TID freq. Items bought 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, p, b} 500 {f, c, a, m, p} root f:1 c:1 a:1 min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 m:1 p:1 16 FP-tree Construction TID freq. Items bought 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, p, b} 500 {f, c, a, m, p} root f:2 c:2 a:2 min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 m:1 p:1 m:1 17 8

FP-tree Construction TID freq. Items bought 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, p, b} 500 {f, c, a, m, p} root f:3 c:1 min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 c:2 a:2 p:1 m:1 p:1 m:1 18 FP-tree Construction TID freq. Items bought 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, p, b} 500 {f, c, a, m, p} root f:4 c:1 min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 c:3 a:3 p:1 m:2 p:2 m:1 19 9

FP-tree Construction TID freq. Items bought 100 {f, c, a, m, p} 200 {f, c, a, b, m} 300 {f, b} 400 {c, p, b} 500 {f, c, a, m, p} Header Table Item freq head f 4 c 4 a 3 b 3 m 3 p 3 c:3 a:3 m:2 f:4 root c:1 p:1 min_support = 3 Item frequency f 4 c 4 a 3 b 3 m 3 p 3 p:2 m:1 20 FP-Tree Definition FP-tree is a frequent pattern tree, defined below: It consists of one root labeled as null a set of item prefix subtrees as the children of the root, and a frequent-item header table. 21 10

11 FP-Tree Definition FP-tree is a frequent pattern tree, defined below: It consists of one root labeled as null a set of item prefix subtrees as the children of the root, and a frequent-item header table. Each node in the item prefix subtrees has three fields: item-name to register which item this node represents, count, the number of transactions represented by the portion of the path reaching this node, and node-link that links to the next node in the FP-tree carrying the same item-name, or null if there is none. 22 FP-Tree Definition FP-tree is a frequent pattern tree, defined below: It consists of one root labeled as null a set of item prefix subtrees as the children of the root, and a frequent-item header table. Each node in the item prefix subtrees has three fields: item-name to register which item this node represents, count, the number of transactions represented by the portion of the path reaching this node, and node-link that links to the next node in the FP-tree carrying the same item-name, or null if there is none. Each entry in the frequent-item header table has two fields, item-name, and head of node-link that points to the first node in the FP-tree carrying the item-name. 23

12 Benefits of the FP-tree Structure Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) For Connect-4 DB, compression ratio could be over 100 24 Partition Patterns and Databases Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p Patterns having c but no a nor b, m, p Pattern f Completeness and non-redundency 25

13 FP-Tree Design Choice Why items in FP-Tree in ordered descending order? 26 FP-Tree Design Choice Why items in FP-Tree in ordered descending order? Example 1: {} TID (unordered) frequent items 100 {f, a, c, m, p} 500 {a, f, c, p, m} f:1 a:1 c:1 m:1 a:1 f:1 c:1 p:1 p:1 m:1 27

14 FP-Tree Design Choice Example 2: TID (ascended) frequent items 100 {p, m, a, c, f} 200 {m, b, a, c, f} 300 {b, f} 400 {p, b, c} 500 {p, m, a, c, f} {} p:3 m:2 c:1 m:2 a:2 c:1 a:2 p:1 This tree is larger than FP-tree, because in FP-tree, more frequent items have a higher position, which makes branches less c:2 f:2 c:1 f:2 28 Mining Frequent Patterns Using FP-tree: FP-Growth General idea (divide-and-conquer) Recursively grow frequent patterns using the FP-tree: looking for shorter ones recursively and then concatenating the suffix: Method For each frequent item, construct its conditional pattern base then its conditional FP-tree Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern) 29

15 Principles of FP-Growth Pattern growth property Let be a frequent itemset in DB, CPB be 's conditional pattern base, and be an itemset in CPB. Then is a frequent itemset in DB iff is frequent in CPB. Is fcabm a frequent pattern? fcab is a branch of m's conditional pattern base b is NOT frequent in transactions containing fcab bm is NOT a frequent itemset. 30 3 Major Steps Starting the processing from the end of list L: Step 1: Step 2 Step 3 Construct conditional pattern base for each item in the header table Construct conditional FP-tree from each conditional pattern base Recursively mine conditional FP-trees and grow frequent patterns obtained so far. If the conditional FP-tree contains a single path, simply enumerate all the patterns 31

Step 1: Construct Conditional Pattern Base Starting at the bottom of frequent-item header table in the FPtree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base Header Table Item head f c a b m p m:2 c:3 a:3 {} f:4 c:1 p:2 m:1 p:1 Conditional pattern bases item p m cond. pattern base fcam:2, c fca:2, fca b fca:1, f:1, c:1 a fc:3 c f:3 f { } 32 Properties of Step 1 Node-link property For any frequent item a i, all the possible frequent patterns that contain a i can be obtained by following a i 's node-links, starting from a i 's head in the FP-tree header. Prefix path property To calculate the frequent patterns for a node a i in a path P, only the prefix sub-path of a i in P need to be accumulated, and its frequency count should carry the same count as node a i. 33 16

17 Step 2: Construct Conditional FP-tree For each pattern base Accumulate the count for each item in the base Construct the conditional FP-tree for the frequent items of the pattern base min support = 3 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 a:3 p:1 m:2 p:2 m:1 {} m- conditional f:3 pattern base: c:3 fca:2, fca a:3 m-conditional FP-tree 34 Conditional Pattern Bases and Conditional FP-Tree Item p m b a c f Conditional pattern base {(fcam:2), (c)} {(fca:2), (fca)} {(fca:1), (f:1), (c:1)} {(fc:3)} {(f:3)} Empty Conditional FP-tree {(c:3)} p {(f:3, c:3, a:3)} m Empty {(f:3, c:3)} a {(f:3)} c Empty order of L 35

Step 3: Recursively mine the conditional FP-tree conditional FP-tree of m : (fca:3) conditional FP-tree of am : (fc:3) {} add c conditional FP-tree of cam : (f:3) {} {} add a f:3 f:3 f:3 c:3 a:3 add c c:3 conditional FP-tree of cm : (f:3) {} add f add f add f conditional FP-tree of of fam : 3 add f f:3 conditional FP-tree of fcm : 3 fcam conditional FP-tree of fm : 3 36 Single FP-tree Path Generation Suppose an FP-tree T has a single path P. The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns concerning m: combination of {f, c, a} and m m, fm, cm, am, fcm, fam, cam, fcam 37 18

19 A Special Case: Single Prefix Path in FP-tree {} a 1 :n 1 a 2 :n 2 Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts Reduction of the single prefix path into one node Concatenation of the mining results of the two parts a 3 :n 3 {} r 1 b 1 :m 1 C 1 :k 1 r 1 = a 1 :n 1 a 2 :n 2 + b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 a 3 :n 3 C 2 :k 2 C 3 :k 3 38 FP-growth -- E min support = 2 A:7 null B:3 Build conditional pattern base for E: P = {(A:1,C:1,D:1), (A:1,D:1), (B:1,C:1)} B:5 C:1 D:1 C:3 C:3 D:1 D:1 E:1 E:1 D:1 E:1 D:1 39

20 FP-growth -- E Build conditional pattern base for E: Conditional tree for E: null A:2 B:1 C:1 C:1 D:1 min support = 2 Conditional Pattern base for E: P = {(A:1,C:1,D:1,E:1), (A:1,D:1,E:1), (B:1,C:1,E:1)} Count for E is 3: {E} is frequent itemset Recursively apply FP-growth on P D:1 E:1 E:1 E:1 40 FP-growth -- DE Conditional tree for D within conditional tree for E: null A:2 B:1 C:1 D:1 C:1 min support = 2 Build Conditional pattern base for D within conditional base for E: P = {(A:1,C:1,D:1), (A:1,D:1)} Count for D is 2: {D,E} is frequent itemset D:1 E:1 E:1 E:1 41

21 FP-growth -- DE Conditional tree for D within conditional tree for E: min support = 2 A:2 null Build Conditional pattern base for D within conditional base for E: P = {(A:1,C:1,D:1), (A:1,D:1)} Count for D is 2: {D,E} is frequent itemset C:1 D:1 D:1 42 FP-growth -- CDE A:2 Conditional tree for C within D within E: null min support = 2 Build Conditional pattern base for C within D within E: P = {(A:1,C:1)} Count for C is 1: {C,D,E} is NOT frequent itemset C:1 D:1 D:1 43

22 FP-growth -- CDE A:1 Conditional tree for C within D within E: null min support = 2 Build Conditional pattern base for C within D within E: P = {(A:1,C:1)} Count for C is 1: {C,D,E} is NOT frequent itemset C:1 44 FP-growth -- ADE Conditional tree for A within D within E: null min support = 2 Count for A is 2: {A,D,E} is frequent itemset A:2 C:1 D:1 D:1 45

23 FP-growth -- ADE Conditional tree for A within D within E: null min support = 2 Count for A is 2: {A,D,E} is frequent itemset A:2 46 Scaling FP-growth by DB Projection FP-tree cannot fit in memory? DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques Parallel projection is space costly 47

Run time(sec.) 24 Partition-based Projection Parallel projection needs a lot of disk space Partition projection saves it Tran. DB fcamp fcabm fb cbp fcamp p-proj DB fcam cb fcam m-proj DB fcab fca fca b-proj DB f cb a-proj DB fc c-proj DB f f-proj DB am-proj DB fc fc fc cm-proj DB f f f 48 FP-Growth vs. Apriori: Scalability With the Support Threshold 100 90 80 70 Data set T25I20D10K D1 FP-grow th runtime D1 Apriori runtime 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 Support threshold(%) 49

Runtime (sec.) 25 FP-Growth vs. Tree-Projection: Scalability with the Support Threshold 140 120 100 80 60 40 20 Data set T25I20D100K D2 FP-growth D2 TreeProjection 0 0 0.5 1 1.5 2 Support threshold (%) 50 Why Is FP-Growth the Winner? Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops counting local freq items and building sub FP-tree, no pattern search and matching 51

26 Implications of the Methodology Mining closed frequent itemsets and max-patterns CLOSET (DMKD 00), CLOSET+ (KDD 03) Mining sequential patterns FreeSpan (KDD 00), PrefixSpan (ICDE 01) Constraint-based mining of frequent patterns Convertible constraints (KDD 00, ICDE 01) Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD 01) 52 Implications of the Methodology Mining closed frequent itemsets CLOSET (DMKD 00), CLOSET+ (KDD 03) 53

27 Frequent Closed Itemsets Definition An itemset Y is a frequent closed itemset if it is frequent and there exists no proper superset Y Y such that sup(y ) = sup(y) Ex: min_sup = 2, f_list = <f:4, c:4, a:3, b:3, m:3, p:3> fc -> frequent closed pattern? correct answer: superset fcam 54 Pruning techniques (for closed itemsets) Item merging Definition: Let X be a frequent itemset. If every transaction containing itemset X also contains itemset Y, but not any proper superset of Y, then " X Y" forms a frequent closed itemset and there is no need to search any itemset containing X but no Y 55

28 Pruning techniques (Cont.) Item merging Ex: Projected conditional database for prefix itemset fc:3 : {(a,m,p), (a,m,p), (a,b,m)} am:3 am is merged with fc a frequent closed itemset fcam:3 56 Pruning techniques (Cont.) Sub-itemset pruning Definition Let X be the frequent itemset currently under consideration. If X is a proper subset of an already found frequent closed itemset Y and sup(x) = sup(y), then X and all of X s descendants in the tree cannot be frequent closed itemsets 57

29 Pruning techniques (Cont.) Sub-itemset pruning Ex: Already found closed itemset fcam:3 Now, mine the patterns with prefix itemset ca:3 : Identify ca:3 is a proper subset of fcam:3 Stop mining the closed patterns with prefix ca:3 58 CLOSET+ In CLOSET+, A hybrid tree-projection method is developed, which builds conditional projected databases in two ways: Dense datasets bottom-up physical treeprojection Sparse datasets top-down pseudo treeprojection 59