Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Similar documents
Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

CS570 Introduction to Data Mining

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Mining Frequent Patterns without Candidate Generation

CSE 5243 INTRO. TO DATA MINING

Data Mining for Knowledge Management. Association Rules

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

CSE 5243 INTRO. TO DATA MINING

Scalable Frequent Itemset Mining Methods

D Data Mining: Concepts and and Tech Techniques

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Association Rules. A. Bellaachia Page: 1

Fundamental Data Mining Algorithms

CSE 5243 INTRO. TO DATA MINING

Association Rule Mining. Entscheidungsunterstützungssysteme

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Roadmap. PCY Algorithm

Performance and Scalability: Apriori Implementa6on

CS145: INTRODUCTION TO DATA MINING

Data Mining Part 3. Associations Rules

Association Rule Mining

CS6220: DATA MINING TECHNIQUES

Information Management course

BCB 713 Module Spring 2011

Information Management course

Frequent Pattern Mining

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Frequent Pattern Mining

Nesnelerin İnternetinde Veri Analizi

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Effectiveness of Freq Pat Mining

Association Rule Mining

DATA MINING II - 1DL460

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Chapter 4: Association analysis:

Product presentations can be more intelligently planned

Decision Support Systems

Knowledge Discovery in Databases

Improved Frequent Pattern Mining Algorithm with Indexing

Chapter 7: Frequent Itemsets and Association Rules

Data Mining Techniques

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Association Rules. Berlin Chen References:

Association Rule Mining (ARM) Komate AMPHAWAN

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Association Rules Apriori Algorithm

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

This paper proposes: Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Jarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques

1. Interpret single-dimensional Boolean association rules from transactional databases

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Advance Association Analysis

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Mining Association Rules in Large Databases

A Taxonomy of Classical Frequent Item set Mining Algorithms

CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

Optimized Frequent Pattern Mining for Classified Data Sets

Data Mining: Concepts and Techniques. Chapter 5 Mining Frequent Patterns. Slide credits: Jiawei Han and Micheline Kamber George Kollios

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Association Rules. Juliana Freire. Modified from Jeff Ullman, Jure Lescovek, Bing Liu, Jiawei Han

Association Analysis. CSE 352 Lecture Notes. Professor Anita Wasilewska

DATA MINING II - 1DL460

Data Mining: Concepts and Techniques. Chapter 5

Data Mining: Concepts and Techniques

Association Pattern Mining. Lijun Zhang

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Association Rule Mining. Introduction 46. Study core 46

Chapter 6: Association Rules

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases

Association Rule Mining from XML Data

CSCI6405 Project - Association rules mining

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

FP-Growth algorithm in Data Compression frequent patterns

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

Memory issues in frequent itemset mining

A Comparative Study of Association Rules Mining Algorithms

Association Rule Mining: FP-Growth

Mining of Web Server Logs using Extended Apriori Algorithm

DATA MINING II - 1DL460

signicantly higher than it would be if items were placed at random into baskets. For example, we

Interestingness Measurements

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Data Mining: Concepts and Techniques. Chapter 5

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

2 CONTENTS

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Data Mining: Concepts and Techniques Chapter 4

Transcription:

What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed Frequent pattern: a pattern (a set o items, subsequences, substructures, etc) that occurs requently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context o requent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were oten purchased together? Beer and diapers?! What are the subsequent purchases ater buying a PC? What kinds o DNA are sensitive to this new drug? Can we automatically classiy web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis 2/2/2015 COMP 465: Data Mining Spring 2015 1 2 Basic Concepts: Frequent Patterns Basic Concepts: Association Rules Tid buys beer Items bought 10 Beer, Nuts, Diaper 20 Beer, Coee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coee, Diaper, Eggs, Milk buys both buys diaper itemset: A set o one or more items k-itemset X = {x 1,, x k } (absolute) support, or, support count o X: Frequency or occurrence o an itemset X (relative) support, s, is the raction o transactions that contains X (ie, the probability that a transaction contains X) An itemset X is requent i X s support is no less than a minsup threshold 3 Tid 10 20 30 40 50 Nuts, Eggs, Milk Nuts, Coee, Diaper, Eggs, Milk buys beer Items bought Beer, Nuts, Diaper Beer, Coee, Diaper Beer, Diaper, Eggs buys both buys diaper Find all the rules X Y with minimum support and conidence support, s, probability that a transaction contains X Y conidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, mincon = 50% Freq Pat: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer Diaper (60%, 100%) Diaper Beer (60%, 75%) 4 1

Closed Patterns and Max-Patterns Closed Patterns and Max-Patterns A long pattern contains a combinatorial number o subpatterns, eg, {a 1,, a 100 } contains ( 100 1) + ( 100 2) + + ( 1 1 0 0 00 ) = 2 100 1 = 127*10 30 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed i X is requent and there exists no superpattern Y כ X, with the same support as X (proposed by Pasquier, et al @ ICDT 99) An itemset X is a max-pattern i X is requent and there exists no requent super-pattern Y כ X (proposed by Bayardo @ SIGMOD 98) Closed pattern is a lossless compression o req patterns Reducing the # o patterns and rules 5 Exercise: Suppose a DB contains only two transactions <a 1,, a 100 >, <a 1,, a 50 > Let min_sup = 1 What is the set o closed itemset? {a 1,, a 100 }: 1 {a 1,, a 50 }: 2 What is the set o max-pattern? {a 1,, a 100 }: 1 What is the set o all patterns? {a 1 }: 2,, {a 1, a 2 }: 2,, {a 1, a 51 }: 1,, {a 1, a 2,, a 100 }: 1 A big number: 2 100-1 6 The Downward Closure Property and Scalable Mining Methods The downward closure property o requent patterns Any subset o a requent itemset must be requent I {beer, diaper, nuts} is requent, so is {beer, diaper} ie, every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB 94) Freq pattern growth (FPgrowth Han, Pei & Yin @SIGMOD 00) Vertical data ormat approach (Charm Zaki & Hsiao @SDM 02) 7 Apriori: A Candidate Generation & Test Approach Apriori pruning principle: I there is any itemset which is inrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB 94, Mannila, et al @ KDD 94) Method: Initially, scan DB once to get requent 1-itemset Generate length (k+1) candidate itemsets rom length k requent itemsets Test the candidates against DB Terminate when no requent or candidate set can be generated 8 2

The Apriori Algorithm An Example The Apriori Algorithm (Pseudo-Code) Itemset sup Database TDB Sup min = 2 {A} 2 L Tid Items C 1 1 {B} 3 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan {C} 3 {D} 1 {E} 3 C 2 C 2 {A, B} 1 L 2 Itemset sup 2 nd scan {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 C Itemset 3 3 rd scan L 3 {B, C, E} Itemset sup {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {B, C, E} 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 9 C k : Candidate itemset o size k L k : requent itemset o size k L 1 = {requent items}; or (k = 1; L k!= ; k++) do begin C k+1 = candidates generated rom L k ; or each transaction t in database do increment the count o all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ; 10 Implementation o Apriori Further Improvement o the Apriori Method How to generate candidates? Major computational challenges Step 1: sel-joining L k Step 2: pruning Example o Candidate-generation L 3 ={abc, abd, acd, ace, bcd} Sel-joining: L 3 *L 3 abcd rom abc and abd acde rom acd and ace Pruning: acde is removed because ade is not in L 3 C 4 = {abcd} 11 Multiple scans o transaction database Huge number o candidates Tedious workload o support counting or candidates Improving Apriori: general ideas Reduce passes o transaction database scans Shrink number o candidates Facilitate support counting o candidates 15 3

Partition: Scan Database Only Twice Any itemset that is potentially requent in DB must be requent in at least one o the partitions o DB Scan 1: partition database and ind local requent patterns Scan 2: consolidate global requent patterns A Savasere, E Omiecinski and S Navathe, VLDB 95 DB 1 + DB 2 + + DB k = DB sup 1 (i) < σdb 1 sup 2 (i) < σdb 2 sup k (i) < σdb k sup(i) < σdb DHP: Reduce the Number o Candidates A k-itemset whose corresponding hashing bucket count is below the threshold cannot be requent Candidates: a, b, c, d, e Hash entries count 35 88 itemsets {ab, ad, ae} {bd, be, de} {ab, ad, ae} {bd, be, de} 102 {yz, qs, wt} Frequent 1-itemset: a, b, d, e Hash Table ab is not a candidate 2-itemset i the sum o count o {ab, ad, ae} is below support threshold J Park, M Chen, and P Yu An eective hash-based algorithm or mining association rules SIGMOD 95 DHP Direct Hashing and Pruning 17 Sampling or Frequent Patterns DIC: Reduce Number o Scans Select a sample o original database, mine requent patterns within sample using Apriori Scan database once to veriy requent itemsets ound in sample, only borders o closure o requent patterns are checked Example: check abcd instead o ab, ac,, etc Scan database again to ind missed requent patterns H Toivonen Sampling large databases or association rules In VLDB 96 18 ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D Apriori Itemset lattice S Brin R Motwani, J Ullman, and S Tsur Dynamic itemset DIC counting and implication rules or market basket data In SIGMOD 97 Once both A and D are determined requent, the counting o AD begins Once all length-2 subsets o BCD are determined requent, the counting o BCD begins Transactions 1-itemsets 2-itemsets 1-itemsets 2-items 3-items 19 4

Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation Construct FP-tree rom a Transaction Database Bottlenecks o the Apriori approach Breadth-irst (ie, level-wise) search Candidate generation and test Oten generates a huge number o candidates The FPGrowth Approach (J Han, J Pei, and Y Yin, SIGMOD 00) Depth-irst search Avoid explicit candidate generation Major philosophy: Grow long patterns rom short ones using local requent items only abc is a requent pattern Get all transactions having abc, ie, project DB on abc: DB abc d is a local requent item in DB abc abcd is a requent pattern 20 TID Items bought (ordered) requent items 100 {, a, c, d, g, i, m, p} {, c, a, m, p} 200 {a, b, c,, l, m, o} {, c, a, b, m} 300 {b,, h, j, o, w} {, b} 400 {b, c, k, s, p} {c, b, p} 500 {a,, c, e, l, p, m, n} {, c, a, m, p} Header Table 1 Scan DB once, ind requent 1-itemset (single item pattern) 2 Sort requent items in requency descending order, -list Item requency head 4 c 4 a 3 b 3 m 3 p 3 3 Scan DB again, construct FP-tree F-list = -c-a-b-m-p min_support = 3 :4 c:1 m:2 p:2 m:1 p:1 21 Partition Patterns and Databases Frequent patterns can be partitioned into subsets according to -list F-list = -c-a-b-m-p Patterns containing p Patterns having m but no p Patterns having c but no a nor b, m, p Pattern Completeness and non-redundency 22 Find Patterns Having P From P-conditional Database Starting at the requent item header table in the FP-tree Traverse the FP-tree by ollowing the link o each requent item p Accumulate all o transormed preix paths o item p to orm p s conditional pattern base Header Table Item requency head 4 c 4 a 3 b 3 m 3 p 3 m:2 :4 c:1 p:2 m:1 p:1 Conditional pattern bases item c :3 a cond pattern base b ca:1, :1, c:1 m p ca:2, ca cam:2, c 23 5

From Conditional Pattern-bases to Conditional FP-trees Recursion: Mining Each Conditional FP-tree For each pattern-base Accumulate the count or each item in the base Construct the FP-tree or the requent items o the pattern base Header Table Item requency head 4 c 4 a 3 b 3 m 3 p 3 m:2 :4 c:1 p:2 m:1 p:1 m-conditional pattern base: ca:2, ca All requent patterns relate to m m, :3 m, cm, am, cm, am, cam, cam m-conditional FP-tree 24 :3 m-conditional FP-tree Cond pattern base o am : () :3 am-conditional FP-tree Cond pattern base o cm : (:3) :3 Cond pattern base o cam : (:3) :3 cam-conditional FP-tree cm-conditional FP-tree 25 A Special Case: Single Preix Path in FP-tree Beneits o the FP-tree Structure Suppose a (conditional) FP-tree T has a shared single preix-path P Mining can be decomposed into two parts Reduction o the single preix path into one node a 1 :n 1 a 2 :n 2 a 3 :n 3 b 1 :m 1 C 1 :k 1 Concatenation o the mining results o the two parts r 1 = a 1 :n 1 a 2 :n 2 + r 1 b 1 :m 1 C 1 :k 1 Completeness Preserve complete inormation or requent pattern mining Never break a long pattern o any transaction Compactness Reduce irrelevant ino inrequent items are gone Items in requency descending order: the more requently occurring, the more likely to be shared Never be larger than the original database (not count nodelinks and the count ield) C 2 :k 2 C 3 :k 3 a 3 :n 3 C 2 :k 2 C 3 :k 3 26 27 6

Run time(sec) 2/3/2015 The Frequent Pattern Growth Mining Method Scaling FP-growth by Database Projection Idea: Frequent pattern growth Recursively grow requent patterns by pattern and database partition Method For each requent item, construct its conditional patternbase, and then its conditional FP-tree Repeat the process on each newly created conditional FPtree Until the resulting FP-tree is empty, or it contains only one path single path will generate all the combinations o its sub-paths, each o which is a requent pattern What about i FP-tree cannot it in memory? DB projection First partition a database into a set o projected DBs Then construct and mine FP-tree or each projected DB Parallel projection vs partition projection techniques Parallel projection Project the DB in parallel or each requent item Parallel projection is space costly All the partitions can be processed in parallel Partition projection Partition the DB based on the ordered requent items Passing the unprocessed parts to the subsequent partitions 28 29 Partition-Based Projection FP-Growth vs Apriori: Scalability With the Support Threshold Parallel projection needs a lot o disk space Partition projection saves it p-proj DB cam cb cam m-proj DB cab ca ca am-proj DB c c c b-proj DB cb Tran DB camp cabm b cbp camp cm-proj DB a-proj DB c c-proj DB -proj DB 30 100 90 80 70 60 50 40 30 20 10 0 Data set T25I20D10K D1 FP-grow th runtime D1 Apriori runtime 0 05 1 15 2 25 3 Support threshold(%) 31 7

Runtime (sec) 2/3/2015 FP-Growth vs Tree-Projection: Scalability with the Support Threshold Advantages o the Pattern Growth Approach 140 120 Data set T25I20D100K D2 FP-growth D2 TreeProjection 100 80 60 40 20 0 0 05 1 15 2 Data Mining: Support Concepts threshold and Techniques (%) 32 Divide-and-conquer: Decompose both the mining task and DB according to the requent patterns obtained so ar Lead to ocused search o smaller databases Other actors No candidate generation, no candidate test Compressed database: FP-tree structure No repeated scan o entire database Basic ops: counting local req items and building sub FP-tree, no pattern search and matching A good open-source implementation and reinement o FPGrowth FPGrowth+ (Grahne and J Zhu, FIMI'03) 33 Interestingness Measure: Correlations (Lit) Are lit and 2 Good Measures o Correlation? play basketball eat cereal [40%, 667%] is misleading The overall % o students eating cereal is 75% > 667% play basketball not eat cereal [20%, 333%] is more accurate, although with lower support and conidence Measure o dependent/correlated events: lit P( A B) lit P( A) P( B) 2000 / 5000 lit ( B, C) 089 3000 / 5000*3750 / 5000 1000 / 5000 lit ( B, C) 133 3000 / 5000*1250 / 5000 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col) 3000 2000 5000 42 Buy walnuts buy milk [1%, 80%] is misleading i 85% o customers buy milk Support and conidence are not good to indicate correlations Over 20 interestingness measures have been proposed (see Tan, Kumar, Sritastava @KDD 02) Which are good ones? 43 8

Summary Basic concepts: association rules, supportconident ramework, closed and max-patterns Scalable requent pattern mining methods Apriori (Candidate generation & test) Projection-based (Fpgrowth) Which patterns are interesting? Pattern evaluation methods 48 9