Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Similar documents
Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Mining Frequent Patterns without Candidate Generation

Data Mining for Knowledge Management. Association Rules

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Association Rules. A. Bellaachia Page: 1

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Association Rule Mining

CS570 Introduction to Data Mining

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Association Rule Mining. Entscheidungsunterstützungssysteme

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Mining Part 3. Associations Rules

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

BCB 713 Module Spring 2011

Nesnelerin İnternetinde Veri Analizi

Fundamental Data Mining Algorithms

Association Rules. Berlin Chen References:

Frequent Pattern Mining

Frequent Pattern Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Scalable Frequent Itemset Mining Methods

Association Rule Mining

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

Association Rule Mining (ARM) Komate AMPHAWAN

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

CSE 5243 INTRO. TO DATA MINING

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

DATA MINING II - 1DL460

Chapter 7: Frequent Itemsets and Association Rules

Information Management course

Chapter 4: Association analysis:

Association Analysis. CSE 352 Lecture Notes. Professor Anita Wasilewska

Information Management course

Interestingness Measurements

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

D Data Mining: Concepts and and Tech Techniques

Mining Association Rules in Large Databases

1. Interpret single-dimensional Boolean association rules from transactional databases

Decision Support Systems

Chapter 6: Association Rules

Interestingness Measurements

Product presentations can be more intelligently planned

CS145: INTRODUCTION TO DATA MINING

Knowledge Discovery in Databases

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

CS6220: DATA MINING TECHNIQUES

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Data Mining Techniques

Effectiveness of Freq Pat Mining

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Association Pattern Mining. Lijun Zhang

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 7

Mining Association Rules in Large Databases

2 CONTENTS

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Jarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques

Association Rules Apriori Algorithm

Which Null-Invariant Measure Is Better? Which Null-Invariant Measure Is Better?

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Association Rules. Juliana Freire. Modified from Jeff Ullman, Jure Lescovek, Bing Liu, Jiawei Han

Chapter 4 Data Mining A Short Introduction

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

CHAPTER 8. ITEMSET MINING 226

Advance Association Analysis

Data Warehousing and Data Mining

Association Rule Discovery

Tutorial on Association Rule Mining

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Association Rule Mining: FP-Growth

CS490D: Introduction to Data Mining Prof. Walid Aref

Improved Frequent Pattern Mining Algorithm with Indexing

Association Rule Discovery

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

This paper proposes: Mining Frequent Patterns without Candidate Generation

Machine Learning: Symbolische Ansätze

CS145: INTRODUCTION TO DATA MINING

Data Mining: Concepts and Techniques. Why Data Mining? What Is Data Mining? Chapter 5

Data Mining: Concepts and Techniques. Chapter 5

Performance and Scalability: Apriori Implementa6on

Data Mining: Concepts and Techniques

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

An Improved Apriori Algorithm for Association Rules

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Chapter 6: Mining Association Rules in Large Databases

Association Rule Mining. Introduction 46. Study core 46

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

FP-Growth algorithm in Data Compression frequent patterns

Data Mining: Concepts and Techniques. Chapter 5 Mining Frequent Patterns. Slide credits: Jiawei Han and Micheline Kamber George Kollios

Lecture notes for April 6, 2005

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

COMP Associa0on Rules

Production rule is an important element in the expert system. By interview with

Data Mining Course Overview

Chapter 13, Sequence Data Mining

A Comparative Study of Association Rules Mining Algorithms

Transcription:

Association rules Marco Saerens (UCL), with Christine Decaestecker (ULB) 1

Slides references Many slides and figures have been adapted from the slides associated to the following books: Alpaydin (2004), Introduction to machine learning. The MIT Press. Duda, Hart & Stork (2000), Pattern classification, John Wiley & Sons. Han & Kamber (2006), Data mining: concepts and techniques, 2nd ed. Morgan Kaufmann. Tan, Steinbach & Kumar (2006), Introduction to data mining. Addison Wesley. As well as from Wikipedia, the free online encyclopedia I really thank these authors for their contribution to machine learning and data mining teaching 2

Frequent pattern analysis 3

What is frequent pattern analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Motivation: Finding inherent regularities in data What products were often purchased together? What are the subsequent purchases after buying a PC? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 4

Why is frequent pattern analysis important? Discloses intrinsic and important properties of data sets Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering 5

Basic concepts Transaction-id 10 20 30 40 50 Customer buys beer Customer buys both Items bought A, B, D A, C, D A, D, E B, E, F B, C, D, E, F Customer buys diaper Itemset X = {x 1,, x p } Find all the rules X Y with minimum support and confidence Support, s, a priori probability that a transaction contains X Y Confidence, c, conditional probability that a transaction having X also contains Y, P(Y X) Let sup min = 50%, conf min = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A D (60%, 100%) 6 D A (60%, 75%)

Closed patterns and max-patterns A long pattern contains a combinatorial number of sub-patterns, e.g., {a 1,, a 100 } contains 2 100 1 = about 1.27*10 30 sub-patterns! Solution: Mine closed patterns and maxpatterns instead An itemset X is closed If X is frequent and there exists no super-pattern Y X, with at least the same support as X An itemset X is a max-pattern If X is frequent and there exists no frequent superpattern Y X 7

Closed patterns and max-patterns Example: DB = {<a 1,, a 100 >, < a 1,, a 50 >} Min_sup_count = 1. What is the set of closed itemset? <a 1,, a 100 >: 1 < a 1,, a 50 >: 2 What is the set of max-pattern? <a 1,, a 100 >: 1 8

Scalable methods for mining frequent patterns The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant) Frequent pattern growth (FPgrowth Han, Pei & Yin) Vertical data format approach (Charm Zaki & Hsiao) 9

Scalable methods for mining frequent patterns Patterns form a lattice null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE 10

Apriori: a candidate generation-and-test approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: Initially, scan DB once to get frequent 1-itemset Generate length (k + 1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 11

Apriori: a candidate generation-and-test approach This allows to prune the lattice null A B C D E AB AC AD AE BC BD BE CD CE DE Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets ABCDE 12

The Apriori algorithm: an example Database Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan Itemset sup {A} C 1 L 1 {B} {C} 3 {D} 1 {E} 3 Itemset C 2 C 2 {A, B} 1 L 2 Itemset sup 2 nd scan {A, C} Sup min = 2 2 {B, C} 2 {B, E} 3 {C, E} 2 {A, C} C Itemset 3 3 rd scan L 3 {B, C, E} sup 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 2 3 Itemset sup {B, C, E} 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} 13

The Apriori algorithm Pseudo-code: C k : Candidate itemset of size k L k : Frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!= ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do Increment the count of all candidates in C k+1 that are contained in t; L k+1 = candidates in C k+1 min_support; end return k L k ; 14

Important details of Apriori How to generate candidates? Step 1: self-joining L k Step 2: pruning How to count supports of candidates? Example of Candidate-generation L 3 = {abc, abd, acd, ace, bcd} Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 = {abcd} 15

How to count supports of candidates? Why counting supports of candidates is a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 16

Example: counting supports of candidates Hash function 3,6,9 1,4,7 Transaction, t 2,5,8 1 2 3 5 6 Level 1 1 2 3 5 6 2 3 5 6 3 5 6 Level 2 1 2 3 5 6 1 3 5 6 1 5 6 2 3 5 6 2 5 6 3 5 6 1 2 3 1 2 5 1 2 6 1 3 5 1 3 6 1 5 6 2 3 5 2 3 6 2 5 6 3 5 6 Level 3 Subsets of 3 items 17

Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 1, 4 or 7 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 18

Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 2, 5 or 8 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 19

Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 3, 6 or 9 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 20

Subset operation using hash tree 1 2 3 5 6 transaction Hash Function 1 + 2 3 5 6 2 + 3 5 6 1,4,7 3,6,9 3 + 5 6 2,5,8 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 2 3 4 5 6 7 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 21

Subset operation using hash tree 1 2 3 5 6 transaction Hash Function 1 2 + 1 3 + 3 5 6 5 6 1 + 2 3 5 6 2 + 3 5 6 3 + 5 6 1,4,7 2,5,8 3,6,9 1 5 + 6 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 3 4 5 3 5 6 3 6 7 3 5 7 3 6 8 6 8 9 22

Subset operation using hash tree 1 2 3 5 6 transaction Hash Function 1 2 + 1 3 + 3 5 6 5 6 1 + 2 3 5 6 2 + 3 5 6 3 + 5 6 1,4,7 2,5,8 3,6,9 1 5 + 6 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 3 4 5 3 5 6 3 6 7 3 5 7 3 6 8 6 8 9 23 Match transaction against 11 out of 15 candidates

Challenges of frequent pattern mining Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates 24

Partition: scan database only twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns 25

Bottleneck of frequent-pattern mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates Bottleneck: candidate-generation-andtest Can we avoid candidate generation? 26

Mining frequent patterns without candidate generation Grow long patterns from short ones using local frequent items abc is a frequent pattern Get all transactions having abc : DB abc d is a local frequent item in DB abc abcd is a frequent pattern 27

Construct FP-tree from a transaction database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} min_support = 3 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f- list 3. Scan DB again, construct FP-tree Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 F-list = f-c-a-b-m-p {} f:4 c:1 c:3 a:3 b:1 b:1 p:1 m:2 b:1 p:2 m:1 28

Benefits of the FP-tree structure This is called a suffix tree in computer sciences Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info Infrequent items are gone Items in frequency descending order The more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) 29

Partition Patterns and Databases Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p Patterns having c but no a nor b, m, p Pattern f Completeness and non-redundency 30

Find patterns having P from P-conditional database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p s conditional pattern base Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 {} f:4 c:1 c:3 a:3 b:1 b:1 p:1 m:2 p:2 b:1 m:1 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 31

From conditional pattern-bases to conditional FP-trees For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 m:2 c:3 a:3 {} f:4 c:1 b:1 p:2 m:1 b:1 b:1 p:1 m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam 32

Recursion: mining each conditional FP-tree {} {} f:3 Cond. pattern base of am : (fc:3) c:3 Cond. pattern base of cm : (f:3) a:3 m-conditional FP-tree f:3 c:3 am-conditional FP-tree {} f:3 cm-conditional FP-tree Cond. pattern base of cam : (f:3) {} f:3 cam-conditional FP-tree 33

A special case: single prefix path in FP-tree {} a 1 :n 1 a 2 :n 2 a 3 :n 3 Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts Reduction of the single prefix path into one node Concatenation of the mining results of the two parts {} r 1 b 1 :m 1 C 1 :k 1 r 1 = a 1 :n 1 a 2 :n 2 + b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 a 3 :n 3 C 34 2 :k 2 C 3 :k 3

Mining frequent patterns with FP-trees Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database partition Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path single path will generate all the combinations of its sub-paths, each of which is a frequent pattern 35

Scaling FP-growth by DB projection FP-tree cannot fit in memory? DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. partition projection techniques Parallel projection is space costly 36

Partition-based projection Parallel projection needs a lot of disk space Partition projection saves it Tran. DB fcamp fcabm fb cbp fcamp p-proj DB fcam cb fcam m-proj DB fcab fca fca b-proj DB f cb a-proj DB fc c-proj DB f f-proj DB am-proj DB fc fc fc cm-proj DB f f f 37

FP-Growth vs. Apriori: scalability with the support threshold 100 90 80 Data set T25I20D10K D1 FP-growth runtime D1 Apriori runtime 70 Run time(sec.) 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 Support threshold(%) 38

FP-Growth vs. tree-projection: scalability with the support threshold Data set T25I20D100K 140 120 D2 FP-growth D2 TreeProjection Runtime (sec.) 100 80 60 40 20 0 0 0.5 1 1.5 2 Support threshold (%) 39

Why is FP-Growth highly performant? Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops counting local freq items and building sub FP-tree, no pattern search and matching 40

Visualization of association rules: plane graph 41

Visualization of association rules: rule graph 42

Visualization of association rules (SGI/MineSet 3.0) 43

Mining multiple-level association rules Items often form hierarchies Flexible support settings Items at the lower level are expected to have lower support uniform support Level 1 min_sup = 5% Milk [support = 10%] reduced support Level 1 min_sup = 5% Level 2 min_sup = 5% 2% Milk [support = 6%] Skim Milk [support = 4%] Level 2 min_sup = 3% 44

Multi-level association: redundancy filtering Some rules may be redundant due to ancestor relationships between items. Example milk wheat bread [support = 8%, confidence = 70%] 2% milk wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the expected value, based on the rule s ancestor. 45

Mining multi-dimensional association Single-dimensional rules: buys(x, milk ) buys(x, bread ) Multi-dimensional rules: 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(x, 19-25 ) occupation(x, student ) buys(x, coke ) hybrid-dimension assoc. rules (repeated predicates) age(x, 19-25 ) buys(x, popcorn ) buys(x, coke ) Categorical Attributes: finite number of possible values, no ordering among values data cube 46 approach

Interestingness measure: correlations play basketball eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7% play basketball not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift lift = P(A B) P(A)P(B) 47

Interestingness measure: correlations Example Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 2000 / 5000 lift( B, C) = = 0.89 3000 / 5000*3750 / 5000 lift(b, C) = 1000 /5000 3000 /5000 *1250 /5000 =1.33 48

Are lift and χ 2 good measures of correlation? Buy walnuts buy milk [1%, 80%] is misleading if 85% of customers buy milk Support and confidence are not good to represent correlations Many interestingness measures all _ conf = sup( X ) max_ item _ sup( X ) Coffee No Coffee Milk m, c m, ~c No Milk ~m, c ~m, ~c Sum (row) c ~c Sum(col.) m ~m Σ sup( X ) coh = universe( X ) DB A1 A2 m, c 1000 100 ~m, c 100 1000 m~c 100 1000 ~m~c 10,000 100,000 lift 9.26 8.44 all-conf 0.91 0.09 coh 0.83 0.05 χ2 9055 670 A3 1000 100 10000 100,000 9.18 0.09 0.09 8172 49 A4 1000 1000 1000 1000 1 0.5 0.33 0

Which measures should be used? 50