Association Rules Apriori Algorithm

Similar documents
Association Rules Apriori Algorithm

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

CS570 Introduction to Data Mining

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Frequent Pattern Mining

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

BCB 713 Module Spring 2011

Association mining rules

Association Rule Mining. Entscheidungsunterstützungssysteme

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Lecture notes for April 6, 2005

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Chapter 4: Association analysis:

Association Rules. A. Bellaachia Page: 1

Chapter 7: Frequent Itemsets and Association Rules

Mining Association Rules in Large Databases

Mining Association Rules in Large Databases

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Chapter 7: Frequent Itemsets and Association Rules

2 CONTENTS

Improved Frequent Pattern Mining Algorithm with Indexing

Chapter 6: Association Rules

Data Mining Part 3. Associations Rules

Association Pattern Mining. Lijun Zhang

CSE 5243 INTRO. TO DATA MINING

2. Discovery of Association Rules

Association Rule Discovery

Frequent Pattern Mining

Association Rules. Berlin Chen References:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Association Rule Discovery

Chapter 4 Data Mining A Short Introduction

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Decision Support Systems

Data Mining Clustering

Frequent Item Sets & Association Rules

Nesnelerin İnternetinde Veri Analizi

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Fundamental Data Mining Algorithms

COMP Associa0on Rules

Association Rule Mining. Introduction 46. Study core 46

Product presentations can be more intelligently planned

Classification by Association

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Tutorial on Association Rule Mining

Discovering interesting rules from financial data

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

Mining Frequent Patterns without Candidate Generation

Machine Learning: Symbolische Ansätze

Effectiveness of Freq Pat Mining

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

ISSN Vol.03,Issue.09 May-2014, Pages:

Big Data Analytics CSCI 4030

International Journal of Advance Research in Computer Science and Management Studies

Supervised and Unsupervised Learning (II)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Interestingness Measurements

Performance Based Study of Association Rule Algorithms On Voter DB

Association Rule Learning

Association Analysis: Basic Concepts and Algorithms

Interestingness Measurements

Association rule mining

FP-Growth algorithm in Data Compression frequent patterns

An Algorithm for Frequent Pattern Mining Based On Apriori

Data Mining Concepts

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

1. Interpret single-dimensional Boolean association rules from transactional databases

数据挖掘 Introduction to Data Mining

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Association Rules and

Association Rules Outline

Association Rules: Past, Present & Future. Ramakrishnan Srikant.

Road Map. Objectives. Objectives. Frequent itemsets and rules. Items and transactions. Association Rules and Sequential Patterns

Association Rule Mining (ARM) Komate AMPHAWAN

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

Production rule is an important element in the expert system. By interview with

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

Advance Association Analysis

Mining Frequent Patterns with Counting Inference at Multiple Levels

Temporal Weighted Association Rule Mining for Classification

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 7

CSE 5243 INTRO. TO DATA MINING

Optimization using Ant Colony Algorithm

Data Mining Course Overview

Rule induction. Dr Beatriz de la Iglesia

Knowledge Discovery in Databases

Comparing the Performance of Frequent Itemsets Mining Algorithms

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Transcription:

Association Rules Apriori Algorithm Market basket analysis n Market basket analysis might tell a retailer that customers often purchase shampoo and conditioner n Putting both items on promotion at the same time would not create a significant increase in revenue n While a promotion involving just one of the items would likely drive sales 1

Association Rules n Discovers co-occurrence relationships n Besides market basket data, association analysis is also applicable to other application domains n bioinformatics, n medical diagnosis n Web mining n scientific data analysis. n A widely used example of cross selling on the web n Market basket analysis is Amazon.com's use of "customers who bought book A also bought book B 2

Sales Transaction Table n We would like to perform a basket analysis of the set of products in a single transaction n Discovering for example, that a customer who buys shoes is likely to buy socks Shoes Socks 3

Transactional Database n The set of all sales transactions is called the population n We represent the transactions in one record per transaction n The transaction are represented by a data tuple TX1 TX2 TX3 TX4 Shoes,Socks,Tie Shoes,Socks,Tie,Belt,Shirt Shoes,Tie Shoes,Socks,Belt Socks Tie n Sock is the rule antecedent n Tie is the rule consequent 4

Support and Confidence n Any given association rule has a support level and a confidence level n Support it the percentage of the population which satisfies the rule n If the percentage of the population in which the antecedent is satisfied, then the confidence is that percentage in which the consequent is also satisfied Transactional Database Socks Tie n Support is 50% (2/4) n Confidence is 66.67% (2/3) TX1 TX2 TX3 TX4 Shoes,Socks,Tie Shoes,Socks,Tie,Belt,Shirt Shoes,Tie Shoes,Socks,Belt 5

Apriori Algorithm n Mining for associations among items in a large database of sales transaction is an important database mining function n For example, the information that a customer who purchases a keyboard also tends to buy a mouse at the same time is represented in association rule below: n Keyboard Mouse n [support = 6%, confidence = 70%] Association Rules n Based on the types of values, the association rules can be classified into two categories: Boolean Association Rules and Quantitative Association Rules n Boolean Association Rule: Keyboard Mouse [support = 6%, confidence = 70%] n Quantitative Association Rule: (Age = 26...30) (Cars =1, 2) [support 3%, confidence = 36%] 6

Minimum Support threshold n The support of an association pattern is the percentage of task-relevant data transactions for which the pattern is true A B support(a B) = P(A B) support(a B) = # _ tuples_containing_both _ A _ and _ B total _# _ of _ tuples Minimum Confidence Threshold n Confidence is defined as the measure of certainty or trustworthiness associated with each discovered pattern A B confidence(a B) = P(B A) The probability of B given that all we know is A confidence(a B) = # _ tuples_containing_both _ A _ and _ B # _ tuples_containing_ A 7

Itemset n A set of items is referred to as itemset n An itemset containing k items is called k-itemset n An itemset can be seen as a conjunction of items Frequent Itemset n Suppose min_sup is the minimum support threshold n An itemset satisfies minimum support if the occurrence frequency of the itemset is greater or equal to min_sup n If an itemset satisfies minimum support, then it is a frequent itemset 8

Strong Rules n Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong Association Rule Mining n Find all frequent itemsets n Generate strong association rules from the frequent itemsets n Apriori algorithm is mining frequent itemsets for Boolean associations rules 9

Apriori Algorithm n Level-wise search n k-itemsets (itensets with k items) are used to explore (k+1)- itemsets from transactional databases for Boolean association rules First, the set of frequent 1-itemsets is found (denoted L 1 ) L 1 is used to find L 2, the set of frequent 2-itemsets L 2 is used to find L 3, and so on, until no frequent k-itemsets can be found n Generate strong association rules from the frequent itemsets n If an itemset is frequent, then all of its subsets must also be frequent. 10

Example Sup Database TDB min = 2 Itemset sup {A} 2 L Tid Items C 1 1 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1 st scan {B} 3 {C} 3 {D} 1 {E} 3 C 2 C 2 L {A, B} 1 2 Itemset sup 2 nd scan {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 C 3 Itemset 3 rd scan L 3 {B, C, E} Itemset sup {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {B, C, E} 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} n The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent items n Employs an iterative approach known as levelwise search, where k-items are used to explore k+1 items 11

Apriori Property n Apriori property is used to reduce the search space n Apriori property: All nonempty subset of frequent items must be also frequent n Anti-monotone in the sense that if a set cannot pass a test, all its supper sets will fail the same test as well Apriori Property n Reducing the search space to avoid finding of each L k requires one full scan of the database (L k set of frequent k-itemsets) n If an itemset I does not satisfy the minimum support threshold, min_sup, the I is not frequent, P(I) < min_sup n If an item A is added to the itemset I, then the resulting itemset cannot occur more frequent than I, therfor I A is not frequent, P(I A) < min_sup 12

Scalable Methods for Mining Frequent Patterns n The downward closure property of frequent patterns n Any subset of a frequent itemset must be frequent n If {beer, diaper, nuts} is frequent, so is {beer, diaper} n i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} n Scalable mining methods: Three major approaches n Apriori (Agrawal & Srikant@VLDB 94) n Freq. pattern growth (FPgrowth Han, Pei & Yin @SIGMOD 00) n Vertical data format approach (Charm Zaki & Hsiao @SDM 02) Algorithm 1. Scan the (entire) transaction database to get the support S of each 1-itemset, compare S with min_sup, and get a set of frequent 1-itemsets, L 1 2. Use L k-1 join L k-1 to generate a set of candidate k- itemsets. Use Apriori property to prune the un frequent k-itemset 3. Scan the transaction database to get the support S of each candidate k-itemset in the final set, compare S with min_sup, and get a set of frequent k-itemsets, L k 4. Is the candidate set empty, if not goto 2 13

5 For each frequent itemset l, generate all nonempty subsets of l 6 For every nonempty subset s of l, output the rule s (I s) if its confidence C > min_conf I={A1,A2,A5} A1 A2 A5 A1 A5 A2 A2 A5 A1 A1 A2 A5 A2 A1 A5 A5 A1 A2 Example n Five transactions from a supermarket TID List of Items 1 Beer,Diaper,Baby Powder,Bread,Umbrella 2 Diaper,Baby Powder 3 Beer,Diaper,Milk 4 Diaper,Beer,Detergent 5 Beer,Milk,Coca-Cola (diaper=fralda) 14

Item Step 1 n Min_sup 40% (2/5) C1 è L1 Support Beer "4/5" Diaper "4/5" Baby Powder "2/5" Bread "1/5" Umbrella "1/5" Milk "2/5" Detergent "1/5" Coca-Cola "1/5" Item Support Beer "4/5" Diaper "4/5" Baby Powder "2/5" Milk "2/5" Step 2 and Step 3 n C2 è L2 Item Support Beer, Diaper "3/5" Beer, Baby Powder "1/5" Beer, Milk "2/5" Diaper,Baby Powder "2/5" Diaper,Milk "1/5" Baby Powder,Milk "0" Item Support Beer, Diaper "3/5" Beer, Milk "2/5" Diaper,Baby Powder "2/5" 15

Step 4 n C3 è empty Item Support Beer, Diaper,Baby Powder "1/5" Beer, Diaper,Milk "1/5" Beer, Milk,Baby Powder "0" Diaper,Baby Powder,Milk "0" Min_sup 40% (2/5) Step 5 n min_sup=40% min_conf=70% Item Support(A,B) Suport A Confidence Beer, Diaper 60% 80% 75% Beer, Milk 40% 80% 50% Diaper,Baby Powder 40% 80% 50% Diaper,Beer 60% 80% 75% Milk,Beer 40% 40% 100% Baby Powder, Diaper 40% 40% 100% 16

Results Beer Diaper n support 60%, confidence 75% Diaper Beer n support 60%, confidence 75% Milk Beer n support 40%, confidence 100% Baby _ Powder Diaper n support 40%, confidence 100% Interpretation n Some results are belivable, like Baby Powder è Diaper n Some rules need aditional analysis, like Milk è Beer n Some rules are unbelivable, like Diaper è Beer n This example could contain unreal results because of the small data 17

n Maximal frequent itemset n Closed itemsets n Closed frequent itemset n Maximal frequent itemset n A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets are frequent. 18

n Maximal frequent itemsets effectively provide a compact representation of frequent itemsets. n They form the smallest set of itemsets from which all frequent itemsets can be derived. n Maximal frequent itemsets do not contain the support information of their subsets Closed itemsets n An itemset X is closed if none of its immediate supersets has exactly the same support count as X n X is not closed if at least one of its immediate supersets has the same support count as X n Closed itemsets provide a minimal representation of itemsets without losing their support information 19

n {b,c} is a closed itemset because it does not have the same support count as any of its supersets n An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets are frequent 20

n An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup n Closed frequent itemsets are useful for removing some of the redundant association rules An association rule X Y is redundant if there exists another rule X Y, where X is a subset of X and Y is a subset of Y,such that the support and confidence for both rules are identical n A maximal frequent itemset is defined as a frequent itemset for which none of its immediate supersets are frequent. 21

n n n The association rule {b} {d,e} is therefore redundant because it has the same support and confidence as {b,c} {d,e}. Such redundant rule not generated n Maximal frequent itemsets are closed because none of the maximal frequent itemsets can have the same support count as their immediate supersets 22

Simpson s Paradox n In some cases, the hidden variables may cause the observed relationship between a pair of variables n Disappear or reverse its direction, a phenomenon that is known as Simpson s paradox n Consider the relationship between the sale of high-definition television (HDTV) and exercise machine {HDTV=Yes} {Exercise machine=yes} has a confidence of 99/180 = 55% {HDTV=No} {Exercise machine=yes} has a confidence of 54/120 = 45%. 23

n Customers who buy high- definition televisions are more likely to buy exercise machines n However, a deeper analysis reveals that the sales of these items depend on whether the customer is a college student or a working adult n n For college students: {HDTV=Yes} {Exercise machine=yes} = 1/10 = 10% {HDTV=No} {Exercise machine=yes} = 4/34 = 11.8% For working adults: {HDTV=Yes} {Exercise machine=yes} = 98/170 = 57.7% {HDTV=No} {Exercise machine=yes} = 50/86 = 58.1% n The rules suggest that, for each group, customers who do not buy high- definition televisions are more likely to buy exercise machines, which contradict the previous conclusion 24

The paradox explained n Most customers who buy HDTVs are working adults n Working adults are also the largest group of customers who buy exercise machines Because nearly 85% of the customers are working adults, the observed relationship between HDTV and exercise machine turns out to be stronger in the combined data Than what it would have been if the data is stratified. n Suppose a/b < c/d and p/q < r/s n a/b and p/q may represent the confidence of the rule A B in two different strata n c/d and r/s may represent the confidence of the rule NOT A B in the two strata. 25

n When the data is pooled together, the confidence values of the rules in the combined data are (a + p)/(b + q) and (c + r)/(d + s), n Simpson s paradox occurs when 26