Advance Association Analysis

Similar documents
CS145: INTRODUCTION TO DATA MINING

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Frequent Pattern Mining

Lecture 10 Sequential Pattern Mining

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

CS570 Introduction to Data Mining

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Roadmap. PCY Algorithm

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Mining Association Rules in Large Databases

Chapter 13, Sequence Data Mining

CSE 5243 INTRO. TO DATA MINING

BCB 713 Module Spring 2011

Frequent Pattern Mining

CS6220: DATA MINING TECHNIQUES

Chapter 6: Association Rules

Which Null-Invariant Measure Is Better? Which Null-Invariant Measure Is Better?

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Mining Frequent Patterns without Candidate Generation

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Chapter 4: Association analysis:

DATA MINING II - 1DL460

Performance and Scalability: Apriori Implementa6on

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Induction of Association Rules: Apriori Implementation

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 7

Chapter 6: Mining Association Rules in Large Databases

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

CSE 5243 INTRO. TO DATA MINING

Chapter 7: Frequent Itemsets and Association Rules

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Scalable Frequent Itemset Mining Methods

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Data Mining for Knowledge Management. Association Rules

Association Rules. A. Bellaachia Page: 1

Effectiveness of Freq Pat Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data warehouse and Data Mining

CSE 5243 INTRO. TO DATA MINING

MS-FP-Growth: A multi-support Vrsion of FP-Growth Agorithm

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Information Management course

Information Management course

CS145: INTRODUCTION TO DATA MINING

CS6220: DATA MINING TECHNIQUES

CHAPTER 8. ITEMSET MINING 226

Product presentations can be more intelligently planned

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Data Mining Part 3. Associations Rules

A Literature Review of Modern Association Rule Mining Techniques

Frequent and Sequential Pattern Mining with Multiple Minimum Supports

Association Pattern Mining. Lijun Zhang

CSE 5243 INTRO. TO DATA MINING

Association Rules and

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Road Map. Objectives. Objectives. Frequent itemsets and rules. Items and transactions. Association Rules and Sequential Patterns

Mining High Average-Utility Itemsets

D Data Mining: Concepts and and Tech Techniques

Frequent Item Sets & Association Rules

Optimized Frequent Pattern Mining for Classified Data Sets

Association Rule Mining. Entscheidungsunterstützungssysteme

Building Roads. Page 2. I = {;, a, b, c, d, e, ab, ac, ad, ae, bc, bd, be, cd, ce, de, abd, abe, acd, ace, bcd, bce, bde}

Data Mining Techniques

Association Analysis: Basic Concepts and Algorithms

Data Mining: Concepts and Techniques. Chap 8. Data Streams, Time Series Data, and. Sequential Patterns. Li Xiong

Association Rule Learning

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Association Rules Apriori Algorithm

Knowledge Discovery in Databases II Winter Term 2015/2016. Optional Lecture: Pattern Mining & High-D Data Mining

Chapter 7. Advanced Frequent Pattern Mining. Meng Jiang CS412 Summer 2017: Introduction to Data Mining

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Finding Sporadic Rules Using Apriori-Inverse

An Improved Apriori Algorithm for Association Rules

Discovering interesting rules from financial data

Interestingness Measurements

A parameterised algorithm for mining association rules

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

CS490D: Introduction to Data Mining Prof. Walid Aref

PADS: A Simple Yet Effective Pattern-Aware Dynamic Search Method for Fast Maximal Frequent Pattern Mining

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Association Rules: Past, Present & Future. Ramakrishnan Srikant.

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Study on Mining Weighted Infrequent Itemsets Using FP Growth

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Data Warehousing and Data Mining

Sequential PAttern Mining using A Bitmap Representation

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Mining Imperfectly Sporadic Rules with Two Thresholds

Transcription:

Advance Association Analysis 1

Minimum Support Threshold 3

Effect of Support Distribution Many real data sets have skewed support distribution Support distribution of a retail data set 4

Effect of Support Distribution How to set the appropriate minsup threshold? If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) If minsup is set too low, it is computationally expensive and the number of itemsets is very large Using a single minimum support threshold may not be effective 5

Multiple Minimum Support How to apply multiple minimum supports? MS(i): minimum support for item i e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1% Challenge: Support is no longer anti-monotone Suppose: Support(Milk, Coke) = 1.5% and Support(Milk, Coke, Broccoli) = 0.5% {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent 6

Multiple Minimum Support Item MS(I) Sup(I) AB ABC A 0.10% 0.25% A AC AD ABD ABE B 0.20% 0.26% B AE ACD C 0.30% 0.29% C BC BD ACE ADE D 0.50% 0.05% D BE BCD E 3% 4.20% E CD CE BCE BDE DE CDE 7

Multiple Minimum Support Item MS(I) Sup(I) A 0.10% 0.25% A AB AC AD ABC ABD ABE B 0.20% 0.26% C 0.30% 0.29% B C AE BC BD ACD ACE ADE D 0.50% 0.05% E 3% 4.20% D E BE CD CE BCD BCE BDE DE CDE 8

Multiple Minimum Support (Liu 1999) Order the items according to their minimum support (in ascending order) e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% Ordering: Broccoli, Salmon, Coke, Milk Need to modify Apriori such that: L 1 : set of frequent items F 1 : set of items whose support is MS(1) where MS(1) is min i ( MS(i) ) C 2 : candidate itemsets of size 2 is generated from F 1 instead of L 1 9

Multiple Minimum Support (Liu 1999) Modifications to Apriori: In traditional Apriori, A candidate (k+1)-itemset is generated by merging two frequent itemsets of size k The candidate is pruned if it contains any infrequent subsets of size k Pruning step has to be modified: Prune only if subset contains the first item e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to minimum support) {Broccoli, Coke} and {Broccoli, Milk} are frequent but {Coke, Milk} is infrequent Candidate is not pruned because {Coke,Milk} does not contain the first item, i.e., Broccoli. 10

Mining Rare Association Rules 11

Rare Association Rule Mining: Motivation Rare events are events that occur infrequently Perhaps in the frequency range (0.1% to 10%) If they occur the consequences can be quite dramatic or negative. Applications: Hardware Fault Detection Faults that are rare but costly Medical Diagnosis Diseases that are typically rare but deadly 12

Detecting Rare Itemsets Apriori-Inverse To discover all rules that satisfy the maximum support (below maximum support) and above a minimum absolute support value. -- UCI Repository: Zoo Maximum support: 0.20 Itemsets Support Used? Venomous = 0 0.92 No Itemsets Analyzed Tail = 1 0.74 No... Fins = 1 0.17 Yes Venomous = 1 0.08 Yes 13

Coincidence vs Interesting 10000 transactions A appears 9500 times AB appears 9000 times AB B appears 9500 times A B (confidence = 0.95) Would we consider this an interesting? What if AB appears 9010 times? Under the normal assumption AB is expected to appear together at least 9025 times. 14

Probability of Collision 15 The probability that A and B will occur together exactly c times is under an assumption of independence: Given N = 1000, A= B = 500, and AB = 250, we are able to determine the probability of A and B occurring exactly 250 times is 0.05. = b N c b a N c a b a N c ),, Pcc( A A B c B N

Minimum Absolute Support To find the number of collisions for which Pcc is smaller than some value p (e.g. 0.0001) minabssup( N, a, b, p) i = = min m i= m 0 Pcc( i N, a, b) 1.0 p Given N = 1000, A = B = 500, and p = 0.0001, minabssup value is 274. Candidate itemsets that appear above the minabssup requirement are retained. 16

Rare pattern Given a user-specified minimum support threshold minsup ϵ [0,1], X is called a rare itemset or rare pattern in D if sup(x,d) minsup. 17

Roadmap for rare pattern mining 18

Mining Negative Rules 19

Negative vs Rare Patterns Rare patterns: Very low support but interesting E.g., buying Rolex watches Mining: Setting individual-based or special group-based support threshold for valuable items Negative patterns Since it is unlikely that one buys Ford Expedition (an SUV car) and Toyota Prius (a hybrid car) together, Ford Expedition and Toyota Prius are likely negatively correlated patterns Negatively correlated patterns that are infrequent tend to be more interesting than those that are frequent 20

Negative Correlated Patterns Definition 1 (support-based) If itemsets X and Y are both frequent but rarely occur together, i.e., sup(x U Y) < sup (X) * sup(y) Then X and Y are negatively correlated Problem: A store sold two needle 100 packages A and B, only one transaction containing both A and B. When there are in total 200 transactions, we have s(a U B) = 0.005, s(a) * s(b) = 0.25, s(a U B) < s(a) * s(b) When there are 10 5 transactions, we have s(a U B) = 1/10 5, s(a) * s(b) = 1/10 3 * 1/10 3, s(a U B) > s(a) * s(b) Where is the problem? Null transactions, i.e., the support-based definition is not null-invariant! 21

Negative Correlated Patterns Definition 2 (negative itemset-based) X is a negative itemset if (1) X = Ā U B, where B is a set of positive items, and Ā is a set of negative items, Ā 1, and (2) s(x) μ Itemsets X is negatively correlated, if This definition suffers a similar null-invariant problem. Definition 3 (Kulzynski measure-based) If itemsets X and Y are frequent, but (P(X Y) + P(Y X))/2 < є, where є is a negative pattern threshold, then X and Y are negatively correlated. 22

Mining Sequential Patterns 23

Sequential Patterns Transaction databases, time-series databases vs. sequence databases Frequent patterns vs. (frequent) sequential patterns Applications of sequential pattern mining Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams Program execution sequence data sets DNA sequences and gene structures 24

What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database A sequence : < (ef) (ab) (df) c b > SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> An element may contain a set of items Items within an element are unordered and we list them alphabetically <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup = 2, <(ab)c> is a sequential pattern Sequential pattern mining: find the complete set of patterns, satisfying the minimum support (frequency) threshold 25

The Apriori Property of Sequential Patterns A basic property: Apriori If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, <hb> is infrequent so is <hab> and <(ah)b> Seq. ID 10 20 30 40 50 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Given support threshold min_sup =2 27

Readings Data Mining Ian Witten - Section 6.3 Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar - Chapter 6 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT 96. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD 00. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97. E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE 03. Charu C. Aggarwal and Jiawei Han. 2014. Frequent Pattern Mining. Springer Publishing Company, Incorporated. 28