CS570 Introduction to Data Mining

Similar documents
Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Performance and Scalability: Apriori Implementa6on

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Association Rules. A. Bellaachia Page: 1

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Frequent Pattern Mining

CSE 5243 INTRO. TO DATA MINING

Association Rule Mining. Entscheidungsunterstützungssysteme

Data Mining Part 3. Associations Rules

BCB 713 Module Spring 2011

Roadmap. PCY Algorithm

Product presentations can be more intelligently planned

Fundamental Data Mining Algorithms

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Decision Support Systems

Data Mining Techniques

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Frequent Pattern Mining

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Mining Frequent Patterns without Candidate Generation

CS145: INTRODUCTION TO DATA MINING

Improved Frequent Pattern Mining Algorithm with Indexing

CS6220: DATA MINING TECHNIQUES

D Data Mining: Concepts and and Tech Techniques

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Chapter 4: Association analysis:

Scalable Frequent Itemset Mining Methods

Nesnelerin İnternetinde Veri Analizi

Optimized Frequent Pattern Mining for Classified Data Sets

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Association Rule Mining

A Taxonomy of Classical Frequent Item set Mining Algorithms

Mining Association Rules in Large Databases

Mining of Web Server Logs using Extended Apriori Algorithm

Association Rules Apriori Algorithm

Knowledge Discovery in Databases

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Data Mining for Knowledge Management. Association Rules

Chapter 7: Frequent Itemsets and Association Rules

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Association Rules. Berlin Chen References:

Association Rule Mining

Advance Association Analysis

Information Management course

Association mining rules

Association Rule Mining from XML Data

signicantly higher than it would be if items were placed at random into baskets. For example, we

Dynamic Itemset Counting and Implication Rules For Market Basket Data

Jarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

2 CONTENTS

Association Pattern Mining. Lijun Zhang

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

Lecture notes for April 6, 2005

DATA MINING II - 1DL460

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

Lecture 2 Wednesday, August 22, 2007

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

Improvements to A-Priori. Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Parallel Algorithms for Discovery of Association Rules

Hash-Based Improvements to A-Priori. Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Association Rule Mining (ARM) Komate AMPHAWAN

Effectiveness of Freq Pat Mining

A Modern Search Technique for Frequent Itemset using FP Tree

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Parallel Mining Association Rules in Calculation Grids

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Information Management course

Privacy Preserving Frequent Itemset Mining Using SRD Technique in Retail Analysis

CSCI6405 Project - Association rules mining

Mining High Average-Utility Itemsets

A Comparative Study of Association Rules Mining Algorithms

Mining Frequent Patterns Based on Data Characteristics

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

An Improved Apriori Algorithm for Association Rules

1. Interpret single-dimensional Boolean association rules from transactional databases

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

Ajay Acharya et al. / International Journal on Computer Science and Engineering (IJCSE)

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

Transcription:

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1

Mining Frequent Patterns, Association and Correlations Basic concepts Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Constraint-based association mining 2

What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Frequent sequential pattern Frequent structured pattern Motivation: Finding inherent regularities in data What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 3

Frequent Itemset Mining Frequent itemset mining: frequent set of items in a transaction data set First proposed by Agrawal, Imielinski, and Swami in SIGMOD 1993 SIGMOD Test of Time Award 2003 This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD 93. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. 4

Basic Concepts: Frequent Patterns and Association Rules Transaction-id Customer buys beer Customer buys both Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Customer buys diaper Itemset: X = {x1,, xk} (k-itemset) Frequent itemset: X with minimum support count Support count (absolute support): count of transactions containing X Association rule: A B with minimum support and confidence Support: probability that a transaction contains A B s = P(A B) Confidence: conditional probability that a transaction having A also contains B c = P(B A) Association rule mining process Find all frequent patterns (more costly) Generate strong association rules 5

Illustration of Frequent Itemsets and Association Rules Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Frequent itemsets (minimum support count = 3)? {A:3, B:3, D:4, E:3, AD:3} Association rules (minimum support = 50%, minimum confidence = 50%)? A D (60%, 100%) D A (60%, 75%) A C? 6

Mining Frequent Patterns, Association and Correlations Basic concepts Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Constraint-based association mining Summary 7

Scalable Methods for Mining Frequent Patterns Scalable mining methods for frequent patterns Apriori (Agrawal & Srikant@VLDB 94) and variations Frequent pattern growth (FPgrowth Han, Pei & Yin @SIGMOD 00) Algorithms using vertical format Closed and maximal patterns and their mining methods FIMI Workshop and implementation repository 8

Apriori Apriori Property Apriori: use prior knowledge to reduce search by pruning unnecessary subsets The apriori property of frequent patterns Any nonempty subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Bottom up search strategy 9

Apriori: Level-Wise Search Method Level-wise search method: Initially, scan DB once to get frequent 1-itemset (L1) with minimum support Generate length (k+1) candidate itemsets from length k frequent itemsets (e.g., find L2 from L1, etc.) Test the candidates against DB Terminate when no frequent or candidate set can be generated 10

The Apriori Algorithm Pseudo-code: C k : Candidate k-itemset L k : frequent k-itemset L 1 = frequent 1-itemsets for (k = 2; L k-1!= ; k++) C k = generate candidate set from L k-1 for each transaction t in database find all candidates in C k that are subset of t increment their count; L k = candidates in C k with min_support return k L k 11

Transaction DB The Apriori Algorithm An Example min_support = 2 Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1st scan Itemset sup {A} 2 C 1 L 1 {B} 3 {C} 3 {D} 1 {E} 3 C 2 C 2 L {A, B} 1 2 Itemset 2nd scan sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} C3 Itemset 3rd scan L 3 {B, C, E} Itemset sup {B, C, E} 2 12

Important Details of Apriori How to generate candidate sets? How to count supports for candidate sets? 13

Candidate Set Generation Step 1: self-joining L k-1 : assuming items and itemsets are sorted in order, joinable only if the first k-2 items are in common Step 2: pruning: prune if it has infrequent subset Example: Generate C4 from L3={abc, abd, acd, ace, bcd} Step 1: Self-joining: L3*L3 abcd from abc and abd; acde from acd and ace Step 2: Pruning: acde is removed because ade is not in L3 C4={abcd} 14

How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be huge Each transaction may contain many candidates Method: Build a hash-tree for candidate itemsets Leaf node contains a list of itemsets Interior node contains a hash function determining which branch to follow Subset function: for each transaction, find all the candidates contained in the transaction using the hash tree 15

Prefix Tree (Trie) Prefix tree (trie from retrieval) Keys are usually strings All descendants of one node have a common prefix Advantages Fast looking up Less space with a large number of short strings Help with longest-prefix matching Applications Storing dictionary Approximate matching algorithms, including spell checking 16

Example: Counting Supports of Candidates hash function 3,6,9 1,4,7 2,5,8 Transaction: 2 3 5 6 7 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 2 5 3 2 3 4 5 6 7 1 3 6 1 5 9 5 6 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 17

Improving Efficiency of Apriori Bottlenecks Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Shrink number of candidates Reduce passes of transaction database scans Reduce number of transactions Facilitate support counting of candidates 18

DHP: Reduce the Number of Candidates DHP (Direct hashing and pruning): hash k-itemsets into buckets and a k-itemset whose bucket count is below the threshold cannot be frequent Especially useful for 2-itemsets Generate a hash table of 2-itemsets during the scan for 1-itemset If the count of a bucket is below minimum support count, the itemsets in the bucket should not be included in candidate 2- itemsets J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD 95 19

DHP: Reducing number of candidates 20

DHP: Reducing the transactions If an item occurs in a frequent (k+1)-itemset, it must occur in at least k candidate k-itemsets (necessary but not sufficient) Discard an item if it does not occur in at least k candidate k-itemsets during support counting J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD 95 21

DIC: Reduce Number of Scans ABCD ABC ABD ACD BCD DIC (Dynamic itemset counting): add new candidate itemsets at partition points Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins AB AC BC AD BD CD Transactions A B C D {} Itemset lattice S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD 97 Apriori DIC 1-itemsets 2-itemsets 1-itemsets 2-items 3-items 22

Partitioning: Reduce Number of Scans Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database in n disjoint partitions and find local frequent patterns (minimum support count?) Scan 2: determine global frequent patterns from the collection of all local frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB 95 23

Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within samples using Apriori Scan database once to verify frequent itemsets found in sample Use a lower support threshold than minimum support Tradeoff accuracy against efficiency H. Toivonen. Sampling large databases for association rules. In VLDB 96 24

Scalable Methods for Mining Frequent Patterns Scalable mining methods for frequent patterns Apriori (Agrawal & Srikant@VLDB 94) and variations Frequent pattern growth (FPgrowth Han, Pei & Yin @SIGMOD 00) Algorithms using vertical format Closed and maximal patterns and their mining methods FIMI Workshop and implementation repository 25