Data Mining Techniques

Similar documents
Hash-Based Improvements to A-Priori. Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

Improvements to A-Priori. Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results

Roadmap. PCY Algorithm

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Big Data Analytics CSCI 4030

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

Association Rules. CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University. Slides adapted from lectures by Jeff Ullman

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Jeffrey D. Ullman Stanford University

CS570 Introduction to Data Mining

Frequent Item Sets & Association Rules

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Mining Frequent Patterns without Candidate Generation

DATA MINING II - 1DL460

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Data Mining Part 3. Associations Rules

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Frequent Pattern Mining

Frequent Pattern Mining

Performance and Scalability: Apriori Implementa6on

Chapter 4: Association analysis:

Association Rules. A. Bellaachia Page: 1

Association Rule Mining

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

BCB 713 Module Spring 2011

Data Mining for Knowledge Management. Association Rules

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Chapter 7: Frequent Itemsets and Association Rules

Association Rule Mining: FP-Growth

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

signicantly higher than it would be if items were placed at random into baskets. For example, we

Mining Association Rules in Large Databases

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

CHAPTER 8. ITEMSET MINING 226

Effectiveness of Freq Pat Mining

Chapter 6: Association Rules

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

Building Roads. Page 2. I = {;, a, b, c, d, e, ab, ac, ad, ae, bc, bd, be, cd, ce, de, abd, abe, acd, ace, bcd, bce, bde}

Association Rule Mining. Introduction 46. Study core 46

Association Rule Mining

Induction of Association Rules: Apriori Implementation

CSE 5243 INTRO. TO DATA MINING

Association Rules. Juliana Freire. Modified from Jeff Ullman, Jure Lescovek, Bing Liu, Jiawei Han

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

Mining High Average-Utility Itemsets

FP-Growth algorithm in Data Compression frequent patterns

Association Pattern Mining. Lijun Zhang

Association Rules. Berlin Chen References:

Information Sciences

This paper proposes: Mining Frequent Patterns without Candidate Generation

Data Warehousing and Data Mining

Association Rule Mining (ARM) Komate AMPHAWAN

Product presentations can be more intelligently planned

Fundamental Data Mining Algorithms

Chapter 5, Data Cube Computation

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Association mining rules

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets

Improved Frequent Pattern Mining Algorithm with Indexing

Optimized Frequent Pattern Mining for Classified Data Sets

SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

A Reconfigurable Platform for Frequent Pattern Mining

Chapter 13, Sequence Data Mining

Association Analysis: Basic Concepts and Algorithms

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Tutorial on Association Rule Mining

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

1. Interpret single-dimensional Boolean association rules from transactional databases

Pattern Lattice Traversal by Selective Jumps

Mining Temporal Association Rules in Network Traffic Data

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

CS145: INTRODUCTION TO DATA MINING

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Mining Association Rules in Large Databases

DATA MINING II - 1DL460

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Advance Association Analysis

CSE 5243 INTRO. TO DATA MINING

Data Mining and Data Warehouse Maximal Simplex Method

Mining of Web Server Logs using Extended Apriori Algorithm

CSE 5243 INTRO. TO DATA MINING

CSCI6405 Project - Association rules mining

On Canonical Forms for Frequent Graph Mining

Association Rules Extraction with MINE RULE Operator

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Transcription:

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)

Apriori: Summary All items Count the items All pairs of items from L 1 Count the pairs All pairs of sets that differ by 1 element Filter C 1 L 1 Construct C 2 L 2 Construct C 3 Filter 1. Set k = 0 2. Define C1 as all size 1 item sets 3. While Ck+1 is not empty 4. Set k = k + 1 5. Scan DB to determine subset Lk Ck with support s 6. Construct candidates Ck+1 by combining sets in Lk that differ by 1 element

Apriori: Bottlenecks All items Count the items All pairs of items from L 1 Count the pairs All pairs of sets that differ by 1 element Filter C 1 L 1 Construct C 2 L 2 Construct C 3 Filter 1. Set k = 0 2. Define C1 as all size 1 item sets 3. While Ck+1 is not empty 4. Set k = k + 1 5. Scan DB to determine subset Lk Ck with support s 6. Construct candidates Ck+1 by combining sets in Lk that differ by 1 element (I/O limited) (Memory limited)

Apriori: Main-Memory Bottleneck For many frequent-itemset algorithms, main-memory is the critical resource As we read baskets, we need to count something, e.g., occurrences of pairs of items The number of different things we can count is limited by main memory For typical market-baskets and reasonable support (e.g., 1%), k = 2 requires most memory Swapping counts in/out is a disaster (why?)

Counting Pairs in Memory Two approaches: Approach 1: Count all pairs using a matrix Approach 2: Keep a table of triples [i, j, c] = the count of the pair of items {i, j} is c. If integers and item ids are 4 bytes, we need approximately 12 bytes for pairs with count > 0 Plus some additional overhead for the hashtable Note: Approach 1 only requires 4 bytes per pair Approach 2 uses 12 bytes per pair (but only for pairs with count > 0)

Comparing the 2 Approaches 4 bytes per pair 12 per occurring pair Triangular Matrix Triples

Comparing the two approaches Approach 1: Triangular Matrix n = total number items Count pair of items {i, j} only if i<j Keep pair counts in lexicographic order: {1,2}, {1,3},, {1,n}, {2,3}, {2,4},,{2,n}, {3,4}, Pair {i, j} is at position (i 1)(n i/2) + j 1 Total number of pairs n(n 1)/2; total bytes= 2n 2 Triangular Matrix requires 4 bytes per pair Approach 2 uses 12 bytes per occurring pair (but only for pairs with count > 0) Beats Approach 1 if less than 1/3 of possible pairs actually occur

Main-Memory: Picture of Apriori Item counts Frequent items Main memory Counts of pairs of frequent items (candidate pairs) Pass 1 Pass 2

PCY (Park-Chen-Yu) Algorithm Observation: In pass 1 of Apriori, most memory is idle We store only individual item counts Can we reduce the number of candidates C 2 (therefore the memory required) in pass 2? Pass 1 of PCY: In addition to item counts, maintain a hash table with as many buckets as fit in memory Keep a count for each bucket into which pairs of items are hashed For each bucket just keep the count, not the actual pairs that hash to the bucket!

PCY Algorithm First Pass New in PCY FOR (each basket): FOR (each item in the basket): add 1 to item s count; FOR (each pair of items): hash the pair to a bucket; add 1 to the count for that bucket; Few things to note: Pairs of items need to be generated from the input file; they are not present in the file We are not just interested in the presence of a pair, but whether it is present at least s (support) times

Eliminating Candidates using Buckets Observation: If a bucket contains a frequent pair, then the bucket is surely frequent However, even without any frequent pair, a bucket can still be frequent So, we cannot use the hash to eliminate any member (pair) of a frequent bucket But, for a bucket with total count less than s, none of its pairs can be frequent Pairs that hash to this bucket can be eliminated as candidates (even if the pair consists of 2 frequent items) Pass 2: Only count pairs that hash to frequent buckets

PCY Algorithm Between Passes Replace the buckets by a bit-vector: 1 means the bucket count exceeded s (call it a frequent bucket); 0 means it did not 4-byte integer counts are replaced by bits, so the bit-vector requires 1/32 of memory Also, decide which items are frequent and list them for the second pass

PCY Algorithm Pass 2 Count all pairs {i, j} that meet the conditions for being a candidate pair: 1. Both i and j are frequent items 2. The pair {i, j} hashes to a bucket whose bit in the bit vector is 1 (i.e., a frequent bucket) Both conditions are necessary for the pair to have a chance of being frequent

PCY Algorithm Summary New in PCY 1. Set k = 0 2. Define C1 as all size 1 item sets 3. Scan DB to construct L1 C1 and a hash table of pair counts 4. Convert pair counts to bit vector and construct candidates C2 5. While Ck+1 is not empty 6. Set k = k + 1 7. Scan DB to determine subset Lk Ck with support s 8. Construct candidates Ck+1 by combining sets in Lk that differ by 1 element

Main-Memory: Picture of PCY Main memory Item counts Hash Hash table table for pairs Frequent items Bitmap Counts of candidate pairs Pass 1 Pass 2

Main-Memory Details Buckets require a few bytes each: Note: we do not have to count past s #buckets is O(main-memory size) On second pass, a table of (item, item, count) triples is essential (we cannot use triangular matrix approach, why?) Thus, hash table must eliminate approx. 2/3 of the candidate pairs for PCY to beat A-Priori

Refinement: Multistage Algorithm Limit the number of candidates to be counted Remember: Memory is the bottleneck Still need to generate all the itemsets but we only want to count/keep track of the ones that are frequent Key idea: After Pass 1 of PCY, rehash only those pairs that qualify for Pass 2 of PCY i and j are frequent, and {i, j} hashes to a frequent bucket from Pass 1 On middle pass, fewer pairs contribute to buckets, so fewer false positives Requires 3 passes over the data

Main-memory: Multistage PCY Item counts Freq. items Freq. items Main memory First hash First table hash table Bitmap 1 Bitmap 1 Second hash table Bitmap 2 Counts of Counts of candidate candidate pairs pairs Pass 1 Pass 2 Pass 3 Count items Hash pairs {i,j} Hash pairs {i,j} into Hash2 iff: i,j are frequent, {i,j} hashes to freq. bucket in B1 Count pairs {i,j} iff: i,j are frequent, {i,j} hashes to freq. bucket in B1 {i,j} hashes to freq. bucket in B2

Apriori: Bottlenecks All items Count the items All pairs of items from L 1 Count the pairs All pairs of sets that differ by 1 element Filter C 1 L 1 Construct C 2 L 2 Construct C 3 Filter 1. Set k = 0 2. Define C1 as all size 1 item sets 3. While Ck+1 is not empty 4. Set k = k + 1 5. Scan DB to determine subset Lk Ck with support s 6. Construct candidates Ck+1 by combining sets in Lk that differ by 1 element (I/O limited) (Memory limited)

FP-Growth Algorithm Overview Apriori requires one pass for each k (2+ on first pass for PCY variants) Can we find all frequent item sets in fewer passes over the data? FP-Growth Algorithm: Pass 1: Count items with support s Sort frequent items in descending order according to count Pass 2: Store all frequent itemsets in a frequent pattern tree (FP-tree) Mine patterns from FP-Tree

FP-Tree Construction TID = 1 TID = 2 TID = 3 a:1 a:1 b:1 a:2 b:1 b:1 b:1 c:1 b:1 c:1 c:1 TID Items Bought Frequent Items 1 {a,b,f} {a,b} 2 {b,g,c,d} {b,c,d} 3 {h, a,c,d,e} {a,c,d,e} 4 {a,d, p,e} {a,d,e} 5 {a,b,c} {a,b,c} 6 {a,b,q,c,d} {a,b,c,d} 7 {a} {a} 8 {a,m,b,c} {a,b,c} 9 {a,b,n,d} {a,b,d} 10 {b,c,e} {b,c,e} a: 8, b: 7, c: 6, d: 5, e: 3, f: 1, g: 1, h: 1, m: 1, n: 1 c:3 b:5 e:1 TID = 10 a:8 b:2 c:2 c:1 e:1 e:1 e:1

Mining Patterns from the FP-Tree Step 1: Extract subtrees ending in each item Full Tree Subtree e Subtree d c:3 b:5 a:8 b:2 c:2 c:1 e:1 e:1 e:1 a:8 b:2 c:1 c:2 e:1 e:1 e:1 a:8 b:5 b:2 c:3 c:2 c:1 Subtree c Subtree b Subtree a b:5 c:3 a:8 c:1 b:2 c:2 a:8 b:5 b:2 a:8 a: 8, b: 7, c: 6, d: 5, e: 3, f: 1, g: 1, h: 1, m: 1, n: 1

Mining Patterns from the FP-Tree Step 2: Construct Conditional FP-Tree for each item Full Tree a:8 b:2 Subtree e a:8 b:2 Conditional e a:2 c:3 b:5 c:1 e:1 e:1 c:2 e:1 c:1 c:2 e:1 e:1 e:1 c:1 c:1 Conditional Pattern Base for e acd: 1, ad: 1, bc: 1 Conditional Node Counts a: 2, b: 1, c: 2, d: 2 Calculate counts for paths ending in e Remove leaf nodes Prune nodes with count s

Mining Patterns from the FP-Tree Step 3: Recursively mine conditional FP-Tree for each item Conditional e Subtree de Conditional de a:2 a:2 c:1 c:1 c:1 a:2 Subtree ce Subtree ae a:2 c:1 c:1 a:2

Mining Patterns from the FP-Tree a:8 b:2 b:5 c:2 c:3 c:1 e:1 e:1 e:1 Su x Conditional Pattern Base e ac; a; bc:1 d abc:1; ab:1; ac:1; a:1; bc:1 c ab:3; a:1; b:2 b a:5 a Su x Frequent Itemsets e {e}, {d,e}, {a,d,e}, {c,e}, {a,e} d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d} c {c}, {b,c}, {a,b,c}, {a,c} b {b}, {a,b} a {a}

Projecting Sub-trees Cutting and pruning trees requires that we create copies/mirrors of the subtrees Mining patterns requires additional memory

FP-Growth vs Apriori Simulated data 10k baskets, 25 items on average (from: Han, Kamber & Pei, Chapter 6)

FP-Growth vs Apriori http://singularities.com/blog/2015/08/apriori-vs-fpgrowth-for-frequent-item-set-mining

FP-Growth vs Apriori Advantages of FP-Growth Only 2 passes over dataset Stores compact version of dataset No candidate generation Faster than A-priori Disadvantages of FP-Growth The FP-Tree may not be compact enough to fit in memory Even more memory required to construct subtrees in mining phase