Introduction to Data Mining

Similar documents
High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Big Data Analytics CSCI 4030

Hash-Based Improvements to A-Priori. Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms

Improvements to A-Priori. Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

Roadmap. PCY Algorithm

Data Mining Techniques

Association Rules. CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University. Slides adapted from lectures by Jeff Ullman

Frequent Item Sets & Association Rules

Jeffrey D. Ullman Stanford University

signicantly higher than it would be if items were placed at random into baskets. For example, we

Association Rules. Juliana Freire. Modified from Jeff Ullman, Jure Lescovek, Bing Liu, Jiawei Han

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Chapter 4: Mining Frequent Patterns, Associations and Correlations

CS570 Introduction to Data Mining

Performance and Scalability: Apriori Implementa6on

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Frequent Pattern Mining

Frequent Itemsets. Chapter 6

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Introduction to Data Mining

BCB 713 Module Spring 2011

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Association Pattern Mining. Lijun Zhang

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Association Rule Discovery

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Association mining rules

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Association Rule Discovery

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Chapter 7: Frequent Itemsets and Association Rules

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Mining Frequent Patterns without Candidate Generation

Frequent Pattern Mining

Mining Association Rules in Large Databases

Data Mining Part 3. Associations Rules

2. Discovery of Association Rules

1. (15 points) Solve the decanting problem for containers of sizes 199 and 179; that is find integers x and y satisfying.

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

CSE 5243 INTRO. TO DATA MINING

Data Warehousing and Data Mining

Tutorial on Association Rule Mining

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Mining Temporal Association Rules in Network Traffic Data

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

Chapter 4: Association analysis:

Mining Association Rules in Large Databases

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Machine Learning: Symbolische Ansätze

Chapter 7: Frequent Itemsets and Association Rules

Association Rules Outline

Discovery of Association Rules in Temporal Databases 1

Association Rule Mining

Mamatha Nadikota et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (4), 2011,

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Association Rules. Berlin Chen References:

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Product presentations can be more intelligently planned

Query Processing: The Basics. External Sorting

Data Mining for Knowledge Management. Association Rules

Big Data Analytics CSCI 4030

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

Association Rules. A. Bellaachia Page: 1

Advance Association Analysis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Induction of Association Rules: Apriori Implementation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

2.3 Algorithms Using Map-Reduce

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Generalizing Map- Reduce

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A Distributed Approach To Frequent Itemset Mining At Low Support Levels. Neal Clark BSeng, University of Victoria, 2009

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

13 Frequent Itemsets and Bloom Filters

Nesnelerin İnternetinde Veri Analizi

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

CSE 5243 INTRO. TO DATA MINING

2 CONTENTS

Lecture notes for April 6, 2005

Association Rules. Comp 135 Machine Learning Computer Science Tufts University. Association Rules. Association Rules. Data Model.

A Taxonomy of Classical Frequent Item set Mining Algorithms

Interestingness Measurements

Optimized Frequent Pattern Mining for Classified Data Sets

A Survey on Apriori algorithm using MapReduce Technique

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center

Interestingness Measurements

CSE 5243 INTRO. TO DATA MINING

2/18/14. Uses for Discrete Math in Computer Science. What is discrete? Why Study Discrete Math? Sets and Functions (Rosen, Sections 2.1,2.2, 2.

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Algorithms Dr. Haim Levkowitz

Association Rules Apriori Algorithm

GPU-Accelerated Apriori Algorithm

CS246: Mining Massive Data Sets Winter Final

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Transcription:

Introduction to Data Mining Lecture #13: Frequent Itemsets-2 Seoul National University 1

In This Lecture Efficient Algorithms for Finding Frequent Itemsets A-Priori PCY 2-Pass algorithm: Random Sampling, SON 2

Outline A-Priori Algorithm PCY Algorithm Frequent Itemsets in < 2 Passes 3

A-Priori Algorithm (1) A two-pass approach called A-Priori limits the need for main memory Key idea: monotonicity If a set of items I appears at least s times, so does every subset J of I E.g., if {A,C} is frequent, then {A} is frequent (so does {C}) Contrapositive for pairs: If item i does not appear in s baskets, then no pair including i can appear in s baskets E.g., if {A} is not frequent, then {A,C} is not frequent So, how does A-Priori find freq. pairs? 4

A-Priori Algorithm (2) Pass 1: Read baskets and count in main memory the occurrences of each individual item Requires only memory proportional to #items Items that appear ss times are the frequent items Pass 2: Read baskets again and count in main memory only those pairs where both elements are frequent (from Pass 1) Requires memory proportional to square of frequent items only (for counts) Plus a list of the frequent items (so you know what must be counted) 5

Main-Memory: Picture of A-Priori Item counts Frequent items Main memory Counts of pairs of frequent items (candidate pairs) Pass 1 Pass 2 6

Detail for A-Priori You can use the triangular matrix method with n = number of frequent items May save space compared with storing triples Trick: re-number frequent items 1,2, and keep a table relating new numbers to original item numbers Main memory Item counts Frequent items Old item #s Counts of pairs of Counts frequent of pairs items of frequent items Pass 1 Pass 2 7

Frequent Triples, Etc. For each k, we construct two sets of k-tuples (sets of size k): C k = candidate k-tuples = those that might be frequent sets (support > s) based on information from the pass for k 1 All items L k = the set of truly frequent k-tuples Count the items All pairs of items from L 1 Count the pairs To be explained C 1 Filter L 1 Construct C 2 Filter L 2 Construct C 3 8

Example Hypothetical steps of the A-Priori algorithm C 1 = { {b} {c} {j} {m} {n} {p} } Count the support of itemsets in C 1 Prune non-frequent: L 1 = { b, c, j, m } Generate C 2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} } Count the support of itemsets in C 2 Prune non-frequent: L 2 = { {b,c} {b,m} {c,j} {c,m} } Generate C 3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} } Count the support of itemsets in C 3 Prune non-frequent: L 3 = { {b,c,m} } ** Note here we generate new candidates by generating C k from L k-1. But one can be more careful with candidate ** generation. For example, in C 3 we know {b,m,j} cannot be frequent since {m,j} is not frequent 9

Generating C 3 From L 2 Assume {x1, x2, x3} is frequent. Then, {x1,x2}, {x1, x3}, {x2, x3} are frequent, too. => if any of {x1,x2}, {x1, x3}, {x2, x3} is NOT frequent, then {x1, x2, x3} is NOT frequent! So, to generate C 3 from L 2, Find two frequent pairs in the form of {a, b}, and {a, c} This can be done efficiently if we sort L 2 Check whether {b,c} is also frequent If yes, include {a,b,c} to C 3 10

A-Priori for All Frequent Itemsets One pass for each k (itemset size) Needs room in main memory to count each candidate k tuple For typical market-basket data and reasonable minimum support (e.g., 1%), k = 2 requires the most memory Many possible extensions: Association rules with intervals: For example: Men over 60 have 2 cars Association rules when items are in a taxonomy Bread, Butter FruitJam BakedGoods, MilkProduct PreservedGoods Lower the min. support s as itemset gets bigger 11

Outline A-Priori Algorithm PCY Algorithm Frequent Itemsets in < 2 Passes 12

PCY (Park-Chen-Yu) Algorithm Observation: In pass 1 of A-Priori, most memory is idle We store only individual item counts Can we use the idle memory to reduce memory required in pass 2? Pass 1 of PCY: In addition to item counts, maintain a hash table with as many buckets as fit in memory Keep a count for each bucket into which pairs of items are hashed For each bucket just keep the count, not the actual pairs that hash to the bucket! 13

PCY Algorithm First Pass New in PCY FOR (each basket) : FOR (each item in the basket) : add 1 to item s count; FOR (each pair of items in the basket) : hash the pair to a bucket; add 1 to the count for that bucket; Few things to note: Pairs of items need to be generated from the input file; they are not present in the file We are not just interested in the presence of a pair, but we need to see whether it is present at least s (support) times 14

Example Assume min. support = 10 Sup(1,2) = 10 Sup(3,5) = 10 Sup(2,3) = 5 Sup(1,5) = 4 Sup(1,6) = 7 Sup(4,5) = 8 {1,2} {3,5} {2,3} {1,5} {1,6} {4,5} Total count: 20 Total count: 9 Total count: 15 Note that {2,3}, and {1,5} cannot be frequent itemsets. (Why?) 15

Observations about Buckets Observation: If a bucket contains a frequent pair, then the bucket is surely frequent However, even without any frequent pair, a bucket can still be frequent So, we cannot use the hash to eliminate any member (pair) of a frequent bucket But, for a bucket with total count less than s, none of its pairs can be frequent Pairs that hash to this bucket can be eliminated from candidates (even if the pair consists of 2 frequent items) E.g., even though {A}, {B} are frequent, count of the bucket containing {A,B} might be < s Pass 2: Only count pairs that hash to frequent buckets 16

PCY Algorithm Between Passes Replace the buckets by a bit-vector: 1 means the bucket count exceeded the support s (call it a frequent bucket); 0 means it did not 4-byte integer counts are replaced by bits, so the bit-vector requires 1/32 of memory Also, decide which items are frequent and list them for the second pass 17

PCY Algorithm Pass 2 Count all pairs {i, j} that meet the conditions for being a candidate pair: 1. Both i and j are frequent items 2. The pair {i, j} hashes to a bucket whose bit in the bit vector is 1 (i.e., a frequent bucket) Both conditions are necessary for the pair to have a chance of being frequent 18

Main-Memory: Picture of PCY Main memory Item counts Hash Hash table table for pairs Frequent items Bitmap Counts of candidate pairs Pass 1 Pass 2 19

Main-Memory Details Buckets require a few bytes each: Note: we do not have to count past s If s < 256, then we need at most 1 byte for a bucket #buckets is O(main-memory size) Large number of buckets helps. (How?) 20

Refinement: Multistage Algorithm Limit the number of candidates to be counted Remember: Memory is the bottleneck We only want to count/keep track of the ones that are frequent Key idea: After Pass 1 of PCY, rehash only those pairs that qualify for Pass 2 of PCY i and j are frequent, and {i, j} hashes to a frequent bucket from Pass 1 On middle pass, fewer pairs contribute to buckets, so fewer false positives Requires 3 passes over the data 21

Main-Memory: Multistage Item counts Freq. items Freq. items Main memory First hash First table hash table Bitmap 1 Bitmap 1 Second hash table Bitmap 2 Counts of of candidate pairs pairs Pass 1 Pass 2 Pass 3 Count items Hash pairs {i,j} Hash pairs {i,j} into Hash2 iff: i,j are frequent, {i,j} hashes to freq. bucket in B1 Count pairs {i,j} iff: i,j are frequent, {i,j} hashes to freq. bucket in B1 {i,j} hashes to freq. bucket in B2 22

Multistage Pass 3 Count only those pairs {i, j} that satisfy these candidate pair conditions: 1. Both i and j are frequent items 2. Using the first hash function, the pair hashes to a bucket whose bit in the first bit-vector is 1 3. Using the second hash function, the pair hashes to a bucket whose bit in the second bit-vector is 1 23

Important Points 1. The two hash functions have to be independent 2. We need to check both hashes on the third pass If not, we may end up counting pairs of items that hashed first to an infrequent bucket but happened to hash second to a frequent bucket 24

Refinement: Multihash Key idea: Use several independent hash tables on the first pass Risk: Halving the number of buckets doubles the average count We have to be sure most buckets will still not reach count s If so, we can get a benefit like multistage, but in only 2 passes 25

Main-Memory: Multihash Main memory Item counts First First hash hash table table Second hash Second table hash table Freq. items Bitmap 1 Bitmap 2 Counts Counts of of candidate candidate pairs pairs Pass 1 Pass 2 26

PCY: Extensions Either multistage or multihash can use more than two hash functions In multistage, there is a point of diminishing returns, since the bit-vectors eventually consume all of main memory If we spend too much space for bit-vectors, then we run out of space for candidate pairs For multihash, the bit-vectors occupy exactly what one PCY bitmap does, but too many hash functions make all counts > s 27

Outline A-Priori Algorithm PCY Algorithm Frequent Itemsets in < 2 Passes 28

Frequent Itemsets in < 2 Passes A-Priori, PCY, etc., take k passes to find frequent itemsets of size k Can we use fewer passes? Methods that use 2 or fewer passes for all sizes:u Random sampling SON (Savasere, Omiecinski, and Navathe) Toivonen (see textbook) 29

Random Sampling (1) Take a random sample of the market baskets Run a-priori or one of its improvements in main memory So we don t pay for disk I/O each time we increase the size of itemsets Reduce min. support proportionally to match the sample size Main memory Copy of sample baskets Space for counts 30

Random Sampling (2) Optionally, verify that the candidate pairs are truly frequent in the entire data set by a second pass (avoid false positives) But you cannot catch sets frequent in the whole but not in the sample (cannot avoid false negatives) Smaller min. support, e.g., s/125, helps catch more truly frequent itemsets But requires more space 31

SON Algorithm (1) Repeatedly read small subsets of the baskets into main memory and run an in-memory algorithm to find all frequent itemsets We are not sampling, but processing the entire file in memory-sized chunks Min. support decreases to (s/k) for k chunks An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets. 32

SON Algorithm (2) On a second pass, count all the candidate itemsets and determine which are frequent in the entire set Key monotonicity idea: an itemset cannot be frequent in the entire set of baskets unless it is frequent in at least one subset. Task: find frequent ( s) itemsets among n baskets n baskets divided into k subsets Load (n/k) baskets in memory, look for frequent ( s/k) pairs 33

SON Distributed Version SON lends itself to distributed data mining Baskets distributed among many nodes Phase 1: find candidate itemsets Phase 2: find true frequent itemsets Distribute candidates to all nodes Accumulate the counts of all candidates 34

SON: Map/Reduce Phase 1: Find candidate itemsets Map: each machine finds frequent itemsets for the subset of baskets assigned to it Reduce: collect and output candidate frequent itemsets (remove duplicates) Phase 2: Find true frequent itemsets Map: output (candidate_itemset, count) for the subset of baskets assigned to it Reduce: sum up the count, and output truly frequent (>= s) itemsets 35

What You Need to Know Frequent Itemsets One of the most classical and important data mining task Association Rules: {A} -> {B} Confidence, Support, Interestingness Algorithms for Finding Frequent Itemsets A-Priori PCY 2-Pass algorithm: Random Sampling, SON 36

Questions? 37