Overview. Sampling Large Databases for Association Rules. Introduction

Size: px

Start display at page:

Download "Overview. Sampling Large Databases for Association Rules. Introduction"

Giles Nichols
5 years ago
Views:

1 Large Databases for Association Rules Paper presented at the VLDB Conference in 996 Proposes a methodology for finding association rules in a sample Contributions: Negative border Sample size Lowered support threshold Paper Author: Hannu Toivonen Presented by: Scott Bass Presented: March 28, 2007 Large data sets Are necessary for reliable results Lead to decreased efficiencies when scanning the database Presented algorithm can use only one full pass through the database Level-Wise algorithms Example: Apriori Standard algorithms use K or K+ database scans Can be reduced by scanning multiple levels at a time

2 Partition Algorithm Select partitions of the database small enough to fit in memory First scan Loads a partition into memory Frequent item set generation proceeds normally on transactions in memory Second scan Uses the union of the locally frequent item sets as the candidate set for the entire database Results in a reduction in database activity Partition Algorithm Partition Partition 2 Partition n Scan entire DB again to count frequencies Freq. Sets Freq. Sets 2 Freq. Sets n Union of Freq. Sets to n of frequent sets of frequent sets Take random sample of the relation Use level-wise algorithms (e.g., Apriori) to determine local frequent item sets Entire DB Approximates properties (e.g., frequent item sets) of entire relation Can be used to efficiently determine a superset of the frequent item set collection Partition Use partition to represent entire database 2

3 of frequent sets Need to lower support threshold Trades accuracy for efficiency Useful for: Using samples small enough to fit in memory to provide reasonably accurate results Approximate the results from a sample to tune a more complete discovery phase (e.g., tune algorithm parameters, select appropriate algorithm) Consists of the minimal item sets such that Sets are not in the sample All subsets are in the sample Includes all -item sets Assume the frequent item set collection from a sample contains: {A,D},{B,D} Since the threshold is lowered, it is likely that these sets form a superset of the true frequent sets Sample frequent set: {A,D},{B,D} Negative border is therefore: {B,F},{C,D},{D,F}, {E} 3

4 Example (scenario ): Full scan counts the frequency of: {A,D},{B,D},{B,F},{C,D},{D,F}, {E} Full DB frequent sets: {A,B},{A,C,F} and their subsets All frequent sets have been found since none of the sets in the negative border were frequent Example (scenario 2): Full scan counts the frequency of: {A,D},{B,D},{B,F},{C,D},{D,F}, {E} Full DB frequent sets: {A,B},{A,C,F},{B,F} and their subsets {B,F} is a miss since it is frequent and in the negative border Since {A,B},{A,F}, and {B,F} are frequent, we have actually missed {A,B,F} and a another pass is required Assume with replacement Database is large enough not to impact sampling s " 2# 2 ln 2 $ s, sample size ", error bound #, maximum probability of an error exceeding the bound 4

5 For a maximum probability of % of obtaining an error greater than % a sample of at least 27,000 transactions is required from an infinitely large database Removing assumption of an infinitely large database s, sample size S, database size ", error bound s " 2# 2 ln 2S $ #, probability of exceeding error bound For a maximum probability of % of obtaining an error greater than % from a database containing 00,000 transactions, a sample of at least 84,000 transactions is required (84% of the database) If we change the database size to,000,000 transactions, the sample size is 95,569 transactions (9.6% of the database) Lowered Frequency Probability of a miss is at most δ for a lowered frequency of low _ fr < min_ fr " 2 s ln # 5

6 Lowered Frequency Probability of a miss is at most δ for a lowered frequency of low _ fr < min_ fr " 2 s ln # Example: For a given threshold of 2%, a probability of %, and a sample size of 50,000, the lowered frequency for the sample will be low _ fr < 0.02 " = ( 50000) ln 0.0 & Remarks Empirical evidence supports proposed probability bounds Defined bounds for sample size and lowered support threshold given error threshold and probability of this error and optionally database size Efficiency is improved using DB sampling Questions? 6

Lecture 2 Wednesday, August 22, 2007

CS 6604: Data Mining Fall 2007 Lecture 2 Wednesday, August 22, 2007 Lecture: Naren Ramakrishnan Scribe: Clifford Owens 1 Searching for Sets The canonical data mining problem is to search for frequent subsets