Lecture 2 Wednesday, August 22, 2007

Similar documents
CS570 Introduction to Data Mining

Roadmap. PCY Algorithm

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

2. Discovery of Association Rules

A Mining Algorithm to Generate the Candidate Pattern for Authorship Attribution for Filtering Spam Mail

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

Closed Non-Derivable Itemsets

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Overview. Sampling Large Databases for Association Rules. Introduction

Hierarchical Parallel Algorithms for Association Mining

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Chapter 4: Association analysis:

Mining of Web Server Logs using Extended Apriori Algorithm

Mining Association Rules with Item Constraints. Ramakrishnan Srikant and Quoc Vu and Rakesh Agrawal. IBM Almaden Research Center

Performance and Scalability: Apriori Implementa6on

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Appropriate Item Partition for Improving the Mining Performance

Maintenance of the Prelarge Trees for Record Deletion

signicantly higher than it would be if items were placed at random into baskets. For example, we

13 Frequent Itemsets and Bloom Filters

Mining High Average-Utility Itemsets

Association Rule Learning

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

BCB 713 Module Spring 2011

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Data Access Paths for Frequent Itemsets Discovery

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Optimized Frequent Pattern Mining for Classified Data Sets

Memory issues in frequent itemset mining

Improved Frequent Pattern Mining Algorithm with Indexing

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets

EFFICIENT ALGORITHM FOR MINING FREQUENT ITEMSETS USING CLUSTERING TECHNIQUES

Mining Association Rules in Large Databases

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Jeffrey D. Ullman Stanford University

Improvements to A-Priori. Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Association Rule Mining. Introduction 46. Study core 46

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Association Rule Mining. Entscheidungsunterstützungssysteme

Mining Frequent Itemsets Along with Rare Itemsets Based on Categorical Multiple Minimum Support

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Fundamental Data Mining Algorithms

A Literature Review of Modern Association Rule Mining Techniques

Association Rules Apriori Algorithm

Induction of Association Rules: Apriori Implementation

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study

An Algorithm for Mining Large Sequences in Databases

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Monotone Constraints in Frequent Tree Mining

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Parallel Algorithms for Discovery of Association Rules

A mining method for tracking changes in temporal association rules from an encoded database

Chapter 4 Data Mining A Short Introduction

Mining Frequent Patterns with Counting Inference at Multiple Levels

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

In-stream Frequent Itemset Mining with Output Proportional Memory Footprint

Association rule mining

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

On Multiple Query Optimization in Data Mining

Finding frequent closed itemsets with an extended version of the Eclat algorithm

CompSci 516 Data Intensive Computing Systems

Data Mining Concepts

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

Product presentations can be more intelligently planned

Frequent Pattern Mining

Chapter 7: Frequent Itemsets and Association Rules

MS-FP-Growth: A multi-support Vrsion of FP-Growth Agorithm

Chapter 7: Frequent Itemsets and Association Rules

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

CSE 5243 INTRO. TO DATA MINING

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining

CSCI6405 Project - Association rules mining

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

Efficiently Mining Long Patterns from Databases

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

Modified Frequent Itemset Mining using Itemset Tidset pair

Medical Data Mining Based on Association Rules

CHUIs-Concise and Lossless representation of High Utility Itemsets

Mining Association Rules with Composite Items *

Fast Vertical Mining Using Diffsets

Maintenance of fast updated frequent pattern trees for record deletion

Association Pattern Mining. Lijun Zhang

Data Mining Clustering

Transcription:

CS 6604: Data Mining Fall 2007 Lecture 2 Wednesday, August 22, 2007 Lecture: Naren Ramakrishnan Scribe: Clifford Owens 1 Searching for Sets The canonical data mining problem is to search for frequent subsets from a set of sets (sometimes called a structured set ). The typical instantiation is known as market based analysis where we are given a set (database) of sets (transactions) of items. An example is: transaction ID t1 t2 t3 items {a,b} {a,c} {b,c,a} Databases are typically stored in one of two formats. A horizontal representation, as shown above, is indexed by transaction ID and stores items in each transaction. A vertical representation, on the other hand, is indexed by item and stores transactions that each item occurs in. Different algorithms are tuned to work with different representations. In this lecture, we will focus on horizontal representations. Given I, the set of all items in the database, the frequency (or support) of s I is given by the number of transactions that contain s, as a fraction of the total number of transactions. The frequent itemset mining problem is to find all subsets of I that have support greater than or equal to some specified threshold θ. It is folklore in data mining that in mining consumer transactions from supermarkets the itemset {beer,diapers} was observed to be frequent. Given this observation, a marketer would like to know whether it is more likely that beer purchasers also buy diapers (represented by beer diapers ), or the other way round (i.e., diapers beer ), or both. The confidence captures these distinctions, where the confidence of rule s t, s, t I is given by support(s t) support(s). Observe that the confidence of beer diapers and the confidence of diapers beer utilize the same numerator but have different denominators. Similarly, from an itemset of size k, we can investigate 2 k 2 possible rules. The association rule mining problem is to first find frequent itemsets that satisfy a given support threshold and then, for each itemset, identify confident rules. We will skip the second problem since the first one is more intensive (why?) and because there is a version of the first problem hiding inside the second (think about it). 2 Algorithms for Searching Set Spaces A trivial algorithm would work as follows. It scans one transaction at a time. For each transaction, it increments the count of every subset occurring in that transaction. For a transaction that has k items, it will increment the count of 2 k 1 subsets. It is easy to see that this algorithm requires only one pass through the database. However it is very wasteful since most subsets will not be frequent but will anyways be counted. 1

The AIS (Agrawal-Imielinski-Swami) algorithm, presented in [1] and named after its authors, is credited with establishing the modern day focus of data mining from secondary storage. It makes multiple passes through the database. In each pass it generates and evaluates candidates. In the first pass, all singleton itemsets, such as {a}, {b}, etc., are considered candidates. For each transaction read, for every item in that transaction, the count of the corresponding singleton itemset is incremented. At the end of the pass, those singleton itemsets that do not have sufficient support are discarded. In the second pass, it constructs itemsets of size 2 (by using the results of the first pass) and counts their support. The exact way it does this is as follows: for each transaction read, it determines if one of the frequent singleton itemsets are present in it. If so, that singleton itemset is extended with every other item present in the transaction. These become the candidates for this pass. For instance, if {a} is found frequent in the first pass, and we encounter a transaction containing {a, b, d} in the second pass, then we create two candidates: {a, b} and {a, d}. Similarly, at pass 3, we create and evaluate candidates of size 3 by extending those itemsets found frequent in pass 2 by one more item. In general, at pass k, we extend frequent itemsets of size k 1 by one more item. It doesn t seem to be widely known that the actual algorithm presented in [1] presents a variety of ways to configure this search. First, in any given pass, it can be made more aggressive by extending itemsets with not just one more item but more items. The reason for doing this is that even if the extended-by-one itemset is frequent, yet another pass is required to ensure that its extension(s) are not frequent. In other words, we need to step across the border to know that we have crossed the border. The approach in [1] is to first conduct a probabilistic modeling of the frequency of the individual items, i.e., the singleton itemsets. Using these, given a frequent itemset, we can determine which of its extensions are expected to be frequent. To allow for stepping across the border, the AIS algorithm identifies those extensions that are expected to be frequent and also those that are expected to be infrequent. All of them are evaluated in the current pass. The second configurable in the AIS algorithm is how to determine those itemsets that are to be extended in a given pass (these are called frontier itemsets in [1]). So far we have implicitly assumed that we will attempt to extend only those that are found frequent in the previous pass. This might be wasteful (why?). Another option is to extend only those that are maximal, i.e., that are not contained in another itemset that has also been determined to be frequent. The paper [1] shows that this can lead to missing some itemsets. The correct approach is, besides the maximal itemsets, to incldue those itemsets that were expected to be infrequent but turned out to be frequent. We have slyly introduced three important lessons in data mining. First, the support criterion for itemsets is anti-monotonic, i.e., if an itemset does not have support, then none of its supersets can have support. This is also known as a downward closure property (since all subsets of frequent set are frequent). The second lesson derives from the first. Due to the downward closure, there is a border separating frequent patterns from infrequent patterns and hence algorithms that can quickly find the border are preferred. Finally, effective data mining is often a matter of efficiency in counting. So, the more itemsets that can be discarded without counting, the better the algorithm. Data mining algorithms are heavily tuned in this regard. Even for those itemsets that have to be counted, we can ensure that we do not encounter the same extension twice by ordering the items in some default (e.g., lexicographic) order. The next algorithm we will study is the Apriori algorithm, due to Srikant and Agrawal [2]. Unlike AIS, it doesn t try to jump multiple levels in a given pass; it proceeds strictly one level in a given pass. But it is slightly more clever. Whereas the AIS algorithm extends itemsets by looking at which other items occur in a given transaction, the Apriori algorithm determines the candidates for a given pass before looking at even a single transaction (for that pass). Given frequent itemsets at level k, F k, candidates at level k + 1 are obtained by picking two sets from F k that have the first k 1 elements the same and unioning them (keep in mind that we will assume sets are lexicographically ordered). For example, if we have {n, q, r, t}, 2

{n, q, s, x}, and {n, q, s, z} as frequent in the fourth pass, we will create {n, q, s, x, z} as a candidate in the fifth pass. But there will be no candidates containing {n, q, r}. This is sufficient to create candidates, as no valid candidates will be overlooked. However, it is not very strict, as there may be candidates unecessarily generated. This is because we are using only two sets from F k to create a candidate at level k + 1, whereas any itemset that is frequent at level k + 1 must have k + 1 subsets that are frequent in level F k. This is again something that can be verified before actually counting (i.e., before undertaking another pass through the database). To use the previous example, given {n, q, s, x, z}, we can verify that {n, q, x, z}, {n, s, x, z}, and {q, s, x, z} all are frequent. Notice that just picking two sets from F k that have k 1 elements in common is not as aggressive as picking two sets from F k that have the first k 1 elements the same. This idea was used in [3]. Unlike a traditional algorithms course, we will assess complexity of data mining algorithms in terms of the number of patterns that are generated. This is often referred to as output-sensitivity in the algorithms literature. Ideally, for an algorithm that generates M patterns, we would like the complexity to be O(M). What is the complexity of the above algorithms? This underscores the importance of setting good parameters for data mining, such as the support threshold. Too small a threshold will produce too many patterns. Too large a threshold might not result in any patterns. 3 Classes of patterns One way to avoid finding too many patterns is to define a new pattern class that is smaller than the set of all frequent patterns. Consider the database: 1 ACTW 2 CDW 3 ACTW 4 ACDW 5 ACDTW 6 CDT Assuming a support threshold of 0.5, the following are frequent: Level 1: A C D T W Level 2: AC AT AW CD CT CW DW TW Level 3: ACT ACW ATW CDW CTW Level 3: ACTW These frequent itemsets can be arranged in a lattice as shown in Fig. 1. Recall that a lattice is a poset where every pair of elements has a greatest lower bound, and every pair of elements has a lowest upper bound. 3

Threshold = 3 Maximal itemsets Closed itemsets ACTW 3 Note: All maximal itemsets are closed ACT 3 ACW 4 ATW 3 CDW 3 CTW 3 AC 5 AT 3 AW 4 CD 4 CT 4 CW 5 DW 3 TW 3 A 5 C 6 D 4 T 4 W 5 Figure 1: Lattice of frequent itemsets. The support threshold requires that every itemset be present in at least 3 transactions. Instead of mining all frequent itemsets, we can try to find only maximal ones, i.e., those frequent itemsets that are not contained in any other frequent itemsets. In Fig. 1, CDW and ACTW are maximal. Many times it is good just to find the maximal frequent patterns. While they don t tell us everything, given the maximal frequent itemsets, we know that all supersets thereof are not frequent, and all subsets thereof are frequent. Hence the maximal frequent itemsets identify the border. The number of maximal frequent itemsets can be an order of magnitude smaller than the number of frequent itemsets. However, the maximal frequent itemsets are not a lossless representation. What we mean by this is that, given just the maximal frequent itemsets and their frequencies (supports), we cannot determine the frequencies of any of the other frequent sets. Another pattern class, called the closed frequent sets, serves this purpose. A closed frequent itemset is a frequent itemset that has no superset with exactly the same support, i.e., that is present in exactly the same set of transactions. From the above example, we see that the sets A and AC occur in: A: 1, 3, 4, 5 AC: 1, 3, 4, 5 Hence AC is closed, but A is not. We can store only AC and its support. Fig. 2 depict the relationship between frequent itemsets, maximal frequent itemsets, and closed frequent itemsets. The closed frequent itemsets can be an order of magnitude smaller than frequent itemsets and the maximal frequent itemsets can be an order of magnitude smaller than the closed ones as well. 4 Food for thought Think about algorithms for mining rules. We have only investigate algorithms for mining frequent itemsets. Why should we search bottom-up? Can we search top-down? What is the over-riding consideration when deciding whether to search bottom-up or top-down? Checkout [4] that tries to search from both directions. 4

Frequent Closed Maximal Figure 2: Classes of itemsets References [1] Rakesh Agrawal, Tomasz Imielinski, Arun Swami, Mining Association Rules between Sets of Items in Large Databases, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207-216, 1993. [2] Rakesh Agrawal, Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the 20th International Conference on Very Large Data Bases (VLDB 94), pages 487-499, 1994. [3] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Efficient Algorithms for Discovering Association Rules, in Knowledge Discovery in Databases: Papers from the 1994 AAAI Workshop, July 1994. [4] Dao-I Lin, Zvi M. Kedem. Pincer Search: A New Algorithm for Discovering the Maximum Frequent Set, in Advances in Database Technology - EDBT 98, 6th International Conference on Extending Database Technology, 1998, pages 105-119, 1998. 5