Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Similar documents
DATA MINING II - 1DL460

CS570 Introduction to Data Mining

Data Mining Part 3. Associations Rules

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Data Mining Techniques

Association Rule Mining. Introduction 46. Study core 46

Performance and Scalability: Apriori Implementa6on

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Mining Frequent Patterns without Candidate Generation

DATA MINING II - 1DL460

BCB 713 Module Spring 2011

Frequent Pattern Mining

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Roadmap. PCY Algorithm

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Mining of Web Server Logs using Extended Apriori Algorithm

Frequent Pattern Mining

Chapter 4: Association analysis:

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Mining Association Rules in Large Databases

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Association Pattern Mining. Lijun Zhang

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Association Rules. A. Bellaachia Page: 1

Association Rule Mining. Entscheidungsunterstützungssysteme

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Chapter 7: Frequent Itemsets and Association Rules

Improved Frequent Pattern Mining Algorithm with Indexing

Fundamental Data Mining Algorithms

Induction of Association Rules: Apriori Implementation

Mining High Average-Utility Itemsets

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Optimized Frequent Pattern Mining for Classified Data Sets

CSE 5243 INTRO. TO DATA MINING

Association Rule Mining

Product presentations can be more intelligently planned

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Association rule mining

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

Association Rule Discovery

Association Rule Mining

A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

A Literature Review of Modern Association Rule Mining Techniques

SQL Based Frequent Pattern Mining with FP-growth

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment

Appropriate Item Partition for Improving the Mining Performance

Association Rule Mining from XML Data

Study on Mining Weighted Infrequent Itemsets Using FP Growth

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Association mining rules

Association Rule Mining: FP-Growth

The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version)

Association Rule Mining (ARM) Komate AMPHAWAN

Association Rule Discovery

An Algorithm for Mining Large Sequences in Databases

Data Mining for Knowledge Management. Association Rules

DATA MINING II - 1DL460

Association Rules Apriori Algorithm

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

Association Rules. Berlin Chen References:

International Journal of Scientific Research and Reviews

Effectiveness of Freq Pat Mining

Association Rule Learning

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

DATA MINING II - 1DL460

CHAPTER 8. ITEMSET MINING 226

CSCI6405 Project - Association rules mining

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Chapter 6: Association Rules

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset

A Reconfigurable Platform for Frequent Pattern Mining

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Maintenance of fast updated frequent pattern trees for record deletion

Memory issues in frequent itemset mining

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

Mining Closed Itemsets: A Review

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

Discovery of Association Rules in Temporal Databases 1

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Hierarchical Online Mining for Associative Rules

SS-FIM: Single Scan for Frequent Itemsets Mining in Transactional Databases

Chapter 13, Sequence Data Mining

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Machine Learning: Symbolische Ansätze

Monotone Constraints in Frequent Tree Mining

Temporal Weighted Association Rule Mining for Classification

A Graph-Based Approach for Mining Closed Large Itemsets

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Transcription:

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining Gyozo Gidofalvi Uppsala Database Laboratory

Announcements Updated material for assignment 3 on the lab course home page. Posted sign-up sheets for labs and examinations for assignment 3 outside P1321. Please make sure you sign up for a slot. Limited number of slot Sign up early! Tight assignment deadline? Attend more labs! 11/20/2009 Gyozo Gidofalvi 2

Outline Association rules and Apriori-based frequent itemset mining Pattern growth by database projections Frequent itemset mining - elements of a DB-projection based implementation using Amos II The assignment 11/20/2009 Gyozo Gidofalvi 3

Association Rules and Apriori-Based Frequent Itemset Mining

Association Rules The Basic Idea By examining transactions, or shop carts, we can find which items are commonly purchased together. This knowledge can be used in advertising or in goods placement in stores. Association rules have the general form: I 1 I 2 where I 1 and I 2 are disjoint sets of items that for example can be purchased in a store. The rule should be read as: Given that someone has bought the items in the set I 1 they are likely to also buy the items in the set I 2. 11/20/2009 Gyozo Gidofalvi 5

Frequent Itemsets Transaction: set of items purchased together by one customer Transaction database D: set of transactions Itemset: set of items Support count of an itemset i: number of transactions in D that contain i, i.e., t in D s.t. i is a subset of t. Support of an itemset i: support count of i relative to D, i.e., the number of transactions in D Supp I = Number of transactions containing I Total number of transactions An itemset i is frequent itemset if supp i min_supp. 11/20/2009 Gyozo Gidofalvi 6

Finding the Frequent Itemsets The Brute Force Approach Just take all items and form all possible combinations of them and count away. Unfortunately, this will take some time... Given n items, how many possible itemsets are there? A better approach: The Apriori Algorithm Basic idea: An itemset can only be a frequent itemset if all its subsets are frequent itemsets 11/20/2009 Gyozo Gidofalvi 7

Example Assume that we have a transaction database with 100 transactions and we have the items a, b and c. Assume that the minimum support is set to 0.60, which gives us a minimum support count of 60. Itemset Count {a} 65 {b} 50 {c} 80 Since the support count of {b} is below the minimum support count no itemset containing b can be a frequent itemset. This means that when we look for itemsets of size 2 we do not have to look for any sets containing b, which in turn leaves us with the only possible candidate {a,c}. 11/20/2009 Gyozo Gidofalvi 8

The Apriori Algorithm A general outline of the algorithm is the following: 1. Count all itemsets of size K. 2. Prune the unsupported itemsets of size K. 3. Generate new candidates of size K+1. 4. Prune the new candidates based on supported subsets. 5. Repeat from 1 until no more candidates or frequent itemsets are found. 6. When all supported itemsets have been found, generate the rules. 11/20/2009 Gyozo Gidofalvi 9

Candidate Generation Assume that we have the frequent itemsets of size 2 {a,b}, {b,c}, {a,c} and {c,d} From these sets we can form the candidates of size 3 {a,b,c}, {a,c,d} and {b,c,d} But... Why not {a,b,d}? Answer: 11/20/2009 Gyozo Gidofalvi 10

Candidate Pruning Again, assume that we have the frequent 2-itemsets {a,b}, {b,c}, {a,c} and {c,d} And the 3-candidates {a,b,c}, {a,c,d} and {b,c,d} Can all of these 3-candidates be frequent itemsets? Answer: 11/20/2009 Gyozo Gidofalvi 11

Find Out if the Itemset is Frequent When we have found our final candidates we can simply count all occurrences of the itemset in the transaction database and then remove those that have too low support count. If at least one of the candidates have enough support count we loop again until we can find no more candidates or frequent itemsets. 11/20/2009 Gyozo Gidofalvi 12

Rule Metrics When we have our frequent itemsets we want to form association rules from them. As we said earlier the association rule has the form I 1 I 2 The support, Supp tot, of the rule is the support of the itemset I tot where I tot = I 1 U I 2 The confidence, C, of the rule is C = Supp tot Supp I1 11/20/2009 Gyozo Gidofalvi 13

Rule Metrics (cont.) How can we interpret the support and the confidence of a rule? The support is how common the rule is in the transaction database. The confidence is how often the left hand side of the rule is associated with the right hand side. In what situations can we have a rule with: High support and low confidence? Low support and high confidence? High support and high confidence? 11/20/2009 Gyozo Gidofalvi 14

Pattern Growth by Database Projections

The Frequent Pattern Growth Approach to FIM The bottleneck of Apriori is candidate generation and testing. High cost of mining long itemsets! Idea: Instead of bottom up, in a top-down fashion extend frequent prefix by adding a single locally frequent item to it. Question: What does locally mean? Answer: To find the frequent itemsets that contain an item i, the only transactions that need to be considered are transactions that contain i. Definition: A frequent item i-related projected transaction table, denoted as PT_i, contains all frequent items (larger than i) in the transactions that contain i. Let s look at an example! 11/20/2009 Gyozo Gidofalvi 16

Example Transactions Relational format Frequent items Filtered transactions Items co-occuring with item 1 Frequent items Projected table on item 1 Frequent items Items co-occuring with item 3 (and 1) Projected table on item 3 (and 1) 11/20/2009 Gyozo Gidofalvi 17

Frequent itemset tree depth-first Discover all frequent itemsets by recursively filtering and projecting transactions in a depth-first-search manner until there are frequent items in the filtered/projected table. 11/20/2009 Gyozo Gidofalvi 18

Frequent Itemset Mining Elements of a DB-Projection Based Implementation Using Amos II / AmosQL

Transactions Stored function (relation) to store transactions: create function transact(integer tid)->bag of Integer as stored; Population of the transaction table: add transact(1)=in({1,2,3}); remove transact(1)=2; Selection of a transaction: transact(1); Transactions as a bag of tuples: create function transact_bt()->bag of <Integer tid, Integer item> as select tid, item where transact(tid)=item; 11/20/2009 Gyozo Gidofalvi 20

Support Counting Function to calculate the support counts of items in bt: create function itemsupps(bag of <Integer, Integer> bt) as groupby ((select item,tid ->Bag of <Integer/*item*/,Integer/*supp*/> from Integer item, Integer tid where <tid,item> in bt), #'count'); 11/20/2009 Gyozo Gidofalvi 21

Item-Related Projection create function irpft(bag of <Integer, Integer> bt, Integer minsupp, Integer pitem) ->Bag of <Integer, Integer> /* Calculates the pitem-related frequent item projection of the bag of transactions bt according to minsupp. */ as select tid, item from Integer tid, Integer item, Integer s where <tid, item> in bt and <item, s> in freq_items(bt, minsupp) and tid in item_suppby(bt, pitem) and item > pitem; 11/20/2009 Gyozo Gidofalvi 22

Sample Association Rule DB create function FIS(Vector)->Integer as stored; add FIS({'beer'})=5; add FIS({'other'})=2; add FIS({'diapers'})=3; add FIS({'beer','diapers'})=2; add FIS({'beer','other'})=1; add FIS({'beer','diapers','other'})=1; add FIS({'diapers','other'})=1; create function showfis()->bag of <Vector, Integer> as select fis, supp from Vector fis, Integer supp where FIS(fis)=supp; create function vunion(vector v1, Vector v2)->vector as sort(select distinct I from object i where i in v1 or i in v2); create function ARconf(vector prec, vector cons) -> real as FIS(vunion(prec,cons))/FIS(prec); ARconf({'beer'},{'diapers'}); ARconf({ diapers'},{'beer'}); 11/20/2009 Gyozo Gidofalvi 23

Subset Generation create function bproject(bag b, Object p)->bag /* Returns a bag of the elements of b that are grater than p. */ as (select o from object o where o in b and o > p); create function bsubset_tcf(bag b, Vector rv)->bag of <Bag, Vector> /* Generates all the children of a node in the subset-tree. */ as select pb, concat(rv, {p}) from Object p, Bag pb where p in b and pb = materialize(bproject(b,p)); create function bsubset_traverse(bag b)->bag of <Vector, Vector> /* Generates all the subsets of the elements of bag b by traversing the subset-tree defined by bsubset_tcf().*/ as select vectorof(pb), rv from Bag pb, Vector rv where <pb, rv> in traverse(# bsubset_tcf, b, {}); 11/20/2009 Gyozo Gidofalvi 24

The Assignment

The Assignment Implement database projection based frequent itemset and association rule mining according to the provided skeleton (a3arm.osql) in Amos II. The program must run in a few minutes since we are going to run it during the examination. Too slow programs will be rejected. The algorithm is easy to get wrong and then you will get a super-exponential behavior that causes the execution time to blow up! There will be example runs on the lab course's home page. Your solution must be able to get these results before the examination. 11/20/2009 Gyozo Gidofalvi 26

Exercise Find the rules with support 0.5 and confidence 0.75 in the following database: TID Transaction 1 {a b c d} 2 {a c d} 3 {a b c} 4 {b c d} 5 {a b c} 6 {a b c} 7 {c d e} 8 {a c} 11/20/2009 Gyozo Gidofalvi 27

Reference and Further Reading 1. Efficient Frequent Pattern Mining in Relational Databases. X. Shang, K.-U. Sattler, and I. Geist. 5. Workshop des GI-Arbeitskreis Knowledge Discovery (AK KD) im Rahmen der LWA 2004. 2. Mining Associations between Sets of Items in Large Databases. R. Agrawal, T. Imielinski, and A. Swami. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp. 207-216, May 1993. 3. Fast Algorithms for Mining Association Rules. R. Agrawal and R. Srikant. In Proceedings of the 20th International Conference on Very Large Databases, pp. 487-499, September 1994. 4. An Effective Hash Based Algorithm for Mining Association Rules. J. S. Park, M.-S. Chen, and P. S. Yu. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pp. 175-186, May 1995. 5. Mining Frequent Patterns without Candidate Generation. J. Han, H. Pei, and Y. Yin. In Proceedings of the ACM-SIGMOD International Conference on Management of Data (SIGMOD'00), pp. 1-12, May 2000. 11/20/2009 Gyozo Gidofalvi 28