Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Similar documents
Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Association analysis:

DATA MINING II - 1DL460

Chapter 7: Frequent Itemsets and Association Rules

Data Mining Techniques

Mining Association Rules in Large Databases

Jeffrey D. Ullman Stanford University

Association Rules. A. Bellaachia Page: 1

Association Pattern Mining. Lijun Zhang

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

Improvements to A-Priori. Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results

Frequent Pattern Mining

CS570 Introduction to Data Mining

Chapter 7: Frequent Itemsets and Association Rules

Chapter 6: Association Rules

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

signicantly higher than it would be if items were placed at random into baskets. For example, we

Hash-Based Improvements to A-Priori. Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms

Association Rule Mining: FP-Growth

Frequent Pattern Mining

Roadmap. PCY Algorithm

Performance and Scalability: Apriori Implementa6on

BCB 713 Module Spring 2011

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Effectiveness of Freq Pat Mining

Mining Frequent Patterns without Candidate Generation

Data Mining Part 3. Associations Rules

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Association Analysis: Basic Concepts and Algorithms

Big Data Analytics CSCI 4030

Association Rules. CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University. Slides adapted from lectures by Jeff Ullman

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Frequent Item Sets & Association Rules

Association Rule Mining

Introduction to Data Mining

Association Rules. Berlin Chen References:

Data Mining for Knowledge Management. Association Rules

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Tutorial on Association Rule Mining

Nesnelerin İnternetinde Veri Analizi

CSE 5243 INTRO. TO DATA MINING

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Association Rule Mining. Entscheidungsunterstützungssysteme

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

CHAPTER 8. ITEMSET MINING 226

Association Rule Mining (ARM) Komate AMPHAWAN

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Association Rule Learning

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Advance Association Analysis

Data Mining Clustering

Mining Association Rules in Large Databases

Introduction to Data Mining

1. Interpret single-dimensional Boolean association rules from transactional databases

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Building Roads. Page 2. I = {;, a, b, c, d, e, ab, ac, ad, ae, bc, bd, be, cd, ce, de, abd, abe, acd, ace, bcd, bce, bde}

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Association Rules Apriori Algorithm

Information Sciences

Association Rules Apriori Algorithm

Efficient Frequent Itemset Mining Mechanism Using Support Count

Association Rules. Comp 135 Machine Learning Computer Science Tufts University. Association Rules. Association Rules. Data Model.

Association Rule Discovery

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

Data Mining and Data Warehouse Maximal Simplex Method

This paper proposes: Mining Frequent Patterns without Candidate Generation

2 CONTENTS

ERMO DG: Evolving Region Moving Object Dataset Generator. Authors: B. Aydin, R. Angryk, K. G. Pillai Presented by Micheal Schuh May 21, 2014

Knowledge Discovery in Databases

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

A Taxonomy of Classical Frequent Item set Mining Algorithms

Fundamental Data Mining Algorithms

Induction of Association Rules: Apriori Implementation

Association Rules Outline

Production rule is an important element in the expert system. By interview with

Association Rule Discovery

Decision Support Systems

Interestingness Measurements

Association Rules Extraction with MINE RULE Operator

Machine Learning: Symbolische Ansätze

Association Analysis. CSE 352 Lecture Notes. Professor Anita Wasilewska

CSE 5243 INTRO. TO DATA MINING

Data Mining Course Overview

2. Discovery of Association Rules

Limsoon Wong (Joint work with Mengling Feng, Thanh-Son Ngo, Jinyan Li, Guimei Liu)

Pattern Lattice Traversal by Selective Jumps

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Data Warehousing and Data Mining

Transcription:

Frequent itemset Association&decision rule mining University of Szeged

What frequent itemsets could be used for? Features/observations frequently co-occurring in some database can gain us useful insights How a marketing person can make use of it? What about a data scientist? Transaction ID 1 5 Items {milk, bread, salami} {beer, diapers} {beer, wurst} {beer, baby food, diapers} {diapers, coke, bread}

Possible forms Horizontal Transaction ID 1 5 Items {milk, bread, salami} {beer, diapers} {beer, wurst} {beer, baby food, diapers} {diapers, coke, bread}

Possible forms Vertical/inverted Item milk bread salami beer diapers wurst baby food coke Basket {1} {1, 5} {1} {,,} {,,5} {} {} {5}

Possible forms Relational Basket 1 1 1 5 5 5 Item milk bread salami beer diapers beer wurst beer baby food diapers diapers coke bread

Collecting association rules Goal: find item sets of the transactional database with high support and confidence Support of set X : s(x ) = {ti ti T X ti } Normalized support: s(x ) normalized by the number of transactions Confidence of rule A B: c(a B) = s(a B) s(a) P(B A) The meaning of a rule A B: given that items included in set A are put in the basket, chances are high that the items included in B are also in some basket

Interestingness of association rules Are all rules with high support and confidence equally interesting? Confidence of a rule can simply be high due to the fact the items on its right side are frequently purchased independently of the items on the left side. Any example? Interest score of rule A B: I (A B) = c(a B) P(B) What is the interpretation of this score? (Interesting rules will have high absolute values. Why?) Choose threshold so that the number of rules defined is manageable There are plenty other scores for measuring interestingness: χ, κ,...

What frequent itemsets can be utilized for? Plagiarism detection Somewhat counter-intuitively let the sentences be the baskets and documents the items Conclusion Sometimes we need to be flexible about the concept of items comprising baskets Our goal is to examine the relation of items to each other, and not that of baskets (we had that one before) How frequent itemsets could be interpreted if items and baskets were determined vice versa?

What else frequent itemsets could be used for? Data mining We can use them to build a simple classifier Missing feature values can be estimated knowing how features typically co-occur with each other We can merge features if they show highly similar behavior in the database

The general schema of producing associational rules Collect the set of frequent items F having a (normalized) support surpassing some threshold t Partition F into non-empty, disjunct subsets and calculate the confidence of the rules determined

Why naive approach fails? How a naive approach would look like? d items d d+1 + 1 possible rules (e.g. d = 9 18660 possibilities) d P d d j Proof (hint: (1 + x)d = + xd) j x j=1 d P i=1 d d i d P i d i j j=1 d = d P i=1 d i (d i 1) = d P i=1 d i d i d P i=1 ( ) (d 1) 106 log10 (Possible association rules) Possible association rules 5 1 0 6 8 d 10 1 6 0 1 6 8 d 10 1 1 d i =

Itemset I is frequent J I itemsets are frequent What can we say if itemset I is not frequent? Anti-monotone property: function f is said to be anti-monotone if X, Y P (U) : X Y f (X ) f (Y ) {a} {b} {c} {a,b} {a,c} {b,c} {a,b,c}

The in action Let assume a frequency threshold of Item beer bread coke diapers milk wurst Frequency 1

The in action Let the frequency threshold be Item Frequency beer bread coke diapers milk wurst 1 Item Frequency {beer, bread} {beer, diapers} {beer, milk} {bread, diapers} {bread, milk} {diapers, milk}

Algorithm 1 Pseudocode for calculation of frequent itemsets Input: set of possible items U, transaction database T, frequency threshold t Output: frequent itemsets 1: C1 := U : Calculate the support of C1 : F1 := {x x C1 s(x) t} : for (k = ; k < U &&Fk 1 6= ; k + +) do 5: Determine Ck based on Fk 1 6: Calculate the support of Ck 7: Fk := {X X Ck s(x ) t} 8: end for 9: return ki=1 Fi

Possible ways of determining Ck Based on the elements in F1. Avoid this! Combining the elements of Fk 1 and F1. Somewhat better Combining the elements of Fk 1 and Fk 1 itself. Generates less candidates but does not miss any candidate which has the chance to be a frequent itemset. Combine two itemsets if A = {a1, a,..., ak 1 }, B = {b1, b,..., bk 1 } Fk 1 : ai = bi, 1 i k ak 1 6= bk 1 We need an ordering over the elements of F (e.g. by converting them to integers) and store them in that form

Possible ways of calculating the frequency for Ck Fk 1 frequencies might need to be stored memory limitations Depending on the ratio of non-zero elements, we can either store them explicitly or store them in the form of (i, j, c) triplets, meaning that item i and j has a co-occurrence frequency of c In a one-dimensional triangular matrix the frequency of item pair (i, j), i j is stored at index (i 1)( Fk 1 i ) + j i If the number of zero frequencies surpasses representation using triplets pays off (i, j, c) ) ( Fk 1

Calculating Ck example Suppose there are 107 baskets, 10 items/baskets There are 105 different items in total Using the triangular matrix representation 5 109 integer is required In worst case there are at most 107 10 different pairs of items in the transaction dataset max..5 108 = 1.5 109 integer suffices to store the non.zero frequencies of the purchase of item pairs

Compressing frequent itemsets Maximal frequent itemsets: {I I frequent @ frequent J I } Closed itemsets: {I @J I : s(j) = s(i )} Closed frequent itemsets: closed itemset having a support above some frequency threshold Can a maximal frequent itemset be non-closed? Can a closed itemset be non-maximal?

Compressing frequent itemsets Example (t = ) Item A B C AB AC BC ABC Frequency 5 Maximal Closed Closed frequent

Compressing frequent itemsets Example (t = ) Item A B C AB AC BC ABC Frequency 5 Maximal + + - Closed + + + + Closed frequent + + + -

The relation of different classes to each other Closed frequent itemsets Maximal frequent itemsets

PCY (Park-Chen-Yu) algorithm an extension to A priori Besides counting the frequency of standalone items keep track of the frequency of buckets into which pairs of elements get assigned according to some hash function What can be said based on the aggregated frequency counts? What cannot be made for sure? Only pairs that consist of frequent items and which got hashed to a frequent bucket have the chance to be indeed frequent in the end

Practical considerations We might as well do sampling from the transaction database correctness and completeness is sacrificed Different strategies can be applied for extracting frequent itemsets Frequent itemset border null null........ {a1,a,...,an} (a) General-to-specific/ Top-down Frequent itemset border null.... {a1,a,...,an} Frequent itemset border (b) Specific-to-general/ Bottom-up {a1,a,...,an} (c) Bidirectional/ Hybrid

Generalizing A priori It is not necessary to do the expansion according to the different levels of the item lattice We can do expansion according to any equivalent classes Originally we defined equivalent classes based on the size of the itemsets Alternatively we can define equivalent classes based on the size of the overlap in the prefixes/suffixes

Generalizing A priori the prefix and suffix view null A AB ABC B AC AD ABD ACD null C BC BD BCD D CD A AB AC ABC B C BC AD ABD D BD CD ACD BCD ABCD ABCD (a) Prefix tree (b) Suffix tree

FP trees Use a data structure which makes the extraction of frequent datasets easy FP trees: alternative, condensed representation It can be constructed by processing the rows of the transaction database in one pass are represented as paths in the tree Certain item subsets show up multiple times overlapping paths compressability Hopefully the whole transaction dataset can be stored in the main memory as a result

Constructing FP trees Useful heuristic: order items according to their decreasing support Processing one line of the transaction database Continue paths starting from the root with the same prefix to the currently processed line Increment the frequencies stored for visited nodes Set the frequency of the new nodes to 1 Otherwise start a new path from the root node Assign 1 for the frequency value of new nodes Include pointers between items of the same kind

FP tree example 1 1 based on the slides of Tan, Steinbach, Kumar: Introduction to Data Mining

FP tree after processing the last basket TID 1 5 6 7 8 9 10 Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E} A:7 B: C: B:5 C:1 D:1 D:1 E:1 D:1 C: D:1 D:1 E:1 E:1 What trees would we get if items within baskets were ordered according to their increasing/decreasing order of overall support? ({A:7,B:8,C:7,D:5,E:})

FP-Growth algorithm Divide and conquer algorithm working on the FP tree in a bottom-up manner If the FP tree reveals that an itemset is frequent check out its supersets on their frequency Examine FP trees conditioned on some already known frequent itemsets E.g. it turns out {E} is frequent check out the frequency of set {A,E}, {B,E}, {C,E} and {D,E}

FP trees conditioned on some target item(s) The part of the FP trees that we would get if we looked at only transactions containing the target item(s) Without building the tree from scratch 1 TID 1 5 6 7 8 9 10 Forget about the parts of the tree not related to the target item(s) Basket {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E} A:7 B: C:1 D:1 C: D:1 E:1 E:1 E:1 FP tree conditioned on {E}

FP trees conditioned on some target item(s) The part of the FP trees that we would get if we looked at only transactions containing the target item(s) Without building the tree from scratch TID...... 10 Let the support of a node be the sum of the upgraded supports of its descendants Basket... {A,C,D,E} {A,D,E}... {B,C,E} A: C:1 D:1 B:1 C:1 D:1 FP tree conditioned on {E}

FP trees conditioned on some target item(s) The part of the FP trees that we would get if we looked at only transactions containing the target item(s) Without building the tree from scratch TID...... 10 Eliminate items below frequency threshold (e.g. using a frequency threshold of ) Create FP trees conditioned on item pairs (containing item {E}), based on which we can determine frequent item triples (e.g. {D,E} {A,D,E}) Basket... {A,C,D,E} {A,D,E}... {B,C,E} A: C:1 C:1 D:1 D:1 FP tree conditioned on {E}