Frequent Pattern Mining

Similar documents
CS145: INTRODUCTION TO DATA MINING

Frequent Pattern Mining

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Lecture 10 Sequential Pattern Mining

Mining Frequent Patterns without Candidate Generation

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Data Mining for Knowledge Management. Association Rules

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

BCB 713 Module Spring 2011

Association Rule Mining

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association Rules. A. Bellaachia Page: 1

Advance Association Analysis

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

CS6220: DATA MINING TECHNIQUES

Chapter 13, Sequence Data Mining

CS570 Introduction to Data Mining

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Data Mining Part 3. Associations Rules

Nesnelerin İnternetinde Veri Analizi

Mining Association Rules in Large Databases

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

Data Mining Techniques

Effectiveness of Freq Pat Mining

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Association Rule Mining

Chapter 4: Association analysis:

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

DATA MINING II - 1DL460

Fundamental Data Mining Algorithms

Chapter 7: Frequent Itemsets and Association Rules

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Chapter 6: Mining Association Rules in Large Databases

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Association Rule Mining (ARM) Komate AMPHAWAN

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

CS490D: Introduction to Data Mining Prof. Walid Aref

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

CSE 5243 INTRO. TO DATA MINING

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Association Pattern Mining. Lijun Zhang

Chapter 6: Association Rules

CSE 5243 INTRO. TO DATA MINING

Data Mining: Foundation, Techniques and Applications

CHAPTER 8. ITEMSET MINING 226

1. Interpret single-dimensional Boolean association rules from transactional databases

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

Association Rules. Berlin Chen References:

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

c 2006 by Shengnan Cong. All rights reserved.

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Association Rule Discovery

Association Rules Apriori Algorithm

Performance and Scalability: Apriori Implementa6on

Scalable Frequent Itemset Mining Methods

Association Rule Discovery

Trajectory Pattern Mining. Figures and charts are from some materials downloaded from the internet.

Knowledge Discovery in Databases II Winter Term 2015/2016. Optional Lecture: Pattern Mining & High-D Data Mining

Knowledge Discovery in Databases

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Chapter 5, Data Cube Computation

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Improved Frequent Pattern Mining Algorithm with Indexing

Roadmap. PCY Algorithm

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Product presentations can be more intelligently planned

FP-Growth algorithm in Data Compression frequent patterns

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

CS145: INTRODUCTION TO DATA MINING

Association Rules Apriori Algorithm

Frequent and Sequential Pattern Mining with Multiple Minimum Supports

Tutorial on Association Rule Mining

Machine Learning: Symbolische Ansätze

Sequential PAttern Mining using A Bitmap Representation

Association Rules and

Distributed frequent sequence mining with declarative subsequence constraints. Alexander Renz-Wieland April 26, 2017

Road Map. Objectives. Objectives. Frequent itemsets and rules. Items and transactions. Association Rules and Sequential Patterns

Production rule is an important element in the expert system. By interview with

CompSci 516 Data Intensive Computing Systems

Association Rule Mining. Entscheidungsunterstützungssysteme

A Comprehensive Survey on Sequential Pattern Mining

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

An Algorithm for Mining Large Sequences in Databases

CS6220: DATA MINING TECHNIQUES

2. Discovery of Association Rules

Decision Support Systems

CLOLINK: An Adapted Algorithm for Mining Closed Frequent Itemsets

CSCI6405 Project - Association rules mining

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

Chapter 7: Frequent Itemsets and Association Rules

Association Analysis: Basic Concepts and Algorithms

Transcription:

Frequent Pattern Mining...3 Frequent Pattern Mining Frequent Patterns The Apriori Algorithm The FP-growth Algorithm Sequential Pattern Mining Summary 44 / 193

Netflix Prize Frequent Pattern Mining Frequent Patterns Users evaluate movies from time to time (http://www.netflixprize.com/ Can we predict how much a user likes a movie?!"#$% &% '% (% )% *% +%!"# $%%&# '(&# $%%&# '(&##!)# $%%&# $%%&# '(&# $%%&#!*# $%%&# '(&# '(&# $%%&# $%%&#!+# $%%&# '(&# $%%&#,# If we find pattern (A=Good) AND (C=Bad) (E=Good) holds for many users, we can recommend movie E to user U4! 45 / 193

Transaction Data Analysis Frequent Patterns Transactions: customers purchases of commodities bread, milk, cheese if they are bought together Frequent patterns are product combinations that are frequently purchased together by customers Generally, frequent patterns are patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93] 46 / 193

Frequent Itemsets Frequent Pattern Mining Frequent Patterns Transaction database TDB TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Itemset: a set of items, e.g., acm = {a, c, m} Support of itemsets, e.g., Sup(acm) =3 Given min sup = 3, acm is a frequent pattern Frequent pattern mining: findingall frequent patterns in a given database with respect to a give support threshold 47 / 193

A Naïve Attempt Frequent Pattern Mining The Apriori Algorithm Generate all possible itemsets, test their supports against the database How to hold a large number of itemsets into main memory? If there are 100 items, there are 2 100 1 possible itemsets How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? For a transaction of length 20, we need to update the support of 2 20 1=1, 048, 575 itemsets 48 / 193

Transactions in Real Applications The Apriori Algorithm A large department store often carries more than 100 thousand different kinds of items Amazon.com carries more than 17,000 books relevant to data mining Walmart has more than 20 million transactions per day AT&T produces more than 275 million calls per day Mining large transaction databases of many items is a real demand 49 / 193

The Apriori Algorithm How to Obtain an Efficient Method? Reducing the number of itemsets that need to be checked Checking the supports of selected itemsets efficiently 50 / 193

An Anti-Monotonicity Frequent Pattern Mining The Apriori Algorithm Any subset of a frequent itemset must be also frequent an anti-monotonic property A transaction containing {beer, diaper, nuts} also contains {beer, diaper} If {beer, diaper, nuts} is frequent, {beer, diaper} must also be frequent In other words, any superset of an infrequent itemset must also be infrequent No superset of any infrequent itemset should be generated or tested Many item combinations can be pruned! 51 / 193

The Apriori Algorithm Candidate Generation & Test (the Apriori Principle) Find frequent items Generate length (k + 1) candidate itemsets from length k frequent itemsets Test the candidates against DB 52 / 193

The Apriori Algorithm Example The Apriori Algorithm Data base D TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2 Scan D Scan D 3-candidates Itemset bce Freq 3-itemsets Itemset Sup bce 2 1-candidates Itemset Sup a 2 b 3 c 3 d 1 e 3 Freq 2-itemsets Itemset Sup ac 2 bc 2 be 3 ce 2 Freq 1-itemsets Itemset Sup a 2 b 3 c 3 e 3 Counting Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2 2-candidates Itemset ab ac ae bc be ce Scan D 53 / 193

The Apriori Algorithm [AgSr94] The Apriori Algorithm Require: transaction database TDB, minimum support threshold min sup {C k : the set of length-k candidate itemsets} {L k : the set of length-k frequent itemsets} L 1 {frequent items} k 1 while L k do {candidate generation} C k+1 candidates generated from L k ; sup(x ) 0forX C k+1 for all transaction t TDB, itemsetx C k+1 do if X t then sup(x )++ end if end for L k+1 = {X X C k+1, sup(x ) min sup}; k ++ end while return k i=1 L k 54 / 193

How to Find Frequent Items? The Apriori Algorithm Finding frequent items using a one dimensional array for all item x do c[x] 0 end for for all transaction t do if x t then c[x]++ end if end for return {x c[x] min sup} 55 / 193

The Apriori Algorithm How to Find Length-2 Frequent Itemsets? Using a 2-dimensional triangle matrix for items i, j (i < j), c[i, j] is the count for itemset ij for all items i and j such that i < j do c[i, j] 0 end for for all transaction t do sort items in t in lexicographic order for i =0tolen(t) 1 do if i is a frequent item then for j=i+1tolen(t) do if j is a frequent item then c[i, j]++ end if end for end if end for end for return {ij i < j c[i, j] min sup} 56 / 193

Implementation Frequent Pattern Mining The Apriori Algorithm A 2-dimensional triangle matrix can be implemented using a 1-dimensional array 1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5 There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)* (2*5-3)/2+5-3]=c[9] 1 2 3 4 5 6 7 8 9 10 57 / 193

Candidate Generation Example The Apriori Algorithm Suppose L 3 = {abc, abd, acd, ace, bcd}. How can we generate C 4? Self-joining: L 3 L 3 : abcd abc abd and acde acd ace Pruning: acde is removed because ade L 3 C 4 = {abcd} 58 / 193

Candidate Generation Algorithm The Apriori Algorithm Require: the items in every itemset in L k are listed in an order R {self-join L k } INSERT INTO C k+1 SELECT p.item 1, p.item 2,...,p.item k, q.item k FROM L k p, L k q WHERE p.item 1 = q.item 1,...,p.item k 1 = q.item k 1, p.item k < R q.item k {pruning} for itemset X C k+1 do for each k-subset X of X do if X L k then C k+1 = C k+1 {X } end if end for end for return C k+1 59 / 193

How to Count Supports? The Apriori Algorithm Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method Candidate itemsets are stored in a hash-tree A leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 60 / 193

Example Frequent Pattern Mining The Apriori Algorithm Subset function 3,6,9 1,4,7 2,5,8 1 + 2 3 5 6 1 3 + 5 6 1 2 + 3 5 6 1 4 5 1 2 4 4 5 7 1 2 5 4 5 8 Transaction: 1 2 3 5 6 2 3 4 5 6 7 1 3 6 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 61 / 193

Bottleneck of Freq Pattern Mining The FP-growth Algorithm Multiple database scans are costly Mining long patterns needs many scans and generates many candidates To find frequent itemset i 1 i 2 i 100, 100 scans are needed and the total number of candidates is ( 100 ) ( 1 + 100 ) ( 2 + + 100 ) 100 =2 100 1 1.27 10 30 Bottleneck: candidate-generation-and-test 62 / 193

The FP-growth Algorithm Search Space of Frequent Pattern Mining ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D {} Itemset lattice 63 / 193

Set Enumeration Tree Frequent Pattern Mining The FP-growth Algorithm Use an order on items, enumerate itemsets in lexicographic order a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d Reduce a lattice to a tree! a b c d ab ac ad bc bd cd abc abd acd bcd abcd 64 / 193

Borders of Frequent Itemsets The FP-growth Algorithm Frequent itemsets are connected is trivially frequent X on the border every subset of X is frequent! a b c d ab ac ad bc bd cd abc abd acd bcd abcd 65 / 193

Projected Databases Frequent Pattern Mining The FP-growth Algorithm X -projected database the set of transactions containing X TDB X = {t TDB X t} To test whether itemset Xy is frequent, we can use the X -projected database and check whether item y is frequent in the X -projected database! a b c d ab ac ad bc bd cd abc abd acd bcd abcd 66 / 193

The FP-growth Algorithm Compressing a Transaction Database by FP-tree The 1st scan: find frequent items Only record frequent items in the FP-tree F-list: f -c-a-b-m-p Header table item f c a b m p f:4 c:3 a:3 m:2 p:2 root b:1 b:1 m:1 c:1 b:1 p:1 The 2nd scan: construct tree Order frequent items in each transaction w.r.t. the f-list Explore sharing among transactions TID Items bought (ordered) freq items 100 f, a, c, d, g, I, m, p f, c, a, m, p 200 a, b, c, f, l,m, o f, c, a, b, m 300 b, f, h, j, o f, b 400 b, c, k, s, p c, b, p 500 a, f, c, e, l, p, m, n f, c, a, m, p 67 / 193

Why FP-tree? Frequent Pattern Mining The FP-growth Algorithm Completeness Never break a long pattern in any transaction Preserve complete information for frequent pattern mining no need to scan the database anymore Compactness Reduce irrelevant information infrequent items are removed Items in frequency descending order (f-list): the more frequently occurring, the more likely to be shared Never be larger than the original database (not counting node-links and the count fields) 68 / 193

Partitioning Frequent Patterns The FP-growth Algorithm Frequent patterns can be partitioned into subsets according to the f-list: f -c-a-b-m-p Patterns containing p Patterns having m but no p... Patterns having c but no a nor b, m, orp Pattern f Depth-first search of a set enumeration tree The partitioning is complete and does not have any overlap 69 / 193

Find Patterns Having Item p The FP-growth Algorithm Only transactions containing p are needed Form p-projected database Starting at entry p of the header table Follow the side-link of frequent item p Accumulate all transformed prefix paths of p p-projected database TDB p fcam: 2 cb: 1 Local frequent item: c:3 Frequent patterns containing p p: 3, pc: 3 Header table item f c a b m p f:4 c:3 a:3 m:2 p:2 root b:1 b:1 m:1 c:1 b:1 p:1 70 / 193

The FP-growth Algorithm Find Patterns Having Item m But No p Form m-projected database TDB m Item p is excluded (why?) TDB m = {fca :2, fcab :1} Local frequent items: f, c, a Build FP-tree for TDB m Header table item f c a root f:3 c:3 a:3 m-projected FP-tree Header table item f c a b m p f:4 c:3 a:3 m:2 p:2 root b:1 b:1 m:1 c:1 b:1 p:1 71 / 193

Recursive Mining Frequent Pattern Mining The FP-growth Algorithm Patterns having m but no p can be mined recursively Optimization: enumerate patterns from a single-branch FP-tree Enumerate all combination Support = that of the last item Example: m, fm, cm, am, fcm, fam, cam, fcam Header table item f c a root f:3 c:3 a:3 m-projected FP-tree 72 / 193

Patterns from a Single Prefix The FP-growth Algorithm When a (projected) FP-tree has a single prefix, we can reduce the single prefix into one virtual node, and join the mining results of the two parts root a 1 :n 1 root r 1 a 1 :n 1 a 2 :n 2 a 3 :n 3! r = a 2 :n 2 + b 1 :m 1 c 1 :k 1 b 1 :m 1 c 1 :k 1 a 3 :n 3 c 2 :k 2 c 3 :k 3 c 2 :k 2 c 3 :k 3 73 / 193

The FP-growth Algorithm The FP-growth Algorithm Pattern-growth: recursively grow frequent patterns by pattern and database partitioning for each frequent item x do construct the x-projected database, and then the x-projected FP-tree Recursively mine the x-projected FP-tree, until the resulted FP-tree either is empty, or contains only one path single path generates all the combinations, each of which is a frequent pattern end for 74 / 193

From Itemsets to Sequences Sequential Pattern Mining Itemsets: combinations of items, no temporal order Temporal order is important in many situations, such as time-series databases and sequence databases Frequent patterns (frequent) sequential patterns Application example of sequential pattern mining mobile user trajectories using pattern Park a car buy parking ticket visit a coffee shop, all in 15 minutes, we can recommend a coffee shop in a cell phone More applications: medical treatment, natural disasters, science and engineering processes, stocks and markets, telephone calling patterns, Web log clickthrough streams, DNA sequences and gene structures 75 / 193

What Is Sequential Pattern Mining? Sequential Pattern Mining Given a set of sequences, find the complete set of frequent subsequences SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> A sequence : < (ef) (ab) (df) c b > Given a minimum support threshold min sup = 2, (ab)c is a sequential pattern 76 / 193

Sequential Pattern Mining An (Anti)-Monotonic Property of Sequential Patterns If a sequence s is infrequent, then none of the super-sequences of s is frequent Example: let min sup = 2. hb is infrequent hab and (ah)b are infrequent Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> 77 / 193

Sequential Pattern Mining Sequential Pattern Mining Algorithm GSP 5 th scan: 1 cand. 1 length-5 seq. pat. 4 th scan: 8 cand. 6 length-4 seq. pat. 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 1 st scan: 8 cand. 6 length-1 seq. pat. <(bd)cba> <abba> <(bd)bc> Cand. cannot pass sup. threshold <abb> <aab> <aba> <baa> <bab> <aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)> <a> <b> <c> <d> <e> <f> <g> <h> Cand. not in DB at all Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> 78 / 193

Sequential Pattern Mining Sequential Pattern Mining Algorithm PrefixSpan Having prefix <a> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> SID SDB sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Having prefix <b> Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Having prefix <c>,, <f> <b>-projected database Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Having prefix <aa> Having prefix <af> <aa>-proj. db <af>-proj. db 79 / 193

Summary Frequent Pattern Mining Summary Frequent patterns: frequent combinations in large transaction databases Mining frequent patterns An anti-monotonic property The Apriori algorithm The FP-growth algorithm Sequential patterns and mining Sequential patterns GSP PrefixSpan 80 / 193

To-Do List Frequent Pattern Mining Summary Read the following paper to understand how PrefixSpan mines sequential patterns: J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.C. Hsu. Mining Sequential Patterns by Pattern-growth: The PrefixSpan Approach. IEEE Transactions on Knowledge and Data Engineering, Volume 16, Number 11, pages 1424-1440, November 2004, IEEE Computer Society. There is often redundancy among frequent patterns. Read the following paper to understand how FP-growth can be extended to mine frequent closed itemsets, a type of non-redundant frequent patterns: J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In Proceedings of the 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dallas,TX, May, 2000. 81 / 193