DATA MINING II - 1DL460

Similar documents
Mining Frequent Patterns without Candidate Generation

Chapter 4: Association analysis:

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Chapter 7: Frequent Itemsets and Association Rules

Association Rule Mining

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Data Mining for Knowledge Management. Association Rules

Data Mining Part 3. Associations Rules

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Association Rules. A. Bellaachia Page: 1

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Data Mining Techniques

Effectiveness of Freq Pat Mining

Chapter 6: Association Rules

Association Rule Mining: FP-Growth

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Building Roads. Page 2. I = {;, a, b, c, d, e, ab, ac, ad, ae, bc, bd, be, cd, ce, de, abd, abe, acd, ace, bcd, bce, bde}

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

DATA MINING II - 1DL460

CHAPTER 8. ITEMSET MINING 226

Mining Association Rules in Large Databases

Frequent Pattern Mining

DATA MINING II - 1DL460

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Tutorial on Association Rule Mining

Performance and Scalability: Apriori Implementa6on

BCB 713 Module Spring 2011

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

On Canonical Forms for Frequent Graph Mining

Information Sciences

Pattern Lattice Traversal by Selective Jumps

Association Rule Mining (ARM) Komate AMPHAWAN

DATA MINING II - 1DL460

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

CS570 Introduction to Data Mining

This paper proposes: Mining Frequent Patterns without Candidate Generation

Memory issues in frequent itemset mining

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Frequent Pattern Mining

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

Induction of Association Rules: Apriori Implementation

Mining Weighted Association Rule using FP tree

Decision Support Systems

Data Warehousing and Data Mining

Advance Association Analysis

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets

Association Analysis: Basic Concepts and Algorithms

Association Pattern Mining. Lijun Zhang

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

An Empirical Comparison of Methods for Iceberg-CUBE Construction. Leah Findlater and Howard J. Hamilton Technical Report CS August, 2000

Efficient Computation of Data Cubes. Network Database Lab

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

An Efficient Sliding Window Based Algorithm for Adaptive Frequent Itemset Mining over Data Streams

Association Rule Mining

Data Mining and Data Warehouse Maximal Simplex Method

Tree Structures for Mining Association Rules

Association Rules Extraction with MINE RULE Operator

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

Computing Complex Iceberg Cubes by Multiway Aggregation and Bounding

Parallel Bifold: Large-Scale Parallel Pattern Mining with Constraints

EFIM: A Fast and Memory Efficient Algorithm for High-Utility Itemset Mining

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Production rule is an important element in the expert system. By interview with

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Hiding Sensitive Frequent Itemsets by a Border-Based Approach

Larger K-maps. So far we have only discussed 2 and 3-variable K-maps. We can now create a 4-variable map in the

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Chapter 5, Data Cube Computation

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

MINING ASSOCIATION RULES WITH SYSTOLIC TREES. Song Sun and Joseph Zambreno

Association Rules. Berlin Chen References:

Analysis of Algorithms

Chapter 13, Sequence Data Mining

Parallel Algorithms for Discovery of Association Rules

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Association rule mining

Mining Closed Itemsets: A Review

CSE 5243 INTRO. TO DATA MINING

Adaption of Fast Modified Frequent Pattern Growth approach for frequent item sets mining in Telecommunication Industry

Association Rule Mining. Introduction 46. Study core 46

COFI Approach for Mining Frequent Itemsets Revisited

A Trie-based APRIORI Implementation for Mining Frequent Item Sequences

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

FP-Growth algorithm in Data Compression frequent patterns

1.4 Euler Diagram Layout Techniques

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

A Graph-Based Approach for Mining Closed Large Itemsets

Data Structure for Association Rule Mining: T-Trees and P-Trees

CompSci 516 Data Intensive Computing Systems

Transcription:

DATA MINING II - 1DL460 Spring 2013 " An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 4/18/13 1

Data Mining Alternative Association Analysis (Tan, Steinbach, Kumar ch. 6)" Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden 4/18/13 2

Alternative methods for frequent itemset generation" Traversal of itemset lattice - General-to-specific vs Specific-to-general: Frequent itemset border Frequent itemset border............ {a 1,a 2,...,a n } {a 1,a 2,...,a n } Frequent itemset border {a 1,a 2,...,a n } (a) General-to-specific (b) Specific-to-general (c) Bidirectional 4/18/13 3

Alternative methods for frequent itemset generation" Traversal of itemset lattice as prefix or suffix trees implies different equivalence classes Left: prefix tree and equivalence classes defined by for prefixes of length k = 1 Right: suffix tree and equivalence classes defined by for prefixes of length k = 1 A B C D A B C D AB AC AD BC BD CD AB AC BC AD BD CD ABC ABD ACD BCD ABC ABD ACD BCD ABCD ABCD (a) Prefix tree (b) Suffix tree 4/18/13 4

Alternative methods for frequent itemset generation" Traversal of itemset lattice Breadth-first vs Depth-first: (a) Breadth first (b) Depth first 4/18/13 5

Alternative methods for frequent itemset generation" Representation of database - horizontal vs vertical data layout: Horizontal Data Layout TID Items 1 A,B,E 2 B,C,D 3 C,E 4 A,C,D 5 A,B,C,D 6 A,E 7 A,B 8 A,B,C 9 A,C,D 10 B Vertical Data Layout A B C D E 1 1 2 2 1 4 2 3 4 3 5 5 4 5 6 6 7 8 9 7 8 9 8 10 9 4/18/13 6

Characteristics of Apriori algorithm" Breadth-first search algorithm: all frequent itemsets of given size are kept in the algorithms processing queue General-to-specific search: start with itemsets with large support, work towards lowersupport region Generate-and-test strategy: generate candidates, test by support counting A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE 4/18/13 7

Weaknesses of Apriori" Apriori is one of the first algorithms that succesfully tackled the exponential size of the frequent itemset space Nevertheless the Apriori suffers from two main weaknesses: High I/O overhead from the generate-and-test strategy: several passes are required over the database to find the frequent itemsets The performance can degrade significantly on dense databases, as large portion of the itemset lattice becomes frequent 4/18/13 8

FP-growth algorithm" FP-growth algorithm: mining frequent patterns without candidate generation using a frequent-pattern (FP) tree. FP-growth avoids the repeated scans of the database of Apriori by using a compressed representation of the transaction database using a data structure called FP-tree Once an FP-tree has been constructed, it uses a recursive divide- and-conquer approach to mine the frequent itemsets FP-tree is a compressed representation of the transaction database Each transaction is mapped onto a path in the tree Each node contains an item and the support count corresponding to the number of transactions with the prefix corresponding to the path from root node Nodes having the same item label are cross-linked: this helps finding the frequent itemsets ending with a particular item 4/18/13 9

FP-tree construction" After reading TID=1: TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} B:1 After reading TID=2: A:1 B:1 A:1 B:1 4/18/13 10

FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} After reading TID=3: A:2 B:1 B:1 4/18/13 11

FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} After reading TID=4: A:3 B:1 B:1 4/18/13 12

FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} After reading TID=5: A:4 B:2 B:1 4/18/13 13

FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:2 After reading TID=6: A:5 B:3 B:1 4/18/13 14

FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:2 After reading TID=7: A:6 B:3 B:1 4/18/13 15

FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:3 After reading TID=8: A:7 B:4 B:1 4/18/13 16

FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:3 After reading TID=9: A:8 B:5 B:1 4/18/13 17

FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} C:3 After reading TID=10: A:8 B:5 B:2 C:2 4/18/13 18

Transaction Database FP-tree construction" TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {A} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} B:5 A:8 B:2 C:2 Header table Item A B C D E Pointer C:3 Pointers are used to assist frequent itemset generation 4/18/13 19

Observations about FP-tree" Size of FP-tree depends on how items are ordered. In the previous example, if ordering is done in increasing order, the resulting FPtree will be different and for this example, it will be denser (wider). At the root node the branching factor will increase from 2 to 5 as shown on next slide. Also, ordering by decreasing support count doesn t always lead to the smallest tree. 4/18/13 20

FP-tree size" The size of an FP tree is typically smaller than the size of the uncompressed data because many transactions often share a few items in common. Best case scenario: All transactions have the same set of items, and the FP tree contains only a single branch of nodes. Worst case scenario: Every transaction has a unique set of items. As none of the transactions have any items in common, the size of the FP tree is effectively the same as the size of the original data. The size of an FP tree also depends on how the items are ordered. If the ordering scheme in the preceding example is reversed, i.e., from lowest to highest support item, the resulting FP tree probably is denser (shown in next slide). Not always though ordering is just a heuristic. 4/18/13 21

FP-tree vs. original database " If the transactions share a significant number of items, FP-tree can be considerably smaller as the common subset of the items is likely to share paths. There is a storage overhead from the links as well from the support counts, so in worst case may even be larger than original. 4/18/13 22

Frequent itemset generation in FP-growth" This algorithm generates frequent itemsets from FP-tree by traversing in bottom-up fashion. This algorithm extracts frequent itemsets ending in e first and then ending in d, c, b and a. As every transaction is mapped onto a single path in the FP-tree, frequent itemsets, say ending in e can be found by investigating the paths containing node e. 4/18/13 23

Mining frequent patterns using FP-tree" General idea (divide-and-conquer) Recursively grow frequent patterns using the FP-tree looking for shorter ones recursively and then concatenating the suffix For each frequent item, construct its conditional pattern base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty. 4/18/13 24

Major steps of FP-growth algorithm" Step 1: Construct conditional pattern base for each item in the header table. Starting at the bottom of frequent-item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base Step 2: Construct conditional FP-tree from each conditional pattern base. For each pattern base: Accumulate the count for each item in the base Construct the conditional FP-tree for the frequent items of the pattern base Step 3: Recursively mine conditional FP-trees and grow frequent patterns obtained so far. 4/18/13 25

Frequent itemset generation in FP-growth" FP-growth uses a divide-and-conquer approach to find frequent itemsets It searches frequent itemsets ending with item E first, then itemsets ending with D, C, B, and A. That is, it uses equivalence classes based on length-1 suffixes. Paths corresponding to different suffixes are extracted from the FPtree. 4/18/13 26

Frequent itemset generation in FP-growth" To find all frequent itemsets ending with a given last item (e.g. E), we first need to compute the support of the item. This is given by the sum of support counts of all nodes labeled with the item (σ(e) = 3) found by following the cross-links connecting the nodes with the same item. If last item is found frequent, FP-growth next iteratively looks for all frequent itemsets ending with given length-2 suffix (DE, CE, BE, and AE), and recursively with length-3 suffix, length-4 suffix until no more frequent itemsets are found Conditional FP-tree is constructed for each different suffix to speed up the computation. 4/18/13 27

Frequent itemset generation in FPG" Paths containing node E: A:8 B:2 C:2 4/18/13 28

Frequent itemset generation in FPG" Paths containing node D: C:3 B:5 A:8 B:2 C:2 4/18/13 29

Frequent itemset generation in FPG" Paths containing node C: Paths containing node B: C:3 B:5 A:8 B:2 C:2 A:8 B:5 B:2 Paths containing node A: A:8 4/18/13 30

Frequent itemset generation for paths ending in E:" Prefix paths ending in E: A:8 B:2 Conditional FP-tree for E: X! A:2 B:1 C:2 Conditional pattern base for E: P = {(A:1,, ), (A:1, ) ()} By recursively applying FP-growth on P one yields: Frequent itemsets(with sup > 1): E, DE, ADE, CE, AE 4/18/13 31

Frequent itemset generation for paths ending in E:" Prefix paths ending in DE: Conditional FP-tree for DE: A:2 A:2 X! 4/18/13 32

Frequent itemset generation for paths ending in E:" Prefix paths ending in CE: Conditional FP-tree for CE: A:1 A:1 X! 4/18/13 33

Frequent itemset generation for paths ending in E:" Since B is already pruned there is no need to cover this case and AE is left: Prefix paths ending in AE: Conditional FP-tree for AE: A:2 4/18/13 34

Is FP-growth faster than Apriori?" As the support threshold goes down, the number of itemsets increases dramatically. FP-growth does not need to generate candidates and test them the support threshold goes down, the number of ite 4/18/13 35

Is FP-growth faster than Apriori?" Both FP-growth and Apriori scale linearly with the number of transactions. But FP-growth is more efficient 4/18/13 36

Frequent itemset generation for paths ending in D:" Prefix paths ending in D: Conditional FP-tree for D: C:3 B:5 A:7 B:3 C:3 B:2 A:4 B:1 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 37

Frequent itemset generation for paths ending in D:" Prefix paths ending in CD: Conditional FP-tree for CD: B:1 A:2 B:1 B:1 A:2 B:1 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 38

Frequent itemset generation for paths ending in D:" Prefix paths ending in BCD: Conditional FP-tree for BCD: B:1 A:1 A:1 X! B:1 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 39

Frequent itemset generation for paths ending in D:" Prefix paths ending in ACD: Conditional FP-tree for ACD: B:1 A:2 A:2 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 40

Frequent itemset generation for paths ending in D:" Prefix paths ending in BD: Conditional FP-tree for BD: B:2 A:2 B:1 A:2 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 41

Frequent itemset generation for paths ending in D:" Prefix paths ending in AD: Conditional FP-tree for AD: A:4 Conditional pattern base for D: P = {(A:1, B:1, ), (A:1, B:1), (A:1, ), (A:1), (B:1, )} Recursively applying FP-growth on P yields: Frequent itemsets (sup > 1): D, CD, BCD, ACD, BD, ABD, AD 4/18/13 42

Frequent itemset generation in FPG" 4/18/13 43

Frequent itemset generation in FPG" 4/18/13 44

The tree projection algorithm" Generation of frequent itemsets by successive construction of nodes of a lexicographic tree of itemsets. FIG. 1. The lexicographic tree. 4/18/13 45