OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Similar documents
Tree Structures for Mining Association Rules

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

DATA MINING II - 1DL460

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Chapter 4: Association analysis:

Data Structure for Association Rule Mining: T-Trees and P-Trees

Data Mining Techniques

AB AC AD BC BD CD ABC ABD ACD ABCD

CS570 Introduction to Data Mining

Data Mining Part 3. Associations Rules

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association rule mining

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

2. Discovery of Association Rules

Association Pattern Mining. Lijun Zhang

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Temporal Approach to Association Rule Mining Using T-tree and P-tree

Mining Frequent Patterns without Candidate Generation

Association Rules. A. Bellaachia Page: 1

CHAPTER 8. ITEMSET MINING 226

Performance and Scalability: Apriori Implementa6on

Effectiveness of Freq Pat Mining

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Chapter 7: Frequent Itemsets and Association Rules

Data Mining for Knowledge Management. Association Rules

Data Warehousing and Data Mining

Chapter 6: Association Rules

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Improved Methods for Extracting Frequent Itemsets from Interim-Support Trees

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Mining Association Rules in Large Databases

Frequent Pattern Mining

BCB 713 Module Spring 2011

Association Analysis: Basic Concepts and Algorithms

Frequent Pattern Mining

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

A linear delay algorithm for building concept lattices

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Association Rule Mining

Pattern Lattice Traversal by Selective Jumps

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

Induction of Association Rules: Apriori Implementation

Association Rule Learning

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Building Roads. Page 2. I = {;, a, b, c, d, e, ab, ac, ad, ae, bc, bd, be, cd, ce, de, abd, abe, acd, ace, bcd, bce, bde}

Association Rules Apriori Algorithm

Association Rule Mining: FP-Growth

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Production rule is an important element in the expert system. By interview with

Efficient Computation of Data Cubes. Network Database Lab

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Information Sciences

On Canonical Forms for Frequent Graph Mining

Association Rule Mining. Introduction 46. Study core 46

Roadmap. PCY Algorithm

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

1.4 Euler Diagram Layout Techniques

This paper proposes: Mining Frequent Patterns without Candidate Generation

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

An Algorithm for Mining Large Sequences in Databases

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Jeffrey D. Ullman Stanford University

An Empirical Comparison of Methods for Iceberg-CUBE Construction. Leah Findlater and Howard J. Hamilton Technical Report CS August, 2000

An Automated Support Threshold Based on Apriori Algorithm for Frequent Itemsets

A fast parallel algorithm for frequent itemsets mining

Weighted Association Rule Mining from Binary and Fuzzy Data

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Association Rule Discovery

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

Association Rules. Comp 135 Machine Learning Computer Science Tufts University. Association Rules. Association Rules. Data Model.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Parallelizing Frequent Itemset Mining with FP-Trees

Association Rules Apriori Algorithm

A Reconfigurable Platform for Frequent Pattern Mining

Improvements to A-Priori. Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

Association Rule Mining (ARM) Komate AMPHAWAN

Chapter 5, Data Cube Computation

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

Parallel Bifold: Large-Scale Parallel Pattern Mining with Constraints

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

Association Rules Outline

MS-FP-Growth: A multi-support Vrsion of FP-Growth Agorithm

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Rule induction. Dr Beatriz de la Iglesia

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Frequent Item Sets & Association Rules

Hash-Based Improvements to A-Priori. Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms

Parallel Algorithms for Discovery of Association Rules

Mining Closed Itemsets: A Review

Advance Association Analysis

Mining High Average-Utility Itemsets

Optimized Frequent Pattern Mining for Classified Data Sets

Performance Analysis of Data Mining Algorithms

Transcription:

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING ES200 Peterhouse College, Cambridge Frans Coenen, Paul Leng and Graham Goulbourne The Department of Computer Science The University of Liverpool

Introduction: The archetypal problem --- shopping basket analysis Which items tend to occur together in shopping baskets? Examine database of purchase transactions look for associations Find Association Rules: PQ -> X When P and Q occur together, X is likely to occur also

Support and Confidence The support for a rule A->B A is the number (proportion) of cases in which AB occur together The confidence for a rule is the ratio of support for rule to support for its antecedent The problem: Find all rules for which support and confidence exceed some threshold (the frequent sets) Support is the difficult part (confidence follows)

Lattice of attribute-subsets A B C D AB AC AD BC BD CD ABC ACD BCD ABCD

Apriori Algorithm Breadth-first lattice traversal: on each iteration k, examine a Candidate Set C k of sets of k attributes: Count the support for all members of C k (one pass of the database, requiring all k-subsets of each record to be examined) Find the set L k of sets with required support Use this to determine C k+, the set of sets of size k+ all of whose subsets are in L k

Performance Requires x+ database passes (where x is the size of the largest frequent set) Candidate sets can become very large (especially if database is dense) Examining k-subsets of a record to identify all members of C k present is time-consuming So: unsatisfactory for databases with densely-packed records

Computing support via Partial support totals Use a single database pass to count the sets present (not subsets): this gives us m partial support-counts (m < m, the database size) Use this set of counts to compute the total support for subsets Gains when records duplicated (m << m) More important: allows us to reorganise data for efficient computation

Building the tree For each record i in database: find the set i on the tree; increment support-count for all sets on path to i if set not present on tree, create a node for it Tree is built dynamically (size ~m rather than 2 n ) Building tree has already counted support deriving from successor-supersets (leading to interim support-count Q i )

Set enumeration tree: The P-tree A 8 4 B C D 2 AB 4 2 AC AD BC 2 BD CD ACD BCD A B BD CD ABC 2 C D ABC ABCD AB ACD AC BCD AD ABCD BC

Set enumeration tree: The P-tree A 7 B C D 4 2 AB 3 AC AD 2 BC BD CD 2 ACD BCD A B BD CD ABCD 3 C D ABC AB ACD AC BCD AD ABCD BC

Dummy Nodes A 7 ABC 3 ABCD 3 AC AD 2 ACD A AC ABCD 3 AD ABC ACD ABCD

A 7 Dummy Nodes ACD A AC AD ABC 2 AC AD 2 ABC ACD ABCD ABCD A AB 7 3 AC AD 2 ABC 2 ACD ABCD

Calculating total support A 8 B C D 4 2 AB 4 AC AD 2 BC 2 BD CD ACD BCD ABC ABCD 2 i TS = i PS + sum(predessessor nodes of I PS ) B TS = B PS +AB PS

Calculating total support A 8 B C D 4 2 AB 4 AC AD 2 BC 2 BD CD ACD BCD ABC ABCD 2 D TS = D PS + CD PS + BD PS + BCD PS + AD PS + ACD PS + PS + ABCD PS

Calculating total support A 8 B 4 C 2 D AB 4 AC AD 2 BC 2 BD CD ACD BCD ABC ABCD 2 D TS = D PS + CD PS + BD PS + BCD PS + AD PS + ACD PS + PS + ABCD PS

Computing total supports: The T-tree A B C D AB AC BC AD BD CD ABC ACD BCD ABCD

Itemset Ordering Advantages gained from partial computation is not equally distributed throughout the set of candidates. For candidate early in the lexicographic order most of the support calculation is complete If we know the frequency of single items sets we can order the tree so that the most common item sets appear first and thus reduced the effort required for total support counting.

Set enumeration tree: The P-tree A 3 4 B CD D 2 AB ACD AD BCD 2 2 BD ABCD D CD BD AD BCD ACD ABCD

Computing Total Supports Have already computed interim support Q i for set i Total support = (adding support for predecessor-supersets) + Qi Pj

Example A B C D AB AC AD BC BD CD ACD BCD ABC ABCD -To complete total for BC, need to add support stored at ABC

General summation algorithm For each node j in tree: for all sets i in Target set T: if i is a subset of j and i is not a subset of the parent of j, add Q j to total for i

Example (2) A B C D AB AC AD BC BD CD ACD BCD ABC ABCD -Add support stored at ABC to support for AC, BC and C - No need to add to A, AB (already counted) or to B (will have AB added, including ABC)

Modified algorithm Problem: still have 2 n Totals to count So use Apriori type algorithm Count C, C 2 etc in repeated passes of tree

Algorithm Apriori-TFP (Totalfrom -Partial) For each node j in P-tree: i is attribute not in parent node starting at node i of T-tree: walk the tree until (parent of) node j is reached, adding support to all subsets of j at the required level On completion, prune the tree to remove unsupported sets Generate the next level and repeat

Illustration A B C D AB AC BC AD BD CD ABC ACD BCD ABCD Pass : C not supported, so do not add AC,BC,CD to tree Pass2: (eg) Item from P-tree added to AD and BD (tree is walked from D to BD)

Advantages. Duplication in records reduces size of tree 2. Fewer subsets to be counted: eg, for a record of r attributes, Apriori counts r(r-)/2 subset-pairs; our method only r- 3. T-tree provides an efficient localisation of candidates to be updated in Apriori-TFP

Related Work The FP-tree (Han et al.), developed contemporaneously, has similar properties, but: FP-tree stores a single item only at each node (so more nodes) FP-tree builds in more links to implement FPgrowth algorithm Conversely, P-tree is generic: Apriori-TFP is only one possible algorithm

Experimental results () Size and construction time for P-tree: almost independent of N (number of attributes) scale linearly with M (number of records) seems to scale linearly as database density increases less than for FP-tree (because of more nodes and links in latter)

Experimental results (2): time to produce all frequent sets T25.I0.NK.D0K

Continuing work Optimise using item ordering heuristic: (as used in FP-growth) Explore other algorithms (eg Partition) applied to P-tree Hybrid methods, using different algorithms for subtrees (exhaustive methods may be effective for small very densely-populated subtrees)