Mining Top-K Association Rules. Philippe Fournier-Viger 1 Cheng-Wei Wu 2 Vincent Shin-Mu Tseng 2. University of Moncton, Canada

Similar documents
Efficient Mining of Top-K Sequential Rules

TKS: Efficient Mining of Top-K Sequential Patterns


FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

TNS: Mining Top-K Non-Redundant Sequential Rules Philippe Fournier-Viger 1 and Vincent S. Tseng 2 1

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining

CMRules: Mining Sequential Rules Common to Several Sequences

Efficient Mining of High-Utility Sequential Rules

CMRULES: An Efficient Algorithm for Mining Sequential Rules Common to Several Sequences

FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

ERMiner: Sequential Rule Mining using Equivalence Classes

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

[Shingarwade, 6(2): February 2019] ISSN DOI /zenodo Impact Factor

An Efficient Algorithm for finding high utility itemsets from online sell

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

An Improved Apriori Algorithm for Association Rules

Performance Based Study of Association Rule Algorithms On Voter DB

gspan: Graph-Based Substructure Pattern Mining

EFIM: A Highly Efficient Algorithm for High-Utility Itemset Mining

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association Rule Mining

EFIM: A Fast and Memory Efficient Algorithm for High-Utility Itemset Mining

VMSP: Efficient Vertical Mining of Maximal Sequential Patterns

Data Mining Part 3. Associations Rules

Optimization using Ant Colony Algorithm

FP-Growth algorithm in Data Compression frequent patterns

A Survey of Sequential Pattern Mining

Appropriate Item Partition for Improving the Mining Performance

Mining Frequent Itemsets Along with Rare Itemsets Based on Categorical Multiple Minimum Support

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

Association Rule Mining. Entscheidungsunterstützungssysteme

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

FOSHU: Faster On-Shelf High Utility Itemset Mining with or without Negative Unit Profit

Mining Maximal Sequential Patterns without Candidate Maintenance

13/05/2018. Spark MLlib provides

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Frequent Itemsets Melange

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

Temporal Weighted Association Rule Mining for Classification

EFIM: A Fast and Memory Efficient Algorithm for High-Utility Itemset Mining

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Chapter 7: Frequent Itemsets and Association Rules

Association Rule Mining: FP-Growth

Association Rules Apriori Algorithm

A Survey of Itemset Mining

Comparing the Performance of Frequent Itemsets Mining Algorithms

Data Mining for Knowledge Management. Association Rules

A Survey on Efficient Algorithms for Mining HUI and Closed Item sets

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Design of Search Engine considering top k High Utility Item set (HUI) Mining

Generation of Potential High Utility Itemsets from Transactional Databases

Keywords: Frequent itemset, closed high utility itemset, utility mining, data mining, traverse path. I. INTRODUCTION

Implementation of CHUD based on Association Matrix

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

Chapter 7: Frequent Itemsets and Association Rules

Maintenance of the Prelarge Trees for Record Deletion

Incrementally mining high utility patterns based on pre-large concept

A Review on High Utility Mining to Improve Discovery of Utility Item set

Mining Imperfectly Sporadic Rules with Two Thresholds

A mining method for tracking changes in temporal association rules from an encoded database

Improved Frequent Pattern Mining Algorithm with Indexing

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

Tutorial on Association Rule Mining

Mining Association Rules in Large Databases

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

A DISTRIBUTED ALGORITHM FOR MINING ASSOCIATION RULES

数据挖掘 Introduction to Data Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms

Minig Top-K High Utility Itemsets - Report

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Association mining rules

Discovering High Utility Change Points in Customer Transaction Data

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

IMPROVING MAPREDUCE FOR MINING EVOLVING BIG DATA USING TOP K RULES

Nesnelerin İnternetinde Veri Analizi

Mining Frequent Patterns without Candidate Generation

Application of Web Mining with XML Data using XQuery

Medical Data Mining Based on Association Rules

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

Association Rule Mining among web pages for Discovering Usage Patterns in Web Log Data L.Mohan 1

Enhanced SWASP Algorithm for Mining Associated Patterns from Wireless Sensor Networks Dataset

A SHORTEST ANTECEDENT SET ALGORITHM FOR MINING ASSOCIATION RULES*

A Review on Mining Top-K High Utility Itemsets without Generating Candidates

Research and Improvement of Apriori Algorithm Based on Hadoop

CompSci 516 Data Intensive Computing Systems

Association Rule Mining. Introduction 46. Study core 46

This paper proposes: Mining Frequent Patterns without Candidate Generation

Transcription:

Mining Top-K Association Rules Philippe Fournier-Viger 1 Cheng-Wei Wu 2 Vincent Shin-Mu Tseng 2 1 University of Moncton, Canada 2 National Cheng Kung University, Taiwan AI 2012 28 May 2012 Introduction A fundamental data mining task with wide applications is association rule mining [Agrawal93]. It consists of finding interesting associations between items in a transaction database. A transaction database is a set of transactions, where each transaction is a set of items (an itemset). For example: 2 1

Association Rules Association rule: an association between two itemsets X Y. Support : the percentage of transactions of a database where the rule occurs. Confidence : the support of the rule divided by the support of its antecedent. For example: {a, b} {c} support = 0.5 confidence = 0.66 3 Association Rule Mining Goal? Discovering all rules that have a support and confidence respectively higher or equal to userdefined thresholds minsup and minconf. How? 1. Mining frequent itemsets with an algorithm such as Apriori, FPGrowth and Hmine. 2. For each frequent itemset X, select an itemset P X to generate a rule of the form P X P. Calculate the support and confidence. 4 2

Example minsup= 0.5 and minconf = 0.5 A transaction database Some association rules found 5 How to choose the thresholds? If set too high, generate too few results, omitting valuable information. If set too low, algorithms can generate an extremely large amount of results, algorithms can become very slow. A major problem because users have limited resources for analyzing the results fine tuning the parameters is time-consuming. 6 3

The problem of setting minsup Scenario: A user wants to discover the top 1000 rules from a database and do not want to find more than 2000 rules. For Connect, the range of minsup values that will satisfy the user is 0.5060 to 0.5052. A user having no a priori knowledge of the database has only a 0.08 % chance of selecting a minsup value that will make him satisfied. Too high, not enough rules. Too low, performance deteriorates. 7 Our proposal Redefining the problem of association rule mining as mining the top-k association rules. Two parameters: k is the number of rules to be generated. minconf Related works: some works have used the term top-k association rules. But they are applied to mining streams or mining nonstandard rules. [Webb05, You2010] 8 4

Challenges An algorithm for top-k association mining cannot use minsup to prune the search space. Large search space: if d different items, up to rules. Adapting the two-steps process for association mining would be inefficient because we would need to mine itemsets with minsup = 0 in the first step. So how to define an algorithm? 9 TopKRules Find larger rules by recursively scanning the database for adding a single item at a time to the left or right part of each rule (these processes are called left and right expansions). {a} {b}. {} {a} {c}. {a, c} {b}. {a} {b, e}. {a} {b, e, f}. {a} {b, f}. 5

Main idea set minsup = 0. TopKRules (2) use left/right expansions for rule generation. keep a set L that contains the current top-k rules found until now. when k rules are found, raise minsup to the lowest support of the rules in L. after that, for each rules added to L, raise the minsup threshold. 11 TopKRules (3) The resulting algorithm has poor execution time because the search space is too large. Observation: if we can find rules with higher support first, we can raise minsup more quickly and prune the search space. How to define what is the most promising? Our experiment show that the support is a good choice. We added a set R containing the k rules having the highest support. 12 6

TopKRules (4) - optimizations We found that the choice of data structures for implementing L and R is also very important: L : fibonnaci heap : O(1) amortized time for insertion and minimum, and O(log(n)) for deletion. R: red-black tree: O(log(n)) worst case time complexity for insertion, deletion, min and max. Merging database scans 13 Experimental evaluation 4 real-life datasets and 1 synthetic dataset 14 7

Results influence of k Execution time an maximum memory usage grow linearly with k. 15 Results influence of minconf Execution time and memory usage increase when minconf increases because more rules have to be generated. How much it increases, depends on the dataset. 16 8

Results influence of database size Execution time and memory increases slowly if the number of rules stay more or less the same. 17 Performance comparison We compared with an implementation of association rule generation with FPGrowth, for optimal minsup selection. Runtime Maximum Memory Usage 18 9

Performance comparison (2) When minsup is chosen optimally, FPGrowth has better performance. However, setting minsup is very difficult. If minsup is set too low, FPGrowth will not find any rule. If minsup is set too high, too many rules will be found and the performance deteriorates. 19 Conclusion We proposed an algorithm that let the user set k, the number of rules to be found. Excellent scalability: execution time linearly increases with k. The algorithm has no problem running in reasonable time and memory limits for k values of up to 5000 for all datasets. 20 10

Thank you. Questions? Thanks to the FQRNT funding programs. Open source Java data mining software, 46 algorithms http://www.phillippe-fournier-viger.com/spmf/ 11