Efficient Mining of Top-K Sequential Rules

Similar documents
Mining Top-K Association Rules. Philippe Fournier-Viger 1 Cheng-Wei Wu 2 Vincent Shin-Mu Tseng 2. University of Moncton, Canada

TKS: Efficient Mining of Top-K Sequential Patterns

TNS: Mining Top-K Non-Redundant Sequential Rules Philippe Fournier-Viger 1 and Vincent S. Tseng 2 1

CMRules: Mining Sequential Rules Common to Several Sequences


FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

CMRULES: An Efficient Algorithm for Mining Sequential Rules Common to Several Sequences

ERMiner: Sequential Rule Mining using Equivalence Classes

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Mining Maximal Sequential Patterns without Candidate Maintenance

FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data Mining for Knowledge Management. Association Rules

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

A Graph-Based Approach for Mining Closed Large Itemsets

VMSP: Efficient Vertical Mining of Maximal Sequential Patterns

Efficient Mining of High-Utility Sequential Rules

International Journal of Scientific Research and Reviews

Chapter 7: Frequent Itemsets and Association Rules

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rule Mining

A Survey of Sequential Pattern Mining

EFIM: A Fast and Memory Efficient Algorithm for High-Utility Itemset Mining

CS145: INTRODUCTION TO DATA MINING

An Approach To Build Sequence Database From Web Log Data For Webpage Access Prediction

Frequent Itemsets Melange

Sequential Pattern Mining Methods: A Snap Shot

Chapter 6: Association Rules

Performance Based Study of Association Rule Algorithms On Voter DB

Association mining rules

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

Chapter 4: Association analysis:

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

[Shingarwade, 6(2): February 2019] ISSN DOI /zenodo Impact Factor

Association Rule Mining: FP-Growth

Data Mining Part 3. Associations Rules

Interestingness Measurements

Performance evaluation of top-k sequential mining methods on synthetic and real datasets

Optimization using Ant Colony Algorithm

MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

Medical Data Mining Based on Association Rules

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Part 2. Mining Patterns in Sequential Data

RHUIET : Discovery of Rare High Utility Itemsets using Enumeration Tree

Nesnelerin İnternetinde Veri Analizi

An Efficient Algorithm for finding high utility itemsets from online sell

gspan: Graph-Based Substructure Pattern Mining

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases

FP-Growth algorithm in Data Compression frequent patterns

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

A Comprehensive Survey on Sequential Pattern Mining

Tutorial on Association Rule Mining

Chapter 7: Frequent Itemsets and Association Rules

Knowledge Discovery in Databases II Winter Term 2015/2016. Optional Lecture: Pattern Mining & High-D Data Mining

Generation of Potential High Utility Itemsets from Transactional Databases

DATA MINING II - 1DL460

Data warehouse and Data Mining

Minig Top-K High Utility Itemsets - Report

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

FOSHU: Faster On-Shelf High Utility Itemset Mining with or without Negative Unit Profit

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

An Algorithm for Mining Large Sequences in Databases

Maintenance of the Prelarge Trees for Record Deletion

Chapter 2. Related Work

Mining Associated Ranking Patterns from Wireless Sensor Networks

Frequent Pattern Mining

Chapter 13, Sequence Data Mining

Research and Application of E-Commerce Recommendation System Based on Association Rules Algorithm

A mining method for tracking changes in temporal association rules from an encoded database

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

International Journal of Computer Engineering and Applications,

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Effective Mining Sequential Pattern by Last Position Induction

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Temporal Weighted Association Rule Mining for Classification

Non-redundant Sequential Association Rule Mining. based on Closed Sequential Patterns

SLPMiner: An Algorithm for Finding Frequent Sequential Patterns Using Length-Decreasing Support Constraint Λ

CS490D: Introduction to Data Mining Prof. Chris Clifton

Appropriate Item Partition for Improving the Mining Performance

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

Machine Learning: Symbolische Ansätze

Mining Frequent Patterns without Candidate Generation

Value Added Association Rules

Association Rule Discovery

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Efficient Algorithm for Mining High Utility Itemsets from Large Datasets Using Vertical Approach

Association Rule Discovery

An Improved Apriori Algorithm for Association Rules

Sequential PAttern Mining using A Bitmap Representation

Transcription:

Session 3A 14:00 FIT 1-315 Efficient Mining of Top-K Sequential Rules Philippe Fournier-Viger 1 Vincent Shin-Mu Tseng 2 1 University of Moncton, Canada 2 National Cheng Kung University, Taiwan 18 th December 2011 ADMA 2011 - Beijing

Sequence Database A set of sequences. Each sequence is an ordered list of transactions. A transaction is an unordered set of items (symbols), considered to occur simultaneously. Ex.: click-stream data, market basket analysis, stock market data, bioinformatics, e-learning 2

Sequential Pattern Mining SPM finds subsequences that are common to more than minsup sequences (Spade, PrefixSpan, GSP...). Ex.: {b}, {f}, {e} is a seq. pattern with support = 50 %. However, this pattern is misleading because {b}, {f} also appear 50 % of the time without {e} following. To solve this problem, we would need to consider the confidence of sequential patterns. 3

Sequential Rule Mining A sequential rule typically has the form X Y and has a confidence and a support. Algorithms for mining seq. rules in a (1) single sequence, (2) across sequences or (3) common to multiple sequences. Several applications: stock market analysis (Das et al., 1998; Hsieh et al., 2006), weather observation (Hamilton & Karimi, 2005), drought management (Harms et al. 2002), alarm analysis, etc. 4

Mining sequential rule common to several sequences RuleGen proposed by Zaki (2001). Rules of the form X Y where X and Y are sequential patterns. Ex.: {a},{b},{c} {d, e} Problem: rules can be too specific. unlikely to match with a new sequence, can have a low support value because they are very specific, despite that they may describe a common situation. - 5

The definition used in this paper A sequential rule X Y is a relationship between two disjoint and unordered itemsets X,Y. A sequential rule X Y has two properties: Support: the number of sequences where X occurs before Y, divided by the number of sequences. Confidence the number of sequences where X occurs before Y, divided by the number of sequences where X occurs. The task: finding all rules with a support and confidence not less than user-defined thresholds minsup and minconf. 6

Example minsup= 0.5 and minconf= 0.5: A sequence database Some rules found 7

Current Algorithms CMRules: An association rule mining based algorithm for the discovery of sequential rules. CMDeo: An Apriori based algorithm for the discovery of sequential rules. RuleGrowth: A pattern-growth based algorithm. (current best) 8

RuleGrowth Find larger rules by recursively scanning the database for adding a single item at a time to the left or right part of each rule (these processes are called left and right expansions). {a} {b}. {} {a} {c}. {a, c} {b}. {a} {b, e}. {a} {b, e, f}. {a} {b, f}.

The problem of setting minsup Scenario: A user wants to discover the top 1000 rules from a database and do not want to find more than 2000 rules. For BMSWebView1, the range of minsup values that will satisfy the user is 0.0011 to 0.0009. A user having no a priori knowledge of the database has only a 0.02 % chance of selecting a minsup value that will make him satisfied. Too high, not enough rules. Too low, performance deteriorates. 10

Main idea set minsup = 0. TopSeqRules use the RuleGrowth strategy for rule generation. keep a set L that contains the current top-k rules found until now. when k rules are found, raise minsup to the lowest support of the rules in L. after that, for each rules added to L, raise the minsup threshold. 11

TopSeqRules (2) The resulting algorithm has poor execution time because the search space is too large. Observation: if we can find rules with higher support first, we can raise minsup more quickly and prune the search space. How to define what is the most promising? Our experiment show that the support is a good choice. We added a set R containing the k rules having the highest support. 12

TopSeqRules (3) - optimizations We found that the choice of data structures for implementing L and R is also very important: L : fibonnaci heap : O(1) amortized time for insertion and minimum, and O(log(n)) for deletion. R: red-black tree: O(log(n)) worst case time complexity for insertion, deletion, min and max. Merging database scans 13

Experimental evaluation Three real-life datasets: 14

Results influence of k Execution time an maximum memory usage grow linearly with k. 15

Results influence of minconf Execution time and memory usage increase exponentially when minconf increases because more rules have to be generated. 16

Results influence of database size Execution time and memory increases slowly if the number of rules stay more or less the same. 17

Performance comparison 18

Performance comparison (2) When minsup is chosen optimally, RuleGrowth has slightly better performance. However, setting minsup is very difficult. If minsup is set too low, RuleGrowth will not find any rule. If minsup is set too high, too many rules will be found and the performance deterioates 19

Conclusion We proposed an algorithm that let the user set k, the number of rules rules to be found. Excellent scalability: execution time linearly increases with k. the algorithm has no problem running in reasonable time and memory limits for k values of up to 5000 for all datasets. 20

Thank you. Questions? Thanks to the organizers of ADMA 2011! Thanks to the FQRNT funding programs. Open source Java data mining software, 43 algorithms http://www.phillippe-fournier-viger.com/spmf/