Frequent Pattern Mining with Uncertain Data

Similar documents
Frequent Pattern Mining with Uncertain Data

Data Mining Part 3. Associations Rules

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Upper bound tighter Item caps for fast frequent itemsets mining for uncertain data Implemented using splay trees. Shashikiran V 1, Murali S 2

A Trie-based APRIORI Implementation for Mining Frequent Item Sequences

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Mining Frequent Itemsets from Uncertain Databases using probabilistic support

Vertical Mining of Frequent Patterns from Uncertain Data

Association Rule Mining

Memory issues in frequent itemset mining

Data Mining for Knowledge Management. Association Rules

Frequent Itemsets Melange

Efficient Pattern Mining of Uncertain Data with Sampling

An Algorithm for Mining Frequent Itemsets from Library Big Data

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association Rule Mining. Introduction 46. Study core 46

Data Structure for Association Rule Mining: T-Trees and P-Trees

Association Rule Mining: FP-Growth

Tutorial on Association Rule Mining

DATA MINING II - 1DL460

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Association rule mining

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

Mining Frequent Patterns without Candidate Generation

Infrequent Weighted Itemset Mining Using Frequent Pattern Growth

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Review of Algorithm for Mining Frequent Patterns from Uncertain Data

UDS-FIM: An Efficient Algorithm of Frequent Itemsets Mining over Uncertain Transaction Data Streams

Performance Analysis of Data Mining Algorithms

A New Fast Vertical Method for Mining Frequent Patterns

FP-Growth algorithm in Data Compression frequent patterns

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Efficient Algorithm for Frequent Itemset Generation in Big Data

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Item Set Extraction of Mining Association Rule

This paper proposes: Mining Frequent Patterns without Candidate Generation

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Appropriate Item Partition for Improving the Mining Performance

Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning

ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS

Association Rules. Berlin Chen References:

INTELLIGENT SUPERMARKET USING APRIORI

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

CSCI6405 Project - Association rules mining

Analyzing Working of FP-Growth Algorithm for Frequent Pattern Mining

Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Association Pattern Mining. Lijun Zhang

Research and Improvement of Apriori Algorithm Based on Hadoop

Performance Based Study of Association Rule Algorithms On Voter DB

Association Rule Discovery

Finding Frequent Patterns Using Length-Decreasing Support Constraints

CS570 Introduction to Data Mining

2. Discovery of Association Rules

arxiv: v1 [cs.db] 11 Jul 2012

Associated Sensor Patterns Mining of Data Stream from WSN Dataset

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

Parallel Trie-based Frequent Itemset Mining on Graphics Processors

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

signicantly higher than it would be if items were placed at random into baskets. For example, we

An Efficient Algorithm for finding high utility itemsets from online sell

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES

Fundamental Data Mining Algorithms

Improved Frequent Pattern Mining Algorithm with Indexing

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

An Improved Apriori Algorithm for Association Rules

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

Mining Association Rules in Large Databases

Parallelizing Frequent Itemset Mining with FP-Trees

Decision Support Systems

Scalable Vertical Mining for Big Data Analytics

Chapter 4: Association analysis:

Implementation of object oriented approach to Index Support for Item Set Mining (IMine)

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

SQL Based Frequent Pattern Mining with FP-growth

Survey on Frequent Pattern Mining

Survey on the Techniques of FP-Growth Tree for Efficient Frequent Item-set Mining

Applying Packets Meta data for Web Usage Mining

Association Rule Discovery

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Product presentations can be more intelligently planned

Mining of Web Server Logs using Extended Apriori Algorithm

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

PC Tree: Prime-Based and Compressed Tree for Maximal Frequent Patterns Mining

DATA MINING II - 1DL460

CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets

FREQUENT ITEMSET MINING USING PFP-GROWTH VIA SMART SPLITTING

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 7: Frequent Itemsets and Association Rules

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Transcription:

Charu C. Aggarwal 1, Yan Li 2, Jianyong Wang 2, Jing Wang 3 1. IBM T J Watson Research Center 2. Tsinghua University 3. New York University Frequent Pattern Mining with Uncertain Data ACM KDD Conference, 2009

Introduction Uncertainty is everywhere Errors in Instrumentation Derived data sets Links between privacy and uncertain data mining Intentionally incorporated uncertainty The results of data mining algorithms are highly impacted by the uncertainty The frequent pattern mining problem is one for which the performance is significantly impacted by uncertain representations

Problem Definition for Uncertain Data Associate existential probabilities for items in transactions Probability of presence of item i in transaction T k is p(i, T k ). Expected support of itemset I in T k is p(i, T k ) = π i I p(i, T k ) Expected support of itemset I is k p(i, T k ) Definition: Determine all frequent patterns with expected support above user-defined threshold

Deterministic Algorithm Classes Candidate Generate-and-Test Algorithms Join based Tree based Pattern Growth Algorithms H-Mine FP-Growth

Contributions Discuss extensions of broad classes of frequent pattern mining algorithms Compare the broad classes of frequent pattern mining algorithms Stress-test on computationally difficult case: high uncertainty probabilities Memory is an important resource in the uncertain case: test for memory requirements

Key Take Aways Algorithms which work well on deterministic data (FP- Growth) may not work as well on uncertain data Pruning tricks which work for low uncertainty probabilities are an overhead for the case of high uncertainty probabilities The pattern-growth paradigm can be leveraged if it is used in the proper context Extensions of the H-mine algorithm turn out to be the most effective in terms of the combination of memory and computational requirements

Apriori Extensions Standard candidate-generate-and-test can be extended directly with the main difference being in counting Chiu et al proposed several pruning techniques Transaction Trimming Methods: Key is in pruning infrequent items Support Pruning Methods: Compute upper bounds on expected support of itemsets; prune when they fall below minimum support

Tree Based Generate-and-Test Algorithms Tree based algorithms generate a trie of candidate itemsets Can directly be generalized to the uncertain case Pruning conditions for deterministic case hold for uncertain case Projected databases can be constructed as in deterministic case, except that uncertainty probabilities also need to be maintained

The FP-Tree Technique: Challenges The FP-Tree technique generates a compressed representation of the database by sharing information about prefixes Uncertain Challenge: The prefixes contain information about probabilities which is specific to each transaction. Implies that effective sharing is not possible

Straightforward Solution Treat each distinct probability as a separate node (no sharing between two transactions with the same item but distinct probabilities) (Leung et al) Criticism: Effective only if a lot of items have exactly the same distinct probability Otherwise compression of FP-Tree is not good, and leads to too much overhead In continuous domain of probability, the assumption of exactly the same probability value is not reasonable

Our Solution Create cluster ranges of probabilities Construct a node for each clustered range (allows some node sharing) Use FP-Tree algorithm to generate a close superset of the frequent itemsets Key: Prove upper bound property of expected supports Remove irrelevant itemsets in a final pass

Two Variants UFP-growth algorithm: Adopts the recursively patterngrowth method used in FP-growth UCFP-growth algorithm: Constructs only the conditional FP-Tree for each frequent item at the first level and mines frequent itemsets for each conditional tree.

Observations Key selling point of FP-Tree is transaction database compression by information sharing: not effective in the uncertain environment Another selling point is the use of the pattern growth paradigm Is it possible to leverage the pattern-growth paradigm without worrying about the node sharing issue of FP-Tree? Solution: Extend H-Mine

Uncertain Extension of H-Mine The H-mine structure uses the linkage behavior among transactions corresponding to a branch of the FP-Tree without actually creating a projected database Uncertain Extension: Maintains item probabilities in original database, and uses linkage behavior to traverse database efficiently Prefix probabilities can be computed on the fly by using the information associated with original transaction

Observations (UH-Mine) Overall Effect: Uses the linkages to effectively traverse the transaction set without worrying about information sharing of the FP-Tree This approach is better than FP-Tree even in the deterministic case, when compression from FP-Tree is not high This will turn out to be particularly true for the uncertain case

Experimental Results Use Connect4, kosarak, and T40.I10.D100K Generate dense uncertainty probabilities Difficult case where rapid fall off in probabilities with increasing pattern length is not available

T40.I10.D100K runtime(in seconds) 625 125 25 UCP-Apriori UApriori UH-mine UFP-tree UCFP-tree memory(in % of 2G) 32 16 8 4 UCP-Apriori UApriori UH-mine UFP-tree UCFP-tree 2 5 0.1 0.2 0.4 0.6 1 3 5 Support Threshold (in %) 1 0.1 0.2 0.4 0.6 1 3 5 Support Threshold (in %) Running Time and Memory Requirements

Scalability runtime(in seconds) 450 400 350 300 250 200 150 100 UCP-Apriori UApriori UH-mine UFP-tree UCFP-tree memory(in % of 2G) 50 40 30 20 10 UCP-Apriori UApriori UH-mine UFP-tree UCFP-tree 50 0 20 40 80 160 320 0 20 40 80 160 320 Number of Transactions(K) Number of Transactions(K) Running Time and Memory Requirements with increasing number of transactions

Connect4 runtime(in seconds) 3125 625 125 25 UCP-Apriori UApriori UH-mine UFP-tree UCFP-ptree memory(in % of 2G) 5 4 3 2 1 UCP-Apriori UApriori UH-mine UFP-tree UCFP-tree 5 0.2 0.3 0.4 0.5 0.65 0.8 Support Threshold 0 0.2 0.3 0.4 0.5 0.65 0.8 Support Threshold Running Time and Memory Requirements

Kosarak runtime(in seconds) 2048 1024 512 256 128 64 32 UCP-Apriori UApriori UH-mine UFP-tree UCFP-tree memory(in % of 2G) 16 14 12 10 8 6 UCP-Apriori UApriori UH-mine UFP-tree UCFP-tree 16 4 0.030.1 0.2 0.3 0.5 0.8 1 0.03 0.1 0.2 0.3 0.5 0.8 1 Support Threshold (in %) Support Threshold (in %) Running Time and Memory Requirements

Conclusions Algorithms which work well on deterministic data (FP- Growth) may not work as well on uncertain data Pruning tricks which work for low uncertainty probabilities are an overhead for the case of high uncertainty probabilities Extensions of the H-mine algorithm turn out to be the most effective in terms of the combination of memory and computational requirements