UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Similar documents
Frequent Pattern Mining with Uncertain Data

Appropriate Item Partition for Improving the Mining Performance

Mining Frequent Itemsets from Uncertain Databases using probabilistic support

Mining High Average-Utility Itemsets

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

Sequential Pattern Mining Methods: A Snap Shot

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Sequential Pattern Mining: A Survey on Issues and Approaches

Efficient Mining of Top-K Sequential Rules

Sequential PAttern Mining using A Bitmap Representation

Comparative Study of Subspace Clustering Algorithms

An Improved Apriori Algorithm for Association Rules

Upper bound tighter Item caps for fast frequent itemsets mining for uncertain data Implemented using splay trees. Shashikiran V 1, Murali S 2

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets Along with Rare Itemsets Based on Categorical Multiple Minimum Support

Review Paper Approach to Recover CSGM Method with Higher Accuracy and Less Memory Consumption using Web Log Mining

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

An Effective Process for Finding Frequent Sequential Traversal Patterns on Varying Weight Range

An Algorithm for Frequent Pattern Mining Based On Apriori

Medical Data Mining Based on Association Rules

Vertical Mining of Frequent Patterns from Uncertain Data

Mining of Web Server Logs using Extended Apriori Algorithm

Improving Efficiency of Apriori Algorithms for Sequential Pattern Mining

A Comprehensive Survey on Sequential Pattern Mining

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Mining Associated Ranking Patterns from Wireless Sensor Networks

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

International Journal of Scientific Research and Reviews

Comparing the Performance of Frequent Itemsets Mining Algorithms

Fast Accumulation Lattice Algorithm for Mining Sequential Patterns

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

SLPMiner: An Algorithm for Finding Frequent Sequential Patterns Using Length-Decreasing Support Constraint Λ

Review of Algorithm for Mining Frequent Patterns from Uncertain Data

TKS: Efficient Mining of Top-K Sequential Patterns

Applying Data Mining to Wireless Networks

Improved Frequent Pattern Mining Algorithm with Indexing

Web page recommendation using a stochastic process model

Maintenance of fast updated frequent pattern trees for record deletion

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint

A Graph-Based Approach for Mining Closed Large Itemsets

Mining Top-K Association Rules. Philippe Fournier-Viger 1 Cheng-Wei Wu 2 Vincent Shin-Mu Tseng 2. University of Moncton, Canada

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

DATA MINING II - 1DL460

Associating Terms with Text Categories

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Probabilistic Graph Summarization

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

Part 2. Mining Patterns in Sequential Data

FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

Temporal Weighted Association Rule Mining for Classification

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

SeqIndex: Indexing Sequences by Sequential Pattern Analysis

Mining Maximal Sequential Patterns without Candidate Maintenance

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

Mining Quantitative Association Rules on Overlapped Intervals

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

An Efficient Algorithm for finding high utility itemsets from online sell

A Fast Algorithm for Data Mining. Aarathi Raghu Advisor: Dr. Chris Pollett Committee members: Dr. Mark Stamp, Dr. T.Y.Lin

A Hierarchical Document Clustering Approach with Frequent Itemsets

Frequent Pattern Mining with Uncertain Data

An Algorithm for Mining Large Sequences in Databases

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Data Mining for Knowledge Management. Association Rules

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

Sequential Pattern Mining A Study

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

A Survey of Sequential Pattern Mining

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

A Comparative study of CARM and BBT Algorithm for Generation of Association Rules

Discover Sequential Patterns in Incremental Database

Data Mining Techniques

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment

Parallelizing Frequent Itemset Mining with FP-Trees

A Quantified Approach for large Dataset Compression in Association Mining

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Item Set Extraction of Mining Association Rule

Comparative Study of Techniques to Discover Frequent Patterns of Web Usage Mining

ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Chapter 13, Sequence Data Mining

ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 2, Issue 2, March 2013

Transcription:

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University of Alberta, Canada Email: {hooshsad, samaneh, naiemi,mirianho,zaiane}@cs.ualberta.ca Uncertainty in various domains implies the necessity for data mining techniques and algorithms that can handle uncertain datasets. Many studies on uncertain datasets have focused on modeling, query ranking, discovering frequent patterns, classification models, clustering, etc. However despite the existing need, not many studies have considered uncertainty in sequential data. This paper introduces UAprioriAll, a method to mine frequent sequences in the presence of uncertainty in transactions. UAprioriAll scales linearly in time relative to the size of the dataset. 1. Introduction 1.1. Producing Hard Copy Using MS-Word Statistical studies on uncertain data have recently attracted significant attention due to the fact that data produced and collected in modern applications are often uncertain or noisy. Uncertainty happens because of the limitations in the equipment, privacy reasons, information conversion or extraction, etc. In a probabilistic transactional database, an item in a transaction can have a probability attached to it indicating its existential uncertainty of appearing in the transaction. This is a common form of uncertainty, called existential uncertainty. As an example of this assume a health-related database in which data is extracted from hand-written or text-based medical records using machine learning approaches [1]. Each attribute in this database describes a fact about the patient, and may be inaccurate for several reasons including inaccuracy in the information extraction method. Sequential Pattern Mining or SPM is a well known and important problem in data mining and has been addressed by many studies [2-5]. However mining frequent sequences from uncertain datasets is still an open problem. In this 1

2 paper, we propose a solution for the aforementioned problem by introducing a new algorithm called UAprioriAll. UAprioriAll is designed to enable mining frequent sequences in datasets with existential uncertainty. The proposed algorithm uses expected support as the measure of frequentness of transactions and sequences. Expected support is a metric that measures the expected frequency of an itemset/sequence in uncertain datasets. It is memory efficient and very fast to compute [6]. In this paper, we briefly overview the studies in the field of uncertain datasets in Section 2, then introduce and describe our proposed algorithm in Section 3. We discuss the experiments designed to evaluate the proposed algorithm in Section 4. 2. Related Works Frequent sequential pattern mining or SPM [2-5] deals with datasets in which each transactions is considered to be associated with an id. Each id may be associated with a sequence of transactions. The definition of the problem is as follows. Given a set of sequences where each sequence consists of a list of elements and each element consists of a set of items, and given a user-specified minimum support threshold, sequential pattern mining is to find all of the frequent subsequences, i.e., the subsequences whose occurrence frequency in the set of sequences is no less than the minimum support. Uncertain datasets have attracted much attention recently [6]. Some studies have addressed the SPM problem for specific and limited types of uncertain datasets [2,3]. The main difference between related works and our study is the nature of the problem and the data modeling. In our model, each item has a probability of existence in each transaction, which implies that the spaces of the models in the previous work are subspaces of our model. The data model that we propose for a realistic capture of uncertainty in sequential datasets is existential probabilistic datasets. In such datasets, each item exists within a transaction with a probability. Equation 1 shows the general form of our datasets. Each dataset D with size D is a set of D sequences S i, where each sequence of size S i contains S i transactions t i,j. = : =1.. = <, : =1.. >,, = (,,,,, ): = 1.., (1)

3 Our novel algorithm, UAprioriAll, his algorithm has three phases: a) U- Litemset; b) U-Transformation; c) U-Sequence. 2.1. UlItemset: Mining Single Sequences In this phase, the sequences of size 1 (containing only one transaction) are evaluated. Each single sequence (itemsets) is marked as a candidate, if its expected support is above the minimum threshold. In the probabilistic datasets D the expected support is computed by Equation 2. (2) ()=! " " = 1 $ % 1 & " &= ' " & ( We mine the probabilistic frequent patterns using a UApriori based technique [7]. These itemsets are put in a set called L 1. Next, each of the patterns in set L 1 is mapped to a unique integer number and L 1 is transformed using this map. Set L 1 is the output of this phase. 2.2. U-Transformation: Simplifying the Dataset In this phase, we transform the sequential dataset based on set L 1. The transformed dataset has two major differences with the original dataset. First, all the infrequent itemsets of the original dataset, that is the ones that are not contained in L 1, are removed from the transformed dataset. Second, the frequent itemsets are mapped into integer numbers as in L 1. UC-Sequence: Mining the Sequences In this phase, we mine the frequent sequences from the transformed dataset (output of U-Transformation) using an UApriori-like algorithm. At each step, the new candidate set (C k ) is filled up based on the frequent sequences of the previous level (L k-1 ) by evaluating L k-1 L k-1 for all two tuples that have k-2 items in common and then removes those that have infrequent subsets. The procedure starts from L 1 which was calculated in phase 2. We continue till the set L k is empty. To form L k, we choose c from C k if the expected support of c is greater than the minimum value. To calculate the expected support based on Equation 2, we need to calculate the value of " for x which size is greater than 1.The probability by which the first k items of candidate * +, (m>k) appears at least once within the first j items of the sequence is denoted by P {j,k} (c,s) and

4 computed by a recursive approach presented in Equation 3. The recursive equation allows us to benefit from the dynamic programming scheme. "* = "., / (* ) ", (* )= " 01,01 (* ) (*34 34)+ " 01, (* ) 1 (*34 34) ", (* )= 0 7 > ",1 (* )= 1 $ :;1.., 89 / 1 (*314 : ) The recursive formula is achieved by dividing the problem into two mutually exclusive states. State a is when s j contains c k and state b is otherwise. The probability value is the addition of the probabilities of the two states. State a requires two events to happen, both *34 34 and s[1..j-1] (the j-1 first elements of s) should contain at least one appearance of c[1..k-1] (the first k-1 elements of c). State b also requires two events, c[k] s[j] and c being a subset of s[1..j-1]. It is evident that states a and b are mutually exclusive. The total number of computations required can be assessed by Equation 4. (4) + (.,/) = ",(.,/) : * > +>?@A>B3"(* )4=. * * C * * C * 2 2 = *.( * +1) Relying on the mathematical meaningfulness of the expected support, the purpose of UAprioriAll is to find all sequences that have higher expected support than the predefined threshold. It is easily verifiable that our algorithm is sound. An induction using the downward closure lemma [7] can prove the algorithm completeness. In addition, the algorithm terminates when the set L k is empty, that is worst case happens at the K max -th level, where K max is the maximum length of the sequences. Therefore, total correctness can be proven for UAprioriAll. 3. Experiments As there are no uncertain sequential datasets publicly available similar to other studies on uncertain data (such as in [8]), we used synthetic datasets in our experiments. Each generated synthetic data is characterized by two parameters: the number of sequences L and the total number of items I. Our goal is to (3)

5 investigate the effect of L on the time consumption, so we set I to 20 and L is varied from 50 to 10000. Figure 1-Time consumption of UAprioriAll with different support values The number of transactions in each sequence is a monotonically distributed pseudo-random number from 1 to 0. We randomly select the items within the transaction, where all items have equal chances. The existence probability attached to each item within the transaction is a randomly generated number between 0 and 1. To increase the reliability of the results, for each value of L, 5 datasets were generated and the method was applied 5 times to each dataset, and then averaged. In this process, we used the Scaling Method. This method scales down a large dataset by randomly eliminating some transactions, to get a smaller dataset with lower number of sequences. The experiments were carried out on a machine with 2.66 GHz clock speed and 8 GB of RAM. In the implementation of the algorithm we adopted a free online Java implementation of Apriori [9].

6 The experiments include the time consumption of the algorithm based on the minimum support. This measure is important because decreasing it may cause a dramatic drop in performance and may affect the consistency in the behavior of UAprioriAll. Setting the minimum support in real world applications depends greatly on the domain. Figure 1 shows the time scalability of UAprioriAll with different values of minimum support. UAprioriAll grows linearly based on the number of sequences. 4. Conclusion In this paper, we proposed a novel uncertain sequential pattern mining algorithm employing the expected support. UAprioriAll considers attribute level (existential) probability, and mines the frequent sequential patterns. We showed the feasibility of our new algorithm in terms of time consumption. Based on the results of the experiments, UAprioriAll's runtime is linearly scalable based on the number of sequences in the dataset. 5. References 1. A. M. Cohen and W. R. Hersh, A survey of current work in biomedical text mining, Briefings in Bioinformatics 6, 57 (2005). 2. J. Pei, J. Han, B. M. Asl, H. Pinto, Q. Chen, U. Dayal and M. C. Hsu, Prefixspan mining sequential patterns efficiently by prefix projected pattern growth, in ICDE 01: Proceedings of the 17th International Conference on Data Engineering, (Heidelberg, Germany, 2001). 3. J. Yang, T. J. Watson, W.Wang, P. S. Yu and J. Han, Mining long sequential patterns in a noisy environment, in SIGMOD 02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, (Madison, Wisconsin, USA, 2002). 4. M. J. Zaki, SPADE: An Efficient Algorithm for Mining Frequent Sequences, Mach. Learn. 42, 31 (2001). 5. R. Agrawal and R. Srikant, Mining sequential patterns, in Data Engineering, 1995. Proceedings of the Eleventh International Conference on, (Taipei, Taiwan, 1995). 6. C. C. Aggarwal and P. Yu, A Survey of Uncertain Data Algorithms and Applications, IEEE Transactions on Knowledge and Data Engineering 21, 609(May 2009). 7. C. C. Aggarwal, Y. Li, J. Wang and J. Wang, Frequent pattern mining with uncertain data, in KDD 09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, (Paris, France, 2009). 8. Q. Zhang, F. Li and K. Yi, Finding frequent items in probabilistic data, in SIGMOD 08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, (Vancouver, Canada, 2008). 9. P. Fournier-Viger, Algorithms/frequent itemset mining algorithms (2010), url: http://www.philippe-fournier-viger.com/spmf/index.php.