Fast Accumulation Lattice Algorithm for Mining Sequential Patterns

Similar documents
Discover Sequential Patterns in Incremental Database

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

A Comprehensive Survey on Sequential Pattern Mining

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Mining Maximal Sequential Patterns without Candidate Maintenance

Mining Closed Itemsets: A Review

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

CS145: INTRODUCTION TO DATA MINING

An Effective Process for Finding Frequent Sequential Traversal Patterns on Varying Weight Range

A Novel Boolean Algebraic Framework for Association and Pattern Mining

Sequential Pattern Mining Methods: A Snap Shot

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Sequential Pattern Mining: A Survey on Issues and Approaches

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Frequent Pattern Mining

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

International Journal of Scientific Research and Reviews

Sequential PAttern Mining using A Bitmap Representation

International Journal of Electrical, Electronics ISSN No. (Online): and Computer Engineering 4(1): 14-19(2015)

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

Mining High Average-Utility Itemsets

PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

PSEUDO PROJECTION BASED APPROACH TO DISCOVERTIME INTERVAL SEQUENTIAL PATTERN

CLOLINK: An Adapted Algorithm for Mining Closed Frequent Itemsets

Data Mining for Knowledge Management. Association Rules

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p

Improved Frequent Pattern Mining Algorithm with Indexing

A Novel Fast Constructive Algorithm for Neural Classifier

Roadmap. PCY Algorithm

Finding frequent closed itemsets with an extended version of the Eclat algorithm

Comparative Study of Techniques to Discover Frequent Patterns of Web Usage Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

International Journal of Computer Engineering and Applications,

An effective algorithm for mining sequential generators

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Mining of Web Server Logs using Extended Apriori Algorithm

On Frequent Itemset Mining With Closure

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Part 2. Mining Patterns in Sequential Data

Sequential Pattern Mining A Study

A Fast Algorithm for Mining Rare Itemsets

Review Paper Approach to Recover CSGM Method with Higher Accuracy and Less Memory Consumption using Web Log Mining

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Memory issues in frequent itemset mining

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

CSE 5243 INTRO. TO DATA MINING

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

Empirical analysis of the Concurrent Edge Prevision and Rear Edge Pruning (CEG&REP) Performance

CS570 Introduction to Data Mining

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Scalable Frequent Itemset Mining Methods

Association Rules Mining:References

Distributed frequent sequence mining with declarative subsequence constraints. Alexander Renz-Wieland April 26, 2017

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

Improving Efficiency of Apriori Algorithms for Sequential Pattern Mining

CGT: a vertical miner for frequent equivalence classes of itemsets

An Algorithm for Frequent Pattern Mining Based On Apriori

SeqIndex: Indexing Sequences by Sequential Pattern Analysis

Mining association rules using frequent closed itemsets

Keywords: Mining frequent itemsets, prime-block encoding, sparse data

CHUIs-Concise and Lossless representation of High Utility Itemsets

TensorFlow and Keras-based Convolutional Neural Network in CAT Image Recognition Ang LI 1,*, Yi-xiang LI 2 and Xue-hui LI 3

An Algorithm for Mining Large Sequences in Databases

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Mining Frequent Patterns without Candidate Generation

Theoretical Analysis of Local Search and Simple Evolutionary Algorithms for the Generalized Travelling Salesperson Problem

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Maintenance of the Prelarge Trees for Record Deletion

Adaption of Fast Modified Frequent Pattern Growth approach for frequent item sets mining in Telecommunication Industry

Generation of Potential High Utility Itemsets from Transactional Databases

Non-redundant Sequential Association Rule Mining. based on Closed Sequential Patterns

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Appropriate Item Partition for Improving the Mining Performance

A High-Speed VLSI Fuzzy Inference Processor for Trapezoid-Shaped Membership Functions *

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint

Data Mining Part 3. Associations Rules

Pattern Lattice Traversal by Selective Jumps

A New Fast Vertical Method for Mining Frequent Patterns

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach *

MS-FP-Growth: A multi-support Vrsion of FP-Growth Agorithm

Transcription:

Proceedings of the 6th WSEAS International Conference on Applied Coputer Science, Hangzhou, China, April 15-17, 2007 229 Fast Accuulation Lattice Algorith for Mining Sequential Patterns NANCY P. LIN, WEI-HUA HAO, HUNG-JEN CHEN Departent of Coputer Science and Inforation Engineering Takang University 151 Ying-chuan Road, Tasui, Taipei, TAIWAN nancylin@ail.tku.edu.tw, 889190111@s89.tku.edu.tw, chenhj@sju.edu.tw Abstract: - Sequential Patterns has any diverse applications in any fields recently. And it has becoe one of the ost iportant issues of Data Mining. The ajor proble in previous studies of ining sequential patterns is too any candidates sequences has been generated during the ining process, costing coputing power and increasing runtie. In this paper we propose a new algorith, Fast Accuulation Lattice (FAL) to alleviate this proble. FAL scan sequential database only once to construct the lattice structure which is a quasi-copressed data representation of original sequential database. The advantages of FAL are: reduce scan ties, reduce searching space and iniize requireent of eory for searching frequent sequences, and axial frequent sequences as well. Keywords: - sequential patterns, sequence ining, data ining, axial frequent sequence 1 Introduction Frequent Sequences Mining is an iportant task of data ining, in the point of view of applications, including Learning Patterns, Web access patterns, Custoer behavior analysis and others tie related data process. The proble can be state as Sequential pattern ining is to discover frequent subsequences as patterns in a sequence database [11]. There are any previous studies of ining sequential patterns efficiently [2][3][5][8]. Most, alost all, of the previous studies of ining sequential patterns, tie related sequence, are adopting apriori-like principle which denote that any super-sequence of an infrequent sequence is also infrequent. The apriori-like principle is based on generation-and-prune ethod; the first scan is finding all of the frequent, also know as large in association rule ining, 1-sequence and which is assebled to generate 2-sequence candidates. Those candidates not satisfying the iniu support threshold will be pruned in the ining process. Repeat the process until no ore candidates were generated. The apriori-like sequential pattern ining ethods has suffered fro several ajor drawbacks: (1) generate a huge set of candidates fro a sequence database, (2) poor tie efficiency due to ultiple scans of sequence database, (3) exponentially generating cobinational candidate sequences in the process of long sequential patterns ining; low threshold as well. In this paper, we propose a novel algorith, FAL, differ fro previous studies. Main goal is to reduce searching space and runtie via a lattice structure algorith. The reainder of this paper is organized as follows: In section 2, preliinary concepts are introduced. In the section 3, the Maxial Sequence is defined. In Section 4, our algorith, Fast Accuulation Lattice, is introduced. In Section 5, the FAL algorith is illustrated with an exaple. The conclusion is in the Section 6.

Proceedings of the 6th WSEAS International Conference on Applied Coputer Science, Hangzhou, China, April 15-17, 2007 230 2 Preliinary Let I { i i,..., = be a set of all ites. A subset 1, 2 i n of I is called an iteset. A sequence is defined as s = e, 1 e2,..., e, e I, e I,..., e 1 2 I. The length of a sequence is the total nuber of ites in the sequence. If a sequence a = a, 1 a2,..., a contains sequence b = b, 1 b2,..., bn then we denote this relationship as a p b, if and only if i 1, i 2,..., i, and that 1 i < i <... < i 1 2 n and a1 bi 1, a,, a. We denote that sequence a is 2 bi 2 b i a subsequence of b, b is supersequence of a. A sequence database is a set of sequences. Suppose D is a sequential database. D is representing the nuber of sequences in the sequence database D. The support of a sequence s is the nuber of sequences in D which contains sequence s. A iniu support, insup, is the threshold of frequent sequence. All sequences with support no less than insup are called frequent sequence, FS. The Maxial Frequent Sequences, MS, is one of the outputs of FAL. The set of axial frequent sequence is defined as { s s FS and s' FS such that s p s' MFS =. The ain issue of ining axial frequent sequence is to find MFS with support no less than iniu support threshold. In this paper we introduced upward closure principle and downward closure principle. The upward closure principle is also known as the ariopri principle; all supersequences of an infrequent sequence are also infrequent. The downward closure indicates that all subsequences of a frequent sequence are also frequent. In our FAL algorith will apply these two principles as an essential part of algorith. 3 Maxial Sequences In previous studies of sequential patterns ining algoriths focus on ining the full set of frequent subsequences that support no less than a given iniu support in a sequence database. On the other hand, a frequent long sequence contains a cobinatorial nuber of frequent subsequences, cost expensively in both tie and space, is the ajor inevitable drawback. A axial sequence approach was introduced to eet these probles. Let S, T S = s, s 1,,, be a s n S and { 1 L set of all frequent sequences. If we delete all subsequences of each sequence of S; the reaining sequences are called Maxial Sequences. In a sense, MS is the copressed representation of FS. This result is due to the downward closure principle. In any cases, the idea of axial sequences is recoended to represent the whole frequent subsequences for the sake of condense inforation. For instance, a axial frequent n-sequence has subsequences, =2 n -1. The condense rate is, which eans that space of axial frequent sequence is ties saller than the whole set of its subsequences. The longer the axial frequent sequence the better condense rate will be. With axial sequences, the lattice can be ore easily fit into eory. To our percept, a structure of axial frequent sequences is a lossless ethod to contain original inforation. 4 Fast Accuulation Lattice Algorith In this section, we described the concept of Fast Accuulation Lattice. FAL has 3 sequential phases: Growing Phase, Pruning Phase and Maxial Phase. In Growing Phase, all sequences are read fro database into a lattice data structure. Each node in the lattice accuulates the count of sequences and passes the count to all its subsequences nodes. Those nodes with count no less than insup, and all its subsequences nodes, will be arked as frequent sequence node. So, to speed up the FAL algorith it doesn t have to pass the accuulating count to these arked nodes. Second, in Pruning Phase, prune off those infrequent sequence nodes in the lattice. Finally, Maxial Phase, delete each node s subsequences to find out axial frequent sequences. In this paper we applied the idea of axial sequence in FAL to iprove the efficiency proble of apriori-like sequence ining. Algorith FAL(D, insup) //Input: D, insup //Output: MFSMaxial Sequences // Growing Phase Initiate Lattice // create root node ConstructLattice(D,insup){ Repeat until the end of D{ read sequence SP fro D if SP Lattice {

Proceedings of the 6th WSEAS International Conference on Applied Coputer Science, Hangzhou, China, April 15-17, 2007 231 search node that node.sequence=sp; if node.frequent=false{ node.count++; if node.count insup For this node and all of it s subsequence node.frequent=true; else { new node; node.sequence=sp; node.count=1; //construct all subsequences nodes of this new node; ConstructLattice(subsequences of node, insup); //Pruning Phase For (k=1:k MaxFrequentSequence -1 :k++) For all nodes of length k If node.frequent=false{ Delete node; Delete all supersequences nodes // Maxial sequence phase For each node{ Delete all subsequences nodes Maxial sequences = all nodes reain in the Lattice 5 exaple For exaple, table 1 is a learning sequence database table. SID represents Sequence Identifier. In this database has included only 5 ites of A, B, C, D and E. In this exaple, the length of the longest sequence is 4. We set insup =2 in this exaple. GROWING PHASE: read in the first sequence ACD fro database. Link the sequence node to root node and all its subsequence are linked to this node and so on so forth. This lattice is grow into a 8 nodes lattice including one root node, one 3-ite sequence{ ACD, 3 2-ite sequences { AC, AD, CD, 3 1-ite sequence { A, C, D. The accuulator of each node has increased by 1 fro 0. The lattice is shown as Fig.1. Fig. 1 Lattice with ACD and subsequences After read in the second sequence ABCE fro sequence database, the first step is to check whether ABCE is already exist in the lattice, not in the lattice or it is a subsequence of a node in the lattice. Second, find out the coon subsequence of each node of the lattice, for now only ACD, and ABCE. In this exaple the coon sequence is AC. The accuulators of node AC and all its subsequence are counted as 2 which has reached the iniu support. So far, there are 3 frequent sequence nodes in the lattice. The lattice structure is shown in Fig.2. Since AC, A and C are frequent sequence node these 3 nodes are shown as white node. Gray node represents infrequent sequences. The nuber on the upper right corner represents accuulated count nuber. Table 1 saple sequence database SID Sequence 1 ACD 2 ABCE 3 BCE 4 BE Fig. 2 Lattice of ACD and ABCE

Proceedings of the 6th WSEAS International Conference on Applied Coputer Science, Hangzhou, China, April 15-17, 2007 232 The Lattice after reading BCE and BE is shown in fig.3. White nodes, AC A C BCE BC BE CE B and E,are denoted as frequent sequences since their accuulator has count nuber no less than insup. Fig. 3 Coplete Lattice Pruning Phase: Scan Lattice fro short sequence nodes, delete infrequent sequence nodes. A pruned Lattice is shown in Fig.4 The advantages are: Sall search space and faster search speed via the pre-knowledge of iniu support. 6 Conclusion In this paper we had introduced a novel lattice structure to represent the original sequence database in a sense of copress ethod. We also investigated issues for ining axial frequent sequential patterns in sequence database and highlight the essential proble of possible inefficiency and redundancy of ining frequent sequential patterns. To the best of our knowledge, this is the first study to solve axial frequent sequences proble with lattice structure. In theory, FAL is outperfor the previous apriori-like algorith for ining sequential patterns by lot ore less space occupation, searching space and runtie. The new algorith consists of three ajor phases; (1) growing phase: scan the sequence database to build a lattice structure with a given iniu support threshold and take the advantage of this pre-knowledge to accelerate the building speed and achieve a ore copact structure. (2) pruning phase : prune all infrequent sequences via apriori principle, start fro short sequence, delete all infrequent sequence and all it s supersequences. (3) axial sequence phase: for each sequence in FS delete all it s subsequences, the reaining is called MS. Fig. 4 Lattice of frequent sequence Maxial Sequences: Scan fro long sequence nodes, delete each node s subsequence, surrounded with a big red circle in Fig. 4. The reaining sequences BCE and AC are called axial sequences. Lattice of MFS is shown as Fig.5. Fig. 5 Lattice of Maxial Sequences Reference: [1] Jiawei Han and Micheline Kaber, Data Mining, Concepts and Techniques, 2 nd edition, Morgan Kaufann Published, 2006. [2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering (ICDE 95), pages 3 14, Taipei, Taiwan, Mar. 1995. [3] J. Han, J. Pri, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu, FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining, Proc. 2000 ACM SIGKDD Int l Conf. Knowledge Discovery in Database (KDD 00), pp. 355-359, Aug. 2000. [4] J. Han, J. Pei and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. 2000 ACM-SIGMOD Int l Conf. Manageent of Data (SIGMOD 00), pp.1-12, May 2000. [5] Wang, J.; Han, J.a, BIDE: efficient ining of frequent closed sequences, Data Engineering, 2004. Proceedings. 20th International

Proceedings of the 6th WSEAS International Conference on Applied Coputer Science, Hangzhou, China, April 15-17, 2007 233 Conference on 30 March-2 April 2004 Page(s):79 90. [6] N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, Discoving frequent closed itesets for association rules. In ICDT ' 99, Jerusale, Israel, Jan. 1999. [7] J. Wang, J. Han, and J. Pei, CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itesets. In KDD ' 03,Washington, DC, Aug. 2003. [8] X. Yan, J. Han, and R. Afshar, CloSpan: Mining Closed Sequential Patterns in Large Databases. In SDM'03, San Francisco, CA, May 2003. [9] M. Zaki, and C. Hsiao, CHARM: An efficient algorith for closed iteset ining. In SDM' 02, Arlington, VA, April 2002. [10] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Janyong Wang, Helen Pinto, Qiing Chen, Ueshwar Dayal, Mei-Chun Hsu, Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach, IEEE Transactions on Knowledge and Data Engineering, vol. 16, No. 11, Noveber 2004. [11] M. Zaki. SPADE: An efficient algorith for ining frequent sequences. Machine Learning, 40:31 60, 2001. [12] Maged El-Sayed, Carolina Ruiz, Elke A. Rundensteiner, Web ining and clustering: FS-Miner: efficient and increental ining of frequent sequence patterns in web logs Proceedings of the 6th annual ACM international workshop on Web inforation and data anageent, Noveber 2004. [13] R. Agrawal and R. Srikant,Fast Algoriths for Mining Association Rules, Proc. 1994 Int l Conf. Very Large Data Bases(VLDB 94), pp.487-499. 1994. [14] R. Agrawal and R. Srikant, Mining Sequential Patterns, Proc. 1995 Int l Conf. Data Eng. (ICDE 95), pp.3-14, Mar. 1995.