International Journal of Computer Engineering and Applications,

Size: px

Start display at page:

Download "International Journal of Computer Engineering and Applications,"

Claud Emil Nash
5 years ago
Views:

International Journal of Computer Engineering and Applications, AN EFFICIENT MINING FOR MAXIMAL FREQUENT SEQUENCE PATTERN USING BINARY DIGIT REPRESENTATION AND SAME SUPPORT VALUE S. Ramesh 1 N.

Adhirampattinam, Tamilnadu 3 ABSTRACT: Mining Sequential Frequent Pattern gives more patterns to user. It is perplex for decision making in business and other applications in Data mining.

1 International Journal of Computer Engineering and Applications, AN EFFICIENT MINING FOR MAXIMAL FREQUENT SEQUENCE PATTERN USING BINARY DIGIT REPRESENTATION AND SAME SUPPORT VALUE S. Ramesh 1 N. Jayaveeran 2 Research Scholar 1, Assistant Professsor 2 Department of Computer Science, Khadir Mohideen College, Adhirampattinam, Tamilnadu 1 Department of Computer Science, Khadir Mohideen College, Adhirampattinam, Tamilnadu 3 ABSTRACT: Mining Sequential Frequent Pattern gives more patterns to user. It is perplex for decision making in business and other applications in Data mining. Because of that the Maximal Closed Frequent Sequential Pattern Mining is proposed by many researchers. However, the Maximal Pattern is mined from vast sequence database which gives more number of patterns. This research paper is proposed the Efficient Maximal Closed Frequent Sequence Pattern (EMaxSPAN) to reduce the processing time and the Patterns by same support threshold value by user given minimum support value. The efficiency is experimented in real time sequence databases. Keywords: Pattern Mining, Same Support Value, Maximal, Closed Sequential, Frequent pattern. [1] INTRODUCTION Mining useful frequent pattern is a demanding research task and it is widely used in business, biological sciences and others. Mining frequent pattern is introduced by Agarwal (SPAM) [1]. A Sub-sequence pattern is called as sequential pattern or frequent sequence if it is frequently appears in a sequence database with no less than a user defined min_sup value. There are lots of algorithms are being created and used for decision making in various sectors. Sequential pattern mining plays 45

2 AN EFFICIENT MINING FOR MAXIMAL FREQUENT SEQUENCE PATTERN USING BINARY DIGIT REPRESENTATION AND SAME SUPPORT VALUE a significant role in data mining. It is important to a wide range of applications, such as the market basket analysis, web click-streams, medical facts, e-learning and biological data analysis. The well known algorithms for sequential pattern mining are: SPAM (Sequential Pattern Mining) [1], GSP (Generalized Sequential Pattern algorithm) [2], PrefixSpan [3] and SPADE (Sequential PAttern Discovery using Equivalence classes) [4]. The popular closed pattern algorithms are BIDE (BI-Directional Extension based frequent closed sequence mining) [5], Clasp (Closed Sequential Patterns algorithm) [6], CloSpan (Closed Sequential Pattern mining) [7] and CM-Clasp (Co-occurrence MAP Clasp) [8]. Some of the Maximal Closed Sequence pattern algorithms are MaxSP (Maximal Sequential Pattern Miner) [9], VMSP (Vertical mining of Maximal Sequential Patterns) [10], MSPX (Maximal Sequential Patterns by using Multiple Samples) [11] and MFSPAN [12] (Maximal Frequent Sequential Pattern Mining Algorithm). This EMaxSPAN algorithm discovers the maximal sequential Patterns by Same support value. The rest of the paper is organized as Section 2 covers preliminaries of frequent sequence patterns. Section 3 describes the problem statements of the previous work. Section 4 provides the details of proposed approach of this algorithm. Section 5 explains the EMaxSPAN algorithm. Section 6 shows the experimental study of this proposed algorithm, and conclusion and future work in section 7. [2] PRELIMINARY CONCEPTS A sequence database D is a set of sequences S={s1, s2,..sn} and a set of items I={I1,I2,...IM} in M unordered list of item sets. The length of S is M, which is the number of item set in a sequence, and S also has N-Sequences. A Sequence X=(x1, x2,...,xi) is a sub-sequence of another sequence Y=(y1,y2,...,yj). A sequence is an ordered list of items S= <I1,I2,...In> such that Ik I (1 k n). A sequence database D contains a set of sequences, and the support of a sequence S is the number of sequences that contain S. A frequent sequential pattern is a sequence with support not less than the minimum support threshold value, min_sup. A closed sequential pattern is a frequent sequence is not strictly included in another pattern having the same frequency. A maximal sequential pattern P in a sequence data base D is a closed sequential pattern that is not firmly included in another closed pattern. Maximal closed patterns have very small numbers of subset of closed sequential patterns. [3] PROBLEM STATEMENTS The actual challenge is to mine the maximal frequent patterns without candidate generations, large memory and processing delay. The previous Closed and Maximal pattern algorithms have given more number of patterns to the user. It is also perplexed for analysing and decision making in business and biological sequence databases. It is necessary to mine Maximal pattern in many field. For example, Super market basket analysis to get longest associated patterns and longest protein sequences. The MaxSP[9] algorithm has been mining maximal sequential pattern without storing 46

3 International Journal of Computer Engineering and Applications, intermediate candidate in main memory but it is needed to scan database twice. This paper is to mine effective maximal pattern by scanning the sequence database only once. [4] PROPOSED APPROACH Mining essential Maximal Closed frequent sequential pattern for the business applications and biological sequence data analysis gained by the user specified min_sup threshold value with Binary Representation and the same support value candidate generation technique is proposed in this algorithm (EMaxSPAN). The EMaxSPAN algorithm is to retrieve maximal pattern from large sequence database efficiently and effectively. Let us have a sample sequence database shown in [Table-1]. The Sequence S1=<BCDA>. The items are found on exploring the sequence left to right and the distinct values and its binary value (if item is found in sequence 1, otherwise 0) is stored in the same column. The length of the base may vary depends upon the patterns are presented in the Database. This approach scans the sequence database only once to generate the base table that is binary representation and its support for all the sequences in database, the generated base table shown in [Table-2]. The candidates are generated like Apriori candidate construction but only considered the sequences with same support value. Distinct candidate items for length-1 is generated and stored in the base table itself. All the other candidates are generated individually by combination of same support value for length-2, length-3... length-n and mined maximal closed frequent pattern from the Binary representation base table. The candidate pruning is made routinely, since only the same supported value is considered for the candidate construction and pattern extraction. Table: 1. Sample Sequence Database Sid Sequence S1 <BCDA> S2 <AEA> S3 <BDCA> S4 <BCD> S5 <EEE> B C D A E Table: 2. Binary Representation Table for Sample Database B C BC

4 AN EFFICIENT MINING FOR MAXIMAL FREQUENT SEQUENCE PATTERN USING BINARY DIGIT REPRESENTATION AND SAME SUPPORT VALUE Table: 3. A Sample Candidate Generation Operation (Logical AND with Candidate Item B and C) From the Base table, the item E is pruned since the support value is 2 which is equal to the min_sup but no other same support value. That is support(2)>=min_sup and CItem>=2. Table: 4. Length-2 Candidates Table: 5. Length-3 Candidates BC BD BA CD CA DA BCD BAC BAD CAD Table: 6. Length-4 Candidate BACD From base table [Table-2] the length -1 candidates = {B,C,D,A}and its support value={3,3,3,3}. Then the length -2 candidates are {BC, BD, BA, CD, CA, DA}, support values are {3,3,2,3,2,2}. These length-2 candidates have more than one same support value. It is divided into two segments according to its support value. Part one {BC, BD, CD} with support value 3 and part two {BA, CA, DA} with support value 2 shown in [Table-4]. From the first part of length-2, we get Length -3 candidate {BCD} with support 3 and the second part we have {BAC, BAD, CAD} with supports consequently {2, 2, 2}. The candidate generating operation is shown in table 5 for length- 3 candidates. Length-4 candidates are generated with same supported candidates in length-3 are {BAC, BAD and CAD}, thus the length-4 candidate is {BACD} and support value 2 shows in [Table-5]. Hence the Maximal Frequent Sequence Pattern is BACD on this sample sequence database [Table-6]. The constraints in building and generating candidates are as follows: 1. The Candidate generating operations is done with AND Logical operation shown in [Table-3]. If binary value for both positions is 1, the output is 1 otherwise The candidates are pruned when the support value is less than the min_sup and if there is no more than one candidate with same support value. 48

5 International Journal of Computer Engineering and Applications, [5] EMaxSPAN ALGORITHM First step, EMaxSPAN scans the complete sequence database to get the discrete items and its position to build into a binary representation base table. In the Second step, the same support values have separated in different segments and combined with logical AND operations with same supported value of the candidate items. It repeats for all candidate generations of length-1, length- 2,... length-n and pruned those unnecessary candidates without same support and less than the user defined min_sup. In step3, the Maximal Closed Patterns are extracted from huge sequence database with same support value that is the last length of candidate. MaxSPAN Pseudo Code Algorithm MaxSPAN(D, min_sup) // D=Sequence Database, // min_sup= Minimum Support. /*scan the Sequence Database to Generate length -1 Candidates and to build Binary representation base table */ BaseTableGeneration(D, min_sup) For each CLENGTH //CLENGTHi to CLENTHn CandidateGeneration(BasTable, CLENGTHi,min_sup) If SupportCount(CLENGTHi)==SupportCount(CLENGTj) then Combine(CLENGTHi,CLENGTHj) CandidateGeneration(BTABLE,CLENGTHij,min_sup) //Else Pruned Endif End for Return (CLENGTHn) //Max Length Item END /*Binary Representation Base Table Generation */ /*D Sequence Database, min_sup- Minimum Support Threshold. */ FUNCTION BaseTableGeneration(D, min_sup) For each SEQUENCE in D // Si to Sn For each ITEM in Sequence // ITEMj to ITEMn If(Distinct(ITEM)==ITEMj then Position[ITEMi,Sj]=1 Else Position[ITEMi,Sj]=0 Endif END //Function Support Count for Distinct Itemsets /* BTable Generated BaseTable CITEM Candidate Item */ FUNCTION SupportCount(BTABLE, CITEM) 49

6 AN EFFICIENT MINING FOR MAXIMAL FREQUENT SEQUENCE PATTERN USING BINARY DIGIT REPRESENTATION AND SAME SUPPORT VALUE For each CITEM in BTABLE //CITEMi to CITEMn For earch Sequence in BTABLE // Si to Sn Support[CITEMi]=SUM(Si:Sn) END /* Function for Candidate Generations */ /* BTABLE BaseTable, CLENGTH CandidateLength, min_sup Minimum Support */ FUNCTION CandidateGeneration(BTABLE, CLENGTH, min_sup) For each Sequence in BTABLE //Si to Sn For each CITEM in BTABLE // CITEM i to CITEMn If(LogicalAND(CITEMi, CITEMj)) then Combine(CITEMi,CITEMj) SupportCount(CITEMij) // Else PRUNED Endif END [6] EXPERIMENTAL EVALUATION The proposed algorithm is developed in VB.NET on a personal computer of Intel Dual core 2.66 GHz processors, 2 GB RAM on Windows7-32bit Ultimate Operating System. Experimental evaluation has done on Real world UCI (University of California, Irvine) Data downloaded from internet as ARFF (Attribute Relation File Format) file format and converted to native SQL server database. The transformed mushroom dataset contains 8,124 instances and 23 classes attribute. The data transformation is shown in [Table-7] and details of mushroom data shown in [Table-8]. Another coli promoter gene sequences (DNA) data is also used to experiment the algorithm EMaxSPAN and data description is shown in [Table-9]. Table7: Sample mushroom Data Sets Classes Edible Poisonous Bell Conical Convex Knobbed Items e p b c x k SNo Descriptions Value 1 Total No. of Instances Number of Attributes 22 Table 8: The Details of Mushroom Data SNo Descriptions Value 1 Total No. of Instances Number of Attributes 59 50

7 Runtime(s) International Journal of Computer Engineering and Applications, Table 9: DNA gene sequence data Experimental study is done on running time of this proposed algorithm on Mushroom and Gene DNA Sequence datasets. [Figure-1] shows that the performance analysis on Mushroom data and [Figure-2] shows analysis in running time with support value changed from 0.01 to 0.05 on DNA sequence. EMAXSpan algorithm is experimented with previous algorithm ClospanSSV (Closed Sequential Pattern by Same Support Value). Figure1 and Figure2 show that the minimum support (min_sup) value is low EMAXSpan outperforms the previous ClospamSSV algorithm. Runtime(s) Performance Analysis on Mushroom Data sets ClospamSSV EMAXSpan Support in % Figure: 1. Performance Analysis on Mushroom Data sets 25 Performance Analysis on (DNA) Gene Sequence ClospamSSV EMAXSpan Support in % Figure: 2. Performance Analysis on Gene Sequence (DNA) Datasets [7] CONCLUSION AND FUTURE WORK The EMAXSpan algorithm is proposed to reduce processing time and mine essential frequent patterns in huge sequence datasets. It is valuable where similar permutation of frequent sequential patterns required to be extracted. The major advantage of this algorithm scans the database only once to generate the binary representation base table. This algorithm able to get the absolute same supported numbers of Maximal Closed Sequential Pattern from the sequence database with user defined minimum support threshold value. However, the key challenge of this algorithm is to have 51

8 AN EFFICIENT MINING FOR MAXIMAL FREQUENT SEQUENCE PATTERN USING BINARY DIGIT REPRESENTATION AND SAME SUPPORT VALUE intermediate data for each candidate generation with it support counts. The Maximal Closed sequential pattern by same support value depends on the size of the candidate s length size of the sequence. In future work, this algorithm can be enhanced with candidate fusion for colossal candidates for extremely huge DNA sequence databases. REFERENCES [1] R. Agrawal and R. Srikant, Mining Sequential Patterns. In Yu, P.S. and Chen, A.S.P., editors, 11th International Conference on Data Engineering ICDE 1995, Taipie, Taiwan, pages 3-14, IEEE Computer Society Press, 1995 [2] Srikant, R. and Agrawal, R., Mining Sequential Patterns: Generalization and Performance Improvements, in Proc. of EDBT 96, pp. 3 17, 1996 [3] Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U. and Hsu, M.-C., PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, in Proc. of IEEE ICDE 01, pp , 2001 [4] Zaki, M., An Efficient Algorithm for Mining Frequent Sequences, Machine Learning, Vol. 40, pp , 2000 [5] Ayres, J., Gehrke, J., Yiu, T. and Flannick, J., Sequential Pattern Mining using Bitmap Representation, in Proc. of ACM SIGKDD 02, pp , [6] P. Fournier-Viger, C. W. Lin, A. Gomariz, A. Soltani, Z. Deng, H. T. Lam, The SPMF open source data mining library version 2," The European Conference on Principles of Data Mining and Knowledge Discovery, pp , [7] Y. Xifeng, H. Jiawei, and R. Afshar, CloSpan: Mining Closed Sequential Patterns in Large Data Base," SIAM International Conference on Data Mining, pp , [8] P. Fournier-Viger, A. Gomariz, M. Campos, and R. Thomas, Fast Vertical Mining of Sequential Patterns Using Co-occurrence Information," The Pacic-Asia Conference on Knowledge Discovery and Data Mining, pp , [9] P. Fournier-Viger, C.-W. Wu, and V. S. Tseng, Mining Maximal Sequential Patterns without Candidate Maintenance," The International Conference on Advanced Data Mining and Applications, pp , [10] P. Fournier-Viger, C.-W. Wu, A. Gomariz, and V. S. Tseng, VMSP: Efficient vertical mining of maximal sequential patterns," The Canadian Conference on Artificial Intelligence, pp , [11] C. Luo, and S. Chung, Efficient mining of maximal sequential patterns using multiple samples," SIAM International Conference on Data Mining, pp , [12] En-Zheng Guan, Xiao-Yu Chang, Zhe Wang, Chun-Guang Zhou, Mining Maximal Sequential Patterns IEEE, [13] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lot Lakhal, Discovering frequent closed itemsets for association rules, Proceedings of the 7th International Conference on Database Theory (ICDT '99), pp , [14] K. Subramanian, E. Elakkiya, Modified Sequential Pattern Mining Using Direct Bit Position Method, International Journal of Science and Research (IJSR), ISSN (Online): , [15] J. Wang, J. Han, and Chun Li, Frequent closed sequence mining without candidate maintenance, IEEE Trans. Knowledge and Data Eng., vol. 19, no. 8, pp , Aug

9 International Journal of Computer Engineering and Applications, [16] J. Pei, J. Han, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.C. Hsu, PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In ICDE 01, Heidelberg, Germany, April 2001 [17] Philippe Fournier-Viger, Jerry Chun-Wei Lin, Rage Uday Kiran, Yun Sing Koh, Rincy Thomas, A Survey of Sequential Pattern Mining Data Science and Pattern Recognition, Ubiquitous International, Volume 1, Number 1, February [18] Charu C. Aggarwal, Jiawei Han (eds.)-frequent Pattern Mining-Springer International Publishing (2014). [19] Mihika Shah, Lynette D mello A Study of Sequential Pattern Mining Algorithms IJIACS, ISSN Volume 4, Issue 11, November [20] Zhu Zhenxin, Lü Jiaguo Closed Sequential Pattern Mining Algorithm Based Positional Data Y. Wu (Ed.): International Conference on WTCS 2009, AISC 116, pp Springer-Verlag Berlin Heidelberg Author 1: Mr. S. RAMESH He is a Research Scholar (PhD) in Computer Science at Khadir Mohideen College of Arts and Science, Adhirampattinam affiliated to Bharathidasan University Trichirappali. He has completed M.Sc (CS) at Khadir Mohideen College on 1997, MPhil (CS) on 2006 at Periyar University Salem. He is presently working as Assistant Professor and Head, Department of Computer Science at Bharathidasan University Model College Aranthangi, Pudukkottai, Tamilnadu. Author 2: Dr. N. JEYAVEERAN He is working as Associate Professor and Head in the Department of Computer Science at Khadir Mohideen College of Arts and Science, Adhirampattinam. He has completed M.Sc(Maths)., M.Phil(Maths)., M.Phil(CS). He has completed PhD Degree in Computer Science from Bharathidasan University on He is a research supervisor in Computer Science. 53

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS ABSTRACT V. Purushothama Raju 1 and G.P. Saradhi Varma 2 1 Research Scholar, Dept. of CSE, Acharya Nagarjuna University, Guntur, A.P., India 2 Department