A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition S.Vigneswaran 1, M.Yashothai 2 1 Research Scholar (SRF), Anna University, Chennai. 2 Research Scholar, Department of Computer Science, Erode Arts College, Erode. yashooraj@gmail.com ABSTRACT Data mining is the process of discovering meaningful, new correlation patterns and trends by sifting through large amount of data stored in repositories, using pattern recognition as well as statistical and mathematical techniques. There is a wide range of well-established business applications for data mining. Various data mining techniques have been developed used in various projects which include Classification, Clustering, Genetic Algorithms, Neural Networks, Association Analysis, Outlier Analysis, Prediction and Evolution Analysis. Discovering sequential patterns from a large database of sequences has been recognized as an important problem in the field of knowledge discovery and data mining. The concept of sequence Data Mining was initially introduced by Rakesh Agrawal and Ramakrishnan Srikant in the period of 1995. The problem was first introduced in the context of market analysis. It aimed to retrieve frequent patterns in the sequences of products purchased by customers through time ordered transactions. Sequential pattern mining is a significant data mining task of determining timerelated behaviours in sequence databases. Sequential pattern mining technology has been applied in numerous domains, like web-log analysis, the analyses of customer purchase behaviour, telecommunication, network detection, DNA research, process analysis of scientific experiments, research, medical analysis and so on. Increased applications of sequential pattern mining requires an ideal understanding of the problem and a clear identification of the plus and weakness of existing algorithms. This paper stands for review on various algorithms of discovering sequential patterns from large sequence databases which is an vital difficulty in the field of knowledge discovery and data mining Keywords Data Mining, Sequential Pattern Mining Algorithms, Apriori Based Mining Algorithm, FP- Growth Based Mining Algorithm. I. INTRODUCTION Data Mining is the extraction of hidden predictive information from large databases. It has attracted a great deal of attention in the information industry and in the society as a whole in recent years due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Data mining models involves many different algorithms to achieve different tasks. All these algorithms attempts to fit a model to the data. The algorithms examine the data that is to be processed and they determine a model that is closest to the characteristics of the data being examined. Data mining consists of two different models namely Predictive Models. Descriptive Models. Predictive modelling is a commonly used statistical technique to predict the future behaviour. A predictive model makes prediction about values of data using known results found from different data. Predictive modelling may be made based on the use of other historical data. A descriptive model identifies patterns or relationships in data. On the contrary to the predictive model, a descriptive model serves as an approach to search the properties of the data examined, not to predict novel properties. II. DATA MINING AND PATTERN MINING 97

Data mining is the analysis step of Knowledge Discovery in Databases process. The overall goal of data mining process is to extract information from a data set and transform it into an understandable structure for further use. Data mining includes[6] Medical data mining Spatial data mining Sensor data mining Visual data mining Sequence data mining Data mining is used in financial data analysis, retail industry, Science and Engineering, Medical Applications, telecommunication industry and biological data analysis. Sequence data mining task includes Sequential pattern mining Association rule mining Frequent item set mining Sequential rule mining Clustering SEQUENTIAL PATTERN MINING: Sequential pattern mining finds the subsequence and frequent relevant patterns from the given sequences. III. SEQUENTIAL PATTERN-MINING ALGORITHMS Sequential pattern mining deals with finding statistically relevant patterns between data sets where the values are delivered in sequence. It is closely related to time series mining and special case of structural data mining. Several Sequential Pattern Mining Algorithms were proposed and mainly vary in two ways: (1) The way in which candidate sequences are generated and stored. The most important goal is to reduce the number of candidate sequences generated so as to minimize I/O cost. (2) The way in which support is counted and how candidate sequences are tested for frequency. The key strategy here is to eliminate any database or data structure that has to be maintained all the time for support of counting purposes only. Based on these criteria s sequential pattern mining can be divided broadly into two parts[3]: Apriori Based Pattern Growth Based Apriori based and Pattern growth. Its main idea is to examine only the prefix subsequence s with minimum support. Most of the sequential pattern mining methods follow the Apriori based methods, which leads to too many scanning of database and very large amount of candidate sequences generation and testing, which decrease the performance of the algorithms. The very first was Apriori algorithm, which was put forward by the founders themselves. Later more scalable algorithms for complex applications were developed. E.g. GSP, Spade, Spam, etc. Pattern growth based methods solve all above problems and in addition, it works on projected database which minimize the search space. Sequential Pattern Mining is the method of finding interesting sequential patterns among the large databases. It also finds out frequent subsequences as patterns from a sequence database. The identified patterns are expressed in terms of sub-sequences of the data sequences and expressed in an order that is the order of the elements of the pattern should be respected in all instances where it appears. If the pattern is considered to be frequent if it appears in a number of instances above a given threshold value, usually defined by the user, then it is considered to be frequent Basic concepts of Sequential Pattern Mining[1] Let I = {x1,..., xn} be a set of items, each possibly being associated with a set of attributes such as value, price, profit, calling distance, period, etc. The value on attribute A of item x is denoted by x.a. An itemset is a 98

non-empty subset of items and an itemset with k items is called a k-itemset. A sequence α = <A 1 A n > is an ordered list of itemsets. An itemset A i (1 i l) in the sequence is called a transaction, a term originated from analyzing customers shopping sequences in a transaction database. A transaction A i may have a special attribute, timestamp, denoted by A i.time, which registers the time at which the transaction gets executed. For a sequence α = <A 1 A n >, assume A i.time < A j.time for 1 i < j l. The length of a sequence is denoted by the number of transactions that are present in a sequence. A sequence with length L is called an L-sequence. For a L-sequence β, length of β is denoted by len(β)=l. Then j th itemset can be denoted by β[j]. Maximum an item can occur one time in an itemset, but can also occur multiple times in various itemsets in a sequence. A sequential pattern [5] is a sequence whose statistical significance is above user-specified threshold. The two alternative measures of statistical significance for sequential patterns: Support Number of occurrences. The number of sequences can be very large and also the users have different interests and requirements. IV. CLASSIFICATION OF SEQUENTIAL PATTERN-MINING ALGORITHMS Pattern mining is a data mining technique that involves finding existing patterns in data[2]. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behaviour in terms of the purchased products. For examples, an association rule coke => crisps (80%) states the four of five customers that bought beer also bought crisps. 1. APriori-Based Methods The Apriori method of sequences states that if a sequence S is not recurrent, then the subsequences of S are also not frequent. It is also described as anti monotonic property (or downward-closed). The initial pass of the algorithm simply counts the happening of the items to determine the frequent itemsets. A subsequent pass k consists of two steps. First the frequent itemsets Lk-1 found in the (k-1)th pass are taken to produce the candidate itemsets Ck using apriori candidate generation. Then the scanning of the database is performed and the support of candidates in Ck is counted. The set of candidate itemsets are then pruned out to ensure that all the subsets of the candidate sets are already known to the frequent itemsets. If a sequence fails in the user minimum support test, then the entire subsequences will also be unsuccessful in the test. Key features of Apriori-based algorithm[3] are Breadth-first search: The algorithms in apriori-based 99

approach are described as breathfirst search algorithms because they construct all the k- sequences, in kth iteration of the algorithm, as they traverse the search space. Generate-and-test: Algorithms that are based on this feature display an inefficient pruning method. They produce an explosive number of candidate sequences and then tests each one by one until some user specified constraints are satisfied. This method consumes a lot of memory in the early stages of mining. Multiple scans of the database: The original database is scanned to check whether a long list of generated candidate sequence is frequent or not. It is a very undesirable characteristic of most apriori-based algorithms and requires a lot of processing time and I/O cost. In Apriori based approach, there are three important algorithms namely GSP (Generalized sequential Pattern), SPADE (Sequential Pattern Discovery using Equivalent classes) and SPAM (Sequential Pattern Mining). 2. PATTERN GROWTH BASED APPROACH The Pattern Growth algorithm[4] comes in near the beginning 2000s, for the outcome to the problem of generates and check. The key concept is for to avoid the candidate generation move altogether, and to focus the search on a specified part of the initial database. In this category of the algorithm the technique of search space partitioning is an essential role in pattern-growth and initiated by building a representation of the database to be mined, and after that identifies the way to partition the search space and generates the candidates sequences by growing on the initially mined frequent sequences. The preliminary algorithm started by using projected databases, which is free-span, prefix span with latter one being most powerful. Pattern growth approaches can be regarded as depth-first traversal algorithms since the recursive generation of the projected database for every length-k pattern to discover length- (k+1) patterns. This highlights the search on a constrained portion of the initial database to stay away from the expensive candidate generation and test action. Pattern-growth approach is a more incremental method in producing the possible repeated or frequent sequences and it utilizes the divide-andconquer technique. Patterngrowth algorithms make projections of the database in an attempt to shrink the search 100

space. This approach consists of a main algorithm called PrefixSpan. Key features of pattern growthbased algorithm are Search space partitioning: It allows partitioning of the generated search space of large candidate sequences for efficient memory management. There are different ways to partition the search space. After partitioning the search space, the smaller partitions can be mined in a parallel way. Advanced techniques for search space partitioning include techniques such as projected databases and conditional search and they are referred to as split-and-project techniques. Tree projection: Tree projection usually accompanies patterngrowth algorithms. Here the algorithms implement a physical tree data structure representation of the search space which is then traversed breadth-first or depthfirst in search of frequent sequences and pruning is based on the apriori property. Depth-first traversal: This approach helps in the early pruning of candidate sequences as well as mining of closed sequences. The main reason for this performance is the fact that depth-first traversal utilizes much less memory, more directed search space and thus the candidate sequence generation is lower than the breadth-first or postorder which are used by some early algorithms. Candidate sequence pruning: Pattern-growth algorithms try to utilize a data structure that allows them to prune candidate sequences early in the mining process. This result in early display of smaller search space and maintain a more directed and narrower search procedure. 3. PROPOSED ALGORITHM Although Sequential Pattern Mining improves the efficiency in numerous circumstances it still faces hard challenges in terms of effectiveness and efficiency. In this paper, two fundamental tasks of sequence mining are considered, that is sequence generation and sequence searching. In first sequence generation process, significant and frequent sequences are generated from the database depending on the user-defined minimum support. Several applications have the need to ensure whether a given search sequence is identified in the sequence database or not and some applications have the need to count the subset occurrence of a given search sequence event in the sequence database. Using the algorithm, the proposed pattern segmentation structure is used to generate valuable information on customer purchasing activities for managerial decision-making. This method consists of two actions namely partitioning and searching. V. ALGORITHM Searching Sequence Pattern by Subset Partition Input: Sequence database Sdb and Search sequence Ss. 101

Output: (i) Subset Search successful or unsuccessful (ii) Sequence Count. Method: 1. Consider the input sequence database, Sdb=<s 1,s 2, s 3,..s n >, s j I, where j=1 to n be the set of sequences and I=<i 1,i 2, i 3,.i m > where i=1 to m be the set of items. 2. Initialize the Count=0, L=0. 3. Partitioning: 3.1 Partition Sdb into T tables based on the sequence length. 3.2 Consider the search sequence Ss and calculate its length (L). 4. Sequence Searching: 4.1 Start search from T L to T n. 4.2 If (Ss T L ) then display the search sequence Ss;Increment the value of count and L; 4.3 If (L>n) Then display the value of count; Else Goto step 4.2; 4.4. Else Increment the value of L and Goto step 4.2; End VI. CONCLUSION In this paper, discussion what is sequential pattern mining and various types of their algorithms. This concept is being introduced in 1995, Sequence pattern mining is gaining importance in today s world since it assists in finding the relationships among the data in an effective manner. So, on the basis of the problems the sequential pattern mining is categorized into two main groups, Apriori approach based algorithms and pattern growth approach based algorithms In this work, the sequence generation algorithms have been discussed. The new sequential search algorithms Searching Sequence Pattern by Subset Partition (SSPSP) have been proposed to perform the sequence search operation. Sequential pattern mining is trying to find the relationships between occurrences of sequential events, to find if there exist any specific orders of the occurrences and can find the sequential patterns of specific individual items,also can find the sequential patterns cross different items. With over a decade of extensive research, there have been hundreds of research publications and tremendous research, development and application activities in this domain. REFERENCES [1] Chetna Chand, Amit Thakkar, Amit Ganatra, Sequential Pattern Mining: Survey and Current Research Challenges, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-1, March 2012. [2] R. Agrawal and R. Srikant, Mining sequential patterns In P. S. Yu and A. L. P. Chen, editors, Proceedings of the Eleventh International Conference on Data Engineering, Taipei, Taiwan, pages3 14. IEEE Computer Society, 1995. [3] NIZAR R. MABROUKEH and C. I. EZEIFE, A Taxonomy of Sequential Pattern Mining Algorithms, ACM Computing Surveys, Vol. 43, No. 1, Article 3,November 2010. [4] Vishal S. Motegaonkar, Prof. Madhav V. Vaidya, A Survey on Sequential Pattern Mining Algorithms, International Journal of Computer Science and Information Technologies, Vol. 5 (2), 2014, 2486-2492 [5] Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz, Fast Discovery of Sequential Patterns Using Materialized Data Mining Views, Poznan University of Technology, Institute of Computing Science, ul. Piotrowo 3a, 60-965 Poznan, Poland [6] V. Uma, M. Kalaivany and G. Aghila, Survey of Sequential Pattern Mining Algorithms and an Extension to Time Interval Based Mining Algorithm, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 12, December 2013 ISSN: 2277 128X 102