Mining Generalized Sequential Patterns using Genetic Programming

Size: px

Start display at page:

Download "Mining Generalized Sequential Patterns using Genetic Programming"

Kellie McKinney
6 years ago
Views:

1 Mining Generalized Sequential Patterns using Genetic Programming Sandra de Amo Universidade Federal de Uberlândia Faculdade de Computação Uberlândia MG - Brazil deamo@ufu.br Ary dos Santos Rocha Jr. Universidade Federal de Uberlândia Faculdade de Engenharia Elétrica Uberlândia MG - Brazil ary@cripta.com.br Abstract We propose a new kind of sequential pattern which we call Generalized Sequential Pattern, and we introduce the problem of mining generalized sequential patterns over temporal databases. A classical sequential pattern consists of a sequence of itemsets. This kind of pattern can be discovered in a database of customer transactions where each transaction consists of a transaction-id, transaction time and the items bought in the transaction. On the other hand, our generalized sequential pattern consists of a sequence of SQL expressions and can be discovered in a large temporal database. We present the genetic algorithm SEG-GEN to solve the problem of mining generalized sequential pattern. We show that SEG-GEN performs better than the classical algorithm AprioriAll for mining simple sequential patterns where the minimum support threshold is low. Keywords: Data Mining, Temporal Mining, Sequential Pattern, Genetic Programming 1 Introduction The problem of discovering sequential patterns in temporal data have been extensively studied in several recent papers [3, 4, 5, 6, 9] and its importance fully justified by the great number of potential application domains where mining sequential patterns appears as a crucial issue, such as financial market (evolution of stock market quotations), retailing (evolution of clients purchases), medicine (variations of patients symptoms), local weather forecast, telecommunication (frequent sequences of alarms output by network switches), etc. Different kinds of sequential patterns have been proposed as well as general formalisms and algorithms for expressing and mining them [7]. Roughly speaking, the problem of mining sequential patterns over a large amount of temporal data can be viewed as follows : (a) we are given a table of transactions Trans(IdCl, Time, Itemsets), where IdCl stands for the client identifier, Time is the time associated to the transaction and Itemsets the set of items bought by the client IdCl at time Time; (b) we are interested in discovering which sequences of itemsets are frequently purchased by the clients. For instance, we could discover that 70% of clients buy TV and CD Player followed by VCR and VCR tapes followed by DVD. The dataset may store other kind of data, for instance, instead of clients and sets of items we could have patients and set of symptoms. The problem of mining sequential patterns have been already treated in the past and a number of efficient algorithms have been proposed to solve it [3, 4, 6, 9]. In this paper, we introduce a new type of sequential pattern, which we call generalized sequential pattern. Roughly speaking, a generalized sequential pattern differs from a classical sequential pattern in the sense that it can capture temporal regularities where different types of information are evolving in time, in contrast with a classical sequential pattern which is designed to capture regularities where

2 only one type of information (the one captured by the Itemsets attribute) evolves in time. A typical example of a generalized sequential pattern is: clients having a low income buy a Fiat and afterwards, having a high income, buy a Mercedes. We notice here that the two attributes Income and Itemsets specify information evolving in time. Another important point is that the sample dataset where generalized sequential patterns are discovered can contain several tables and not only one table as in the classical case. Related Work. In [3] the problem of mining sequential patterns has been introduced and three different algorithms for mining them have been presented. One of these algorithms, the AprioriAll algorithm, finds all frequent sequential patterns and its performance is better than or comparable to the other two algorithms. In [4], the GSP algorithm has been introduced for mining sequential patterns and its performance, in some cases, is far better than AprioriAll s. The GSP algorithm allows the user to interact with the mining process by imposing some constraints on the patterns to be discovered. These algorithms are based on the so called Apriori Property or Antimonotony Property which states that if a sequence is frequent then all its subsequences must be frequent as well. Other algorithms ([6, 9]), based on different principles, have been proposed in order to mine sequential patterns; most of them have better performance than the Apriori family of algorithms (AprioriAll, GSP, etc). None of these algorithms, however, have been designed to mine generalized sequential patterns over multiple tables. In [5], a more general kind of sequential pattern have been introduced, the multidimensional sequential pattern, which involves a table with more than one non-key attribute, in contrast with the classical case, where only the non-key attribute Itemsets is allowed. The main difference between our generalized sequential pattern and the multidimensional sequential pattern is that, in the former, all non-keys attributes can depend on the time attribute, i.e., the values associated to these attributes differ from time to time, whereas in the later only one non-key attribute, the Itemsets, can depend on time, the other ones are fixed with respect to time and depend only on the non-temporal key attribute (IdCl). The algorithm SEG-GEN we propose to solve the problem of mining generalized sequential patterns uses genetic programming tools. The use of this technique is justified by the fact that all the existing methods for mining sequential patterns are designed to produce all frequent patterns, and so, they involve a combinatorial process whose inherent cost is enormous independing on the implementation techniques. The use of a genetic programming technique allows to solve the more complex problem of mining generalized sequential patterns in an approximated but satisfactory way. It is important to notice that genetic programming have already been used to solve other temporal mining problems, e.g., the Temporal Constraint Mining [2] and the Temporal Patterns of Time Series Events [8]. Our Contribution. In this paper we introduce a new type of temporal pattern which generalizes the classical sequential patterns of [3, 4] as well as the multidimensional sequential patterns of [5]. We also propose the genetic algorithm SEG-GEN to solve the problem of mining generalized sequential patterns. The paper is organized as follows : in Section 2, we give a formal description of the problem of mining generalized sequential patterns, in Section 3 we describe the algorithm SEG-GEN which uses genetic programming tools for discovering such patterns. We conclude the paper by presenting some experimental results and comparing the performance of SEG-GEN to the classical AprioriAll algorithm [3] in the particular case where the input database reduces to a unique table of transactions and consequently the sequential patterns reduces to the classical ones.

3 2 Problem Statement We suppose the reader is familiar with the classical database terminology [1]. Let R = {R 1,..., R m } be a database schema. If A is an attribute, we denote by type(a) the set of values which A can take (for instance, type(a) can be the set of integers, the set of all strings, etc). A PL (Pattern Language) expression E is a SELECT FROM WHERE expression over R (where the condition in the WHERE clause is a boolean combination of simple conditions like A = B, A = a, A a, etc) as well as unions and intersections of SELECT FROM WHERE expressions. For more details on the specification language PL, see [2]. The schema of a PL expression E (denoted by schema(e)) is the set of attributes appearing in the SELECT clause. If I is a database instance and E is a PL expression, we denote by E(I) the set of answers of E when applied to I. Two PL expressions E 1 and E 2 are said to be compatible if schema(e 1 ) = {A 1,..., A k }, schema(e 2 ) = {B 1,..., B k } and type(a i ) = type(b i ) for each i = 1,..., k. A generalized sequential pattern over the database schema R is a sequence < E 1, E 2,..., E n > of compatible PL expressions. Example 2.1 Let R={Client(CliCod,CliName,Income), Buy(CliCod,CarCod), Car(CarCod,Model,Year)} be a database schema. So, < E 1, E 2 > given below is a generalized sequential pattern over R : E 1 SELECT Client.CliCod FROM Buy, Car WHERE (Car.CarCod = Buy.CarCod) AND (Car.Model = Fiat ) AND (Client.Income = low ) E 2 SELECT Client.CliCod FROM Buy, Car WHERE (Car.CarCod = Buy.CarCod) AND (Car.Model = Mercedes ) AND (Client.Income = high ) Following the snapshot approach, a temporal database D is a sequence of database instances D = (D 1,..., D n ) over a database schema R = {R 1,..., R m } (notice that we have m different tables for each instant i = 1,..., n). We denote by dom(d) (the domain of D) the set of all elements appearing in the tables of D i, for each i = 1,..., n. Let us suppose we are given a temporal database D and a generalized sequential pattern σ =< E 1,..., E m >, with m n. Let u be a tuple over schema(e 1 ) 1. We say that u supports σ w.r.t. the dataset D if there exists j 1,..., j m such that u E 1 (D j1 ) E 2 (D j2 )... E m (D jm ). Let (schema(e 1 ),D) denote the set of all tuples over schema(e 1 ) taking values over dom(d). Let N be the cardinality of (schema(e 1 ),D). We define the support of σ w.r.t. D (denoted by sup(σ,d)) as : sup(σ, D) = {u u supports σ} N We say that σ is frequent w.r.t. the dataset D if sup(σ, D) α, where α is a given threshold, 0 α 1. The problem of mining generalized sequential patterns can be stated as follows: Given a temporal database D over a database schema R and a threshold α such that 0 α 1, find the frequent generalized sequential patterns with respect to D and α. 3 The Algorithm SEG-GEN Before presenting the Algorithm SEG-GEN responsible for mining generalized sequential patterns over a temporal dataset D, we first will define the usual genetic programming concepts used in the algorithm. A chromossome is a generalized sequential pattern as defined in section 2. A population is a set of chromossomes. The mutation operation is performed over a chromossome σ in the following way : an arbitrary element (an attribute, a constant, a relation) of an expression E appearing in σ 1 As the expressions E i have compatible schemas, u is also a tuple over schema(e j ) for j = 2,..., m)

4 is chosen and afterwards it is replaced by a different element of the same kind. The reproduction operation is defined as usual. The crossing operation is performed over two chromossomes as follows : arbitrary positions i, j are chosen in the first and second chromossomes respectively. The portion of the first chromossome on the left side of i is exchanged with the portion of the second chromossome on the right side of j and vice-versa. The fitness of a chromossome σ is mesured according its support. Now, we are ready to give an informal description of the Algorithm SEG-GEN. For lack of space, the implementation details of the different procedures used in the algorithm are omitted. Procedure SEG-GEN Input : maxgen (the maximal number of generations), perfit (optimal percentual of fitting chromossomes), minsup (support threshold). GENPOP (% Generates the initial population); SUPCAL(% Calculates the support of each chromossome); ORDERPOP (% The population is ordered in the increasing order given by the support); N := number of chromossomes; U := number of unfitting chromossomes; G := 1; p := U N ; while ((G maxgen) and (p < perfit)) do Choose one of the following operations : CROSSING OR MUTATION OR REPRODUCTION; ORDERPOP; G := G + 1 ; U := number of unfitting chromossomes; p := U N ; BACKTRACKING; At each generation, unfitted chromossomes are stored into a table and afterwards are used in the BACKTRACKING procedure which gives the opportunity to these chromossomes to produce some other frequent patterns. 4 Experimental Results In this section, we present some experimental results of SEG-GEN with synthetic data sets. For this purpose, we have created 24 different databases. The synthetic data have been produced by an algorithm based on the ideas of [3]. The tests have been performed on a Intel Pentium III 650 Mhz workstation, 128 MB of main memory and running Microsoft Windows 2000 Professional. Data was stored on a 20 GB HD Quantum AT Fireball LCT and was accessed via ODBC. Two groups of tests have been executed : the first group concerns the execution of SEG-GEN over multiple tables ; the second one concerns the execution of SEG-GEN over a unique table, and the mining sequences are simple sequential patterns like in [3]. The objective of this second group of tests is to study the relative performance of SEG- GEN and AprioriAll. For lack of space, we present here only the results for some selected databases. First Group : Figure 1 shows that the execution times of SEG-GEN increases as the support is decreased. However, for support values between 0.33 and 0.5 the execution time increasing is not very important. The parameters used are shown in table??. Figure 2 shows the results of SEG-GEN executed over 4 different datasets, with distinct total number of records in their tables. The parameters used are shown in table??. We can verify that the execution times for SEG-GEN scale quite linearly. Second Group : Figure 3 shows the executions of SEG-GEN and AprioriAll over the same dataset, where minimum support is decreased from 0.75 to The parameters used are shown in table??. We can notice that for values of minimum support smaller than 0.07, SEG-GEN is faster than AprioriAll. Figure 4 shows the executions of the two algorithms over several datasets. The parameters used are shown in table??. These results show that AprioriAll is faster than SEG-GEN but the difference between them decreases as the number of records in the dataset increase.

Table 1 : Parameters used in figure 1 Number of tables 2 Number of records 1000 Number of patterns 100 Number of instances 12 Table 2 : Parameters used in figure 2 Number of tables 2 Number

Table 4 : Parameters used in figure 4 Number of items 1000 Number of itemset 2500 Number of patterns 3000 Support 33% Figure 1 Figure 2 Figure 4 References [1] S Abiteboul, R. Hull, and V.

XV Simpósio Brasileiro de Banco de Dados SBBD 2000, João Pessoa, Brazil, October 2000, pages 172 187. [3] R. Agrawal and R. Srikant. Mining sequential patterns.

Mining sequential patterns: Generalizations and performance improvements. In In Proc. of the Fifth Int l Conference on Extending Database Technology (EDBT), Avignon, France, March, 1996.

5 Table 1 : Parameters used in figure 1 Number of tables 2 Number of records 1000 Number of patterns 100 Number of instances 12 Table 2 : Parameters used in figure 2 Number of tables 2 Number of patterns 100 Number of instances 12 Support 33% Figure 3 Table 3 : Parameters used in figure 3 Number of items 1000 Number of itemset 2500 Number of patterns 3000 Number of records 2500 Table 4 : Parameters used in figure 4 Number of items 1000 Number of itemset 2500 Number of patterns 3000 Support 33% Figure 1 Figure 2 Figure 4 References [1] S Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, [2] Sandra de Amo, Márcia Fernandes, Flávio Silva, and João Nunes. Mining temporal constraints in databases using genetic programing. XV Simpósio Brasileiro de Banco de Dados SBBD 2000, João Pessoa, Brazil, October 2000, pages [3] R. Agrawal and R. Srikant. Mining sequential patterns. Research Report RJ 9910, IBM Almaden Research Center, San Jose, California, October, [4] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In In Proc. of the Fifth Int l Conference on Extending Database Technology (EDBT), Avignon, France, March, [5] H. Pinto, J. Han, J. Pei, K. Wang, Q. Chen, and U. Dayal. Multi-dimensional sequential pattern mining. In In Proc Int. Conf. on Information and Knowledge Management (CIKM 01), Atlanta, November, [6] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: Frequent

6 pattern-projected sequential pattern mining. In In Proc Int. Conf. on Knowledge Discovery and Data Mining (KDD 00), Boston, MA, August, [7] Joshi, M. V., Karypis, G., Kumar, V. : A Universal Formulation of Sequential Patterns. Technical Report, Department of Computer Science, University of Minnesota, [8] Povinelli, R.J. : Using Genetic Algorithms to Find Temporal Patterns Indicative of Time Serie Events. In Proceedings of Genetic and Evolutionary Computation Conference (GECCO- 2000) Workshop Program, Las Vegas, Nevada, 2000, pp [9] Mohammed J. Zaki. Spade: An efficient algorithm for mining frequent sequences. In Machine Learning Journal, special issue on Unsupervised Learning (Doug Fisher, ed.), Vol. 42 Nos. 1/2.

An Algorithm for Frequent Pattern Mining Based On Apriori

An Algorithm for Frequent Pattern Mining Based On Goswami D.N.*, Chaturvedi Anshu. ** Raghuvanshi C.S.*** *SOS In Computer Science Jiwaji University Gwalior ** Computer Application Department MITS Gwalior