Incremental and Interactive Sequence Mining

Size: px

Start display at page:

Download "Incremental and Interactive Sequence Mining"

Kristian White
5 years ago
Views:

1 Incremental and Interactive Sequence Mining S. Parthasarathy, M. J. Zaki, M. Ogihara, S. Dwarkadas Computer Science Dept., U. of Rochester, Rochester, NY 7 Computer Science Dept., Rensselaer Polytechnic Inst., Troy NY bstract The discovery of frequent sequences in temporal databases is an important data mining problem. Most current work assumes that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new database. In this paper, we propose novel techniques for maintaining sequences in the presence of a) database updates, and b) user interaction (e.g. modifying mining parameters). This is a very challenging task, since such updates can invalidate existing sequences or introduce new ones. In both the above scenarios, we avoid re-executing the algorithm on the entire dataset, thereby reducing execution time. Experimental results confirm that our approach results in execution time improvements of up to several orders of magnitude in practice. Introduction Sequence mining is an important data mining task, where one attempts to discover frequent sequences over time, of attribute sets in large databases. This problem was originally motivated by applications in the retail industry, including attached mailing, addon sales and customer satisfaction. It also applies to many scientific and business domains. For instance, in the health care industry it can be used for predicting the onset of disease from a sequence of symptoms, and in the financial industry it can be used for predicting investment risk based on a sequence of stock market events. Discovering all frequent sequences in a very large database can be very compute and I/O intensive because the search space size is essentially exponential in the length of the longest transaction sequence in it. This high computational cost may be acceptable when the database is static since the discovery is done only once, and several approaches to this problem have been presented in the literature. However, many domains such as electronic commerce, stock analysis, collabo- Contact uthor: S. Parthasarathy, srini@cs.rochester.edu. This work was supported in part by NSF grants CD 9, CCR 97, CCR 97559, CCR 979, CCR 975, INT-977, and a DRP grant F-9--; and an external research grant from Digital Equipment Corporation. rative surgery, etc., impose soft real-time constraints on the mining process. In such domains, where the databases are updated on a regular basis and user interactions modify the search parameters, running the discovery program all over again is infeasible. Hence, there is a need for algorithms that maintain valid mined information across i) database updates, and ii) user interactions (modifying/constraining the search space). lthough the incremental [,, ] and interactive [] mining problem has been studied for association rules, no work has been done for incremental and interactive sequential patterns. In this paper, we present a method for incremental and interactive sequence mining. Our goal is to minimize the I/O and computation requirements for handling incremental updates. Our algorithm accomplishes this goal by maintaining information about maximally frequent and minimally infrequent sequences. When incremental data arrives, the incremental part is scanned once to incorporate the new information. The new data is combined with the maximal and minimal information in order to determine the portions of the original database that need to be re-scanned. This process is aided by the use of a vertical database layout where attributes are associated with the list of transactions in which they occur. The result is an improvement in execution time by up to several orders of magnitude in practice, both for handling increments to the database, as well as for handling interactive queries. The rest of the paper is organized as follows. In Section, we formulate the sequence discovery problem. In Section, we describe the SPDE algorithm upon which we build our incremental approach. Section describes our incremental sequence mining algorithm. In Section 5, we describe how we support online querying. n experimental evaluation is presented in Section. We discuss related work in Section 7, and conclude in Section. Problem Formulation In this section, we define the incremental sequence mining problem that this paper is concerned with. We begin by defining the notation we use. Let the items, denoted I, be the set of all possible attributes. We assume a fixed enumeration of all members in I and identify the items with their indices in the enumeration. n itemset is a set of items. n itemset is denoted by the enumeration of its elements in increasing order. For

2 CID CID DTSE TID 5 5 DTSE TID 5 5 Items C C C C C Items C C NEGTIVE ORDER ->-> NEGTIVE ORDER ->-> SEQUENCE LTTICE ->-> ->-> -> ->-> ->-> -> -> -> Incremental Database -> -> -> -> { } FREQUENT SET: {, ->, ->,,, ->, ->, ->} Figure : Original Database ->-> SEQUENCE LTTICE ->->-> ->-> ->->-> ->->->C -> ->-> ->-> -> -> ->-> ->-> ->-> -> -> C-> -> -> C-> ->-> ->-> -> ->-> C ->->->C ->->C ->->C ->C C ->C C ->C C->C FREQUENT SET: {,, C, ->, ->,, ->, ->, ->, ->C, ->C, ->->, ->, ->->, ->->C, ->->C} Figure : Original plus Incremental Database and Lattice { } C ->->C ->->C ->-> an itemset i, its size, denoted by i, is the number of elements in it. n itemset of size k is called a k-itemset. sequence is an ordered list (ordered in time) of nonempty itemsets. sequence of itemsets α,..., α n is denoted by (α α n ). The length of a sequence is the sum of the sizes of each of its itemsets. For each integer k, a sequence of length k is called a k- sequence. sequence α is a subsequence of a sequence β, denoted by α β, if α can be constructed from β by striking out some (or none) of the items in β, and then by eliminating all the occurrences of and one at a time. For example, C is a subsequence of E CD. We say that α is a proper subsequence of β, denoted α β, if α β and α β. For k, the generating subsequences of a length k sequence are the two length k subsequences of α obtained by dropping exactly one of its first or second items. y definition, the generating sequences share a common suffix of length k. For example, the two generating subsequences of CD E are CD E and CD E, and they share the common suffix CD E. sequence is maximal in a collection C of sequences if the sequence is not a subsequence of any other sequence in C. Our database is a collection of customers, each with a sequence of transactions, each of which is an itemset. For a database D and a sequence α, the support or frequency of α in D, denoted by support D (α), is the number of customers in D whose sequences contain α as a subsequence. The minimum-support, denoted by min support, is a user-specified threshold that is used to define frequent sequences : a sequence is frequent in D if its support in D is at least min support. rule involving sequence and sequence is said to have confidence c if c% of the customers that contain also contain. Suppose that new data δ is to be added to a database D. Then we call D the original database and δ the incremental database. The updated database is denoted by D + δ. For each k, F k denotes the collection of all frequent sequences of length k in the updated database. lso FS denotes the set of all frequent sequences in the updated database. The negative border (N) is the collection of all sequences that are not frequent but both of whose generating subsequences are frequent. y the old sequences, we mean the set of all frequent sequences in the original database and by the new sequences we mean the set of all frequent sequences in the join of the original and the increment. For example, consider the customer database shown in Figure. The database has three items (,, C), and four customers. The figure also shows the Increment Sequence Lattice (ISL) with all the frequent sequences (the frequency is also shown with each node) and the negative border, when a minimum support of 75%, or customers, is used. For each frequent sequence, the figure shows its two generating subsequences in bold lines. Figure shows how the frequent set and the negative border change when we mine over the combined original and incremental database (high-

3 lighted in dark grey). For example, C is not frequent in the original database D, but C (along with some of its supersequences) becomes frequent after the update δ. The update also causes some elements to move from N to the new FS. Incremental Sequence Discovery Problem: Given an original database D of sequences, and a new increment to the database δ, find all frequent sequences in the database D + δ, with minimum possible recomputation and I/O. The SPDE lgorithm In this section we describe SPDE [], an algorithm for fast discovery of frequent sequences, which forms the basis for our incremental algorithm. Sequence Lattice: SPDE uses the observation that the subsequence relation defines a partial order on the set of sequences, i.e., if β is a frequent sequence, then all subsequences α β are also frequent. The algorithm systematically searches the sequence lattice spanned by the subsequence relation, from the most general (single items) to the most specific frequent sequences (maximal sequences) in a depth-first manner. For instance, in Figure, the bold lines correspond to the lattice for the example dataset. Support Counting: Most of the current sequence mining algorithms [9] assume a horizontal database layout such as the one shown in Figure. In the horizontal format, the database consists of a set of customers (cid s). Each customer has a set of transactions (tid s), along with the items contained in the transaction. In contrast, we use a vertical database layout, where we associate with each item X in the sequence lattice its idlist, denoted L(X), which is a list of all customer (cid) and transaction identifier (tid) pairs containing the item. For example, the idlist for the item C in the original database (Figure ) would consist of the tuples {<, >, <, >}. Given the per item idlists, we can iteratively determine the support of any k-sequence from the idlists of any two of its (k ) length subsequences. In particular, we combine (intersect) the two (k ) length subsequences that share a common suffix (the generating sequences) to compute the support of a new k length sequence. simple check on the support of the resulting idlist tells us whether the new sequence is frequent or not. If we had enough main-memory, we could enumerate all the frequent sequences by traversing the lattice, and performing intersections to obtain sequence supports. In practice, however, we only have a limited amount of main-memory, and all the intermediate idlists will not fit in memory. SPDE breaks up this large search space into small, manageable chunks that can be processed independently in memory. This is accomplished via suffix-based equivalence classes (henceforth denoted as a class). We say that two k length sequences are in the same class if they share a common k length suffix. The key observation is that each class is a sub-lattice of the original sequence lattice and can be processed independently. Each suffix class is independent in the sense that it has complete information for generating Described in Zaki []. all frequent sequences that share the same suffix. For example, if a class [X] has the elements Y X, and Z X as the only sequences, the only possible frequent sequences at the next step can be Y Z X, Z Y X, and (Y Z) X. It should be obvious that no other item Q can lead to a frequent sequence with the suffix X, unless (QX) or Q X is also in [X]. EGIN Enumerate-Frequent-Seq([S]): for all elements i [S] do [ i ] = ; for all elements j [S] do R = j i ; /*sequences R formed by generating subsequences j and i with i as a suffix*/ idlist(r) = idlist( j ) idlist( i ); if support(idlist(r)) min sup then [ i ] = [ i ] {R}; for all [ i ] do Enumerate-Frequent-Seq([ i ]); END Enumerate-Frequent-Seq([S]): Figure : Enumerating Frequent Sequences SPDE recursively decomposes the sequences at each new level into even smaller independent classes. For instance, at level one it uses suffix classes of length one (X,Y), at level two it uses suffix classes of length two (X Y, XY) and so on. We refer to level one suffix classes as parent classes. These suffix classes are processed one-by-one. Figure shows the pseudo-code (simplified for exposition, see [] for exact details) for the main procedure of the SPDE algorithm. The input to the procedure is a class, along with the idlist for each of its elements. Frequent sequences are generated by intersecting [] the idlists of all distinct pairs of sequences in each class and checking the support of the resulting idlist against min sup. The sequences found to be frequent at the current level form classes for the next level. This level-wise process is recursively repeated until all frequent sequences have been enumerated. In terms of memory management, it is easy to see that we need memory to store intermediate idlists for at most two consecutive levels. Once all the frequent sequences for the next level have been generated, the sequences at the current level can be deleted. For more details on SPDE, see []. Incremental Mining lgorithm Our purpose is to minimize re-computation or rescanning of the original database when mining sequences in the presence of increments to the database (the increments are assumed to be appended to the database, i.e., later in time). In order to accomplish this, we use an efficient memory management scheme that indexes into the database efficiently, and create an Increment Sequence Lattice (ISL), exploiting its properties to prune the search space for potential new sequences. The ISL consists of all elements in the negative border and the frequent set, and is initially constructed using SPDE. In the ISL, the children of each nonempty sequence are its generating subsequences. Each node of the ISL contains the support for the given sequence. Theory of Incremental Sequences Let C, T and I be the set of all cid s, tid s and items, respectively, that appear in the incremental part δ. Define D to

4 PHSE :. compute D (i), D (i) for all items i. for all items i in I. Q.enqueue(S);. while (Q is not empty) 5. p = Q.dequeue(); Compute Support(p);. if (support(p) min sup ) 7. k = length(p);. if (p is in the negative border) 9. N-to-FS[k].enqueue(p);. else if (D (p) ). for all k + -sequences S in ISL that are. generating ascendents of p. Q.enqueue(S); PHSE :. for each item i in N-to-FS[]. construct suffix class [i];. N-to-FS[].enqueue([i]);. for (k = to...) 5. for each class C in N-to-FS[k]. Enumerate-Frequent-Seq(C); Compute Support(p):. = generating subsequence(p);. = generating subsequence(p);. support D (p) = intersect(d (), D ());. support D (p) = intersect(d (), D ()); 5. support(p) = support(p) + support D (p) support D (p); be the set of all records (in D δ) with cid in C and D = D \ δ. For the sake of simplicity assume that there is no new customer added to the database. This implies that infrequent sequences can become frequent but not the other way around. We use the following properties of the lattice to efficiently perform incremental sequence mining. y set inclusion-exclusion, we can update the support of a sequence in FS or N based on its support in D and D. support D+δ (X) () = support D (X) + support D (X) support D (X). This allows us to compute the support at any node in the lattice quickly, by limiting re-access to the original database to D. Proposition For every sequence X, if support D+δ (X) > support D (X), then the last item of X belongs to I. This allows us to limit the nodes in the ISL that are re-examined to those with descendants in I. We call a sequence Y a generating descendant of X if there exists a list [Z, Z,..., Z m ] of sequences such that Z = Y, Z m = X, and for every i, i m, Z i is a generating subsequence of Z i+. We show that that if a sequence has become a member of FS N in the updated database, but it was not a member before the update, then one of its generating descendants was in N and now is in FS. Proposition Let X be a sequence of length at least. If X is in FS D+δ N D+δ but not in FS D N D, then X has a generating descendant in N D FS D+δ. Proof The proof is by induction on k, the length of X. Let X and X be the two generating subsequences of X. Note that if both X and X belong to FS D then X is in FS D N D, which contradicts our assumption. Therefore, either X or X is out of FS D. For the base case, k =, since X and X are of length, by definition both belong to FS D N D, and by the above, at least one must be in N D. For X to be in FS D+δ N D+δ, X and X must be in FS D+δ by definition. Thus the claim holds, since either X or X must be in N D FS D+δ, and they are generating descendants of X. For the induction step, suppose Figure : The ISM lgorithm k > and that the claim holds for all k < k. Suppose X and X are both in FS D N D. Then, either X or X N D. We know X FS D+δ N D+δ, so X and X belong to FS D+δ. Since X and X are generating subsequences of X, the claim holds for X. Finally, we have to consider the case where either X or X is not in FS D N D. We know that as X FS D+δ N D+δ, both X and X belong to FS D+δ N D+δ. Now suppose that X FS D N D. We know that X is in FS D+δ N D+δ, X is in FS D+δ N D+δ. Therefore from the induction step (since X has length less than k) the claim holds for X. Let Y be a generating descendant satisfying the claim for X. Since X is a generating subsequence of X, Y is also a generating descendant of X. Thus the claim holds for k. The same argument can be applied to the case when X / FS D N D. Proposition limits the additional sequences (not found in the original ISL) that need to be examined to update the ISL. Memory Management SPDE simply requires per item idlists. For incremental mining, in order to limit accesses to customers and items in the increment, we use a two level disk-file indexing scheme. However, since the number of customers is unbounded, we use a hashing mechanism described below. The vertical database is partitioned into a number of blocks such that each individual block fits in memory. Each block contains the vertical representation of all transactions involving a set of customers. Within each block there exists an item dereferencing array, pointing to the first entry for each item. Given a customer, and an item, we first identify the block containing the customer s transactions using a first level cid-index (hash function). The second item-index then locates the item within the given block. fter this we perform a linear search for the exact customer identifier. Using this two level indexing scheme we can quickly jump to only that portion of the database which will be affected by the update, without having to touch the entire database. Note that using a vertical data format we were able to efficiently retrieve all affected item s cids, without having to touch the entire database. This is not possible in the horizontal format, since a given item can appear in any transaction, which is found by scanning the entire data.

5 Incremental Sequence Mining (ISM) lgorithm Our incremental algorithm maintains the incremental sequence lattice, ISL, which consists of all the frequent sequences and all sequences in the negative border in the original database. The support of each member is kept in the lattice, too. There are two properties of increments we are concerned with: whether new customers are added and whether new transactions are added. We first check whether a new customer is added. If so, the minimum support in terms of the number of transactions is raised. We examine the entire ISL from the -sequences towards longer and longer sequences to compute where each sequence belongs. More precisely, for each sequence X that has been reclassified from frequent to infrequent, if its two generating sequences are still frequent we make X as a negative border element; otherwise, X is eliminated from the lattice. Then we default to the algorithm described below (see Figure ). The algorithm consists of two phases. Phase is for updating the supports of elements in N and FS and Phase is for adding to N and FS beyond what was done in Phase. To describe the algorithm, for a sequence p we represent by D (p) and D (p) the vertical id-list of p in D and that in D, respectively. Phase begins by generating the single item components of ISL: For each single item sequences p, we compute D (p) and D (p) and put p into a queue Q, which is empty at the beginning of computation. Then we repeat the following until Q is empty: We dequeue one element p from Q. We update the support of p using the subroutine Compute Support, which computes the support based on Equation. Once the support is updated, if the sequence p (of length k) is in the frequent set (line ), all length k + sequences that are already in ISL and that are generating ascendents of p are queued into Q. If the sequence, p, is in the negative border (line ) and its support suggests it is frequent, then this element is placed in N-to-FS[k]. t the end of Phase, we have exact and up-todate supports for all elements in the ISL. We further have a list of elements that were in the negative border but have become frequent as a result of the database increment (in N-to-FS). In the example in Figures and, the following elements had supports updated:,,,,, and C. Of these, the following moved from the negative border to the frequent set:,, and C. We next describe Phase (see Figure ). s to Phase, at the end of Phase the N-to-FS is a list (or an array) of hash tables containing elements that have moved from N to FS. y Proposition these are the only suffix-based classes we need to examine. For all -sequences that have moved we intersect it with all possible other frequent -sequences. We add all such frequent -sequences into the queue N-to-FS[] for further processing. In our running example in Figures and, C and C are added to the Nto-FS[] table. t the same time all other evaluated two-sequences involving C that were not frequent are placed in N D+δ. Thus, C, C, C, C and C C are placed in N D+δ. The next step in Phase is to, starting with the hash table containing length two sequences, pick an element that has not been processed and create the list of frequent sets, along with associated id-lists from D δ, in its equivalence class. The next step is to pass the resulting equivalence class to Enumerate-Frequent-Set, which adds any new frequent sequences or new negative border elements and associated elements to the ISL. We repeat this until all the N-to-FS tables are empty. s an example, let us consider the equivalence class associated with C. From Figures and we see that the only other frequent sequence of its suffix class is C. s both the above sequences are frequent, they are placed in FS D+δ. Recursively enumerating the frequent itemsets results in the sequences C and C being added to FS D+δ. Similarly, the sequences C, C, C, C,, and C are added to N D+δ. 5 Interactive Sequence Mining The idea in interactive sequence mining is that an end user be allowed to query the database for association rules at differing values of support and confidence. The goal is to allow such interaction without excessive I/O or computation. Interactive usage of the system normally involves a lot of manual tuning of parameters and resubmission of queries that may be very demanding on the memory subsystem of the server. In most current algorithms, multiple passes have to be made over the database for each < support, conf idence > pair. This leads to unacceptable response times for online queries. Our approach to the problem of supporting such queries efficiently is to create pre-processed summaries that can quickly respond to such online queries. typical set of queries that such a system could support include: i) Simple Queries: identify the rules for support x%, confidence y%, ii) Refined queries: where the support value is modified (x + y or x y) involves the same procedure, iii) Quantified Queries: identify the k most important rules in terms of support, confidence pairs or find out for what support/confidence values can we generate exactly k rules, iv) Including Queries: find the rules including itemsets i,..., i n, v) Excluding Queries: compute the rules excluding itemsets i,..., i n, and vi) Hierarchical Queries: treat items i (coke),..., i n (pepsi), as one item (cola) and return the new rules. Our approach to the problem of supporting such queries efficiently is to adapt the Increment Sequence Lattice. The preprocessing step of the algorithm involves computing such a lattice for a small enough support S min, such that all future queries will involve a support S larger than S min. In order to handle certain queries (Including, Excluding etc.), we modify the lattice to allow links from a k-length sequence to all its k subsequences of length k (rather than just its generating subsequences). Given such a lattice, we can produce answers to all but one (Hierarchical queries) of the queries described in the previous section at interactive speeds without going back to the original database. This is easy to see as all of the queries will basically involve a form of pruning over the lattice. lattice, as opposed to a flat file containing the relevant sequences, is an important data structure as it permits rapid pruning of relevant sequences. Exactly how we do this is discussed in more detail in [].

6 Hierarchical queries require the algorithm to treat a set of related items as one super-item. For example we may want to treat chips, cookies, peanuts, etc. all together as a single item called snacks. We would like to know what are the frequent sequences involving this super-item. To generate the resulting sequences, we have to modify the SPDE algorithm. We reconstruct the id-list for the new item (i,...,i n ) via a special union operator, and we remove from consideration the individual items i,...,i n. Then, we rerun the equivalence class algorithm for this new item and return the set of frequent sequences. Experimental Evaluation ll the experiments were on a single processor of a DECStation using a maximum of 5 M of physical memory. The DECStation contains MHz lpha processors. No other user processes were running at the time. We used different synthetic databases with size ranging from M to 55M, which were generated using the procedure described in [9]. lthough the size of our benchmark databases fit in memory, our goal is to work with outof-core databases. Hence, we assume that the database resides on disk. The datasets are generated using the following process. First N I itemsets of average size I are generated by choosing from N items. Then N S sequences of average length S are created by assigning itemsets from N I to each sequence. Next, a customer of average T transactions is created, and sequences in N S are assigned to different customer elements, respecting the average transaction size of T. The generation stops when C customers have been generated. Table shows the databases used and their properties. The total number of transactions is denoted as D, average transaction size per customer as T, and the total number of customers C. The parameters we used were N =, N I = 5, I =.5, N S = 5, S =. Please see [9] for further details on the dataset generation process. Database C T D Total Size C.T, M C.T, 5 M C.T5 5 5, M C5.T 5 5, M C.T, M C5.T 5 5, 55 M Table : Database properties To evaluate the incremental algorithm, we modified the database generation mechanism to construct two datasets one corresponding to the original database, and one corresponding to the increment database. The input to the generator also included an increment percentage roughly corresponding to the number of customers in the increment and the percentage of transactions for each such customer that belongs in the increment database. ssuming the database being looked at is C.T, if we set the increment percentage to 5% and the percentage of transactions to %, then we could expect 5 customers (5% of,) to belong to C, each of which would contain on average two transactions (% of ) in the increment database. The actual number of customers in the increment is determined by drawing from a uniform distribution (increment percentage as parameter). Similarly, for each customer in the increment the number of transactions belonging to the increment is also drawn from a uniform distribution (transaction percentage as parameter). Incremental Performance: For the first experiment (see Figure 5), we varied the increment percentage for databases while fixing the transaction percentage to %. We ran the SPDE algorithm on the entire database (original and increment) combined, and evaluated the cost of running just the incremental algorithm (after constructing the ISL from the original database) for increment database values of five, three and one percent. For each database, we also evaluated the breakdown of the cost of the incremental algorithm phases. The results show that the speedups obtained by using the incremental algorithm in comparison to rerunning the SPDE algorithm over the entire database range from a factor of 7 to over two orders of magnitude. s expected, on moving from a larger increment value to a smaller one, the improvements increase, since there are fewer new sequences from a smaller increment. The breakdown figures reveal that the phase one time is pretty negligible, requiring under second for all the datasets for all increment values. It also shows that the phase two times, while an order of magnitude larger than the phase one times, are still much faster than re-executing the entire algorithm. Further, while increasing database size does increase the overall running time of phase, it does not increase at the same rate as re-executing the entire algorithm for these datasets. The second experiment we conducted was to vary the support sizes for a given increment size (%), and for two databases. The results for this experiment are documented in Figure. For both databases, as the support size is increased, the execution time of phase and phase rapidly approaches. This is not surprising when you consider that at higher supports, the number of elements in the ISL are fewer (affecting phase ) and the number of new sequences are much smaller (affecting phase ). The third experiment we conducted was to keep the support, the number of customers, and the transaction percentage constant (.%,,, and % respectively), and vary the number of transactions per customer (,, and 5). Figure 7 depicts the breakdown of the two phases of the ISM algorithm for varying increment values. We see that moving from to 5 transactions per customer, the execution time of both phases progressively increases for all database increment sizes. This is because the number of sequences in the ISL are more (affecting phase) and the number of new sequences are also more (affecting phase). Interactive Performance: In this section, we evaluate the performance of the interactive queries described in Section 5. ll the interactive query experiments were performed on a SUN UltraSparc, 7MHz processor with 5 M of memory. We envisage off-loading the interactive querying feature onto client machines as opposed to executing on the server, and shipping the results to the data mining client. Thus we wanted to compare executing interactive queries on a slower machine. nother reason for evaluating the queries on a

7 7 5 Database: CT INC reakdown Phase Phase 9 7 Database: C5T INC reakdown Phase Phase 5 SPDE 5% INC % INC % INC 5% INC % INC % INC SPDE 5% INC % INC % INC 5% INC % INC % INC Database CT INC reakdown Phase Phase Database C5T INC reakdown Phase Phase SPDE 5% INC % INC % INC 5% INC % INC % INC SPDE 5% INC % INC % INC 5% INC % INC % INC Figure 5: Incremental lgorithm Performance Comparison ( in Seconds) Dataset: CT Dataset: CT Phase Phase Phase Phase in Seconds in Seconds Support in % Support in % Figure : Effect of Varying Support: C.T and C.T INC(.5%) reakdown vs Transaction Length INC(.%) reakdown vs Transaction Length INC(.%) reakdown vs Transaction Length Phase Phase Phase Phase Phase 7 Phase (s) (s) (s) 5 CT CT CT5 CT CT CT5 CT CT CT5 Databases Databases Databases Figure 7: Effect of Varying Transaction Length

8 Database Simple Refined(.5) Priority(5) Including Excluding Rerunning SPDE CK.T... negligible.7 75 CK.T... negligible. CK.T5...9 negligible. 9 C5K.T.5.. negligible.9 9 CK.T...5 negligible. C5K.T... negligible. 5 Table : Interactive Performance: in Seconds slower machine is that the relative speeds of the various interactive queries is better seen on a slower machine (on the DECs all queries executed in negligible time). Since hierarchical queries simply entail a modified execution of phase, we do not evaluate it again. We evaluated simple querying on supports ranging from.%-.5%, refined querying (support refined to.5% for all the datasets), priority querying (querying for the 5 sequences with highest support), including queries (including a random item) and excluding queries (excluding a random item). Results are presented in Table along with the cost of rerunning the SPDE algorithm on the DEC machine. We see that the querying time for refined, priority, including and excluding queries are very low and capable of achieving interactive speeds. The priority query takes more time, since it has to sort the sequences according to support value, and this sorting dominates the computation time. Comparing with rerunning SPDE (on a much faster DEC machine) we see that the interactive querying is several orders of magnitude faster, in spite of executing it on a much slower machine. 7 Related Work Sequence Mining: The concept of sequence mining as defined in this paper was first described in [9]. Recently, SPDE [] was shown to outperform the algorithm presented in [9] by a factor of two in the general case, and by a factor of ten with a pre-processing step. The problem of finding frequent episodes in a single long sequence of events was presented in [5]. The problem of discovering patterns in multiple event sequences was studied in [7]; they search the rule space directly instead of searching the sequence space and then forming the rules. Incremental Sequence Mining: There has been almost no work addressing the incremental mining of sequences. One related proposal in [] uses a dynamic suffix tree based approach to incremental mining in a single long sequence. However, we are dealing with sequences across different customers, i.e., multiple sequences of sets of items as opposed to a single long sequence of items. The other closest work is in incremental association mining [,, ] However, there are important differences that make incremental sequence mining a more difficult problem. While association rules discover only intra-transaction patterns (itemsets), we now also have to discover intertransaction patterns (sequences). The set of all frequent sequences is an unbounded superset of the set of frequent itemsets (bounded). Hence, sequence search is much more complex and challenging than the itemset search, thereby further necessitating fast algorithms. Interactive Sequence Mining mine-and-examine paradigm for interactive exploration of associations and sequence episodes was presented in []. Similar paradigms have been proposed exclusively for associations []. Our interactive approach tackles a different problem (sequences across different customers) and supports a wider range of interactive querying features. Conclusions In this paper, we propose novel techniques that maintain data structures for mining sequences in the presence of a) database updates, and b) user interaction. Results obtained show speedups from several factors to two orders of magnitude for incremental mining when compared with re-executing a state-of-the-art sequence mining algorithm. Results for interactive approaches are even better. t the cost of maintaining a summary data structure, the interactive approach performs several orders of magnitude faster than any current sequence mining algorithm. One of the limitations of the incremental approach proposed in this paper is the size of the negative border, and the resulting memory utilization. We are currently investigating methods to alleviate this problem, either by refining the algorithm or by performing out-of-core computation. References [] C. ggarwal and P. Yu. Online generation of associations. In th ICDE, 99. [] D. Cheung, J. Han, V. Ng, and C. Wong. Maintenance of discovered association rules in large databases: an incremental updating technique. In th ICDE, 99. [] R. Feldman, Y. umann,. mir, and H. Mannila. Efficient algorithms for discovering frequent sets in incremental databases. In nd DMKD Workshop, 997. [] M. Klemettinen, et al. Finding interesting rules from large sets of discovered association rules. In rd CIKM, 99. [5] H. Mannila, H. Toivonen, and I. Verkamo. Discovery of frequent episodes in event sequences. DMKD Journal, ():59 9, 997. [] R. T. Ng, et al. Exploratory mining and pruning optimizations of constrained association rules. In SIGMOD Conference, 99. [7] T. Oates, et al. family of algorithms for finding temporal structure in data. In th Workshop on I and Statistics, 997. [] S. Parthasarathy, et al. Incremental and interactive sequence mining. TR75, CS Dept., University of Rochester, June 999. [9] R. Srikant and R. grawal. Mining sequential patterns: Generalizations and performance improvements. In 5th EDT, 99. [] R. Srikant, Q. Vu, and R. grawal. Mining ssociation Rules with Item Constraints. In rd KDD, 997. [] S. Thomas, S. odgala, K. lsabti, and S. Ranka. n efficient algorithm for incremental updation of association rules in large databases. In rd KDD, 997. [] K. Wang. Discovering patterns from large and dynamic sequential data. J. Intelligent Information Systems, 9(), ugust 997. [] M. J. Zaki. Efficient enumeration of frequent sequences. In 7th CIKM, 99.

Sequence Mining in Categorical Domains: Incorporating Constraints

Sequence Mining in Categorical Domains: Incorporating Constraints Mohammed J. Zaki Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 8 zaki@cs.rpi.edu, http://www.cs.rpi.edu/ zaki