Incremental and Interactive Sequence Mining

Size: px
Start display at page:

Download "Incremental and Interactive Sequence Mining"

Transcription

1 Incremental and Interactive Sequence Mining S. Parthasarathy, M. J. Zaki, M. Ogihara, S. Dwarkadas Computer Science Dept., U. of Rochester, Rochester, NY 7 Computer Science Dept., Rensselaer Polytechnic Inst., Troy NY bstract The discovery of frequent sequences in temporal databases is an important data mining problem. Most current work assumes that the database is static, and a database update requires rediscovering all the patterns by scanning the entire old and new database. In this paper, we propose novel techniques for maintaining sequences in the presence of a) database updates, and b) user interaction (e.g. modifying mining parameters). This is a very challenging task, since such updates can invalidate existing sequences or introduce new ones. In both the above scenarios, we avoid re-executing the algorithm on the entire dataset, thereby reducing execution time. Experimental results confirm that our approach results in execution time improvements of up to several orders of magnitude in practice. Introduction Sequence mining is an important data mining task, where one attempts to discover frequent sequences over time, of attribute sets in large databases. This problem was originally motivated by applications in the retail industry, including attached mailing, addon sales and customer satisfaction. It also applies to many scientific and business domains. For instance, in the health care industry it can be used for predicting the onset of disease from a sequence of symptoms, and in the financial industry it can be used for predicting investment risk based on a sequence of stock market events. Discovering all frequent sequences in a very large database can be very compute and I/O intensive because the search space size is essentially exponential in the length of the longest transaction sequence in it. This high computational cost may be acceptable when the database is static since the discovery is done only once, and several approaches to this problem have been presented in the literature. However, many domains such as electronic commerce, stock analysis, collabo- Contact uthor: S. Parthasarathy, srini@cs.rochester.edu. This work was supported in part by NSF grants CD 9, CCR 97, CCR 97559, CCR 979, CCR 975, INT-977, and a DRP grant F-9--; and an external research grant from Digital Equipment Corporation. rative surgery, etc., impose soft real-time constraints on the mining process. In such domains, where the databases are updated on a regular basis and user interactions modify the search parameters, running the discovery program all over again is infeasible. Hence, there is a need for algorithms that maintain valid mined information across i) database updates, and ii) user interactions (modifying/constraining the search space). lthough the incremental [,, ] and interactive [] mining problem has been studied for association rules, no work has been done for incremental and interactive sequential patterns. In this paper, we present a method for incremental and interactive sequence mining. Our goal is to minimize the I/O and computation requirements for handling incremental updates. Our algorithm accomplishes this goal by maintaining information about maximally frequent and minimally infrequent sequences. When incremental data arrives, the incremental part is scanned once to incorporate the new information. The new data is combined with the maximal and minimal information in order to determine the portions of the original database that need to be re-scanned. This process is aided by the use of a vertical database layout where attributes are associated with the list of transactions in which they occur. The result is an improvement in execution time by up to several orders of magnitude in practice, both for handling increments to the database, as well as for handling interactive queries. The rest of the paper is organized as follows. In Section, we formulate the sequence discovery problem. In Section, we describe the SPDE algorithm upon which we build our incremental approach. Section describes our incremental sequence mining algorithm. In Section 5, we describe how we support online querying. n experimental evaluation is presented in Section. We discuss related work in Section 7, and conclude in Section. Problem Formulation In this section, we define the incremental sequence mining problem that this paper is concerned with. We begin by defining the notation we use. Let the items, denoted I, be the set of all possible attributes. We assume a fixed enumeration of all members in I and identify the items with their indices in the enumeration. n itemset is a set of items. n itemset is denoted by the enumeration of its elements in increasing order. For

2 CID CID DTSE TID 5 5 DTSE TID 5 5 Items C C C C C Items C C NEGTIVE ORDER ->-> NEGTIVE ORDER ->-> SEQUENCE LTTICE ->-> ->-> -> ->-> ->-> -> -> -> Incremental Database -> -> -> -> { } FREQUENT SET: {, ->, ->,,, ->, ->, ->} Figure : Original Database ->-> SEQUENCE LTTICE ->->-> ->-> ->->-> ->->->C -> ->-> ->-> -> -> ->-> ->-> ->-> -> -> C-> -> -> C-> ->-> ->-> -> ->-> C ->->->C ->->C ->->C ->C C ->C C ->C C->C FREQUENT SET: {,, C, ->, ->,, ->, ->, ->, ->C, ->C, ->->, ->, ->->, ->->C, ->->C} Figure : Original plus Incremental Database and Lattice { } C ->->C ->->C ->-> an itemset i, its size, denoted by i, is the number of elements in it. n itemset of size k is called a k-itemset. sequence is an ordered list (ordered in time) of nonempty itemsets. sequence of itemsets α,..., α n is denoted by (α α n ). The length of a sequence is the sum of the sizes of each of its itemsets. For each integer k, a sequence of length k is called a k- sequence. sequence α is a subsequence of a sequence β, denoted by α β, if α can be constructed from β by striking out some (or none) of the items in β, and then by eliminating all the occurrences of and one at a time. For example, C is a subsequence of E CD. We say that α is a proper subsequence of β, denoted α β, if α β and α β. For k, the generating subsequences of a length k sequence are the two length k subsequences of α obtained by dropping exactly one of its first or second items. y definition, the generating sequences share a common suffix of length k. For example, the two generating subsequences of CD E are CD E and CD E, and they share the common suffix CD E. sequence is maximal in a collection C of sequences if the sequence is not a subsequence of any other sequence in C. Our database is a collection of customers, each with a sequence of transactions, each of which is an itemset. For a database D and a sequence α, the support or frequency of α in D, denoted by support D (α), is the number of customers in D whose sequences contain α as a subsequence. The minimum-support, denoted by min support, is a user-specified threshold that is used to define frequent sequences : a sequence is frequent in D if its support in D is at least min support. rule involving sequence and sequence is said to have confidence c if c% of the customers that contain also contain. Suppose that new data δ is to be added to a database D. Then we call D the original database and δ the incremental database. The updated database is denoted by D + δ. For each k, F k denotes the collection of all frequent sequences of length k in the updated database. lso FS denotes the set of all frequent sequences in the updated database. The negative border (N) is the collection of all sequences that are not frequent but both of whose generating subsequences are frequent. y the old sequences, we mean the set of all frequent sequences in the original database and by the new sequences we mean the set of all frequent sequences in the join of the original and the increment. For example, consider the customer database shown in Figure. The database has three items (,, C), and four customers. The figure also shows the Increment Sequence Lattice (ISL) with all the frequent sequences (the frequency is also shown with each node) and the negative border, when a minimum support of 75%, or customers, is used. For each frequent sequence, the figure shows its two generating subsequences in bold lines. Figure shows how the frequent set and the negative border change when we mine over the combined original and incremental database (high-

3 lighted in dark grey). For example, C is not frequent in the original database D, but C (along with some of its supersequences) becomes frequent after the update δ. The update also causes some elements to move from N to the new FS. Incremental Sequence Discovery Problem: Given an original database D of sequences, and a new increment to the database δ, find all frequent sequences in the database D + δ, with minimum possible recomputation and I/O. The SPDE lgorithm In this section we describe SPDE [], an algorithm for fast discovery of frequent sequences, which forms the basis for our incremental algorithm. Sequence Lattice: SPDE uses the observation that the subsequence relation defines a partial order on the set of sequences, i.e., if β is a frequent sequence, then all subsequences α β are also frequent. The algorithm systematically searches the sequence lattice spanned by the subsequence relation, from the most general (single items) to the most specific frequent sequences (maximal sequences) in a depth-first manner. For instance, in Figure, the bold lines correspond to the lattice for the example dataset. Support Counting: Most of the current sequence mining algorithms [9] assume a horizontal database layout such as the one shown in Figure. In the horizontal format, the database consists of a set of customers (cid s). Each customer has a set of transactions (tid s), along with the items contained in the transaction. In contrast, we use a vertical database layout, where we associate with each item X in the sequence lattice its idlist, denoted L(X), which is a list of all customer (cid) and transaction identifier (tid) pairs containing the item. For example, the idlist for the item C in the original database (Figure ) would consist of the tuples {<, >, <, >}. Given the per item idlists, we can iteratively determine the support of any k-sequence from the idlists of any two of its (k ) length subsequences. In particular, we combine (intersect) the two (k ) length subsequences that share a common suffix (the generating sequences) to compute the support of a new k length sequence. simple check on the support of the resulting idlist tells us whether the new sequence is frequent or not. If we had enough main-memory, we could enumerate all the frequent sequences by traversing the lattice, and performing intersections to obtain sequence supports. In practice, however, we only have a limited amount of main-memory, and all the intermediate idlists will not fit in memory. SPDE breaks up this large search space into small, manageable chunks that can be processed independently in memory. This is accomplished via suffix-based equivalence classes (henceforth denoted as a class). We say that two k length sequences are in the same class if they share a common k length suffix. The key observation is that each class is a sub-lattice of the original sequence lattice and can be processed independently. Each suffix class is independent in the sense that it has complete information for generating Described in Zaki []. all frequent sequences that share the same suffix. For example, if a class [X] has the elements Y X, and Z X as the only sequences, the only possible frequent sequences at the next step can be Y Z X, Z Y X, and (Y Z) X. It should be obvious that no other item Q can lead to a frequent sequence with the suffix X, unless (QX) or Q X is also in [X]. EGIN Enumerate-Frequent-Seq([S]): for all elements i [S] do [ i ] = ; for all elements j [S] do R = j i ; /*sequences R formed by generating subsequences j and i with i as a suffix*/ idlist(r) = idlist( j ) idlist( i ); if support(idlist(r)) min sup then [ i ] = [ i ] {R}; for all [ i ] do Enumerate-Frequent-Seq([ i ]); END Enumerate-Frequent-Seq([S]): Figure : Enumerating Frequent Sequences SPDE recursively decomposes the sequences at each new level into even smaller independent classes. For instance, at level one it uses suffix classes of length one (X,Y), at level two it uses suffix classes of length two (X Y, XY) and so on. We refer to level one suffix classes as parent classes. These suffix classes are processed one-by-one. Figure shows the pseudo-code (simplified for exposition, see [] for exact details) for the main procedure of the SPDE algorithm. The input to the procedure is a class, along with the idlist for each of its elements. Frequent sequences are generated by intersecting [] the idlists of all distinct pairs of sequences in each class and checking the support of the resulting idlist against min sup. The sequences found to be frequent at the current level form classes for the next level. This level-wise process is recursively repeated until all frequent sequences have been enumerated. In terms of memory management, it is easy to see that we need memory to store intermediate idlists for at most two consecutive levels. Once all the frequent sequences for the next level have been generated, the sequences at the current level can be deleted. For more details on SPDE, see []. Incremental Mining lgorithm Our purpose is to minimize re-computation or rescanning of the original database when mining sequences in the presence of increments to the database (the increments are assumed to be appended to the database, i.e., later in time). In order to accomplish this, we use an efficient memory management scheme that indexes into the database efficiently, and create an Increment Sequence Lattice (ISL), exploiting its properties to prune the search space for potential new sequences. The ISL consists of all elements in the negative border and the frequent set, and is initially constructed using SPDE. In the ISL, the children of each nonempty sequence are its generating subsequences. Each node of the ISL contains the support for the given sequence. Theory of Incremental Sequences Let C, T and I be the set of all cid s, tid s and items, respectively, that appear in the incremental part δ. Define D to

4 PHSE :. compute D (i), D (i) for all items i. for all items i in I. Q.enqueue(S);. while (Q is not empty) 5. p = Q.dequeue(); Compute Support(p);. if (support(p) min sup ) 7. k = length(p);. if (p is in the negative border) 9. N-to-FS[k].enqueue(p);. else if (D (p) ). for all k + -sequences S in ISL that are. generating ascendents of p. Q.enqueue(S); PHSE :. for each item i in N-to-FS[]. construct suffix class [i];. N-to-FS[].enqueue([i]);. for (k = to...) 5. for each class C in N-to-FS[k]. Enumerate-Frequent-Seq(C); Compute Support(p):. = generating subsequence(p);. = generating subsequence(p);. support D (p) = intersect(d (), D ());. support D (p) = intersect(d (), D ()); 5. support(p) = support(p) + support D (p) support D (p); be the set of all records (in D δ) with cid in C and D = D \ δ. For the sake of simplicity assume that there is no new customer added to the database. This implies that infrequent sequences can become frequent but not the other way around. We use the following properties of the lattice to efficiently perform incremental sequence mining. y set inclusion-exclusion, we can update the support of a sequence in FS or N based on its support in D and D. support D+δ (X) () = support D (X) + support D (X) support D (X). This allows us to compute the support at any node in the lattice quickly, by limiting re-access to the original database to D. Proposition For every sequence X, if support D+δ (X) > support D (X), then the last item of X belongs to I. This allows us to limit the nodes in the ISL that are re-examined to those with descendants in I. We call a sequence Y a generating descendant of X if there exists a list [Z, Z,..., Z m ] of sequences such that Z = Y, Z m = X, and for every i, i m, Z i is a generating subsequence of Z i+. We show that that if a sequence has become a member of FS N in the updated database, but it was not a member before the update, then one of its generating descendants was in N and now is in FS. Proposition Let X be a sequence of length at least. If X is in FS D+δ N D+δ but not in FS D N D, then X has a generating descendant in N D FS D+δ. Proof The proof is by induction on k, the length of X. Let X and X be the two generating subsequences of X. Note that if both X and X belong to FS D then X is in FS D N D, which contradicts our assumption. Therefore, either X or X is out of FS D. For the base case, k =, since X and X are of length, by definition both belong to FS D N D, and by the above, at least one must be in N D. For X to be in FS D+δ N D+δ, X and X must be in FS D+δ by definition. Thus the claim holds, since either X or X must be in N D FS D+δ, and they are generating descendants of X. For the induction step, suppose Figure : The ISM lgorithm k > and that the claim holds for all k < k. Suppose X and X are both in FS D N D. Then, either X or X N D. We know X FS D+δ N D+δ, so X and X belong to FS D+δ. Since X and X are generating subsequences of X, the claim holds for X. Finally, we have to consider the case where either X or X is not in FS D N D. We know that as X FS D+δ N D+δ, both X and X belong to FS D+δ N D+δ. Now suppose that X FS D N D. We know that X is in FS D+δ N D+δ, X is in FS D+δ N D+δ. Therefore from the induction step (since X has length less than k) the claim holds for X. Let Y be a generating descendant satisfying the claim for X. Since X is a generating subsequence of X, Y is also a generating descendant of X. Thus the claim holds for k. The same argument can be applied to the case when X / FS D N D. Proposition limits the additional sequences (not found in the original ISL) that need to be examined to update the ISL. Memory Management SPDE simply requires per item idlists. For incremental mining, in order to limit accesses to customers and items in the increment, we use a two level disk-file indexing scheme. However, since the number of customers is unbounded, we use a hashing mechanism described below. The vertical database is partitioned into a number of blocks such that each individual block fits in memory. Each block contains the vertical representation of all transactions involving a set of customers. Within each block there exists an item dereferencing array, pointing to the first entry for each item. Given a customer, and an item, we first identify the block containing the customer s transactions using a first level cid-index (hash function). The second item-index then locates the item within the given block. fter this we perform a linear search for the exact customer identifier. Using this two level indexing scheme we can quickly jump to only that portion of the database which will be affected by the update, without having to touch the entire database. Note that using a vertical data format we were able to efficiently retrieve all affected item s cids, without having to touch the entire database. This is not possible in the horizontal format, since a given item can appear in any transaction, which is found by scanning the entire data.

5 Incremental Sequence Mining (ISM) lgorithm Our incremental algorithm maintains the incremental sequence lattice, ISL, which consists of all the frequent sequences and all sequences in the negative border in the original database. The support of each member is kept in the lattice, too. There are two properties of increments we are concerned with: whether new customers are added and whether new transactions are added. We first check whether a new customer is added. If so, the minimum support in terms of the number of transactions is raised. We examine the entire ISL from the -sequences towards longer and longer sequences to compute where each sequence belongs. More precisely, for each sequence X that has been reclassified from frequent to infrequent, if its two generating sequences are still frequent we make X as a negative border element; otherwise, X is eliminated from the lattice. Then we default to the algorithm described below (see Figure ). The algorithm consists of two phases. Phase is for updating the supports of elements in N and FS and Phase is for adding to N and FS beyond what was done in Phase. To describe the algorithm, for a sequence p we represent by D (p) and D (p) the vertical id-list of p in D and that in D, respectively. Phase begins by generating the single item components of ISL: For each single item sequences p, we compute D (p) and D (p) and put p into a queue Q, which is empty at the beginning of computation. Then we repeat the following until Q is empty: We dequeue one element p from Q. We update the support of p using the subroutine Compute Support, which computes the support based on Equation. Once the support is updated, if the sequence p (of length k) is in the frequent set (line ), all length k + sequences that are already in ISL and that are generating ascendents of p are queued into Q. If the sequence, p, is in the negative border (line ) and its support suggests it is frequent, then this element is placed in N-to-FS[k]. t the end of Phase, we have exact and up-todate supports for all elements in the ISL. We further have a list of elements that were in the negative border but have become frequent as a result of the database increment (in N-to-FS). In the example in Figures and, the following elements had supports updated:,,,,, and C. Of these, the following moved from the negative border to the frequent set:,, and C. We next describe Phase (see Figure ). s to Phase, at the end of Phase the N-to-FS is a list (or an array) of hash tables containing elements that have moved from N to FS. y Proposition these are the only suffix-based classes we need to examine. For all -sequences that have moved we intersect it with all possible other frequent -sequences. We add all such frequent -sequences into the queue N-to-FS[] for further processing. In our running example in Figures and, C and C are added to the Nto-FS[] table. t the same time all other evaluated two-sequences involving C that were not frequent are placed in N D+δ. Thus, C, C, C, C and C C are placed in N D+δ. The next step in Phase is to, starting with the hash table containing length two sequences, pick an element that has not been processed and create the list of frequent sets, along with associated id-lists from D δ, in its equivalence class. The next step is to pass the resulting equivalence class to Enumerate-Frequent-Set, which adds any new frequent sequences or new negative border elements and associated elements to the ISL. We repeat this until all the N-to-FS tables are empty. s an example, let us consider the equivalence class associated with C. From Figures and we see that the only other frequent sequence of its suffix class is C. s both the above sequences are frequent, they are placed in FS D+δ. Recursively enumerating the frequent itemsets results in the sequences C and C being added to FS D+δ. Similarly, the sequences C, C, C, C,, and C are added to N D+δ. 5 Interactive Sequence Mining The idea in interactive sequence mining is that an end user be allowed to query the database for association rules at differing values of support and confidence. The goal is to allow such interaction without excessive I/O or computation. Interactive usage of the system normally involves a lot of manual tuning of parameters and resubmission of queries that may be very demanding on the memory subsystem of the server. In most current algorithms, multiple passes have to be made over the database for each < support, conf idence > pair. This leads to unacceptable response times for online queries. Our approach to the problem of supporting such queries efficiently is to create pre-processed summaries that can quickly respond to such online queries. typical set of queries that such a system could support include: i) Simple Queries: identify the rules for support x%, confidence y%, ii) Refined queries: where the support value is modified (x + y or x y) involves the same procedure, iii) Quantified Queries: identify the k most important rules in terms of support, confidence pairs or find out for what support/confidence values can we generate exactly k rules, iv) Including Queries: find the rules including itemsets i,..., i n, v) Excluding Queries: compute the rules excluding itemsets i,..., i n, and vi) Hierarchical Queries: treat items i (coke),..., i n (pepsi), as one item (cola) and return the new rules. Our approach to the problem of supporting such queries efficiently is to adapt the Increment Sequence Lattice. The preprocessing step of the algorithm involves computing such a lattice for a small enough support S min, such that all future queries will involve a support S larger than S min. In order to handle certain queries (Including, Excluding etc.), we modify the lattice to allow links from a k-length sequence to all its k subsequences of length k (rather than just its generating subsequences). Given such a lattice, we can produce answers to all but one (Hierarchical queries) of the queries described in the previous section at interactive speeds without going back to the original database. This is easy to see as all of the queries will basically involve a form of pruning over the lattice. lattice, as opposed to a flat file containing the relevant sequences, is an important data structure as it permits rapid pruning of relevant sequences. Exactly how we do this is discussed in more detail in [].

6 Hierarchical queries require the algorithm to treat a set of related items as one super-item. For example we may want to treat chips, cookies, peanuts, etc. all together as a single item called snacks. We would like to know what are the frequent sequences involving this super-item. To generate the resulting sequences, we have to modify the SPDE algorithm. We reconstruct the id-list for the new item (i,...,i n ) via a special union operator, and we remove from consideration the individual items i,...,i n. Then, we rerun the equivalence class algorithm for this new item and return the set of frequent sequences. Experimental Evaluation ll the experiments were on a single processor of a DECStation using a maximum of 5 M of physical memory. The DECStation contains MHz lpha processors. No other user processes were running at the time. We used different synthetic databases with size ranging from M to 55M, which were generated using the procedure described in [9]. lthough the size of our benchmark databases fit in memory, our goal is to work with outof-core databases. Hence, we assume that the database resides on disk. The datasets are generated using the following process. First N I itemsets of average size I are generated by choosing from N items. Then N S sequences of average length S are created by assigning itemsets from N I to each sequence. Next, a customer of average T transactions is created, and sequences in N S are assigned to different customer elements, respecting the average transaction size of T. The generation stops when C customers have been generated. Table shows the databases used and their properties. The total number of transactions is denoted as D, average transaction size per customer as T, and the total number of customers C. The parameters we used were N =, N I = 5, I =.5, N S = 5, S =. Please see [9] for further details on the dataset generation process. Database C T D Total Size C.T, M C.T, 5 M C.T5 5 5, M C5.T 5 5, M C.T, M C5.T 5 5, 55 M Table : Database properties To evaluate the incremental algorithm, we modified the database generation mechanism to construct two datasets one corresponding to the original database, and one corresponding to the increment database. The input to the generator also included an increment percentage roughly corresponding to the number of customers in the increment and the percentage of transactions for each such customer that belongs in the increment database. ssuming the database being looked at is C.T, if we set the increment percentage to 5% and the percentage of transactions to %, then we could expect 5 customers (5% of,) to belong to C, each of which would contain on average two transactions (% of ) in the increment database. The actual number of customers in the increment is determined by drawing from a uniform distribution (increment percentage as parameter). Similarly, for each customer in the increment the number of transactions belonging to the increment is also drawn from a uniform distribution (transaction percentage as parameter). Incremental Performance: For the first experiment (see Figure 5), we varied the increment percentage for databases while fixing the transaction percentage to %. We ran the SPDE algorithm on the entire database (original and increment) combined, and evaluated the cost of running just the incremental algorithm (after constructing the ISL from the original database) for increment database values of five, three and one percent. For each database, we also evaluated the breakdown of the cost of the incremental algorithm phases. The results show that the speedups obtained by using the incremental algorithm in comparison to rerunning the SPDE algorithm over the entire database range from a factor of 7 to over two orders of magnitude. s expected, on moving from a larger increment value to a smaller one, the improvements increase, since there are fewer new sequences from a smaller increment. The breakdown figures reveal that the phase one time is pretty negligible, requiring under second for all the datasets for all increment values. It also shows that the phase two times, while an order of magnitude larger than the phase one times, are still much faster than re-executing the entire algorithm. Further, while increasing database size does increase the overall running time of phase, it does not increase at the same rate as re-executing the entire algorithm for these datasets. The second experiment we conducted was to vary the support sizes for a given increment size (%), and for two databases. The results for this experiment are documented in Figure. For both databases, as the support size is increased, the execution time of phase and phase rapidly approaches. This is not surprising when you consider that at higher supports, the number of elements in the ISL are fewer (affecting phase ) and the number of new sequences are much smaller (affecting phase ). The third experiment we conducted was to keep the support, the number of customers, and the transaction percentage constant (.%,,, and % respectively), and vary the number of transactions per customer (,, and 5). Figure 7 depicts the breakdown of the two phases of the ISM algorithm for varying increment values. We see that moving from to 5 transactions per customer, the execution time of both phases progressively increases for all database increment sizes. This is because the number of sequences in the ISL are more (affecting phase) and the number of new sequences are also more (affecting phase). Interactive Performance: In this section, we evaluate the performance of the interactive queries described in Section 5. ll the interactive query experiments were performed on a SUN UltraSparc, 7MHz processor with 5 M of memory. We envisage off-loading the interactive querying feature onto client machines as opposed to executing on the server, and shipping the results to the data mining client. Thus we wanted to compare executing interactive queries on a slower machine. nother reason for evaluating the queries on a

7 7 5 Database: CT INC reakdown Phase Phase 9 7 Database: C5T INC reakdown Phase Phase 5 SPDE 5% INC % INC % INC 5% INC % INC % INC SPDE 5% INC % INC % INC 5% INC % INC % INC Database CT INC reakdown Phase Phase Database C5T INC reakdown Phase Phase SPDE 5% INC % INC % INC 5% INC % INC % INC SPDE 5% INC % INC % INC 5% INC % INC % INC Figure 5: Incremental lgorithm Performance Comparison ( in Seconds) Dataset: CT Dataset: CT Phase Phase Phase Phase in Seconds in Seconds Support in % Support in % Figure : Effect of Varying Support: C.T and C.T INC(.5%) reakdown vs Transaction Length INC(.%) reakdown vs Transaction Length INC(.%) reakdown vs Transaction Length Phase Phase Phase Phase Phase 7 Phase (s) (s) (s) 5 CT CT CT5 CT CT CT5 CT CT CT5 Databases Databases Databases Figure 7: Effect of Varying Transaction Length

8 Database Simple Refined(.5) Priority(5) Including Excluding Rerunning SPDE CK.T... negligible.7 75 CK.T... negligible. CK.T5...9 negligible. 9 C5K.T.5.. negligible.9 9 CK.T...5 negligible. C5K.T... negligible. 5 Table : Interactive Performance: in Seconds slower machine is that the relative speeds of the various interactive queries is better seen on a slower machine (on the DECs all queries executed in negligible time). Since hierarchical queries simply entail a modified execution of phase, we do not evaluate it again. We evaluated simple querying on supports ranging from.%-.5%, refined querying (support refined to.5% for all the datasets), priority querying (querying for the 5 sequences with highest support), including queries (including a random item) and excluding queries (excluding a random item). Results are presented in Table along with the cost of rerunning the SPDE algorithm on the DEC machine. We see that the querying time for refined, priority, including and excluding queries are very low and capable of achieving interactive speeds. The priority query takes more time, since it has to sort the sequences according to support value, and this sorting dominates the computation time. Comparing with rerunning SPDE (on a much faster DEC machine) we see that the interactive querying is several orders of magnitude faster, in spite of executing it on a much slower machine. 7 Related Work Sequence Mining: The concept of sequence mining as defined in this paper was first described in [9]. Recently, SPDE [] was shown to outperform the algorithm presented in [9] by a factor of two in the general case, and by a factor of ten with a pre-processing step. The problem of finding frequent episodes in a single long sequence of events was presented in [5]. The problem of discovering patterns in multiple event sequences was studied in [7]; they search the rule space directly instead of searching the sequence space and then forming the rules. Incremental Sequence Mining: There has been almost no work addressing the incremental mining of sequences. One related proposal in [] uses a dynamic suffix tree based approach to incremental mining in a single long sequence. However, we are dealing with sequences across different customers, i.e., multiple sequences of sets of items as opposed to a single long sequence of items. The other closest work is in incremental association mining [,, ] However, there are important differences that make incremental sequence mining a more difficult problem. While association rules discover only intra-transaction patterns (itemsets), we now also have to discover intertransaction patterns (sequences). The set of all frequent sequences is an unbounded superset of the set of frequent itemsets (bounded). Hence, sequence search is much more complex and challenging than the itemset search, thereby further necessitating fast algorithms. Interactive Sequence Mining mine-and-examine paradigm for interactive exploration of associations and sequence episodes was presented in []. Similar paradigms have been proposed exclusively for associations []. Our interactive approach tackles a different problem (sequences across different customers) and supports a wider range of interactive querying features. Conclusions In this paper, we propose novel techniques that maintain data structures for mining sequences in the presence of a) database updates, and b) user interaction. Results obtained show speedups from several factors to two orders of magnitude for incremental mining when compared with re-executing a state-of-the-art sequence mining algorithm. Results for interactive approaches are even better. t the cost of maintaining a summary data structure, the interactive approach performs several orders of magnitude faster than any current sequence mining algorithm. One of the limitations of the incremental approach proposed in this paper is the size of the negative border, and the resulting memory utilization. We are currently investigating methods to alleviate this problem, either by refining the algorithm or by performing out-of-core computation. References [] C. ggarwal and P. Yu. Online generation of associations. In th ICDE, 99. [] D. Cheung, J. Han, V. Ng, and C. Wong. Maintenance of discovered association rules in large databases: an incremental updating technique. In th ICDE, 99. [] R. Feldman, Y. umann,. mir, and H. Mannila. Efficient algorithms for discovering frequent sets in incremental databases. In nd DMKD Workshop, 997. [] M. Klemettinen, et al. Finding interesting rules from large sets of discovered association rules. In rd CIKM, 99. [5] H. Mannila, H. Toivonen, and I. Verkamo. Discovery of frequent episodes in event sequences. DMKD Journal, ():59 9, 997. [] R. T. Ng, et al. Exploratory mining and pruning optimizations of constrained association rules. In SIGMOD Conference, 99. [7] T. Oates, et al. family of algorithms for finding temporal structure in data. In th Workshop on I and Statistics, 997. [] S. Parthasarathy, et al. Incremental and interactive sequence mining. TR75, CS Dept., University of Rochester, June 999. [9] R. Srikant and R. grawal. Mining sequential patterns: Generalizations and performance improvements. In 5th EDT, 99. [] R. Srikant, Q. Vu, and R. grawal. Mining ssociation Rules with Item Constraints. In rd KDD, 997. [] S. Thomas, S. odgala, K. lsabti, and S. Ranka. n efficient algorithm for incremental updation of association rules in large databases. In rd KDD, 997. [] K. Wang. Discovering patterns from large and dynamic sequential data. J. Intelligent Information Systems, 9(), ugust 997. [] M. J. Zaki. Efficient enumeration of frequent sequences. In 7th CIKM, 99.

Sequence Mining in Categorical Domains: Incorporating Constraints

Sequence Mining in Categorical Domains: Incorporating Constraints Sequence Mining in Categorical Domains: Incorporating Constraints Mohammed J. Zaki Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 8 zaki@cs.rpi.edu, http://www.cs.rpi.edu/ zaki

More information

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * Shichao Zhang 1, Xindong Wu 2, Jilian Zhang 3, and Chengqi Zhang 1 1 Faculty of Information Technology, University of Technology

More information

An Algorithm for Mining Large Sequences in Databases

An Algorithm for Mining Large Sequences in Databases 149 An Algorithm for Mining Large Sequences in Databases Bharat Bhasker, Indian Institute of Management, Lucknow, India, bhasker@iiml.ac.in ABSTRACT Frequent sequence mining is a fundamental and essential

More information

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Fast Discovery of Sequential Patterns Using Materialized Data Mining Views Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Materialized Data Mining Views *

Materialized Data Mining Views * Materialized Data Mining Views * Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland tel. +48 61

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

Data Mining Query Scheduling for Apriori Common Counting

Data Mining Query Scheduling for Apriori Common Counting Data Mining Query Scheduling for Apriori Common Counting Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,

More information

Parallel Algorithms for Discovery of Association Rules

Parallel Algorithms for Discovery of Association Rules Data Mining and Knowledge Discovery, 1, 343 373 (1997) c 1997 Kluwer Academic Publishers. Manufactured in The Netherlands. Parallel Algorithms for Discovery of Association Rules MOHAMMED J. ZAKI SRINIVASAN

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM G.Amlu #1 S.Chandralekha #2 and PraveenKumar *1 # B.Tech, Information Technology, Anand Institute of Higher Technology, Chennai, India

More information

Parallel Sequence Mining on Shared-Memory Machines

Parallel Sequence Mining on Shared-Memory Machines Journal of Parallel and Distributed Computing 61, 401426 (2001) doi:10.1006jpdc.2000.1695, available online at http:www.idealibrary.com on Parallel Sequence Mining on Shared-Memory Machines Mohammed J.

More information

Maintenance of fast updated frequent pattern trees for record deletion

Maintenance of fast updated frequent pattern trees for record deletion Maintenance of fast updated frequent pattern trees for record deletion Tzung-Pei Hong a,b,, Chun-Wei Lin c, Yu-Lung Wu d a Department of Computer Science and Information Engineering, National University

More information

Efficiently Mining Approximate Models of Associations in Evolving Databases

Efficiently Mining Approximate Models of Associations in Evolving Databases Efficiently Mining Approximate Models of Associations in Evolving Databases A. Veloso, B. Gusmão, W. Meira Jr., M. Carvalho, S. Parthasarathy, and M. Zaki Computer Science Department, Universidade Federal

More information

Sequential PAttern Mining using A Bitmap Representation

Sequential PAttern Mining using A Bitmap Representation Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu Dept. of Computer Science Cornell University ABSTRACT We introduce a new algorithm for mining

More information

Item Set Extraction of Mining Association Rule

Item Set Extraction of Mining Association Rule Item Set Extraction of Mining Association Rule Shabana Yasmeen, Prof. P.Pradeep Kumar, A.Ranjith Kumar Department CSE, Vivekananda Institute of Technology and Science, Karimnagar, A.P, India Abstract:

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Lecture 2 Wednesday, August 22, 2007

Lecture 2 Wednesday, August 22, 2007 CS 6604: Data Mining Fall 2007 Lecture 2 Wednesday, August 22, 2007 Lecture: Naren Ramakrishnan Scribe: Clifford Owens 1 Searching for Sets The canonical data mining problem is to search for frequent subsets

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang CARPENTER Find Closed Patterns in Long Biological Datasets Zhiyu Wang Biological Datasets Gene expression Consists of large number of genes Knowledge Discovery and Data Mining Dr. Osmar Zaiane Department

More information

rule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day.

rule mining can be used to analyze the share price R 1 : When the prices of IBM and SUN go up, at 80% same day. Breaking the Barrier of Transactions: Mining Inter-Transaction Association Rules Anthony K. H. Tung 1 Hongjun Lu 2 Jiawei Han 1 Ling Feng 3 1 Simon Fraser University, British Columbia, Canada. fkhtung,hang@cs.sfu.ca

More information

Finding Frequent Patterns Using Length-Decreasing Support Constraints

Finding Frequent Patterns Using Length-Decreasing Support Constraints Finding Frequent Patterns Using Length-Decreasing Support Constraints Masakazu Seno and George Karypis Department of Computer Science and Engineering University of Minnesota, Minneapolis, MN 55455 Technical

More information

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on   to remove this watermark. 119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched

More information

Maintenance of the Prelarge Trees for Record Deletion

Maintenance of the Prelarge Trees for Record Deletion 12th WSEAS Int. Conf. on APPLIED MATHEMATICS, Cairo, Egypt, December 29-31, 2007 105 Maintenance of the Prelarge Trees for Record Deletion Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu Department of

More information

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set Pincer-Search: An Efficient Algorithm for Discovering the Maximum Frequent Set Dao-I Lin Telcordia Technologies, Inc. Zvi M. Kedem New York University July 15, 1999 Abstract Discovering frequent itemsets

More information

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS by Ramin Afshar B.Sc., University of Alberta, Alberta, 2000 THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

More information

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 68 CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 5.1 INTRODUCTION During recent years, one of the vibrant research topics is Association rule discovery. This

More information

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Parallel Mining of Maximal Frequent Itemsets in PC Clusters Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

2 CONTENTS

2 CONTENTS Contents 5 Mining Frequent Patterns, Associations, and Correlations 3 5.1 Basic Concepts and a Road Map..................................... 3 5.1.1 Market Basket Analysis: A Motivating Example........................

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Jun Luo Sanguthevar Rajasekaran Dept. of Computer Science Ohio Northern University Ada, OH 4581 Email: j-luo@onu.edu Dept. of

More information

Data Mining for Knowledge Management. Association Rules

Data Mining for Knowledge Management. Association Rules 1 Data Mining for Knowledge Management Association Rules Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Thanks for slides to: Jiawei Han George Kollios Zhenyu Lu Osmar R. Zaïane Mohammad

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

Parallelizing Frequent Itemset Mining with FP-Trees

Parallelizing Frequent Itemset Mining with FP-Trees Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas

More information

15.083J Integer Programming and Combinatorial Optimization Fall Enumerative Methods

15.083J Integer Programming and Combinatorial Optimization Fall Enumerative Methods 5.8J Integer Programming and Combinatorial Optimization Fall 9 A knapsack problem Enumerative Methods Let s focus on maximization integer linear programs with only binary variables For example: a knapsack

More information

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application Data Structures Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali 2009-2010 Association Rules: Basic Concepts and Application 1. Association rules: Given a set of transactions, find

More information

Finding frequent closed itemsets with an extended version of the Eclat algorithm

Finding frequent closed itemsets with an extended version of the Eclat algorithm Annales Mathematicae et Informaticae 48 (2018) pp. 75 82 http://ami.uni-eszterhazy.hu Finding frequent closed itemsets with an extended version of the Eclat algorithm Laszlo Szathmary University of Debrecen,

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets : A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets J. Tahmores Nezhad ℵ, M.H.Sadreddini Abstract In recent years, various algorithms for mining closed frequent

More information

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Association Rules. Berlin Chen References:

Association Rules. Berlin Chen References: Association Rules Berlin Chen 2005 References: 1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques, Chapter 6 Association Rules: Basic Concepts A

More information

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Marek Wojciechowski, Krzysztof Galecki, Krzysztof Gawronek Poznan University of Technology Institute of Computing Science ul.

More information

SeqIndex: Indexing Sequences by Sequential Pattern Analysis

SeqIndex: Indexing Sequences by Sequential Pattern Analysis SeqIndex: Indexing Sequences by Sequential Pattern Analysis Hong Cheng Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign {hcheng3, xyan, hanj}@cs.uiuc.edu

More information

Parallel, Incremental and Interactive Mining for Frequent Itemsets in Evolving Databases

Parallel, Incremental and Interactive Mining for Frequent Itemsets in Evolving Databases Parallel, Incremental and Interactive Mining for Frequent Itemsets in Evolving Databases Adriano Veloso Wagner Meira Jr. Márcio Bunte de Carvalho Computer Science Department Universidade Federal de Minas

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2013 " An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

2. Discovery of Association Rules

2. Discovery of Association Rules 2. Discovery of Association Rules Part I Motivation: market basket data Basic notions: association rule, frequency and confidence Problem of association rule mining (Sub)problem of frequent set mining

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

AB AC AD BC BD CD ABC ABD ACD ABCD

AB AC AD BC BD CD ABC ABD ACD ABCD LGORITHMS FOR OMPUTING SSOITION RULES USING PRTIL-SUPPORT TREE Graham Goulbourne, Frans oenen and Paul Leng Department of omputer Science, University of Liverpool, UK graham g, frans, phl@csc.liv.ac.uk

More information

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets Jianyong Wang, Jiawei Han, Jian Pei Presentation by: Nasimeh Asgarian Department of Computing Science University of Alberta

More information

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the

More information

Sequence Mining in Categorical Domains: Algorithms and Applications

Sequence Mining in Categorical Domains: Algorithms and Applications Sequence Mining in Categorical Domains: Algorithms and Applications Mohammed J. Zaki Computer Science Department Rensselaer Polytechnic Institute, Troy, NY Introduction This chapter focuses on sequence

More information

Interestingness Measurements

Interestingness Measurements Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected

More information

Efficient subset and superset queries

Efficient subset and superset queries Efficient subset and superset queries Iztok SAVNIK Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, 5000 Koper, Slovenia Abstract. The paper

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 1/8/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 Supermarket shelf

More information

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 5, May 2014, pg.923

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

CSCI6405 Project - Association rules mining

CSCI6405 Project - Association rules mining CSCI6405 Project - Association rules mining Xuehai Wang xwang@ca.dalc.ca B00182688 Xiaobo Chen xiaobo@ca.dal.ca B00123238 December 7, 2003 Chen Shen cshen@cs.dal.ca B00188996 Contents 1 Introduction: 2

More information

Roadmap. PCY Algorithm

Roadmap. PCY Algorithm 1 Roadmap Frequent Patterns A-Priori Algorithm Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results Data Mining for Knowledge Management 50 PCY

More information

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data Qiankun Zhao Nanyang Technological University, Singapore and Sourav S. Bhowmick Nanyang Technological University,

More information

Data Access Paths for Frequent Itemsets Discovery

Data Access Paths for Frequent Itemsets Discovery Data Access Paths for Frequent Itemsets Discovery Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science {marekw, mzakrz}@cs.put.poznan.pl Abstract. A number

More information

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Efficient Remining of Generalized Multi-supported Association Rules under Support Update Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Discovering Periodic Patterns in System Logs

Discovering Periodic Patterns in System Logs Discovering Periodic Patterns in System Logs Marcin Zimniak 1, Janusz R. Getta 2, and Wolfgang Benn 1 1 Faculty of Computer Science, TU Chemnitz, Germany {marcin.zimniak,benn}@cs.tu-chemnitz.de 2 School

More information

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule

More information

A new algorithm for gap constrained sequence mining

A new algorithm for gap constrained sequence mining 24 ACM Symposium on Applied Computing A new algorithm for gap constrained sequence mining Salvatore Orlando Dipartimento di Informatica Università Ca Foscari Via Torino, 155 - Venezia, Italy orlando@dsi.unive.it

More information

The Fuzzy Search for Association Rules with Interestingness Measure

The Fuzzy Search for Association Rules with Interestingness Measure The Fuzzy Search for Association Rules with Interestingness Measure Phaichayon Kongchai, Nittaya Kerdprasop, and Kittisak Kerdprasop Abstract Association rule are important to retailers as a source of

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Efficiently Mining Frequent Trees in a Forest

Efficiently Mining Frequent Trees in a Forest Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki Computer Science Department, Rensselaer Polytechnic Institute, Troy NY 8 zaki@cs.rpi.edu, http://www.cs.rpi.edu/ zaki ABSTRACT Mining frequent

More information

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

An Improved Algorithm for Mining Association Rules Using Multiple Support Values An Improved Algorithm for Mining Association Rules Using Multiple Support Values Ioannis N. Kouris, Christos H. Makris, Athanasios K. Tsakalidis University of Patras, School of Engineering Department of

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

Closed Pattern Mining from n-ary Relations

Closed Pattern Mining from n-ary Relations Closed Pattern Mining from n-ary Relations R V Nataraj Department of Information Technology PSG College of Technology Coimbatore, India S Selvan Department of Computer Science Francis Xavier Engineering

More information

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE Volume 3, No. 1, Jan-Feb 2012 International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info ISSN No. 0976-5697 Performance Analysis of Frequent Closed

More information

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

ANU MLSS 2010: Data Mining. Part 2: Association rule mining ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements

More information

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Efficient Incremental Mining of Top-K Frequent Closed Itemsets Efficient Incremental Mining of Top- Frequent Closed Itemsets Andrea Pietracaprina and Fabio Vandin Dipartimento di Ingegneria dell Informazione, Università di Padova, Via Gradenigo 6/B, 35131, Padova,

More information

A Safe-Exit Approach for Efficient Network-Based Moving Range Queries

A Safe-Exit Approach for Efficient Network-Based Moving Range Queries Data & Knowledge Engineering Data & Knowledge Engineering 00 (0) 5 A Safe-Exit Approach for Efficient Network-Based Moving Range Queries Duncan Yung, Man Lung Yiu, Eric Lo Department of Computing, Hong

More information

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS APPLYIG BIT-VECTOR PROJECTIO APPROACH FOR EFFICIET MIIG OF -MOST ITERESTIG FREQUET ITEMSETS Zahoor Jan, Shariq Bashir, A. Rauf Baig FAST-ational University of Computer and Emerging Sciences, Islamabad

More information

An Algorithm for Frequent Pattern Mining Based On Apriori

An Algorithm for Frequent Pattern Mining Based On Apriori An Algorithm for Frequent Pattern Mining Based On Goswami D.N.*, Chaturvedi Anshu. ** Raghuvanshi C.S.*** *SOS In Computer Science Jiwaji University Gwalior ** Computer Application Department MITS Gwalior

More information

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad Hyderabad,

More information

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking Shariq Bashir National University of Computer and Emerging Sciences, FAST House, Rohtas Road,

More information

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph. Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial

More information

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach *

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach * Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach * Hongyan Liu 1 Jiawei Han 2 Dong Xin 2 Zheng Shao 2 1 Department of Management Science and Engineering, Tsinghua

More information

Memory Placement Techniques for Parallel Association Mining

Memory Placement Techniques for Parallel Association Mining Memory Placement Techniques for Parallel ssociation Mining Srinivasan Parthasarathy, Mohammed J. Zaki, and Wei Li omputer Science epartment, University of Rochester, Rochester NY 14627 srini,zaki,wei @cs.rochester.edu

More information

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification Adriano Veloso 1, Wagner Meira Jr 1 1 Computer Science Department Universidade Federal de Minas Gerais (UFMG) Belo Horizonte

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013-2017 Han, Kamber & Pei. All

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(B), July 2012 pp. 5165 5178 AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR

More information

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets American Journal of Applied Sciences 2 (5): 926-931, 2005 ISSN 1546-9239 Science Publications, 2005 Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets 1 Ravindra Patel, 2 S.S.

More information