ASSOCIATION rules mining is a very popular data mining

Size: px
Start display at page:

Download "ASSOCIATION rules mining is a very popular data mining"

Transcription

1 472 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 A Transaction Mapping Algorithm for Frequent Itemsets Mining Mingjun Song and Sanguthevar Rajasekaran, Senior Member, IEEE Abstract In this paper, we present a novel algorithm for mining complete frequent itemsets. This algorithm is referred to as the TM (Transaction Mapping) algorithm from hereon. In this algorithm, transaction ids of each itemset are mapped and compressed to continuous transaction intervals in a different space and the counting of itemsets is performed by intersecting these interval lists in a depth-first order along the lexicographic tree. When the compression coefficient becomes smaller than the average number of comparisons for intervals intersection at a certain level, the algorithm switches to transaction id intersection. We have evaluated the algorithm against two popular frequent itemset mining algorithms, FP-growth and declat, using a variety of data sets with short and long frequent patterns. Experimental data show that the TM algorithm outperforms these two algorithms. Index Terms Algorithms, association rule mining, data mining, frequent itemsets. æ 1 INTRODUCTION ASSOCIATION rules mining is a very popular data mining technique and it finds relationships among the different entities of records (for example, transaction records). Since the introduction of frequent itemsets in 1993 by Agrawal et al. [1], it has received a great deal of attention in the field of knowledge discovery and data mining. One of the first algorithms proposed for association rules mining was the AIS algorithm [1]. The problem of association rules mining was introduced in [1] as well. This algorithm was improved later to obtain the Apriori algorithm [2]. The Apriori algorithm employs the downward closure property if an itemset is not frequent, any superset of it cannot be frequent either. The Apriori algorithm performs a breadth-first search in the search space by generating candidate k þ 1-itemsets from frequent k-itemsets. The frequency of an itemset is computed by counting its occurrence in each transaction. Many variants of the Apriori algorithm have been developed, such as AprioriTid, AprioriHybrid, direct hashing and pruning (DHP), dynamic itemset counting (DIC), Partition algorithm, etc. For a survey on association rules mining algorithms, please see [3]. FP-growth [4] is a well-known algorithm that uses the FP-tree data structure to achieve a condensed representation of the database transactions and employs a divide-andconquer approach to decompose the mining problem into a set of smaller problems. In essence, it mines all the frequent itemsets by recursively finding all frequent 1-itemsets in the conditional pattern base that is efficiently constructed with the help of a node link structure. A variant of FP-growth is the H-mine algorithm [5]. It uses array-based and trie-based. The authors are with the Department of Computer Science and Engineering, University of Connecticut, Storrs, CT {mjsong, rajasek}@engr.uconn.edu. Manuscript received 14 Dec. 2004; revised 23 June 2005; accepted 18 Oct. 2005; published online 17 Feb For information on obtaining reprints of this article, please send to: tkde@computer.org, and reference IEEECS Log Number TKDE data structures to deal with sparse and dense data sets, respectively. PatriciaMine [6] employs a compressed Patricia trie to store the data sets. FPgrowth* [7] uses an array technique to reduce the FP-tree traversal time. In FP-growth-based algorithms, recursive construction of the FP-tree affects the algorithm s performance. Eclat [8] is the first algorithm to find frequent patterns by a depth-first search and it has been shown to perform well. It uses a vertical database representation and counts the itemset supports using the intersection of tids. However, because of the depth-first search, pruning used in the Apriori algorithm is not applicable during the candidate itemsets generation. VIPER [9] and Mafia [10] also use the vertical database layout and the intersection to achieve a good performance. The only difference is that they use the compressed bitmaps to represent the transaction list of each itemset. However, their compression scheme has limitations especially when tids are uniformly distributed. Zaki and Gouda [11] developed a new approach called declat using the vertical database representation. They store the difference of tids, called diffset, between a candidate k-itemset and its prefix k 1-frequent itemsets, instead of the tids intersection set, denoted here as tidset. They compute the support by subtracting the cardinality of diffset from the support of its prefix k 1-frequent itemset. This algorithm has been shown to gain significant performance improvements over Eclat. However, when the database is sparse, diffset will lose its advantage over tidset. In this paper, we present a novel approach that maps and compresses the transaction id list of each itemset into an interval list using a transaction tree and counts the support of each itemset by intersecting these interval lists. The frequent itemsets are found in a depth-first order along a lexicographic tree as done in the Eclat algorithm. The basic idea is to save the intersection time in Eclat by mapping transaction ids into continuous transaction intervals. When these intervals become scattered, we switch to transaction ids as in Eclat. We call the new algorithm the TM (transaction mapping) algorithm. The rest of the paper is arranged as /06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society

2 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 473 TABLE 1 Horizontal Representation TABLE 2 Vertical Tidset Representation follows: Section 2 introduces the basic concept of association rules mining, two types of data representation, and the lexicographic tree used in our algorithm. Section 3 addresses how the transaction id list of each itemset is compressed to a continuous interval list and the details of the TM algorithm. Section 4 gives an analysis of the compression efficiency of transaction mapping. Section 5 experimentally compares the TM algorithm with two popular algorithms FP-Growth and declat. In Section 6, we provide some general comments and Section 7 concludes the paper. 2 BASIC PRINCIPLES 2.1 Association Rules Mining Let I ¼fi 1 ;i 2 ;...;i m g be a set of items and let D be a database having a set of transactions where each transaction T is a subset of I. An association rule is an association relationship of the form: X ) Y, where X I, Y I, and X \ Y ¼;. The support of rule X ) Y is defined as the percentage of transactions containing both X and Y in D. The confidence of X ) Y is defined as the percentage of transactions containing X that also contain Y in D. The task of association rules mining is to find all strong association rules that satisfy a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf). Mining association rules consists of two phases. In the first phase, all frequent itemsets that satisfy the min_sup are found. In the second phase, strong association rules are generated from the frequent itemsets found in the first phase. Most research considers only the first phase because once frequent itemsets are found, mining association rules is trivial. generally better than the horizontal format [8] [9]. Table 1, Table 2, and Table 3 show examples for different types of layouts. 2.3 Lexicographic Prefix Tree In this paper, we employ a lexicographic prefix tree data structure to efficiently generate candidate itemsets and count their frequency, which is very similar to the lexicographic tree used in the TreeProjection algorithm [12]. This tree structure is also used in many other algorithms such as Eclat [8]. An example of this tree is shown in Fig. 1. Each node in the tree stores a collection of frequent itemsets together with the support of these itemsets. The root contains all frequent 1-itemsets. Itemsets in level (for any ) are frequent -itemsets. Each edge in the tree is labeled with an item. Itemsets in any node are stored as singleton sets with the understanding that the actual itemset also contains all the items found on the edges from this node to the root. For example, consider the leftmost node in level 2 of the tree in Fig. 1. There are four 2-itemsets in this node, namely, f1; 2g, f1; 3g, f1; 4g, and f1; 5g. The singleton sets in each node of the tree are stored in the lexicographic order. If the root contains f1g; f2g;...; fng, then, the nodes in level 2 will contain f2g; f3g;...; fng; f3g; f4g;...; fng;...;fng, and so on. For each candidate itemset, we also store a list of transaction ids (i.e., ids of transactions in which all the items of the itemset occur). This tree will not be generated in full. The tree is generated in a depth first order and, at any given time, we only store TABLE 3 Vertical Bitvector Representation 2.2 Data Representation Two types of database layouts are employed in association rules mining: horizontal and vertical. In the traditional horizontal database layout, each transaction consists of a set of items and the database contains a set of transactions. Most Apriori-like algorithms use this type of layout. For vertical database layout, each item maintains a set of transaction ids (denoted by tidset) where this item is contained. This layout could be maintained as a bitvector. Eclat uses tidsets, while VIPER and Mafia use compressed bitvectors. It has been shown that vertical layout performs

3 474 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 TABLE 4 A Sample Transaction Database Fig. 1. Illustration of lexicographic tree. minimum information needed to continue the search. In particular, this means that, at any instance, at most a path of the tree will be stored. As the search progresses, if the expansion of a node cannot possibly lead to the discovery of itemsets that have minimum support, then the node will not be expanded and the search will backtrack. As a frequent itemset that meets the minimum support requirement is found, it is output. Candidate itemsets generated by depth first search are the same as those generated by the joining step (without pruning) of the Apriori algorithm. 3 TM ALGORITHM Our contribution is that we compress tids (transaction ids) for each itemset to continuous intervals by mapping transaction ids into a different space appealing to a transaction tree. Frequent itemsets are found by intersecting these interval lists instead of intersecting the transaction id lists (as in the Eclat algorithm). We will begin with the construction of a transaction tree. 3.1 Transaction Tree The transaction tree is similar to FP-tree except that there is no header table or node link. The transaction tree can be thought of as a compact representation of all the transactions in the database. Each node in the tree has an id corresponding to an item and a counter that keeps the number of transactions that contain this item in this path. Adapted from [4], the construction of the transaction tree (called constructransactiontree) is as follows: 1. Scan through the database once and identify all the frequent 1-itemsets and sort them in descending order of frequency. At the beginning, the transaction tree consists of just a single node (which is a dummy root). 2. Scan through the database for a second time. For each transaction, select items that are in frequent 1-itemsets, sort them according to the order of frequent 1-itemsets, and insert them into the transaction tree. When inserting an item, start from the root. At the beginning, the root is the current node. In general, if the current node has a child node whose id is equal to this item, then just increment the count of this child by 1; otherwise, create a new child node and set its counter as 1. Table 4 and Fig. 2 illustrate the construction of a transaction tree. Table 4 shows an example of a transaction database and Fig. 2 displays the constructed transaction tree assuming the minimum support count is 2. The number before the colon in each node is the item id and the number after the colon is the count of this item in this path. 3.2 Transaction Mapping and the Construction of Interval Lists After the transaction tree is constructed, all the transactions that contain an item are represented with an interval list. Each interval corresponds to a contiguous sequence of relabeled ids. Each node in the transaction tree will be associated with an interval. The construction of interval lists for each item is done recursively starting from the root in a depth-first order. The process is described as follows: Consider a node u whose number of transactions is c and Fig. 2. A transaction tree for the database shown in Table 4.

4 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 475 TABLE 5 Example of Transaction Mapping Fig. 3. Transaction tree for illustration. whose associated interval is [s, e]. Here, s is the relabeled start id and e is the relabeled end id with e s þ 1 ¼ c. Assume that u has m children with child i having c i transactions, for i ¼ 1; 2;...;m. It is obvious that P m i¼1 c i c. If the intervals associated with the children of u are: ½s 1 ;e 1 Š; ½s 2 ;e 2 Š;...; ½s m ;e m Š, these intervals are constructed as follows: s 1 ¼ s; e 1 ¼ s 1 þ c 1 1; s i ¼ e i 1 þ 1; for i ¼ 2; 3;...;m; e i ¼ s i þ c i 1; for i ¼ 2; 3;...;m: For the root, s ¼ 1. For example, in Fig. 2, the root has two children. For the first child, s 1 ¼ 1, e 1 ¼ 1 þ 5 1 ¼ 5, so the interval is ½1; 5Š; for the second child, s 2 ¼ 5 þ 1 ¼ 6;e 2 ¼ 6 þ 3 1 ¼ 8; so the interval is ½6; 8Š. The compressed transaction id lists of each item is ordered by the start id of each associated interval. In addition, if two intervals are contiguous, they will be merged and replaced with a single interval. For example, each interval associated with each node is shown in Fig. 2. Two intervals of item 3, ½1; 2Š and ½3; 3Š, will be merged to ½1; 3Š. To illustrate the efficiency of this mapping process more clearly, assume that the eight transactions of the example database shown in Table 4 repeat 100 times each. In this case, the transaction tree becomes the one shown in Fig. 3. The mapped transaction interval lists for each item is shown in Table 5, where of item 3 results from the merging of and We now summarize a procedure (called maptransaction Intervals) that computes the interval lists for each item as follows: Using depth first order, traverse the transaction tree. For each node, create an interval composed of a start id and an end id. If it is the first child of its parent, then the start id of the interval is equal to the start id of the parent (1) ð1þ ð2þ ð3þ ð4þ and the end id is computed by (2). If not, the start id is computed by (3) and the end id is computed by (4). Insert this interval to the interval list of the corresponding item. Once the interval lists for frequent 1-itemsets are constructed, frequent i-itemsets (for any i) are found by intersecting interval lists along the lexicograpgic tree. Details are provided in the next section. 3.3 Interval Lists Intersection In addition to the items described above, each element of a node in the lexicographic tree also stores a transaction interval list (corresponding to the itemset denoted by the element). By constructing the lexicographic tree in a depthfirst order, the support count of the candidate itemset is computed by intersecting the interval lists of the two elements. For example, element 2 in the second level of the lexicographic tree in Fig. 1 represents the itemset 1,2, whose support count is computed by intersecting the interval lists of itemset 1 and itemset 2. In contrast, Eclat uses a tid list intersection. Interval lists intersection is more efficient. Note that, since the interval is constructed from the transaction tree, it cannot partially contain or be partially contained in another interval. There are only three possible relationships between any two intervals A ¼½s 1 ;e 1 Š and B ¼½s 2 ;e 2 Š. 1. A \ B ¼;. In this case, interval A and interval B come from different paths of the transaction tree. For instance, interval ½1; 500Š and interval ½501; 800Š in Table A B. In this case, interval A comes from the ancestor nodes of interval B in the transaction tree. For instance, interval ½1; 500Š and interval ½1; 300Š in Table A B. In this case, interval A comes from the descendant nodes of interval B in the transaction tree. For instance, interval ½1; 300Š and interval ½1; 500Š in Table 5. Considering the above three cases, the average number of comparisons for two intervals is Switching After a certain level of the lexicographic tree, the transaction interval lists of elements in any node will be expected to become scattered. There could be many transaction intervals that contain only single tids. At this point, interval representation will lose its advantage over single tid representation because the intersection of two segments

5 476 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 will use three comparisons in the worst case while the intersection of two single tids only needs one comparison. Therefore, we need to switch to the single tid representation at some point. Here, we define a coefficient of compression for one node in the lexicographic tree, denoted by coeff, as follows: Assume that a node has m elements, and let s i represent the support of the ith element, l i representing the size of the transaction list of the ith element. Then, coeff ¼ 1 m X m i¼1 s i l i : For the intersection of two interval lists, the average number of comparisons is 2, so we will switch to tid set intersection when coeff is less than Details of the TM Algorithm Now, we provide details on the steps involved in the TM algorithm. There are four steps involved: 1. Scan through the database and identify all frequent-1 itemsets. 2. Construct the transaction tree with counts for each node. 3. Construct the transaction interval lists. Merge intervals if they are mergeable (i.e., if the intervals are contiguous). 4. Construct the lexicographic tree in a depth first order keeping only the minimum amount of information necessary to complete the search. This, in particular, means that no more than a path in the lexicographic tree will ever be stored. While, at any node, if further expansion of that will not be fruitful, then the search backtracks. When processing a node in the tree, for every element in the node, the corresponding interval lists are computed by interval intersections. As the search progresses, itemsets with enough support are output. When the compression coefficient of a node becomes less than 2, switch to tid list intersection. In the next section, we provide an analysis to indicate how TM can provide computational efficiency. 4 COMPRESSION AND TIME ANALYSIS OF TRANSACTION MAPPING Suppose the transaction tree is fully filled in the worst case, as illustrated in Fig. 4, where the subscript of C is the possible itemset and C represents the count for this itemset. Assume that there are n frequent 1-itemsets with a support of S 1 ;S 2 ;...;S n, respectively. Then, we have the following relationships:... S 1 ¼ C 1 ¼jT 1 j; S 2 ¼ C 2 þ C 1;2 ¼jT 1 jþjt 1;2 j; S 3 ¼ C 3 þ C 1;3 þ C 1;2;3 þ C 2;3 ¼jT 3 jþjt 1;3 jþjt 1;2;3 jþjt 2;3 j; Fig. 4. Full transaction tree. S n ¼ C n þ C 1;n þ C 2;n þ...þ C n 1;n þ C 1;2;n þ C 1;3;n þ... ¼jT n jþjt 1;n jþjt 2;n jþ...þjt n 1;n jþjt 1;2;n jþjt 1;3;n j þ...: Here, each T represents the interval for a node and jtj represents the length of T, which is equal to C. The maximum number of intervals possible for each frequent 1-itemset i is 2 i 1. The average compression ratio is Avg ratio S 1 þ S þ S þ...þ S i 2 i 1 þ...þ S n S n 1 þ þ þ...þ 1 2 n 1 ¼ 2S n ð1 2 n Þ: 2 n 1 When S n, which is equal to min sup, is high, the compression ratio will be large and, thus, the intersection time will be less. On the other hand, because the compression ratio for any itemset cannot be less than 1, we assume that, for frequent 1-itemset i, the compression S ratio is equal to 1, i.e., i 2 ¼ 1. Then, for all frequent i 1 1-itemsets (in the first level of the lexicographic tree) whose ID number is less than i, the compression ratio is greater than 1 and, for all frequent 1-itemsets whose ID number is larger than i, the compression ratio is equal to 1. Therefore, we have: Avg ratio S 1 þ S þ S þ...þ S i þ n i 2i 1 2S i ð1 2 i Þþn i ¼ 2 i 1 þ n i: Since 2 i >i, when i is large, i.e., when fewer of the frequent 1-itemsets have compression equal to 1, the transaction tree is narrow. In the worst case, when the transaction tree is fully filled, the compression ratio reaches the minimum value. Intuitively, when the size of the data set is large and there are more repetitive patterns, the transaction tree will be narrow. In general, market data has this kind of characteristics. In summary, when the minimum support is large or the items are sparsely associated and there are more repetitive patterns (as in the case of market data), the algorithm runs faster.

6 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 477 TABLE 6 Characteristics of Experiment Data Sets 5 EXPERIMENTS AND PERFORMANCE EVALUATION 5.1 Comparison with declat and FP-Growth We used five sets of data in our experiments. Three of these sets are synthetic data (T10I4D100K, T25I10D10K, and T40I10D100k). These synthetic data resemble market basket data with short frequent patterns. The other two data sets are real data (Mushroom and Connect-4 data) which are dense in long frequent patterns. These data sets were often used in the previous study of association rules mining and were downloaded from and sets.php. Some characteristics of these data sets are shown in Table 6. We have compared the TM algorithm mainly with two popular algorithms declat and FP-growth, the implementations of which were downloaded from implemented by Goethals, B. using std libraries. They were compiled in Visual C++. The TM algorithm was implemented based on these two codes. Small modifications were made to implement the transaction tree and interval lists construction, interval lists intersection, and switching. The same std libraries were used to make the comparison fair. Implementations that employ other libraries and data structures might be faster than Goethals implementation. Comparing such implementations with the TM implementation will be unfair. The FPgrowth code was modified a little to read the whole database into memory at the beginning so that the comparison of all the three algorithms is fair. We did not compare with Eclat because it was shown in [11] that declat outperforms Eclat. Both TM and declat use the same optimization techniques described below: 1. Early stopping. This technique was used earlier in Eclat [8]. The intersection between two tid sets can be stopped if the number of mismatches in one set is greater than the support of this set minus the minimum support threshold. For instance, assume that the minimum support threshold is 50 and the supports of two itemsets AB and AC are 60 and 80, respectively. If the number of mismatches in AB has reached 11, then itemset ABC cannot be frequent. For interval lists intersection, the number of mismatches is a little hard to record because of complicated set relationships. Thus, we have used the following rule: If the number of transactions not intersected yet is less than the minimum support threshold minus the number of matches, the intersection will be stopped. 2. Dynamic ordering. Reordering all the items in every node at each level of the lexicographic tree in ascending order of support can reduce the number of generated candidate itemsets and, hence, reduce the number of needed intersections. This property was first used by Bayardo [13]. 3. Save intersection with combination. This technique comes from the following corollary [3]. If the support TABLE 7 Runtime(s) for T10I4D100k Data

7 478 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 TABLE 8 Runtime(s) for T25I10D10K Data of the itemset X [ Y is equal to the support of X, then the support of the itemset X [ Y [ Z is equal to the support of the itemset X [ Z. For example, if the support of itemset f1; 2g is equal to the support of f1g, then the support of the itemset f1; 2; 3g is equal to the support of itemset f1; 3g. So, we do not need to conduct the intersection between f1; 2g and f1; 3g. Correspondingly, if the supports of several itemsets are all equal to the support of their common prefix itemset (subset) that is frequent, then any combination of these itemsets will be frequent. For example, if the supports of itemsets f1; 2g, f1; 3g, and f1; 4g are all equal to the support of the frequent itemset f1g, then f1; 2g, f1; 3g, f1; 4g, f1; 2; 3g, f1; 2; 4g, f1; 3; 4g, and f1; 2; 3; 4g are all frequent itemsets. This optimization is similar to the single path solution in the FP-growth algorithm. All experiments were performed on a DELL 2.4GHz Pentium PC with 1G of memory, running Windows All times shown include time for outputting all the frequent itemsets. The results are presented in Table 7, Table 8, Table 9, Table 10, and Table 11 and Fig. 5, Fig. 6, Fig. 7, Fig. 8, Fig. 9, and Fig. 10. Table 7 shows the running time of the compared algorithms on T10I4D100K data with different minimum supports represented by percentage of the total transactions. Under large minimum supports, declat runs faster than FP-Growth while running slower than FP-Growth under small minimum supports. TM algorithm runs faster than both algorithms under almost all minimum support values. On average, TM algorithm runs almost 2 times faster than the faster of FP-Growth and declat. Two graphs (Fig. 5 and Fig. 6) are employed to display the performance comparison under large minimum support and small minimum support, respectively. Table 8 and Fig. 7 show the performance comparison of the compared algorithms on T2510D10K data. declat runs, in general, faster than FP-Growth with some exceptions at some minimum support values. TM algorithm runs twice faster than declat on an average. Table 9 and Fig. 8 show the performance comparison on T40I10D100K data. TM algorithm runs faster when the minimum support is larger while slower when the minimum support is smaller. TABLE 9 Runtime(s) for T40104D100k Data TABLE 10 Runtime(s) for Mushroom Data

8 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 479 TABLE 11 Runtime(s) for Connect-4 Data Table 10 and Fig. 9 compare the algorithms of interest on mushroom data. declat is better than FP-Growth while TM is better than declat. Table 11 and Fig. 10 show the relative performance of the algorithms on Connect-4 data. Connect-4 data is very dense and, hence, the smallest minimum support is 40 percent in this experiment. Similar to the result on mushroom data, declat is faster than FP-Growth while TM is faster than declat, though the difference is not significant. 5.2 Experiments with dtm We have combined the TM algorithm with the declat algorithm in the following way: We represent the diffset [11] in declat between a candidate k-itemset and its prefix k 1-frequent itemset using mapped transaction intervals and compute the support by subtracting the cardinality of diffset from the support of its prefix k 1-frequent itemset. We name the corresponding algorithm the dtm algorithm. We ran the dtm algorithm on the five data sets and the runtimes are shown in Table 7, Table 8, Table 9, Table 10, and Table 11. Unexpectedly, the performance of dtm is worse than that of TM. The reason is that the computation of the difference interval sets between two itemsets is more complicated than the computation of the intersection and has more overhead. For instance, consider interval set1 ¼½s1; e1š, interval set2 ¼½s2; e2š; ½s3; e3š. Both ½s2; e2š and ½s3; e3š are within ½s1; e1š. The difference between Fig. 5. Runtime for T10I4D100k data (1). Fig. 7. Runtime for T25I10D10K data. Fig. 6. Runtime for T10I4D100k data (2). Fig. 8. Runtime for T40I10D100K data.

9 480 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 Fig. 9. Runtime for mushroom data. ½s3; e3š and ½s1; e1š is dependent on the difference between ½s2; e2š and ½s1; e1š. So, there are more cases to consider here than in the computation of the intersection of two sets. 5.3 Experiments with MAFIA and FP-Growth* In this experiment, we experimented with two other algorithms mentioned in the introduction MAFIA and FP-growth*. The comparison, however, is just for reference because the implementations of MAFIA and FP-growth* use different libraries and data structures, which makes the comparison unfair. The implementation of MAFIA was downloaded from Mafia/#download, and the implementation of FP-growth* was downloaded from dbdm/dm.html. The runtimes for these two algorithms are also shown in Table 7, Table 8, Table 9, Table 10, and Table 11. TM is faster than MAFIA for four data sets, while slower than MAFIA for just the mushroom data set. FPgrowth* is the fastest among all the algorithms with which we experimented. The comparison, however, is unfair. For example, FP-tree construction should be slower than the transaction tree construction, but, in FP-growth*, the implementation of FP-tree construction is faster than our implementation for transaction tree construction. In the case of a minimum support of 0.5 percent, FP-growth* runs in 1.187s, while the construction of transaction tree alone in the TM algorithm takes 1.281s. The runtime difference between FP-growth and FP-growth* is not so large in the paper of FP-growth* [7] as in this experiment ([7] uses a different implementation of FP-growth), which indicates that the implementation plays a great role. 6 DISCUSSION 6.1 Overhead of Constructing Interval Lists and Interval Comparison One may be concerned with the fact that it takes extra effort to relabel while constructing interval lists. Fortunately, constructing the transaction tree is just done once and the relabeling of transactions is done just once by traversing the transaction tree in depth-first order. Relabeling time is negligible compared to the intersection time. For example, for connect_4 data with a support of 0.5, the construction of transaction tree takes 0.734s, constructing interval lists takes less than 0.001s, and generating frequent sets takes s. In FP-growth algorithm, constructing the first FP-tree takes Fig. 10. Runtime for Connect-4 data s, which is longer than the time to construct the transaction tree because of building header table and node link. There is overhead of interval comparison, i.e., the average number of interval comparisons is 2, according to only three cases of relationship between two intervals, which is greater than the number of comparisons for id intersection (which is 1) used in Eclat algorithm. During the first several levels, however, the interval compression ratio is bigger than 2. On the other hand, we keep track of this compression ratio (coefficient) and, when it becomes less than 2, we switch to single id transaction as in the Eclat algorithm. Therefore, our algorithm somewhat combines the advantages of both FP-growth and Eclat. When data can be compressed well by the transaction tree (one advantage of FP-growth is to use FP-tree to compress the data), we use interval lists intersection; when we cannot, we switch to id lists intersection as in Eclat. 6.2 Runtime The data sets we have used in our experiments have often been used in previous research and the times shown include the time needed in all the steps. Our algorithm outperforms FP-growth and declat. Actually, it is also much faster than Eclat. We did not show the comparison with Eclat because declat was claimed to outperform Eclat in [11]. We believe that our algorithms will be faster than the Apriori algorithm. We did not compare TM and Apriori since the algorithms FP-growth, Eclat, and declat have been shown to outperform Apriori [4], [8], [11]. 6.3 Storage Cost Storage cost for maintaining the intervals of itemsets is less than that for maintaining id lists in the Eclat algorithm. Because once one interval is generated, its corresponding node in the transaction tree is deleted. Once all the interval lists are generated, the transaction tree is removed, so we only need to store interval lists. The storage is also less than that of FP-tree (FP-tree has a header table and a node link). We use the lexicographic tree just to illustrate DFS procedure as in the Eclat algorithm. This tree is built on-the-fly and not built fully at all. So, the lexicographic tree is not stored in full. 6.4 About Comparisons This paper focuses on algorithmic concepts rather than on implementations. For the same algorithm, the runtime is

10 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 481 different for different implementations. We downloaded the declat and fp-growth implementations of Goethals and implemented our algorithm based on his codes. Data structures (set, vector, and multisets) and libraries (std) used are the same and only the algorithmic parts are different. This makes the comparison fair. Although the implementations of MAFIA and FP-growth* used in this experiment are all in C/C++, data structures and libraries used are different, which makes the comparison unfair for algorithms. For example, fp-tree construction should be slower than transaction tree construction. But, in Fp-growth*, the implementation of fp-tree construction is faster than our implementation for transaction tree construction. Another example is that [7] uses a different implementation of FP-growth, so the runtime difference between FP-growth and FP-growth* is not so large as in this experiment. For the TM algorithm, we just modified Goethals implementation for fp-tree construction and did not use faster implementations because we want to make the comparison between TM and Fp-growth fair. Our implementation, however, could be improved to make the runtime faster. We feel that if we develop an implementation tailored for the TM algorithm instead of just modifying the downloaded codes, TM will be competitive with FP-growth*. 7 CONCLUSIONS AND FUTURE WORK In this paper, we have presented a new algorithm, TM, using the vertical database representation. Transaction ids of each itemset are transformed and compressed to continuous transaction interval lists in a different space using the transaction tree and frequent itemsets are found by transaction intervals intersection along a lexicographic tree in depth first order. This compression greatly saves the intersection time. Through experiments, the TM algorithm has been shown to gain significant performance improvement over FP-growth and declat on data sets with short frequent patterns and also some improvement on data sets with long frequent patterns. We have also performed the compression and time analysis of transaction mapping using the transaction tree and proven that transaction mapping can greatly compress the transaction ids into continuous transaction intervals, especially when the minimum support is high. Although FP-growth* is faster than TM in this experiment, the comparison is unfair. In our future work, we plan to improve the implementation of the TM algorithm and make a fair comparison with FP-growth*. ACKNOWLEDGMENTS This work has been supported in part by US National Science Foundation Grants CCR and ITR REFERENCES [1] R. Agrawal, T. Imielinski, and A.N. Swami, Mining Association Rules between Sets of Items in Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, pp , May [2] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, Proc. 20th Int l Conf. Very Large Data Bases, pp , [3] B. Goethals, Survey on Frequent Pattern Mining, manuscript, [4] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. ACM SIGMOD Int l Conf. Management of Data, pp. 1-12, May [5] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, Hmine: Hyper-Structure Mining of Frequent Patterns in Large Databases, Proc. IEEE Int l Conf. Data Mining, pp , [6] A. Pietracaprina and D. Zandolin, Mining Frequent Itemsets Using Patricia Tries, Proc. ICDM 2003 Workshop Frequent Itemset Mining Implementations, Dec [7] G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. ICDM 2003 Workshop Frequent Itemset Mining Implementations, Dec [8] M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New Algorithms for Fast Discovery of Association Rules, Proc. Third Int l Conf. Knowledge Discovery and Data Mining, pp , [9] P. Shenoy, J.R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. Shah, Turbo-Charging Vertical Mining of Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, pp , May [10] D. Burdick, M. Calimlim, and J. Gehrke, MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases, Proc. Int l Conf. Data Eng., pp , Apr [11] M.J. Zaki and K. Gouda, Fast Vertical Mining Using Diffsets, Proc. Ninth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining, pp , [12] R. Agrawal, C. Aggarwal, and V. Prasad, A Tree Projection Algorithm for Generation of Frequent Item Sets, Parallel and Distributed Computing, pp , [13] R.J. Bayardo, Efficiently Mining Long Patterns from Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, pp , June Mingjun Song received his first PhD degree in remote sensing from the University of Connecticut. He is working at the ADE Corporation as a software research engineer and is in his second PhD program in computer science and engineering at the University of Connecticut. His research interests include algorithms and complexity, data mining, pattern recognition, image processing, remote sensing, and geographical information system. Sanguthevar Rajasekaran received the ME degree in automation from the Indian Institute of Science, Bangalore, in 1983, and the PhD degree in computer science from Harvard University in He is a full professor and UTC Chair Professor of Computer Science and Engineering (CSE) at the University of Connecticut (UConn). He is also the director of the Booth Engineering Center for Advanced Technologies (BECAT) at UConn. Before joining UConn, he served as a faculty member in the CISE Department at the University of Florida and in the CIS Department at the University of Pennsylvania. From , he was the chief scientist for Arcot Systems. His research interests include parallel algorithms, bioinformatics, data mining, randomized computing, computer simulations, and combinatorial optimization. He has published more than 130 articles in journals and conferences. He has coauthored two texts on algorithms and coedited four books on algorithms and related topics. He is an elected member of the Connecticut Academy of Science and Engineering (CASE). He is a senior member of the IEEE.. For more information on this or any other computing topic, please visit our Digital Library at

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets : A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets J. Tahmores Nezhad ℵ, M.H.Sadreddini Abstract In recent years, various algorithms for mining closed frequent

More information

Data Structure for Association Rule Mining: T-Trees and P-Trees

Data Structure for Association Rule Mining: T-Trees and P-Trees IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 6, JUNE 2004 1 Data Structure for Association Rule Mining: T-Trees and P-Trees Frans Coenen, Paul Leng, and Shakil Ahmed Abstract Two new

More information

A Frame Work for Frequent Pattern Mining Using Dynamic Function

A Frame Work for Frequent Pattern Mining Using Dynamic Function www.ijcsi.org 4 A Frame Work for Frequent Pattern Mining Using Dynamic Function Sunil Joshi, R S Jadon 2 and R C Jain 3 Computer Applications Department, Samrat Ashok Technological Institute Vidisha, M.P.,

More information

ETP-Mine: An Efficient Method for Mining Transitional Patterns

ETP-Mine: An Efficient Method for Mining Transitional Patterns ETP-Mine: An Efficient Method for Mining Transitional Patterns B. Kiran Kumar 1 and A. Bhaskar 2 1 Department of M.C.A., Kakatiya Institute of Technology & Science, A.P. INDIA. kirankumar.bejjanki@gmail.com

More information

EFFICIENT mining of frequent itemsets (FIs) is a fundamental

EFFICIENT mining of frequent itemsets (FIs) is a fundamental IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 10, OCTOBER 2005 1347 Fast Algorithms for Frequent Itemset Mining Using FP-Trees Gösta Grahne, Member, IEEE, and Jianfei Zhu, Student Member,

More information

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking Shariq Bashir National University of Computer and Emerging Sciences, FAST House, Rohtas Road,

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining

ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining Samir A. Mohamed Elsayed, Sanguthevar Rajasekaran, and Reda A. Ammar Computer Science Department, University of Connecticut.

More information

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai EFFICIENTLY MINING FREQUENT ITEMSETS IN TRANSACTIONAL DATABASES This article has been peer reviewed and accepted for publication in JMST but has not yet been copyediting, typesetting, pagination and proofreading

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Keywords: Mining frequent itemsets, prime-block encoding, sparse data

Keywords: Mining frequent itemsets, prime-block encoding, sparse data Computing and Informatics, Vol. 32, 2013, 1079 1099 EFFICIENTLY USING PRIME-ENCODING FOR MINING FREQUENT ITEMSETS IN SPARSE DATA Karam Gouda, Mosab Hassaan Faculty of Computers & Informatics Benha University,

More information

A New Fast Vertical Method for Mining Frequent Patterns

A New Fast Vertical Method for Mining Frequent Patterns International Journal of Computational Intelligence Systems, Vol.3, No. 6 (December, 2010), 733-744 A New Fast Vertical Method for Mining Frequent Patterns Zhihong Deng Key Laboratory of Machine Perception

More information

Mining Frequent Patterns Based on Data Characteristics

Mining Frequent Patterns Based on Data Characteristics Mining Frequent Patterns Based on Data Characteristics Lan Vu, Gita Alaghband, Senior Member, IEEE Department of Computer Science and Engineering, University of Colorado Denver, Denver, CO, USA {lan.vu,

More information

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns Guimei Liu Hongjun Lu Dept. of Computer Science The Hong Kong Univ. of Science & Technology Hong Kong, China {cslgm, luhj}@cs.ust.hk

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

Vertical Mining of Frequent Patterns from Uncertain Data

Vertical Mining of Frequent Patterns from Uncertain Data Vertical Mining of Frequent Patterns from Uncertain Data Laila A. Abd-Elmegid Faculty of Computers and Information, Helwan University E-mail: eng.lole@yahoo.com Mohamed E. El-Sharkawi Faculty of Computers

More information

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS APPLYIG BIT-VECTOR PROJECTIO APPROACH FOR EFFICIET MIIG OF -MOST ITERESTIG FREQUET ITEMSETS Zahoor Jan, Shariq Bashir, A. Rauf Baig FAST-ational University of Computer and Emerging Sciences, Islamabad

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 Uppsala University Department of Information Technology Kjell Orsborn DATA MINING II - 1DL460 Assignment 2 - Implementation of algorithm for frequent itemset and association rule mining 1 Algorithms for

More information

MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining

MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining Philippe Fournier-Viger 1, Espérance Mwamikazi 1, Ted Gueniche 1 and Usef Faghihi 2 1 Department of Computer Science, University

More information

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p YAFIMA: Yet Another Frequent Itemset Mining Algorithm Mohammad El-Hajj, Osmar R. Zaïane Department of Computing Science University of Alberta, Edmonton, AB, Canada {mohammad, zaiane}@cs.ualberta.ca ABSTRACT:

More information

Iliya Mitov 1, Krassimira Ivanova 1, Benoit Depaire 2, Koen Vanhoof 2

Iliya Mitov 1, Krassimira Ivanova 1, Benoit Depaire 2, Koen Vanhoof 2 Iliya Mitov 1, Krassimira Ivanova 1, Benoit Depaire 2, Koen Vanhoof 2 1: Institute of Mathematics and Informatics BAS, Sofia, Bulgaria 2: Hasselt University, Belgium 1 st Int. Conf. IMMM, 23-29.10.2011,

More information

Mining frequent item sets without candidate generation using FP-Trees

Mining frequent item sets without candidate generation using FP-Trees Mining frequent item sets without candidate generation using FP-Trees G.Nageswara Rao M.Tech, (Ph.D) Suman Kumar Gurram (M.Tech I.T) Aditya Institute of Technology and Management, Tekkali, Srikakulam (DT),

More information

A Taxonomy of Classical Frequent Item set Mining Algorithms

A Taxonomy of Classical Frequent Item set Mining Algorithms A Taxonomy of Classical Frequent Item set Mining Algorithms Bharat Gupta and Deepak Garg Abstract These instructions Frequent itemsets mining is one of the most important and crucial part in today s world

More information

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining Miss. Rituja M. Zagade Computer Engineering Department,JSPM,NTC RSSOER,Savitribai Phule Pune University Pune,India

More information

PC Tree: Prime-Based and Compressed Tree for Maximal Frequent Patterns Mining

PC Tree: Prime-Based and Compressed Tree for Maximal Frequent Patterns Mining Chapter 42 PC Tree: Prime-Based and Compressed Tree for Maximal Frequent Patterns Mining Mohammad Nadimi-Shahraki, Norwati Mustapha, Md Nasir B Sulaiman, and Ali B Mamat Abstract Knowledge discovery or

More information

Approaches for Mining Frequent Itemsets and Minimal Association Rules

Approaches for Mining Frequent Itemsets and Minimal Association Rules GRD Journals- Global Research and Development Journal for Engineering Volume 1 Issue 7 June 2016 ISSN: 2455-5703 Approaches for Mining Frequent Itemsets and Minimal Association Rules Prajakta R. Tanksali

More information

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets Azzedine Boukerche and Samer Samarah School of Information Technology & Engineering University of Ottawa, Ottawa, Canada

More information

On Frequent Itemset Mining With Closure

On Frequent Itemset Mining With Closure On Frequent Itemset Mining With Closure Mohammad El-Hajj Osmar R. Zaïane Department of Computing Science University of Alberta, Edmonton AB, Canada T6G 2E8 Tel: 1-780-492 2860 Fax: 1-780-492 1071 {mohammad,

More information

A Theoretical Formulation of Bit Mask Search Mining Technique for mining Frequent Itemsets

A Theoretical Formulation of Bit Mask Search Mining Technique for mining Frequent Itemsets A Theoretical Formulation of Bit Mask Search Mining Technique for mining Frequent Itemsets Jayshree Boaddh 1, Prof. Urmila Mahor 2 Prof.Niket Bhargava 3 1 Student, Department of Computer Science & Engg.,

More information

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Parallel Mining of Maximal Frequent Itemsets in PC Clusters Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan

More information

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

More information

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database Algorithm Based on Decomposition of the Transaction Database 1 School of Management Science and Engineering, Shandong Normal University,Jinan, 250014,China E-mail:459132653@qq.com Fei Wei 2 School of Management

More information

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM G.Amlu #1 S.Chandralekha #2 and PraveenKumar *1 # B.Tech, Information Technology, Anand Institute of Higher Technology, Chennai, India

More information

Product presentations can be more intelligently planned

Product presentations can be more intelligently planned Association Rules Lecture /DMBI/IKI8303T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Faculty of Computer Science, Objectives Introduction What is Association Mining? Mining Association Rules

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/19/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN: IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A Brief Survey on Frequent Patterns Mining of Uncertain Data Purvi Y. Rana*, Prof. Pragna Makwana, Prof. Kishori Shekokar *Student,

More information

A Review on Mining Top-K High Utility Itemsets without Generating Candidates

A Review on Mining Top-K High Utility Itemsets without Generating Candidates A Review on Mining Top-K High Utility Itemsets without Generating Candidates Lekha I. Surana, Professor Vijay B. More Lekha I. Surana, Dept of Computer Engineering, MET s Institute of Engineering Nashik,

More information

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad Hyderabad,

More information

This paper proposes: Mining Frequent Patterns without Candidate Generation

This paper proposes: Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation a paper by Jiawei Han, Jian Pei and Yiwen Yin School of Computing Science Simon Fraser University Presented by Maria Cutumisu Department of Computing

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

COFI Approach for Mining Frequent Itemsets Revisited

COFI Approach for Mining Frequent Itemsets Revisited COFI Approach for Mining Frequent Itemsets Revisited Mohammad El-Hajj Department of Computing Science University of Alberta,Edmonton AB, Canada mohammad@cs.ualberta.ca Osmar R. Zaïane Department of Computing

More information

Item Set Extraction of Mining Association Rule

Item Set Extraction of Mining Association Rule Item Set Extraction of Mining Association Rule Shabana Yasmeen, Prof. P.Pradeep Kumar, A.Ranjith Kumar Department CSE, Vivekananda Institute of Technology and Science, Karimnagar, A.P, India Abstract:

More information

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining P.Subhashini 1, Dr.G.Gunasekaran 2 Research Scholar, Dept. of Information Technology, St.Peter s University,

More information

Parallelizing Frequent Itemset Mining with FP-Trees

Parallelizing Frequent Itemset Mining with FP-Trees Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas

More information

International Journal of Computer Sciences and Engineering. Research Paper Volume-5, Issue-8 E-ISSN:

International Journal of Computer Sciences and Engineering. Research Paper Volume-5, Issue-8 E-ISSN: International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-5, Issue-8 E-ISSN: 2347-2693 Comparative Study of Top Algorithms for Association Rule Mining B. Nigam *, A.

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran

FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Sanguthevar Rajasekaran FIT: A Fast Algorithm for Discovering Frequent Itemsets in Large Databases Jun Luo Sanguthevar Rajasekaran Dept. of Computer Science Ohio Northern University Ada, OH 4581 Email: j-luo@onu.edu Dept. of

More information

Chapter 7: Frequent Itemsets and Association Rules

Chapter 7: Frequent Itemsets and Association Rules Chapter 7: Frequent Itemsets and Association Rules Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14 VII.1&2 1 Motivational Example Assume you run an on-line

More information

A Hybrid Approach for Mining Frequent Itemsets

A Hybrid Approach for Mining Frequent Itemsets A Hybrid Approach for Mining Frequent Itemsets Bay Vo Ton Duc Thang University, Ho Chi Minh City Viet Nam bayvodinh@gmail.com Frans Coenen Department of Computer Science, University of Liverpool, UK coenen@liverpool.ac.uk

More information

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results Yaochun Huang, Hui Xiong, Weili Wu, and Sam Y. Sung 3 Computer Science Department, University of Texas - Dallas, USA, {yxh03800,wxw0000}@utdallas.edu

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Performance Based Study of Association Rule Algorithms On Voter DB

Performance Based Study of Association Rule Algorithms On Voter DB Performance Based Study of Association Rule Algorithms On Voter DB K.Padmavathi 1, R.Aruna Kirithika 2 1 Department of BCA, St.Joseph s College, Thiruvalluvar University, Cuddalore, Tamil Nadu, India,

More information

Finding frequent closed itemsets with an extended version of the Eclat algorithm

Finding frequent closed itemsets with an extended version of the Eclat algorithm Annales Mathematicae et Informaticae 48 (2018) pp. 75 82 http://ami.uni-eszterhazy.hu Finding frequent closed itemsets with an extended version of the Eclat algorithm Laszlo Szathmary University of Debrecen,

More information

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm? H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases Paper s goals Introduce a new data structure: H-struct J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang Int. Conf. on Data Mining

More information

A novel algorithm for frequent itemset mining in data warehouses

A novel algorithm for frequent itemset mining in data warehouses 216 Journal of Zhejiang University SCIENCE A ISSN 1009-3095 http://www.zju.edu.cn/jzus E-mail: jzus@zju.edu.cn A novel algorithm for frequent itemset mining in data warehouses XU Li-jun ( 徐利军 ), XIE Kang-lin

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

A Comparative Study of Association Rules Mining Algorithms

A Comparative Study of Association Rules Mining Algorithms A Comparative Study of Association Rules Mining Algorithms Cornelia Győrödi *, Robert Győrödi *, prof. dr. ing. Stefan Holban ** * Department of Computer Science, University of Oradea, Str. Armatei Romane

More information

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets Jianyong Wang, Jiawei Han, Jian Pei Presentation by: Nasimeh Asgarian Department of Computing Science University of Alberta

More information

GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets

GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets Data Mining and Knowledge Discovery,, 20, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. GenMax: An Efficient Algorithm for Mining Maximal Frequent Itemsets KARAM

More information

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and

More information

Bit Stream Mask-Search Algorithm in Frequent Itemset Mining

Bit Stream Mask-Search Algorithm in Frequent Itemset Mining European Journal of Scientific Research ISSN 1450-216X Vol.27 No.2 (2009), pp.286-297 EuroJournals Publishing, Inc. 2009 http://www.eurojournals.com/ejsr.htm Bit Stream Mask-Search Algorithm in Frequent

More information

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule

More information

Estimation of the Density of Datasets with Decision Diagrams

Estimation of the Density of Datasets with Decision Diagrams Estimation of the Density of Datasets with Decision Diagrams Ansaf Salleb 1 Christel Vrain 2 1 IRISA-INRIA Campus Universitaire de Beaulieu 352 Rennes Cedex - France Ansaf.Salleb@irisa.fr 2 LIFO, Université

More information

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold Zengyou He, Xiaofei Xu, Shengchun Deng Department of Computer Science and Engineering, Harbin Institute of Technology,

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

Fundamental Data Mining Algorithms

Fundamental Data Mining Algorithms 2018 EE448, Big Data Mining, Lecture 3 Fundamental Data Mining Algorithms Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html REVIEW What is Data

More information

Sequential PAttern Mining using A Bitmap Representation

Sequential PAttern Mining using A Bitmap Representation Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu Dept. of Computer Science Cornell University ABSTRACT We introduce a new algorithm for mining

More information

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others Vol.15 No.6 J. Comput. Sci. & Technol. Nov. 2000 A Fast Algorithm for Mining Association Rules HUANG Liusheng (ΛΠ ), CHEN Huaping ( ±), WANG Xun (Φ Ψ) and CHEN Guoliang ( Ξ) National High Performance Computing

More information

CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets

CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets : Searching for the Best Strategies for Mining Frequent Closed Itemsets Jianyong Wang Department of Computer Science University of Illinois at Urbana-Champaign wangj@cs.uiuc.edu Jiawei Han Department of

More information

Frequent Itemset Mining on Large-Scale Shared Memory Machines

Frequent Itemset Mining on Large-Scale Shared Memory Machines 20 IEEE International Conference on Cluster Computing Frequent Itemset Mining on Large-Scale Shared Memory Machines Yan Zhang, Fan Zhang, Jason Bakos Dept. of CSE, University of South Carolina 35 Main

More information

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree Global Journal of Computer Science and Technology Software & Data Engineering Volume 13 Issue 2 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania

More information

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS ABSTRACT V. Purushothama Raju 1 and G.P. Saradhi Varma 2 1 Research Scholar, Dept. of CSE, Acharya Nagarjuna University, Guntur, A.P., India 2 Department

More information

The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version)

The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version) The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version) Ferenc Bodon 1 and Lars Schmidt-Thieme 2 1 Department of Computer

More information

An Algorithm for Mining Large Sequences in Databases

An Algorithm for Mining Large Sequences in Databases 149 An Algorithm for Mining Large Sequences in Databases Bharat Bhasker, Indian Institute of Management, Lucknow, India, bhasker@iiml.ac.in ABSTRACT Frequent sequence mining is a fundamental and essential

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

On Mining Max Frequent Generalized Itemsets

On Mining Max Frequent Generalized Itemsets On Mining Max Frequent Generalized Itemsets Gene Cooperman Donghui Zhang Daniel Kunkle College of Computer & Information Science Northeastern University, Boston, MA 02115 {gene, donghui, kunkle}@ccs.neu.edu

More information

An Efficient Algorithm for finding high utility itemsets from online sell

An Efficient Algorithm for finding high utility itemsets from online sell An Efficient Algorithm for finding high utility itemsets from online sell Sarode Nutan S, Kothavle Suhas R 1 Department of Computer Engineering, ICOER, Maharashtra, India 2 Department of Computer Engineering,

More information

Modified Frequent Itemset Mining using Itemset Tidset pair

Modified Frequent Itemset Mining using Itemset Tidset pair IJCST Vo l. 8, Is s u e 1, Ja n - Ma r c h 2017 ISSN : 0976-8491 (Online ISSN : 2229-4333 (Print Modified Frequent Itemset Mining using Itemset Tidset pair 1 Dr. Jitendra Agrawal, 2 Dr. Shikha Agrawal

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information

A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining

A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining Son N. Nguyen, Maria E. Orlowska School of Information Technology and Electrical Engineering The University of Queensland,

More information

A binary based approach for generating association rules

A binary based approach for generating association rules A binary based approach for generating association rules Med El Hadi Benelhadj, Khedija Arour, Mahmoud Boufaida and Yahya Slimani 3 LIRE Laboratory, Computer Science Department, Mentouri University, Constantine,

More information

Pattern Lattice Traversal by Selective Jumps

Pattern Lattice Traversal by Selective Jumps Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB, Canada {zaiane, mohammad}@cs.ualberta.ca ABSTRACT Regardless

More information

Association Rule Mining: FP-Growth

Association Rule Mining: FP-Growth Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong We have already learned the Apriori algorithm for association rule mining. In this lecture, we will discuss a faster

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013-2017 Han, Kamber & Pei. All

More information

ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS

ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS D.SUJATHA 1, PROF.B.L.DEEKSHATULU 2 1 HOD, Department of IT, Aurora s Technological and Research Institute, Hyderabad 2 Visiting Professor, Department

More information

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm Marek Wojciechowski, Krzysztof Galecki, Krzysztof Gawronek Poznan University of Technology Institute of Computing Science ul.

More information

SURVEY OF FREQUENT PATTERN MINING ALGORITHMS IN HORIZONTAL AND VERTICAL DATA LAYOUTS

SURVEY OF FREQUENT PATTERN MINING ALGORITHMS IN HORIZONTAL AND VERTICAL DATA LAYOUTS Available Online at http://warse.org/ijacst/static/pdf/file/ijacst03442015.pdf SURVEY OF FREQUENT PATTERN MINING ALGORITHMS IN HORIZONTAL AND VERTICAL DATA LAYOUTS ABSTRACT A.Meenakshi Department of Computer

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Survey on Frequent Pattern Mining

Survey on Frequent Pattern Mining Survey on Frequent Pattern Mining Bart Goethals HIIT Basic Research Unit Department of Computer Science University of Helsinki P.O. box 26, FIN-00014 Helsinki Finland 1 Introduction Frequent itemsets play

More information

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * Shichao Zhang 1, Xindong Wu 2, Jilian Zhang 3, and Chengqi Zhang 1 1 Faculty of Information Technology, University of Technology

More information

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES 1 SALLAM OSMAN FAGEERI 2 ROHIZA AHMAD, 3 BAHARUM B. BAHARUDIN 1, 2, 3 Department of Computer and Information Sciences Universiti Teknologi

More information

Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms

Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms ABSTRACT R. Uday Kiran International Institute of Information Technology-Hyderabad Hyderabad

More information

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity Unil Yun and John J. Leggett Department of Computer Science Texas A&M University College Station, Texas 7783, USA

More information

Mining Top-K Association Rules Philippe Fournier-Viger 1, Cheng-Wei Wu 2 and Vincent S. Tseng 2 1 Dept. of Computer Science, University of Moncton, Canada philippe.fv@gmail.com 2 Dept. of Computer Science

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information