ASSOCIATION rules mining is a very popular data mining

Size: px

Start display at page:

Download "ASSOCIATION rules mining is a very popular data mining"

Angelica Miller
5 years ago
Views:

1 472 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 A Transaction Mapping Algorithm for Frequent Itemsets Mining Mingjun Song and Sanguthevar Rajasekaran, Senior Member, IEEE Abstract In this paper, we present a novel algorithm for mining complete frequent itemsets. This algorithm is referred to as the TM (Transaction Mapping) algorithm from hereon. In this algorithm, transaction ids of each itemset are mapped and compressed to continuous transaction intervals in a different space and the counting of itemsets is performed by intersecting these interval lists in a depth-first order along the lexicographic tree. When the compression coefficient becomes smaller than the average number of comparisons for intervals intersection at a certain level, the algorithm switches to transaction id intersection. We have evaluated the algorithm against two popular frequent itemset mining algorithms, FP-growth and declat, using a variety of data sets with short and long frequent patterns. Experimental data show that the TM algorithm outperforms these two algorithms. Index Terms Algorithms, association rule mining, data mining, frequent itemsets. æ 1 INTRODUCTION ASSOCIATION rules mining is a very popular data mining technique and it finds relationships among the different entities of records (for example, transaction records). Since the introduction of frequent itemsets in 1993 by Agrawal et al. [1], it has received a great deal of attention in the field of knowledge discovery and data mining. One of the first algorithms proposed for association rules mining was the AIS algorithm [1]. The problem of association rules mining was introduced in [1] as well. This algorithm was improved later to obtain the Apriori algorithm [2]. The Apriori algorithm employs the downward closure property if an itemset is not frequent, any superset of it cannot be frequent either. The Apriori algorithm performs a breadth-first search in the search space by generating candidate k þ 1-itemsets from frequent k-itemsets. The frequency of an itemset is computed by counting its occurrence in each transaction. Many variants of the Apriori algorithm have been developed, such as AprioriTid, AprioriHybrid, direct hashing and pruning (DHP), dynamic itemset counting (DIC), Partition algorithm, etc. For a survey on association rules mining algorithms, please see [3]. FP-growth [4] is a well-known algorithm that uses the FP-tree data structure to achieve a condensed representation of the database transactions and employs a divide-andconquer approach to decompose the mining problem into a set of smaller problems. In essence, it mines all the frequent itemsets by recursively finding all frequent 1-itemsets in the conditional pattern base that is efficiently constructed with the help of a node link structure. A variant of FP-growth is the H-mine algorithm [5]. It uses array-based and trie-based. The authors are with the Department of Computer Science and Engineering, University of Connecticut, Storrs, CT {mjsong, rajasek}@engr.uconn.edu. Manuscript received 14 Dec. 2004; revised 23 June 2005; accepted 18 Oct. 2005; published online 17 Feb For information on obtaining reprints of this article, please send to: tkde@computer.org, and reference IEEECS Log Number TKDE data structures to deal with sparse and dense data sets, respectively. PatriciaMine [6] employs a compressed Patricia trie to store the data sets. FPgrowth* [7] uses an array technique to reduce the FP-tree traversal time. In FP-growth-based algorithms, recursive construction of the FP-tree affects the algorithm s performance. Eclat [8] is the first algorithm to find frequent patterns by a depth-first search and it has been shown to perform well. It uses a vertical database representation and counts the itemset supports using the intersection of tids. However, because of the depth-first search, pruning used in the Apriori algorithm is not applicable during the candidate itemsets generation. VIPER [9] and Mafia [10] also use the vertical database layout and the intersection to achieve a good performance. The only difference is that they use the compressed bitmaps to represent the transaction list of each itemset. However, their compression scheme has limitations especially when tids are uniformly distributed. Zaki and Gouda [11] developed a new approach called declat using the vertical database representation. They store the difference of tids, called diffset, between a candidate k-itemset and its prefix k 1-frequent itemsets, instead of the tids intersection set, denoted here as tidset. They compute the support by subtracting the cardinality of diffset from the support of its prefix k 1-frequent itemset. This algorithm has been shown to gain significant performance improvements over Eclat. However, when the database is sparse, diffset will lose its advantage over tidset. In this paper, we present a novel approach that maps and compresses the transaction id list of each itemset into an interval list using a transaction tree and counts the support of each itemset by intersecting these interval lists. The frequent itemsets are found in a depth-first order along a lexicographic tree as done in the Eclat algorithm. The basic idea is to save the intersection time in Eclat by mapping transaction ids into continuous transaction intervals. When these intervals become scattered, we switch to transaction ids as in Eclat. We call the new algorithm the TM (transaction mapping) algorithm. The rest of the paper is arranged as /06/$20.00 ß 2006 IEEE Published by the IEEE Computer Society

2 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 473 TABLE 1 Horizontal Representation TABLE 2 Vertical Tidset Representation follows: Section 2 introduces the basic concept of association rules mining, two types of data representation, and the lexicographic tree used in our algorithm. Section 3 addresses how the transaction id list of each itemset is compressed to a continuous interval list and the details of the TM algorithm. Section 4 gives an analysis of the compression efficiency of transaction mapping. Section 5 experimentally compares the TM algorithm with two popular algorithms FP-Growth and declat. In Section 6, we provide some general comments and Section 7 concludes the paper. 2 BASIC PRINCIPLES 2.1 Association Rules Mining Let I ¼fi 1 ;i 2 ;...;i m g be a set of items and let D be a database having a set of transactions where each transaction T is a subset of I. An association rule is an association relationship of the form: X ) Y, where X I, Y I, and X \ Y ¼;. The support of rule X ) Y is defined as the percentage of transactions containing both X and Y in D. The confidence of X ) Y is defined as the percentage of transactions containing X that also contain Y in D. The task of association rules mining is to find all strong association rules that satisfy a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf). Mining association rules consists of two phases. In the first phase, all frequent itemsets that satisfy the min_sup are found. In the second phase, strong association rules are generated from the frequent itemsets found in the first phase. Most research considers only the first phase because once frequent itemsets are found, mining association rules is trivial. generally better than the horizontal format [8] [9]. Table 1, Table 2, and Table 3 show examples for different types of layouts. 2.3 Lexicographic Prefix Tree In this paper, we employ a lexicographic prefix tree data structure to efficiently generate candidate itemsets and count their frequency, which is very similar to the lexicographic tree used in the TreeProjection algorithm [12]. This tree structure is also used in many other algorithms such as Eclat [8]. An example of this tree is shown in Fig. 1. Each node in the tree stores a collection of frequent itemsets together with the support of these itemsets. The root contains all frequent 1-itemsets. Itemsets in level (for any ) are frequent -itemsets. Each edge in the tree is labeled with an item. Itemsets in any node are stored as singleton sets with the understanding that the actual itemset also contains all the items found on the edges from this node to the root. For example, consider the leftmost node in level 2 of the tree in Fig. 1. There are four 2-itemsets in this node, namely, f1; 2g, f1; 3g, f1; 4g, and f1; 5g. The singleton sets in each node of the tree are stored in the lexicographic order. If the root contains f1g; f2g;...; fng, then, the nodes in level 2 will contain f2g; f3g;...; fng; f3g; f4g;...; fng;...;fng, and so on. For each candidate itemset, we also store a list of transaction ids (i.e., ids of transactions in which all the items of the itemset occur). This tree will not be generated in full. The tree is generated in a depth first order and, at any given time, we only store TABLE 3 Vertical Bitvector Representation 2.2 Data Representation Two types of database layouts are employed in association rules mining: horizontal and vertical. In the traditional horizontal database layout, each transaction consists of a set of items and the database contains a set of transactions. Most Apriori-like algorithms use this type of layout. For vertical database layout, each item maintains a set of transaction ids (denoted by tidset) where this item is contained. This layout could be maintained as a bitvector. Eclat uses tidsets, while VIPER and Mafia use compressed bitvectors. It has been shown that vertical layout performs

3 474 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 TABLE 4 A Sample Transaction Database Fig. 1. Illustration of lexicographic tree. minimum information needed to continue the search. In particular, this means that, at any instance, at most a path of the tree will be stored. As the search progresses, if the expansion of a node cannot possibly lead to the discovery of itemsets that have minimum support, then the node will not be expanded and the search will backtrack. As a frequent itemset that meets the minimum support requirement is found, it is output. Candidate itemsets generated by depth first search are the same as those generated by the joining step (without pruning) of the Apriori algorithm. 3 TM ALGORITHM Our contribution is that we compress tids (transaction ids) for each itemset to continuous intervals by mapping transaction ids into a different space appealing to a transaction tree. Frequent itemsets are found by intersecting these interval lists instead of intersecting the transaction id lists (as in the Eclat algorithm). We will begin with the construction of a transaction tree. 3.1 Transaction Tree The transaction tree is similar to FP-tree except that there is no header table or node link. The transaction tree can be thought of as a compact representation of all the transactions in the database. Each node in the tree has an id corresponding to an item and a counter that keeps the number of transactions that contain this item in this path. Adapted from [4], the construction of the transaction tree (called constructransactiontree) is as follows: 1. Scan through the database once and identify all the frequent 1-itemsets and sort them in descending order of frequency. At the beginning, the transaction tree consists of just a single node (which is a dummy root). 2. Scan through the database for a second time. For each transaction, select items that are in frequent 1-itemsets, sort them according to the order of frequent 1-itemsets, and insert them into the transaction tree. When inserting an item, start from the root. At the beginning, the root is the current node. In general, if the current node has a child node whose id is equal to this item, then just increment the count of this child by 1; otherwise, create a new child node and set its counter as 1. Table 4 and Fig. 2 illustrate the construction of a transaction tree. Table 4 shows an example of a transaction database and Fig. 2 displays the constructed transaction tree assuming the minimum support count is 2. The number before the colon in each node is the item id and the number after the colon is the count of this item in this path. 3.2 Transaction Mapping and the Construction of Interval Lists After the transaction tree is constructed, all the transactions that contain an item are represented with an interval list. Each interval corresponds to a contiguous sequence of relabeled ids. Each node in the transaction tree will be associated with an interval. The construction of interval lists for each item is done recursively starting from the root in a depth-first order. The process is described as follows: Consider a node u whose number of transactions is c and Fig. 2. A transaction tree for the database shown in Table 4.

4 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 475 TABLE 5 Example of Transaction Mapping Fig. 3. Transaction tree for illustration. whose associated interval is [s, e]. Here, s is the relabeled start id and e is the relabeled end id with e s þ 1 ¼ c. Assume that u has m children with child i having c i transactions, for i ¼ 1; 2;...;m. It is obvious that P m i¼1 c i c. If the intervals associated with the children of u are: ½s 1 ;e 1 Š; ½s 2 ;e 2 Š;...; ½s m ;e m Š, these intervals are constructed as follows: s 1 ¼ s; e 1 ¼ s 1 þ c 1 1; s i ¼ e i 1 þ 1; for i ¼ 2; 3;...;m; e i ¼ s i þ c i 1; for i ¼ 2; 3;...;m: For the root, s ¼ 1. For example, in Fig. 2, the root has two children. For the first child, s 1 ¼ 1, e 1 ¼ 1 þ 5 1 ¼ 5, so the interval is ½1; 5Š; for the second child, s 2 ¼ 5 þ 1 ¼ 6;e 2 ¼ 6 þ 3 1 ¼ 8; so the interval is ½6; 8Š. The compressed transaction id lists of each item is ordered by the start id of each associated interval. In addition, if two intervals are contiguous, they will be merged and replaced with a single interval. For example, each interval associated with each node is shown in Fig. 2. Two intervals of item 3, ½1; 2Š and ½3; 3Š, will be merged to ½1; 3Š. To illustrate the efficiency of this mapping process more clearly, assume that the eight transactions of the example database shown in Table 4 repeat 100 times each. In this case, the transaction tree becomes the one shown in Fig. 3. The mapped transaction interval lists for each item is shown in Table 5, where of item 3 results from the merging of and We now summarize a procedure (called maptransaction Intervals) that computes the interval lists for each item as follows: Using depth first order, traverse the transaction tree. For each node, create an interval composed of a start id and an end id. If it is the first child of its parent, then the start id of the interval is equal to the start id of the parent (1) ð1þ ð2þ ð3þ ð4þ and the end id is computed by (2). If not, the start id is computed by (3) and the end id is computed by (4). Insert this interval to the interval list of the corresponding item. Once the interval lists for frequent 1-itemsets are constructed, frequent i-itemsets (for any i) are found by intersecting interval lists along the lexicograpgic tree. Details are provided in the next section. 3.3 Interval Lists Intersection In addition to the items described above, each element of a node in the lexicographic tree also stores a transaction interval list (corresponding to the itemset denoted by the element). By constructing the lexicographic tree in a depthfirst order, the support count of the candidate itemset is computed by intersecting the interval lists of the two elements. For example, element 2 in the second level of the lexicographic tree in Fig. 1 represents the itemset 1,2, whose support count is computed by intersecting the interval lists of itemset 1 and itemset 2. In contrast, Eclat uses a tid list intersection. Interval lists intersection is more efficient. Note that, since the interval is constructed from the transaction tree, it cannot partially contain or be partially contained in another interval. There are only three possible relationships between any two intervals A ¼½s 1 ;e 1 Š and B ¼½s 2 ;e 2 Š. 1. A \ B ¼;. In this case, interval A and interval B come from different paths of the transaction tree. For instance, interval ½1; 500Š and interval ½501; 800Š in Table A B. In this case, interval A comes from the ancestor nodes of interval B in the transaction tree. For instance, interval ½1; 500Š and interval ½1; 300Š in Table A B. In this case, interval A comes from the descendant nodes of interval B in the transaction tree. For instance, interval ½1; 300Š and interval ½1; 500Š in Table 5. Considering the above three cases, the average number of comparisons for two intervals is Switching After a certain level of the lexicographic tree, the transaction interval lists of elements in any node will be expected to become scattered. There could be many transaction intervals that contain only single tids. At this point, interval representation will lose its advantage over single tid representation because the intersection of two segments

5 476 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 will use three comparisons in the worst case while the intersection of two single tids only needs one comparison. Therefore, we need to switch to the single tid representation at some point. Here, we define a coefficient of compression for one node in the lexicographic tree, denoted by coeff, as follows: Assume that a node has m elements, and let s i represent the support of the ith element, l i representing the size of the transaction list of the ith element. Then, coeff ¼ 1 m X m i¼1 s i l i : For the intersection of two interval lists, the average number of comparisons is 2, so we will switch to tid set intersection when coeff is less than Details of the TM Algorithm Now, we provide details on the steps involved in the TM algorithm. There are four steps involved: 1. Scan through the database and identify all frequent-1 itemsets. 2. Construct the transaction tree with counts for each node. 3. Construct the transaction interval lists. Merge intervals if they are mergeable (i.e., if the intervals are contiguous). 4. Construct the lexicographic tree in a depth first order keeping only the minimum amount of information necessary to complete the search. This, in particular, means that no more than a path in the lexicographic tree will ever be stored. While, at any node, if further expansion of that will not be fruitful, then the search backtracks. When processing a node in the tree, for every element in the node, the corresponding interval lists are computed by interval intersections. As the search progresses, itemsets with enough support are output. When the compression coefficient of a node becomes less than 2, switch to tid list intersection. In the next section, we provide an analysis to indicate how TM can provide computational efficiency. 4 COMPRESSION AND TIME ANALYSIS OF TRANSACTION MAPPING Suppose the transaction tree is fully filled in the worst case, as illustrated in Fig. 4, where the subscript of C is the possible itemset and C represents the count for this itemset. Assume that there are n frequent 1-itemsets with a support of S 1 ;S 2 ;...;S n, respectively. Then, we have the following relationships:... S 1 ¼ C 1 ¼jT 1 j; S 2 ¼ C 2 þ C 1;2 ¼jT 1 jþjt 1;2 j; S 3 ¼ C 3 þ C 1;3 þ C 1;2;3 þ C 2;3 ¼jT 3 jþjt 1;3 jþjt 1;2;3 jþjt 2;3 j; Fig. 4. Full transaction tree. S n ¼ C n þ C 1;n þ C 2;n þ...þ C n 1;n þ C 1;2;n þ C 1;3;n þ... ¼jT n jþjt 1;n jþjt 2;n jþ...þjt n 1;n jþjt 1;2;n jþjt 1;3;n j þ...: Here, each T represents the interval for a node and jtj represents the length of T, which is equal to C. The maximum number of intervals possible for each frequent 1-itemset i is 2 i 1. The average compression ratio is Avg ratio S 1 þ S þ S þ...þ S i 2 i 1 þ...þ S n S n 1 þ þ þ...þ 1 2 n 1 ¼ 2S n ð1 2 n Þ: 2 n 1 When S n, which is equal to min sup, is high, the compression ratio will be large and, thus, the intersection time will be less. On the other hand, because the compression ratio for any itemset cannot be less than 1, we assume that, for frequent 1-itemset i, the compression S ratio is equal to 1, i.e., i 2 ¼ 1. Then, for all frequent i 1 1-itemsets (in the first level of the lexicographic tree) whose ID number is less than i, the compression ratio is greater than 1 and, for all frequent 1-itemsets whose ID number is larger than i, the compression ratio is equal to 1. Therefore, we have: Avg ratio S 1 þ S þ S þ...þ S i þ n i 2i 1 2S i ð1 2 i Þþn i ¼ 2 i 1 þ n i: Since 2 i >i, when i is large, i.e., when fewer of the frequent 1-itemsets have compression equal to 1, the transaction tree is narrow. In the worst case, when the transaction tree is fully filled, the compression ratio reaches the minimum value. Intuitively, when the size of the data set is large and there are more repetitive patterns, the transaction tree will be narrow. In general, market data has this kind of characteristics. In summary, when the minimum support is large or the items are sparsely associated and there are more repetitive patterns (as in the case of market data), the algorithm runs faster.

6 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 477 TABLE 6 Characteristics of Experiment Data Sets 5 EXPERIMENTS AND PERFORMANCE EVALUATION 5.1 Comparison with declat and FP-Growth We used five sets of data in our experiments. Three of these sets are synthetic data (T10I4D100K, T25I10D10K, and T40I10D100k). These synthetic data resemble market basket data with short frequent patterns. The other two data sets are real data (Mushroom and Connect-4 data) which are dense in long frequent patterns. These data sets were often used in the previous study of association rules mining and were downloaded from and sets.php. Some characteristics of these data sets are shown in Table 6. We have compared the TM algorithm mainly with two popular algorithms declat and FP-growth, the implementations of which were downloaded from implemented by Goethals, B. using std libraries. They were compiled in Visual C++. The TM algorithm was implemented based on these two codes. Small modifications were made to implement the transaction tree and interval lists construction, interval lists intersection, and switching. The same std libraries were used to make the comparison fair. Implementations that employ other libraries and data structures might be faster than Goethals implementation. Comparing such implementations with the TM implementation will be unfair. The FPgrowth code was modified a little to read the whole database into memory at the beginning so that the comparison of all the three algorithms is fair. We did not compare with Eclat because it was shown in [11] that declat outperforms Eclat. Both TM and declat use the same optimization techniques described below: 1. Early stopping. This technique was used earlier in Eclat [8]. The intersection between two tid sets can be stopped if the number of mismatches in one set is greater than the support of this set minus the minimum support threshold. For instance, assume that the minimum support threshold is 50 and the supports of two itemsets AB and AC are 60 and 80, respectively. If the number of mismatches in AB has reached 11, then itemset ABC cannot be frequent. For interval lists intersection, the number of mismatches is a little hard to record because of complicated set relationships. Thus, we have used the following rule: If the number of transactions not intersected yet is less than the minimum support threshold minus the number of matches, the intersection will be stopped. 2. Dynamic ordering. Reordering all the items in every node at each level of the lexicographic tree in ascending order of support can reduce the number of generated candidate itemsets and, hence, reduce the number of needed intersections. This property was first used by Bayardo [13]. 3. Save intersection with combination. This technique comes from the following corollary [3]. If the support TABLE 7 Runtime(s) for T10I4D100k Data

7 478 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 TABLE 8 Runtime(s) for T25I10D10K Data of the itemset X [ Y is equal to the support of X, then the support of the itemset X [ Y [ Z is equal to the support of the itemset X [ Z. For example, if the support of itemset f1; 2g is equal to the support of f1g, then the support of the itemset f1; 2; 3g is equal to the support of itemset f1; 3g. So, we do not need to conduct the intersection between f1; 2g and f1; 3g. Correspondingly, if the supports of several itemsets are all equal to the support of their common prefix itemset (subset) that is frequent, then any combination of these itemsets will be frequent. For example, if the supports of itemsets f1; 2g, f1; 3g, and f1; 4g are all equal to the support of the frequent itemset f1g, then f1; 2g, f1; 3g, f1; 4g, f1; 2; 3g, f1; 2; 4g, f1; 3; 4g, and f1; 2; 3; 4g are all frequent itemsets. This optimization is similar to the single path solution in the FP-growth algorithm. All experiments were performed on a DELL 2.4GHz Pentium PC with 1G of memory, running Windows All times shown include time for outputting all the frequent itemsets. The results are presented in Table 7, Table 8, Table 9, Table 10, and Table 11 and Fig. 5, Fig. 6, Fig. 7, Fig. 8, Fig. 9, and Fig. 10. Table 7 shows the running time of the compared algorithms on T10I4D100K data with different minimum supports represented by percentage of the total transactions. Under large minimum supports, declat runs faster than FP-Growth while running slower than FP-Growth under small minimum supports. TM algorithm runs faster than both algorithms under almost all minimum support values. On average, TM algorithm runs almost 2 times faster than the faster of FP-Growth and declat. Two graphs (Fig. 5 and Fig. 6) are employed to display the performance comparison under large minimum support and small minimum support, respectively. Table 8 and Fig. 7 show the performance comparison of the compared algorithms on T2510D10K data. declat runs, in general, faster than FP-Growth with some exceptions at some minimum support values. TM algorithm runs twice faster than declat on an average. Table 9 and Fig. 8 show the performance comparison on T40I10D100K data. TM algorithm runs faster when the minimum support is larger while slower when the minimum support is smaller. TABLE 9 Runtime(s) for T40104D100k Data TABLE 10 Runtime(s) for Mushroom Data

SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 479 TABLE 11 Runtime(s) for Connect-4 Data Table 10 and Fig. 9 compare the algorithms of interest on mushroom data.

8 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 479 TABLE 11 Runtime(s) for Connect-4 Data Table 10 and Fig. 9 compare the algorithms of interest on mushroom data. declat is better than FP-Growth while TM is better than declat. Table 11 and Fig. 10 show the relative performance of the algorithms on Connect-4 data. Connect-4 data is very dense and, hence, the smallest minimum support is 40 percent in this experiment. Similar to the result on mushroom data, declat is faster than FP-Growth while TM is faster than declat, though the difference is not significant. 5.2 Experiments with dtm We have combined the TM algorithm with the declat algorithm in the following way: We represent the diffset [11] in declat between a candidate k-itemset and its prefix k 1-frequent itemset using mapped transaction intervals and compute the support by subtracting the cardinality of diffset from the support of its prefix k 1-frequent itemset. We name the corresponding algorithm the dtm algorithm. We ran the dtm algorithm on the five data sets and the runtimes are shown in Table 7, Table 8, Table 9, Table 10, and Table 11. Unexpectedly, the performance of dtm is worse than that of TM. The reason is that the computation of the difference interval sets between two itemsets is more complicated than the computation of the intersection and has more overhead. For instance, consider interval set1 ¼½s1; e1š, interval set2 ¼½s2; e2š; ½s3; e3š. Both ½s2; e2š and ½s3; e3š are within ½s1; e1š. The difference between Fig. 5. Runtime for T10I4D100k data (1). Fig. 7. Runtime for T25I10D10K data. Fig. 6. Runtime for T10I4D100k data (2). Fig. 8. Runtime for T40I10D100K data.

9 480 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 Fig. 9. Runtime for mushroom data. ½s3; e3š and ½s1; e1š is dependent on the difference between ½s2; e2š and ½s1; e1š. So, there are more cases to consider here than in the computation of the intersection of two sets. 5.3 Experiments with MAFIA and FP-Growth* In this experiment, we experimented with two other algorithms mentioned in the introduction MAFIA and FP-growth*. The comparison, however, is just for reference because the implementations of MAFIA and FP-growth* use different libraries and data structures, which makes the comparison unfair. The implementation of MAFIA was downloaded from Mafia/#download, and the implementation of FP-growth* was downloaded from dbdm/dm.html. The runtimes for these two algorithms are also shown in Table 7, Table 8, Table 9, Table 10, and Table 11. TM is faster than MAFIA for four data sets, while slower than MAFIA for just the mushroom data set. FPgrowth* is the fastest among all the algorithms with which we experimented. The comparison, however, is unfair. For example, FP-tree construction should be slower than the transaction tree construction, but, in FP-growth*, the implementation of FP-tree construction is faster than our implementation for transaction tree construction. In the case of a minimum support of 0.5 percent, FP-growth* runs in 1.187s, while the construction of transaction tree alone in the TM algorithm takes 1.281s. The runtime difference between FP-growth and FP-growth* is not so large in the paper of FP-growth* [7] as in this experiment ([7] uses a different implementation of FP-growth), which indicates that the implementation plays a great role. 6 DISCUSSION 6.1 Overhead of Constructing Interval Lists and Interval Comparison One may be concerned with the fact that it takes extra effort to relabel while constructing interval lists. Fortunately, constructing the transaction tree is just done once and the relabeling of transactions is done just once by traversing the transaction tree in depth-first order. Relabeling time is negligible compared to the intersection time. For example, for connect_4 data with a support of 0.5, the construction of transaction tree takes 0.734s, constructing interval lists takes less than 0.001s, and generating frequent sets takes s. In FP-growth algorithm, constructing the first FP-tree takes Fig. 10. Runtime for Connect-4 data s, which is longer than the time to construct the transaction tree because of building header table and node link. There is overhead of interval comparison, i.e., the average number of interval comparisons is 2, according to only three cases of relationship between two intervals, which is greater than the number of comparisons for id intersection (which is 1) used in Eclat algorithm. During the first several levels, however, the interval compression ratio is bigger than 2. On the other hand, we keep track of this compression ratio (coefficient) and, when it becomes less than 2, we switch to single id transaction as in the Eclat algorithm. Therefore, our algorithm somewhat combines the advantages of both FP-growth and Eclat. When data can be compressed well by the transaction tree (one advantage of FP-growth is to use FP-tree to compress the data), we use interval lists intersection; when we cannot, we switch to id lists intersection as in Eclat. 6.2 Runtime The data sets we have used in our experiments have often been used in previous research and the times shown include the time needed in all the steps. Our algorithm outperforms FP-growth and declat. Actually, it is also much faster than Eclat. We did not show the comparison with Eclat because declat was claimed to outperform Eclat in [11]. We believe that our algorithms will be faster than the Apriori algorithm. We did not compare TM and Apriori since the algorithms FP-growth, Eclat, and declat have been shown to outperform Apriori [4], [8], [11]. 6.3 Storage Cost Storage cost for maintaining the intervals of itemsets is less than that for maintaining id lists in the Eclat algorithm. Because once one interval is generated, its corresponding node in the transaction tree is deleted. Once all the interval lists are generated, the transaction tree is removed, so we only need to store interval lists. The storage is also less than that of FP-tree (FP-tree has a header table and a node link). We use the lexicographic tree just to illustrate DFS procedure as in the Eclat algorithm. This tree is built on-the-fly and not built fully at all. So, the lexicographic tree is not stored in full. 6.4 About Comparisons This paper focuses on algorithmic concepts rather than on implementations. For the same algorithm, the runtime is

Data structures (set, vector, and multisets) and libraries (std) used are the same and only the algorithmic parts are different. This makes the comparison fair.

10 SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 481 different for different implementations. We downloaded the declat and fp-growth implementations of Goethals and implemented our algorithm based on his codes. Data structures (set, vector, and multisets) and libraries (std) used are the same and only the algorithmic parts are different. This makes the comparison fair. Although the implementations of MAFIA and FP-growth* used in this experiment are all in C/C++, data structures and libraries used are different, which makes the comparison unfair for algorithms. For example, fp-tree construction should be slower than transaction tree construction. But, in Fp-growth*, the implementation of fp-tree construction is faster than our implementation for transaction tree construction. Another example is that [7] uses a different implementation of FP-growth, so the runtime difference between FP-growth and FP-growth* is not so large as in this experiment. For the TM algorithm, we just modified Goethals implementation for fp-tree construction and did not use faster implementations because we want to make the comparison between TM and Fp-growth fair. Our implementation, however, could be improved to make the runtime faster. We feel that if we develop an implementation tailored for the TM algorithm instead of just modifying the downloaded codes, TM will be competitive with FP-growth*. 7 CONCLUSIONS AND FUTURE WORK In this paper, we have presented a new algorithm, TM, using the vertical database representation. Transaction ids of each itemset are transformed and compressed to continuous transaction interval lists in a different space using the transaction tree and frequent itemsets are found by transaction intervals intersection along a lexicographic tree in depth first order. This compression greatly saves the intersection time. Through experiments, the TM algorithm has been shown to gain significant performance improvement over FP-growth and declat on data sets with short frequent patterns and also some improvement on data sets with long frequent patterns. We have also performed the compression and time analysis of transaction mapping using the transaction tree and proven that transaction mapping can greatly compress the transaction ids into continuous transaction intervals, especially when the minimum support is high. Although FP-growth* is faster than TM in this experiment, the comparison is unfair. In our future work, we plan to improve the implementation of the TM algorithm and make a fair comparison with FP-growth*. ACKNOWLEDGMENTS This work has been supported in part by US National Science Foundation Grants CCR and ITR REFERENCES [1] R. Agrawal, T. Imielinski, and A.N. Swami, Mining Association Rules between Sets of Items in Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, pp , May [2] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, Proc. 20th Int l Conf. Very Large Data Bases, pp , [3] B. Goethals, Survey on Frequent Pattern Mining, manuscript, [4] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. ACM SIGMOD Int l Conf. Management of Data, pp. 1-12, May [5] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, Hmine: Hyper-Structure Mining of Frequent Patterns in Large Databases, Proc. IEEE Int l Conf. Data Mining, pp , [6] A. Pietracaprina and D. Zandolin, Mining Frequent Itemsets Using Patricia Tries, Proc. ICDM 2003 Workshop Frequent Itemset Mining Implementations, Dec [7] G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. ICDM 2003 Workshop Frequent Itemset Mining Implementations, Dec [8] M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New Algorithms for Fast Discovery of Association Rules, Proc. Third Int l Conf. Knowledge Discovery and Data Mining, pp , [9] P. Shenoy, J.R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. Shah, Turbo-Charging Vertical Mining of Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, pp , May [10] D. Burdick, M. Calimlim, and J. Gehrke, MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases, Proc. Int l Conf. Data Eng., pp , Apr [11] M.J. Zaki and K. Gouda, Fast Vertical Mining Using Diffsets, Proc. Ninth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining, pp , [12] R. Agrawal, C. Aggarwal, and V. Prasad, A Tree Projection Algorithm for Generation of Frequent Item Sets, Parallel and Distributed Computing, pp , [13] R.J. Bayardo, Efficiently Mining Long Patterns from Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, pp , June Mingjun Song received his first PhD degree in remote sensing from the University of Connecticut. He is working at the ADE Corporation as a software research engineer and is in his second PhD program in computer science and engineering at the University of Connecticut. His research interests include algorithms and complexity, data mining, pattern recognition, image processing, remote sensing, and geographical information system. Sanguthevar Rajasekaran received the ME degree in automation from the Indian Institute of Science, Bangalore, in 1983, and the PhD degree in computer science from Harvard University in He is a full professor and UTC Chair Professor of Computer Science and Engineering (CSE) at the University of Connecticut (UConn). He is also the director of the Booth Engineering Center for Advanced Technologies (BECAT) at UConn. Before joining UConn, he served as a faculty member in the CISE Department at the University of Florida and in the CIS Department at the University of Pennsylvania. From , he was the chief scientist for Arcot Systems. His research interests include parallel algorithms, bioinformatics, data mining, randomized computing, computer simulations, and combinatorial optimization. He has published more than 130 articles in journals and conferences. He has coauthored two texts on algorithms and coedited four books on algorithms and related topics. He is an elected member of the Connecticut Academy of Science and Engineering (CASE). He is a senior member of the IEEE.. For more information on this or any other computing topic, please visit our Digital Library at

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets : A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets J. Tahmores Nezhad ℵ, M.H.Sadreddini Abstract In recent years, various algorithms for mining closed frequent