Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

Size: px

Start display at page:

Download "Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns"

Jacob Nicholson
6 years ago
Views:

1 Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns Guimei Liu Hongjun Lu Dept. of Computer Science The Hong Kong Univ. of Science & Technology Hong Kong, China {cslgm, Yabo Xu Jeffrey Xu Yu Dept. of SEEM The Chinese University of Hong Kong Hong Kong, China {ybxu, Abstract Mining frequent patterns is a fundamental and important problem in many data mining applications. Many of the algorithms adopt the pattern growth approach, which is shown to be superior to the candidate generate-andtest approach significantly. In this paper, we identify the key factors that influence the performance of the pattern growth approach, and optimize them to further improve the performance. Our algorithm uses a simple while compact data structure ascending frequency ordered prefixtree() to organize the conditional databases, in which we use arrays to store single branches to further save space. We traverse our prefix-tree structure using a topdown strategy. Our experiment results show that the combination of the top-down traversal strategy and the ascending frequency item ordering method achieves significant performance improvement over previous works.. Introduction Mining frequent patterns plays an important role in many data mining applications. It can also serve as a feature selection tool for classification [9] and clustering. The problem was first introduced by Agrawal et al in the context of transactional databases [2]. It can be stated as following: let I = {i, i 2,, i n } be a set of items, each subset of I is referred to as an itemset, and if an itemset contains k items, then it is called a k-itemset. Given a database D of customer transactions, each transaction is a subset of I, the support of an itemset is defined as the percentage of transactions in D supporting it (or defined as the absolute number of transactions supporting it). Here, we say an itemset s is supported by a customer transaction t if s is a subset of t. If the support of an itemset exceeds a user-specified minimum support threshold, then the itemset is called a frequent This work was partly supported by the Research Grant Council of the Hong Kong SAR, China (Grants DAG0/02.EG4) and National 973 Fundamental Research Program of China (G ). itemset. The task of the frequent pattern mining problem is given a minimum support threshold, to enumerate all the frequent itemsets in the given database. The main issues in frequent itemsets mining are: () reducing the database scanning times, since in many cases the transactional database is too large to fit into the main memory and scanning data from disk is very costly; (2) reducing the search space since every subset of I can be frequent and the number of them is exponential to the size of I; (3) counting the support for itemsets efficiently, naïve subset matching is quite costly due to the large size of the database and the large number of potential frequent itemsets. Extensive efforts have been put on developing efficient algorithms for mining frequent itemsets. Most, if not all, of the proposed algorithms prune the search space based on the Apriori heuristic: if an itemset is not frequent then none of its supersets is frequent. Two typical approaches are proposed: the candidate generate-and-test approach and the pattern growth approach, the latter is shown to be superior to the former significantly, especially on dense datasets or with low minimum support threshold. The basic idea of the pattern growth approach is to grow a pattern from its prefix. It constructs a conditional database for each frequent itemset t, then the mining of the patterns that have t as prefix is performed only on t s conditional database. The key factors that influence the performance of the pattern growth approach are the total number of conditional databases built during the whole mining process, and the cost of mining a single conditional database. The former depends on how we order the items and how we explore the search space. The latter depends on the representation and the traversal strategy of the conditional databases. In this paper, we propose a compact data structure Ascending Frequency Ordered Prefix-Tree () to represent the conditional databases, and the tree is traversed top-down. We show that the combination of the top-down traversal strategy and ascending frequency ordering method minimizes both the total number of conditional databases and the traversal cost of a single conditional database. The rest of the paper is organized as following: Section 2 introduces some related works; In Section 3, we describe how to use our structure to mine the frequent pat-

2 terns; how to handle very large databases is described in Section 4; Section 5 compares our algorithm with some previous works; finally, Section 6 concludes this paper. 2. Related work The search space of the frequent pattern mining problem can be represented using set enumeration tree [, 4, 5, 7]. For example, given a set of items I = {a, b, c, d, e} that are ordered lexicographically, the search space can be represented as a tree as shown in Figure. ab ab c ab cd ab cde ab d NULL{a,b,c,d,e} a b c d ac ad ae b c b d b e cd ab ce ab e acd acde ace ad e b cd b cde b ce b d e Figure. Search space tree The root of the search space tree represents the empty set, and each node at level l (the root is at level 0, and its children are at level, and so on) represents a l-itemset. In the remaining of this paper, we will not distinguish an itemset and the node representing it if there is no misleading. Given a node p, if it is frequent and an item x can be appended to p to form a longer frequent itemset, then x is called a frequent extension of p. Suppose q = p {x} is a child of p and it is frequent, then any frequent extension of p after x is possibly a frequent extension of q. They are called the candidate extensions of q. For example, if c, d, e are frequent extensions of a, then d and e are candidate extensions of ac, but c is not candidate extension of ad or ae. Based on the Apriori heuristic, two typical approaches for mining frequent patterns are proposed the candidate generate-and-test approach [2, 3, 0] and the pattern growth approach [6, 4, 2]. Both approaches work in an iterative manner. They first generate all frequent -itemset. In each subsequent iteration of the candidate generate-and-test approach, pairs of k-itemsets are joined to form candidate (k+)-itemsets, then the database is scanned to verify the support of all the candidate (k+)-itemsets, the set of resultant frequent (k+)-itemsets will be used as the input for next iteration. The drawbacks of the candidate generateand-test algorithms are: () they all need generate lots of candidate itemsets, many of which are proved to be infrequent after scanning the database; (2) it needs scan database multiple times, in worst case, equal to the maximal length of the frequent patterns. In contrast, the pattern growth approach avoids the cost of generating a large number of candidate itemsets by growing a frequent itemset from its prefix. It constructs a condi- ce cde d e e tional database for each frequent itemset t, denoted as D t, which is a collection of the projected transactions and each transaction contains only the candidate extensions of t. All the patterns that have t as prefix can be mined only from D t. The key of the pattern growth approach is how to reduce the traversal and construction cost of a single conditional database, and reduce the total number of conditional databases constructed. However, it is always hard to reduce the traversal cost and the construction cost at the same time because the save of the construction cost usually invokes more traversal cost, and vice versa. In [6], Han et al use a compact data structure FP-tree to represent the conditional databases, which is a combination of prefix-tree structure and node-links. To improve the possiblity of prefix sharing, the items are ordered in descending order of their frequencies. The traversal of the FP-tree is from bottom to top along node-links. When the mining of a FP-tree is finished, the times that a node q has to be visited is equal to the number of q s descendants. Another drawback is that at each node we need maintain a pointer pointing to its parent as well as the node-link. is not efficient on sparse databases due to its high tree construction cost. Another algorithm H-Mine is proposed [2] to alleviate this problem by using the hyper-structure (which is a combination of arrays and hyper-links). No prefix sharing among different transactions in hyper-structure. H-Mine algorithm constructs new conditional databases via link adjustment, so its construction cost is very low. The negative effect of pseudo construction is that the conditional databases of most itemsets are unfiltered, which incurs additional traversal cost. The hyper-structure will not change except the hyper-links, so in H-Mine algorithm, the items can only be ordered by a fixed order. The hyper-structure is not efficient on dense databases, H-Mine algorithm will switch to use the FP-tree structure on dense data sets. Another work that adopts the pattern growth approach is the tree projection algorithm [4]. It physically constructs the conditional databases but uses arrays to store the conditional databases. It has been shown that this algorithm is defeated by the consistently. So we will not study this algorithm further. A newly proposed algorithm [8] combines the advantages of the H-mine and the. It adaptively chooses array-based or tree-based structure, and pseudo projection or physical projection of the conditional databases. This opportunistic projection technique makes it efficient for both sparse and dense datasets. In this paper, we propose another pattern growth algorithm. We also use prefix-tree structure to organize the conditional databases. Contrary to the FP-tree structure, our structure adopts the top-down traversal strategy and the ascending frequency ordering method. The top-down traversal strategy is capable of minimizing the traversal cost of a conditional database, we do not need to maintain the parent links and node-links at each node. The ascending frequency order method is capable of minimizing the total number of conditional databases. One drawback of this 2

3 ordering method is that it reduces node sharing among different transactions compared with the descending frequency ordering method. To alleviate this problem, we use arrays to store single branches. This representation leads to great space saving and also reduces the tree construction cost. We will describe the details of our algorithm in next section. There are also a couple of papers putting efforts on mining only the frequent closed itemsets, e.g. [, 6, 3], or only the maximal patterns, e.g. [7, 7,, 4, 5]. In this paper, we focus on the problem of mining the complete set of frequent itemsets. However, our algorithm can be easily extended to mine only the frequent closed itemsets or maximal frequent itemsets by incorporating the existing pruning techniques. 3. Mining frequent patterns using In this section, we describe our algorithm in details. We will also analyze the time complexity of our algorithm to see why it can be faster than previous ones. 3. The mining algorithm Our algorithm first traverses the database to find the frequent items, and sort them in ascending order of their frequencies. Then the database is traversed the second time to construct a prefix-tree to represent the conditional databases of these frequent items. Only the frequent items are included in the prefix-tree. We use arrays to store single branches in the prefix-tree to save space and construction cost. Each node in the prefix-tree contains three pieces of information: the item id, the support of the itemset corresponding to the path from the root to the node, and the pointers pointing to the node s children. Each entry of the arrays only keeps the first two pieces of information. We use an example to illustrate the construction of the prefix-tree from the database. Given a database D of customer transactions as shown in Figure 2 and minimum support threshold 40%, the frequent items are F = {c, e, f, d, m, a}, and they are sorted in ascending order of their frequencies. In the second database scan, transaction is first read into the memory, after removing infrequent item h and p and sorting, it becomes {c, e, f, m, a}. Since the tree is empty, a branch will be created, with each element having support. Transaction 2 becomes {c, f, m, a} after removing infrequent items and sorting. It shares a prefix c with transaction, so node c s support is set to 2 and it has two children e and f. After all the transactions are processed, the whole prefix-tree is shown in Figure 2. Suppose T is the prefix-tree constructed from the original database, and p,, p m is the set of nodes in T whose item id is i, then the conditional database of item i is the union of the subtrees rooted at p,, p m. For example, in Figure 2, the conditional database of item c is completely represented by the first subtree of the root since only its root is item c. The conditional database of item e contains two TID Transactions Projected Transactions a, c, e, f, h, m, p c, e, f, m, a 2 a, b, c, f, g, m c, f, m, a 3 a, d, e, f, q, s, t e, f, d, a 4 a, b, d, k, m, o, r d, m, a 5 a, e, f, h, m, q, y e, f, m, a 6 a, c, d, j, m, t, w c, d, m, a 7 a, d, u, x, z d, a m in_sup = 40%, F = {c :3, e:3, f:4, d:4, m :5, a: 7} c :3 e: f: d: f: m: m: m: d: NULL e:2 d:2 f:2 m : Figure 2. The structure construction subtrees, one is the first subtree of node c and the other one is the second subtree of the root. Note that the prefix-tree is only a compact representation of the conditional databases, so it contains the complete information for mining frequent itemsets from the original database. The size of the prefix-tree is bounded by, but usually much smaller than, the total number of frequent item occurrences in the database. After the prefix-tree is constructed, the remaining mining will be performed only on the prefix-tree. Our algorithm traverses the search space in depth-first order. It will first perform mining on the first item s conditional database D c (the first subtree of the root), then process c s children s conditional databases until all the patterns having c as prefix are mined. After the mining on D c is finished, D c will not be needed any more. All of its subtrees will be merged with its siblings. We call this step the push-right step. Now the second subtree of the root becomes the first subtree of the root, and it is the complete representation of the second item s conditional database, the mining will then be performed on it. There are three steps when mining an itemset t s conditional database. In the counting step, D t is traversed topdown. Each node in D t is visited once and only once. The supports of all the items in D t are counted. The output of this step is the set of frequent extensions of t. In the construction step, D t is traversed the second time. A new prefix-tree T is constructed and it only contains the frequent extensions of t. After all the patterns having t as prefix are mined, we have a push-right step. In the pushright step, the subtrees of t is merged with t s right siblings. The merging operation involves only pointer adjustment and support update. It can be done by traverse the two trees at the same time. During the traversal, accumulate the support of the same node and adjust pointers if two nodes with the same item id do not have the same set of children. So the time complexity of the merging operation is proportional to the size of the smaller tree in worst case, and is much smaller than the size of the smaller tree in average case. Note that during the merging process, no new nodes will be created, only the duplicated nodes will be disposed. We still use the example shown in Figure 2 to illustrate the mining process. First, all the patterns containing item c will be mined on the first subtree of the root. The frequent extensions of c is {a, m}, a new prefix-tree is built which only contains a single branch. We can directly output all the subsets of {c, a, m}, the mining on D c is finished. Then all the subtrees of node c are merged with node c s right- m : 3

4 c :3 e: f: d: f: m: m: merge m: d: NULL e:2 f:2 merge m : m : d:2 d: e:3 f: d:3 f:3 m :2 a:2 NULL Figure 3. Merge subtree ce with e siblings, which is illustrated in Figure 3. The subtree rooted at node ce is merged with subtree e, and subtree cf becomes the second child of the root, subtree cd is merged with subtree d. Now the subtree rooted at node e becomes the first subtree of the root, and it completely represents e s conditional database. Then mining is performed on this subtree. The frequent extensions of e are f and a, we build a new prefix-tree containing only f and a. It also has only one branch, so we can directly output all the subsets of {e, a, f}, the mining on D e is finished. Then node e s subtrees are merged with node e s right siblings, and the mining is performed on the new first subtree of the root and so on. Put all the things together, we get our algorithm as shown in Algorithm. The correctness of our algorithm is guaranteed by the fact that whenever a subtree becomes the first subtree of the root, suppose the itemset corresponding to this subtree is t, then it completely represents t s conditional database. Remember that in our algorithm, given a prefix-tree, the conditional database of an item i includes all the subtrees whose root s item id is i. Suppose p is one of such subtree and p is also a child of the root, then all the other subtrees must appear in p s left siblings according to the property of the prefix-tree structure. In our algorithm, all the subtrees of node p s left siblings are integrated with the subtree rooted at node p progressively by the merging operation. So every time, when we perform mining on the first subtree of the root, it must contain the complete information of the corresponding conditional database. m: m :2 a:2 Algorithm Algorithm Input: root is the root of the prefix tree; min sup is the minimum support threshold; Description: : if there is only one branch in tree root then 2: Output patterns and return; 3: end if 4: for all children c of root do 5: Traversal subtree rooted at c and find the set of frequent items F, sort the items in F in ascending order of their frequencies; 6: if F > then 7: Traversal subtree rooted at c and build a new prefix tree newroot which contains only the items in F ; 8: (newroot, min sup); 9: end if 0: for all children subroot of c do : sibroot = the right sibling of c whose item equal to subroot.item; 2: Merge(sibroot, subroot); 3: end for 4: end for and the traversal cost of (2)(3) is no more than N cost 0. Now we prove that the traversal cost of (2) is always no less than (3). Suppose the number of leaf nodes in the FP-tree representing D is n fp, then the bottom-up traversal strategy needs traverse n fp branches. According to the traversal order, the items in these n fp branches are ordered in ascending order of their frequencies. If we use these branches to build a prefix-tree, it is exactly our structure. Since we use a top-down traverse strategy, the number of node visits is equal to the size of the tree, which is smaller than the total length of the n fp branches. So (3) needs less traversal cost than (2). 3.2 Time complexity analysis We now prove that using our structure results in the minimal traversal cost, and the ascending frequency ordering method leads to building the minimal number of conditional databases. We have the following Lemma: Lemma Given a conditional database D, we use three different structures to represent it, for each structure, we adopt a particular traversal strategy: () hyper-structure and top-down traversal strategy; (2) FP-tree structure and bottom-up traversal strategy; (3) structure and topdown traversal strategy. We assume that the cost of visiting one node in the above three structures is the same, we use cost 0 to denote it, then (3) needs the least traversal cost. Proof Suppose the total number of item occurrences in D is N, it is obvious that the traversal cost of () is N cost 0 The total number of conditional databases built during the mining process relies on how we order the items. Intuitively, an itemset with high frequency will very possibly have more frequent extensions than an itemset with lower frequency. That means if we take the most frequent item as the first item, then it is very possible that we have to construct bigger and/or more prefix-trees in subsequent mining. If we put the most infrequent item in front, the number of its frequent extensions cannot be very large, so we only need build smaller and/or less prefix-trees in the following mining. With mining going on, the items become more and more frequent, but their candidate extension sets become smaller and smaller, their prefix-trees cannot be very large either. This also explains why algorithm uses the bottom-up traversal strategy: in algorithm, the items are ordered in descending order of their frequencies, the candidate extensions of an item is defined as all the frequent items before it. 4

5 If a node p s conditional database D p can be represented by a single branches, we can enumerate all the frequent patterns contained in the single branch easily, then we needn t built new conditional databases from D p even if p has more than one frequent extensions. Our structure can help further reduce the number of conditional databases constructed. We have the following lemma: Lemma 2 Given a conditional database D, if it can be represented by a single branch in FP-tree, then so does it in our structure, but not vice versa. Proof 2 From Lemma, we know that using our structure, the number of node visits is minimal. If D can be represented by a single branch using FP-tree, then it must be able to be represented by a single branch using our structure. Otherwise, it will conflict with Lemma. On the other hand, if D can be represented by a single branch in our structure, it may contain multiple branches in FP-tree. Consider an example, the set of frequent extension of a node p is {f, a, b, c, d}, and they are in ascending order of their frequencies. The structure representing f s conditional database D f contains a single branch {a : 3, b : 3, c : 2, d : 2}, where the numbers after colon are the frequencies of the items. If we use FP-tree to represent D f, we have two branches: {d : 2, c : 2, b : 2, a : 2} and {b :, a : }. Our structure may contain more nodes than the FP-tree since the ascending frequency ordering method reduces the chances for node sharing among different transactions. However, the algorithm needs to maintain parent links and node-links at each node, which incurs additional construction cost. To further save space, we use arrays to store single branches in our structure, which can lead to significant space and construction cost saving. 3.3 Optimizations The main overhead of our algorithm is the tree construction operation and the push-right operation. But the construction and push-right are worth to do since they can save the traversal cost dramatically. To further improve the performance of our algorithm, we adopt two optimization techniques to reduce the the cost of these two operations. Bucket Counting Technique: this technique was first proposed in []. The basic idea is that if the number of frequent extensions of an itemset is small enough, we can maintain a counter for every combination of the frequent extensions instead of building an structure. The bucket counting can be implemented very efficiently compared with the tree construction and traversal operation. However, the condition when to use the bucket counting technique should be carefully decided since the time complexity of the bucket counting technique is exponential to the number of frequent extensions. In our experiments, when the number of frequent extensions is no greater than 8, we use the bucket counting technique. Combining with Hyper-structure: the construction and push-right cost of our structure is higher than that of the hyper-structure, especially on sparse datasets where the compression ratio of the structure is low and the traversal cost saved is not worth of the construction and push-right cost we paid. To solve this problem, we choose to first construct hyper-structure from the database on sparse datasets, but the items in the hyper-structure are ordered in ascending order of their frequencies, then we construct structures from the hyper-structure. 4. Handling very large databases One common problem with all the pattern growth algorithms is that when the conditional databases cannot be held in main memory, they will suffer from a large amount of page swaps between disk and main memory. In this section, we describe how to handle such cases. One partition based algorithm P artition has been proposed in [5] to handle very large databases. It divides the whole database into several disjoint partitions such that each partition can be held in main memory, then mines local frequent patterns on each partition. After all the partitions have been processed, all the local frequent patterns form a superset of all the global frequent patterns. Then another database scan is performed to verify the supports of the local patterns. The whole mining process needs scan the original database only twice. In H-Mine algorithm, the large databases are handled by the similar approach. The above approach is based on random partition. One assumption behind it is that there are not many patterns in the database and they occupy very little memory. But for dense domain, the number of patterns can easily exceeds tens of thousand or even more, when all the patterns cannot be held in memory, the P artition algorithm will become very inefficient. In this section, we propose to partition the database according to the search space. One advantage of our structure over FP-tree is its good data locality. More specifically, algorithm requires the whole FP-tree must be resident in main memory during mining. But for very large and not so dense databases, it is possible that the FPtree is too large to be held in main memory, thus many random I/O would be involved. While for our structure, every time only the first subtree of the root will be traversed, other subtrees will not be needed until the subtrees before them are processed. So when the whole structure cannot be held in main memory, we can only keep the first several subtrees in main memory, and put all the other data on disk. When the subtrees in memory are processed, the next few subtrees are fetched from disk. Note that the subtrees in memory also contain information for future mining, the portion needed for immediate mining will be kept in main memory and merged with the data just read from disk, and other data will be forced to disk. 5

6 The key problem is how many subtrees should be kept in main memory. The first item has the largest candidate extension set but it is the least frequent one. With mining going on, the items become more and more frequent, but their candidate extension sets become smaller and smaller. So each of the subtrees cannot be very large. Given the large amount of memory available nowadays, it is almost impossible for a single subtree to be too large to fit into memory. We can partition the database as following: () scan the database to find the frequent items, and sort them in ascending order of their frequencies, they are denoted as F = {i, i 2,, i m }; (2) find p, p 2,, p l, and p i < p i+, for i =, 2,, l, the database is partitioned into l disjoint partitions, every transaction in the jth partition must contain one of the items in {i pj +,, i pj }, but no items before i pj +. The constraint for choosing p i s is that the whole data structure used for mining on each partition must be held in memory. The difficulty is that during the mining process, new prefix-trees will be constructed, their size and quantity are almost impossible to be calculated accurately before mining. We know the size of a prefix-tree is bounded by the total number of item occurrences in it, so we can estimate the size of a prefix-tree built from a partition as following: let sup(i) be the support of item i, the total number of item occurrences in partition j can be estimated as M j = p j t=p sup(i j + t) L avg, where L avg is the average length of the transactions. During the mining process, once the mining on a subtree is finished, it will be disposed and make space for mining the other subtrees. So when choosing p i s, we only need to ensure that we have kept enough memory for mining the first subtree. In practice, we have observed that the maximal memory consumed when mining a prefix-tree can hardly exceed double size of the original prefix-tree. Note that in our approach the patterns mined from each partition are also global patterns with respect to the entire database, and the support counted over the partition is also the final support. So there is no need to keep the patterns in memory. Also there is no need to perform another database scan to verify the support of the patterns. When the number of the patterns is large, the cost for support verification process will also be very high. So our approach is able to perform much better than the random partition approach when the number of frequent patterns is large. 5. Experimental results In this section, we conduct a set of experiments to compare the performance of our algorithm and other algorithms. All the experiments were conducted on a Ghz Pentium III with 256MB memory running Microsoft Windows 2000 Professional. All codes were complied using Microsoft Visual C Table shows the several datasets used for performance study. BMS-WebView- is a web log [8], and it is very sparse. T0I4D00k and T25I20D00k are two Data Sets Size #Trans #Items AvgTlen MaxTlen BMS-WebView-.28M T0I4D00k 5.06M T25I20D00k 2.M 00, Pumsb 4.75M Connect-4 2.4M Mushroom 0.83M Table. Data sets synthetic datasets generated by IBM Quest Synthetic Data Generation Code. T0I4D00k is very sparse, while T25I20D00k is relatively dense. Pumsb, Connect-4 and Mushroom are three dense datasets obtained from UCI machine learning repository 2. Table lists some statistical information about the datasets. The last two columns are the average and maximal length of the transactions respectively. 5. Performance comparison We compare the efficiency of our algorithm with, H-Mine and algorithm. Figure 4 shows the running time of the algorithms over the six datasets. The running time shown in the figures includes only the input time and CPU time. We can see that the running time of our algorithm is rather stable with respect to the decrease of the minimum support threshold, and it outperforms the other three algorithms on all testing datasets. On BMS-WebView- dataset, when the minimum support is greater than 0.%, shows the worst performance. With the decreasing of the minimum support, H-Mine becomes the slowest one. When the minimum support is decreased to 0.05%, the number of patterns increases sharply from 4652 (0.06%) to still can finish mining in less than 5 seconds, while the other three algorithms need more than 000 seconds. On synthetic sparse dataset T0I4D00k, performs the worst. is nearly 3 times faster than. On dataset T25I20D00k, and show similar performance with is slightly faster. Both of them are significantly faster than the other two algorithms. The inefficiency of is caused by the high construction cost, which even exceeds the total running time of and. On Pumsb dataset, when the minimum support is higher than 60%, our is much faster than the other two algorithms. When the minimum support threshold gets lower, the running time of our becomes close to that of, but still significantly faster than. On Connect- 4 and Mushroom datasets, needs the longest running time. At the lowest minimum support threshold (0.% for Mushroom and 30% for Connect-4), our algorithm is about 0 times faster than, and about 60 and 40 times faster than respectively mlearn/mlrepository.html 6

7 Data set: BMS-WebView- H-Mine Data set: T0I4D00k H-Mine Data set: T25I20D00k H-Mine (a) BMS-WebView- (b) T0I4D00k (c) T25I20D00k Data set: pumsb Data set: Connect Data set: Mushroom (d) Pumsb (e) Connect-4 (f) Mushroom We also studied the maximal space usage of the several algorithms. We found that on relatively sparse datasets, such as BMS-WebView-, T0I4D00k and T25I20D00k, our algorithm and H-Mine use similar space since our algorithm will also first construct a hyper-structure from the database on sparse datasets, then construct structures from the hyper-structure. Since we order the items in ascending order of their frequencies, the structure constructed from the hyper-structure is always very small compared with the whole hyper-structure. The space consumed by is almost double of that of our algorithm. On dense datasets, the space consumed by our algorithm is very close to that of. This is due to two reasons: () we use arrays to store single branches; (2) in our algorithm, once a subtree of the root is processed, it will be disposed to make space for mining the other subtrees. While in, the FP-tree structure cannot be touched until all the patterns contained in that tree are mined. 5.2 Scalability We study the scalability of our algorithm by varying the total number of transactions and the size of the available memory. The experimental results are shown in Figure 5. We use IBM synthetic data generator to produce a set of datasets with different number of transactions. For these datasets, we set the average transaction length to 20 and average pattern length to 0, the total number of transactions Figure 4. Performance comparison varies from 200k to 000k. All the other parameters use their default values. We fix the minimum support threshold to 0.%. The results are shown in Figure 5(a). Our algorithm is about 2 times faster than algorithm with respect to all transaction length. All the four algorithms show stable performance with respect to the size of the database, except that when the number of transactions exceeds 000k, the whole data structure used during the mining process for and H-Mine algorithm cannot be held in main memory, the running time of these two algorithms grows very sharply due to large amount of page swaps between disk and main memory. To see the efficiency of our partition technique for handling the cases where the whole data structure used during the mining process cannot resident in main memory, we limit the size of the memory used by our algorithm to 4MB, 8MB, 2MB and 6MB respectively. We implemented a random partition algorithm and compare with it. In this implementation, we use H-Mine to mine each partition, and do not count the space for storing the local patterns, i.e., the available memory is only for storing the hyper-structure. We use dataset T25I20D00k to conduct this experiment. We can see from Figure 5(b) that our algorithm is significantly faster than the random partition based algorithm. When the size of available memory is set to 6MB, the whole data structure used by our algorithm can be held in main memory. When the minimum support threshold is greater than 0.5%, even with memory size 2MB, our 7

8 Acknowledgments Scalability w.r.t. #Transactions(min_sup=0.%) Scalability w.r.t. Memory Size #Transactions(k) H-Mine M -8M -2M -6M Partition-4M Partition-8M Partition-2M Partition-6M We d like to thank Jiawei Han for providing us with his algorithm, and Junqiang Liu for providing us with his algorithm. References Figure 5. Scalability study algorithm needs to scan database only twice. An interesting observation from Figure 5(b) is that when the minimum support threshold is lower than 0.5% and the memory size is increased from 8MB to 2MB, the running time of the P artition algorithm increases. The reason is that with random partition, we do not have a guarantee on the number of local patterns on each partition. With minimum support 0.05%, when the memory size is increased to 2MB, the total number of local patterns is , and the total number of distinct local patterns is , while it is and respectively when the memory size is 8MB. They are both much greater than the total number of global patterns So the random partition algorithm needs a high global pattern verification cost. 6. Conclusions In this paper, we proposed an efficient algorithm for mining the complete set of frequent patterns, which shows significant performance improvement over previous works. Our algorithm adopts the pattern growth approach, and uses the ascending frequency ordered prefix-tree to represent the conditional databases and the tree is traversed topdown. Our analysis and experiment results show that this combination is more efficient than the combination of the bottom-up traversal strategy and descending frequency ordering method, which is adopted by the algorithm. Both of the above two combinations are much more efficient than the other two combinations the combination of the top-down traversal strategy and the descending frequency ordering method, and the combination of the bottom-up traversal strategy and the ascending frequency ordering method. The performance of our algorithm can be further improved by incorporating the opportunistic projection technique proposed in [8], e.g. to adaptively choose the array-based and structure based representation, and the pseudo unfiltered projection and filtered projection. We also described how to handle very large databases using our algorithm, which partitions the database more intelligently than previous works. It can handle the cases where the domain is dense and large amount of patterns exist in the database, while the previous random partition based approaches would fail or become very inefficient in such situation. [] R. Agrawal, C. Aggarwal, and V. Prasad. Depth first generation of long patterns. In Proc. of ACM SIGKDD Conf., pages 08 8, [2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In Proc. of ACM SIGMOD Conf., pages , 993. [3] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of ACM SIGMOD Conf., pages , 997. [4] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proc. of ICDE Conf., pages , 200. [5] K. Gouda and M. J. Zaki. Efficiently mining maximal frequent itemsets. In Proc. of ICDM conf., pages 63 70, 200. [6] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of ACM SIGMOD Conf., pages 2, [7] R. J. B. Jr. Efficiently mining long patterns from databases. In Proc. of ACM SIGMOD Conf., pages 85 93, 998. [8] J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by opportunistic projection. In Proc. of KDD Conf., [9] D. Meretakis, D. Fragoudis, H. Lu, and S. Likothanassis. Scalable association-based text classification. In Proc. of CIKM Conf., pages 5, [0] J. S. Park, M. Chen, and P. S. Yu. An effective hash based algorithm for mining association rules. In Proc. of ACM SIGMOD Conf., pages 75 86, 995. [] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of ICDT Conf., pages , 999. [2] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H- mine: Hyper-structure mining of frequent patterns in large databases. In Proc. of ICDM Conf., pages , 200. [3] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for mining frequent closed itemsets. In Proc. of ACM SIGMOD DMKD Workshop, pages 2 30, [4] R.C.Agarwal, C.C.Aggarwal, and V.V.V.Prasad. A tree projection algorithm for finding frequent itemsets. Journal on Parallel and Distributed Computing, 6(3):350 37, 200. [5] A. Savasere, E. Omiecinski, and S. B. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. of VLDB Conf., pages , 995. [6] M. J. Zaki and C. Hsiao. Charm: An efficient algorithm for closed association rule mining. Technical report, Rensselaer Polytechnic Institute, 999. [7] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. of KDD Conf., pages , 997. [8] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In Proc. of KDD Conf., pages ,

Data Mining for Knowledge Management. Association Rules

1 Data Mining for Knowledge Management Association Rules Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Thanks for slides to: Jiawei Han George Kollios Zhenyu Lu Osmar R. Zaïane Mohammad