Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

Size: px
Start display at page:

Download "Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns"

Transcription

1 Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns Guimei Liu Hongjun Lu Dept. of Computer Science The Hong Kong Univ. of Science & Technology Hong Kong, China {cslgm, Yabo Xu Jeffrey Xu Yu Dept. of SEEM The Chinese University of Hong Kong Hong Kong, China {ybxu, Abstract Mining frequent patterns is a fundamental and important problem in many data mining applications. Many of the algorithms adopt the pattern growth approach, which is shown to be superior to the candidate generate-andtest approach significantly. In this paper, we identify the key factors that influence the performance of the pattern growth approach, and optimize them to further improve the performance. Our algorithm uses a simple while compact data structure ascending frequency ordered prefixtree() to organize the conditional databases, in which we use arrays to store single branches to further save space. We traverse our prefix-tree structure using a topdown strategy. Our experiment results show that the combination of the top-down traversal strategy and the ascending frequency item ordering method achieves significant performance improvement over previous works.. Introduction Mining frequent patterns plays an important role in many data mining applications. It can also serve as a feature selection tool for classification [9] and clustering. The problem was first introduced by Agrawal et al in the context of transactional databases [2]. It can be stated as following: let I = {i, i 2,, i n } be a set of items, each subset of I is referred to as an itemset, and if an itemset contains k items, then it is called a k-itemset. Given a database D of customer transactions, each transaction is a subset of I, the support of an itemset is defined as the percentage of transactions in D supporting it (or defined as the absolute number of transactions supporting it). Here, we say an itemset s is supported by a customer transaction t if s is a subset of t. If the support of an itemset exceeds a user-specified minimum support threshold, then the itemset is called a frequent This work was partly supported by the Research Grant Council of the Hong Kong SAR, China (Grants DAG0/02.EG4) and National 973 Fundamental Research Program of China (G ). itemset. The task of the frequent pattern mining problem is given a minimum support threshold, to enumerate all the frequent itemsets in the given database. The main issues in frequent itemsets mining are: () reducing the database scanning times, since in many cases the transactional database is too large to fit into the main memory and scanning data from disk is very costly; (2) reducing the search space since every subset of I can be frequent and the number of them is exponential to the size of I; (3) counting the support for itemsets efficiently, naïve subset matching is quite costly due to the large size of the database and the large number of potential frequent itemsets. Extensive efforts have been put on developing efficient algorithms for mining frequent itemsets. Most, if not all, of the proposed algorithms prune the search space based on the Apriori heuristic: if an itemset is not frequent then none of its supersets is frequent. Two typical approaches are proposed: the candidate generate-and-test approach and the pattern growth approach, the latter is shown to be superior to the former significantly, especially on dense datasets or with low minimum support threshold. The basic idea of the pattern growth approach is to grow a pattern from its prefix. It constructs a conditional database for each frequent itemset t, then the mining of the patterns that have t as prefix is performed only on t s conditional database. The key factors that influence the performance of the pattern growth approach are the total number of conditional databases built during the whole mining process, and the cost of mining a single conditional database. The former depends on how we order the items and how we explore the search space. The latter depends on the representation and the traversal strategy of the conditional databases. In this paper, we propose a compact data structure Ascending Frequency Ordered Prefix-Tree () to represent the conditional databases, and the tree is traversed top-down. We show that the combination of the top-down traversal strategy and ascending frequency ordering method minimizes both the total number of conditional databases and the traversal cost of a single conditional database. The rest of the paper is organized as following: Section 2 introduces some related works; In Section 3, we describe how to use our structure to mine the frequent pat-

2 terns; how to handle very large databases is described in Section 4; Section 5 compares our algorithm with some previous works; finally, Section 6 concludes this paper. 2. Related work The search space of the frequent pattern mining problem can be represented using set enumeration tree [, 4, 5, 7]. For example, given a set of items I = {a, b, c, d, e} that are ordered lexicographically, the search space can be represented as a tree as shown in Figure. ab ab c ab cd ab cde ab d NULL{a,b,c,d,e} a b c d ac ad ae b c b d b e cd ab ce ab e acd acde ace ad e b cd b cde b ce b d e Figure. Search space tree The root of the search space tree represents the empty set, and each node at level l (the root is at level 0, and its children are at level, and so on) represents a l-itemset. In the remaining of this paper, we will not distinguish an itemset and the node representing it if there is no misleading. Given a node p, if it is frequent and an item x can be appended to p to form a longer frequent itemset, then x is called a frequent extension of p. Suppose q = p {x} is a child of p and it is frequent, then any frequent extension of p after x is possibly a frequent extension of q. They are called the candidate extensions of q. For example, if c, d, e are frequent extensions of a, then d and e are candidate extensions of ac, but c is not candidate extension of ad or ae. Based on the Apriori heuristic, two typical approaches for mining frequent patterns are proposed the candidate generate-and-test approach [2, 3, 0] and the pattern growth approach [6, 4, 2]. Both approaches work in an iterative manner. They first generate all frequent -itemset. In each subsequent iteration of the candidate generate-and-test approach, pairs of k-itemsets are joined to form candidate (k+)-itemsets, then the database is scanned to verify the support of all the candidate (k+)-itemsets, the set of resultant frequent (k+)-itemsets will be used as the input for next iteration. The drawbacks of the candidate generateand-test algorithms are: () they all need generate lots of candidate itemsets, many of which are proved to be infrequent after scanning the database; (2) it needs scan database multiple times, in worst case, equal to the maximal length of the frequent patterns. In contrast, the pattern growth approach avoids the cost of generating a large number of candidate itemsets by growing a frequent itemset from its prefix. It constructs a condi- ce cde d e e tional database for each frequent itemset t, denoted as D t, which is a collection of the projected transactions and each transaction contains only the candidate extensions of t. All the patterns that have t as prefix can be mined only from D t. The key of the pattern growth approach is how to reduce the traversal and construction cost of a single conditional database, and reduce the total number of conditional databases constructed. However, it is always hard to reduce the traversal cost and the construction cost at the same time because the save of the construction cost usually invokes more traversal cost, and vice versa. In [6], Han et al use a compact data structure FP-tree to represent the conditional databases, which is a combination of prefix-tree structure and node-links. To improve the possiblity of prefix sharing, the items are ordered in descending order of their frequencies. The traversal of the FP-tree is from bottom to top along node-links. When the mining of a FP-tree is finished, the times that a node q has to be visited is equal to the number of q s descendants. Another drawback is that at each node we need maintain a pointer pointing to its parent as well as the node-link. is not efficient on sparse databases due to its high tree construction cost. Another algorithm H-Mine is proposed [2] to alleviate this problem by using the hyper-structure (which is a combination of arrays and hyper-links). No prefix sharing among different transactions in hyper-structure. H-Mine algorithm constructs new conditional databases via link adjustment, so its construction cost is very low. The negative effect of pseudo construction is that the conditional databases of most itemsets are unfiltered, which incurs additional traversal cost. The hyper-structure will not change except the hyper-links, so in H-Mine algorithm, the items can only be ordered by a fixed order. The hyper-structure is not efficient on dense databases, H-Mine algorithm will switch to use the FP-tree structure on dense data sets. Another work that adopts the pattern growth approach is the tree projection algorithm [4]. It physically constructs the conditional databases but uses arrays to store the conditional databases. It has been shown that this algorithm is defeated by the consistently. So we will not study this algorithm further. A newly proposed algorithm [8] combines the advantages of the H-mine and the. It adaptively chooses array-based or tree-based structure, and pseudo projection or physical projection of the conditional databases. This opportunistic projection technique makes it efficient for both sparse and dense datasets. In this paper, we propose another pattern growth algorithm. We also use prefix-tree structure to organize the conditional databases. Contrary to the FP-tree structure, our structure adopts the top-down traversal strategy and the ascending frequency ordering method. The top-down traversal strategy is capable of minimizing the traversal cost of a conditional database, we do not need to maintain the parent links and node-links at each node. The ascending frequency order method is capable of minimizing the total number of conditional databases. One drawback of this 2

3 ordering method is that it reduces node sharing among different transactions compared with the descending frequency ordering method. To alleviate this problem, we use arrays to store single branches. This representation leads to great space saving and also reduces the tree construction cost. We will describe the details of our algorithm in next section. There are also a couple of papers putting efforts on mining only the frequent closed itemsets, e.g. [, 6, 3], or only the maximal patterns, e.g. [7, 7,, 4, 5]. In this paper, we focus on the problem of mining the complete set of frequent itemsets. However, our algorithm can be easily extended to mine only the frequent closed itemsets or maximal frequent itemsets by incorporating the existing pruning techniques. 3. Mining frequent patterns using In this section, we describe our algorithm in details. We will also analyze the time complexity of our algorithm to see why it can be faster than previous ones. 3. The mining algorithm Our algorithm first traverses the database to find the frequent items, and sort them in ascending order of their frequencies. Then the database is traversed the second time to construct a prefix-tree to represent the conditional databases of these frequent items. Only the frequent items are included in the prefix-tree. We use arrays to store single branches in the prefix-tree to save space and construction cost. Each node in the prefix-tree contains three pieces of information: the item id, the support of the itemset corresponding to the path from the root to the node, and the pointers pointing to the node s children. Each entry of the arrays only keeps the first two pieces of information. We use an example to illustrate the construction of the prefix-tree from the database. Given a database D of customer transactions as shown in Figure 2 and minimum support threshold 40%, the frequent items are F = {c, e, f, d, m, a}, and they are sorted in ascending order of their frequencies. In the second database scan, transaction is first read into the memory, after removing infrequent item h and p and sorting, it becomes {c, e, f, m, a}. Since the tree is empty, a branch will be created, with each element having support. Transaction 2 becomes {c, f, m, a} after removing infrequent items and sorting. It shares a prefix c with transaction, so node c s support is set to 2 and it has two children e and f. After all the transactions are processed, the whole prefix-tree is shown in Figure 2. Suppose T is the prefix-tree constructed from the original database, and p,, p m is the set of nodes in T whose item id is i, then the conditional database of item i is the union of the subtrees rooted at p,, p m. For example, in Figure 2, the conditional database of item c is completely represented by the first subtree of the root since only its root is item c. The conditional database of item e contains two TID Transactions Projected Transactions a, c, e, f, h, m, p c, e, f, m, a 2 a, b, c, f, g, m c, f, m, a 3 a, d, e, f, q, s, t e, f, d, a 4 a, b, d, k, m, o, r d, m, a 5 a, e, f, h, m, q, y e, f, m, a 6 a, c, d, j, m, t, w c, d, m, a 7 a, d, u, x, z d, a m in_sup = 40%, F = {c :3, e:3, f:4, d:4, m :5, a: 7} c :3 e: f: d: f: m: m: m: d: NULL e:2 d:2 f:2 m : Figure 2. The structure construction subtrees, one is the first subtree of node c and the other one is the second subtree of the root. Note that the prefix-tree is only a compact representation of the conditional databases, so it contains the complete information for mining frequent itemsets from the original database. The size of the prefix-tree is bounded by, but usually much smaller than, the total number of frequent item occurrences in the database. After the prefix-tree is constructed, the remaining mining will be performed only on the prefix-tree. Our algorithm traverses the search space in depth-first order. It will first perform mining on the first item s conditional database D c (the first subtree of the root), then process c s children s conditional databases until all the patterns having c as prefix are mined. After the mining on D c is finished, D c will not be needed any more. All of its subtrees will be merged with its siblings. We call this step the push-right step. Now the second subtree of the root becomes the first subtree of the root, and it is the complete representation of the second item s conditional database, the mining will then be performed on it. There are three steps when mining an itemset t s conditional database. In the counting step, D t is traversed topdown. Each node in D t is visited once and only once. The supports of all the items in D t are counted. The output of this step is the set of frequent extensions of t. In the construction step, D t is traversed the second time. A new prefix-tree T is constructed and it only contains the frequent extensions of t. After all the patterns having t as prefix are mined, we have a push-right step. In the pushright step, the subtrees of t is merged with t s right siblings. The merging operation involves only pointer adjustment and support update. It can be done by traverse the two trees at the same time. During the traversal, accumulate the support of the same node and adjust pointers if two nodes with the same item id do not have the same set of children. So the time complexity of the merging operation is proportional to the size of the smaller tree in worst case, and is much smaller than the size of the smaller tree in average case. Note that during the merging process, no new nodes will be created, only the duplicated nodes will be disposed. We still use the example shown in Figure 2 to illustrate the mining process. First, all the patterns containing item c will be mined on the first subtree of the root. The frequent extensions of c is {a, m}, a new prefix-tree is built which only contains a single branch. We can directly output all the subsets of {c, a, m}, the mining on D c is finished. Then all the subtrees of node c are merged with node c s right- m : 3

4 c :3 e: f: d: f: m: m: merge m: d: NULL e:2 f:2 merge m : m : d:2 d: e:3 f: d:3 f:3 m :2 a:2 NULL Figure 3. Merge subtree ce with e siblings, which is illustrated in Figure 3. The subtree rooted at node ce is merged with subtree e, and subtree cf becomes the second child of the root, subtree cd is merged with subtree d. Now the subtree rooted at node e becomes the first subtree of the root, and it completely represents e s conditional database. Then mining is performed on this subtree. The frequent extensions of e are f and a, we build a new prefix-tree containing only f and a. It also has only one branch, so we can directly output all the subsets of {e, a, f}, the mining on D e is finished. Then node e s subtrees are merged with node e s right siblings, and the mining is performed on the new first subtree of the root and so on. Put all the things together, we get our algorithm as shown in Algorithm. The correctness of our algorithm is guaranteed by the fact that whenever a subtree becomes the first subtree of the root, suppose the itemset corresponding to this subtree is t, then it completely represents t s conditional database. Remember that in our algorithm, given a prefix-tree, the conditional database of an item i includes all the subtrees whose root s item id is i. Suppose p is one of such subtree and p is also a child of the root, then all the other subtrees must appear in p s left siblings according to the property of the prefix-tree structure. In our algorithm, all the subtrees of node p s left siblings are integrated with the subtree rooted at node p progressively by the merging operation. So every time, when we perform mining on the first subtree of the root, it must contain the complete information of the corresponding conditional database. m: m :2 a:2 Algorithm Algorithm Input: root is the root of the prefix tree; min sup is the minimum support threshold; Description: : if there is only one branch in tree root then 2: Output patterns and return; 3: end if 4: for all children c of root do 5: Traversal subtree rooted at c and find the set of frequent items F, sort the items in F in ascending order of their frequencies; 6: if F > then 7: Traversal subtree rooted at c and build a new prefix tree newroot which contains only the items in F ; 8: (newroot, min sup); 9: end if 0: for all children subroot of c do : sibroot = the right sibling of c whose item equal to subroot.item; 2: Merge(sibroot, subroot); 3: end for 4: end for and the traversal cost of (2)(3) is no more than N cost 0. Now we prove that the traversal cost of (2) is always no less than (3). Suppose the number of leaf nodes in the FP-tree representing D is n fp, then the bottom-up traversal strategy needs traverse n fp branches. According to the traversal order, the items in these n fp branches are ordered in ascending order of their frequencies. If we use these branches to build a prefix-tree, it is exactly our structure. Since we use a top-down traverse strategy, the number of node visits is equal to the size of the tree, which is smaller than the total length of the n fp branches. So (3) needs less traversal cost than (2). 3.2 Time complexity analysis We now prove that using our structure results in the minimal traversal cost, and the ascending frequency ordering method leads to building the minimal number of conditional databases. We have the following Lemma: Lemma Given a conditional database D, we use three different structures to represent it, for each structure, we adopt a particular traversal strategy: () hyper-structure and top-down traversal strategy; (2) FP-tree structure and bottom-up traversal strategy; (3) structure and topdown traversal strategy. We assume that the cost of visiting one node in the above three structures is the same, we use cost 0 to denote it, then (3) needs the least traversal cost. Proof Suppose the total number of item occurrences in D is N, it is obvious that the traversal cost of () is N cost 0 The total number of conditional databases built during the mining process relies on how we order the items. Intuitively, an itemset with high frequency will very possibly have more frequent extensions than an itemset with lower frequency. That means if we take the most frequent item as the first item, then it is very possible that we have to construct bigger and/or more prefix-trees in subsequent mining. If we put the most infrequent item in front, the number of its frequent extensions cannot be very large, so we only need build smaller and/or less prefix-trees in the following mining. With mining going on, the items become more and more frequent, but their candidate extension sets become smaller and smaller, their prefix-trees cannot be very large either. This also explains why algorithm uses the bottom-up traversal strategy: in algorithm, the items are ordered in descending order of their frequencies, the candidate extensions of an item is defined as all the frequent items before it. 4

5 If a node p s conditional database D p can be represented by a single branches, we can enumerate all the frequent patterns contained in the single branch easily, then we needn t built new conditional databases from D p even if p has more than one frequent extensions. Our structure can help further reduce the number of conditional databases constructed. We have the following lemma: Lemma 2 Given a conditional database D, if it can be represented by a single branch in FP-tree, then so does it in our structure, but not vice versa. Proof 2 From Lemma, we know that using our structure, the number of node visits is minimal. If D can be represented by a single branch using FP-tree, then it must be able to be represented by a single branch using our structure. Otherwise, it will conflict with Lemma. On the other hand, if D can be represented by a single branch in our structure, it may contain multiple branches in FP-tree. Consider an example, the set of frequent extension of a node p is {f, a, b, c, d}, and they are in ascending order of their frequencies. The structure representing f s conditional database D f contains a single branch {a : 3, b : 3, c : 2, d : 2}, where the numbers after colon are the frequencies of the items. If we use FP-tree to represent D f, we have two branches: {d : 2, c : 2, b : 2, a : 2} and {b :, a : }. Our structure may contain more nodes than the FP-tree since the ascending frequency ordering method reduces the chances for node sharing among different transactions. However, the algorithm needs to maintain parent links and node-links at each node, which incurs additional construction cost. To further save space, we use arrays to store single branches in our structure, which can lead to significant space and construction cost saving. 3.3 Optimizations The main overhead of our algorithm is the tree construction operation and the push-right operation. But the construction and push-right are worth to do since they can save the traversal cost dramatically. To further improve the performance of our algorithm, we adopt two optimization techniques to reduce the the cost of these two operations. Bucket Counting Technique: this technique was first proposed in []. The basic idea is that if the number of frequent extensions of an itemset is small enough, we can maintain a counter for every combination of the frequent extensions instead of building an structure. The bucket counting can be implemented very efficiently compared with the tree construction and traversal operation. However, the condition when to use the bucket counting technique should be carefully decided since the time complexity of the bucket counting technique is exponential to the number of frequent extensions. In our experiments, when the number of frequent extensions is no greater than 8, we use the bucket counting technique. Combining with Hyper-structure: the construction and push-right cost of our structure is higher than that of the hyper-structure, especially on sparse datasets where the compression ratio of the structure is low and the traversal cost saved is not worth of the construction and push-right cost we paid. To solve this problem, we choose to first construct hyper-structure from the database on sparse datasets, but the items in the hyper-structure are ordered in ascending order of their frequencies, then we construct structures from the hyper-structure. 4. Handling very large databases One common problem with all the pattern growth algorithms is that when the conditional databases cannot be held in main memory, they will suffer from a large amount of page swaps between disk and main memory. In this section, we describe how to handle such cases. One partition based algorithm P artition has been proposed in [5] to handle very large databases. It divides the whole database into several disjoint partitions such that each partition can be held in main memory, then mines local frequent patterns on each partition. After all the partitions have been processed, all the local frequent patterns form a superset of all the global frequent patterns. Then another database scan is performed to verify the supports of the local patterns. The whole mining process needs scan the original database only twice. In H-Mine algorithm, the large databases are handled by the similar approach. The above approach is based on random partition. One assumption behind it is that there are not many patterns in the database and they occupy very little memory. But for dense domain, the number of patterns can easily exceeds tens of thousand or even more, when all the patterns cannot be held in memory, the P artition algorithm will become very inefficient. In this section, we propose to partition the database according to the search space. One advantage of our structure over FP-tree is its good data locality. More specifically, algorithm requires the whole FP-tree must be resident in main memory during mining. But for very large and not so dense databases, it is possible that the FPtree is too large to be held in main memory, thus many random I/O would be involved. While for our structure, every time only the first subtree of the root will be traversed, other subtrees will not be needed until the subtrees before them are processed. So when the whole structure cannot be held in main memory, we can only keep the first several subtrees in main memory, and put all the other data on disk. When the subtrees in memory are processed, the next few subtrees are fetched from disk. Note that the subtrees in memory also contain information for future mining, the portion needed for immediate mining will be kept in main memory and merged with the data just read from disk, and other data will be forced to disk. 5

6 The key problem is how many subtrees should be kept in main memory. The first item has the largest candidate extension set but it is the least frequent one. With mining going on, the items become more and more frequent, but their candidate extension sets become smaller and smaller. So each of the subtrees cannot be very large. Given the large amount of memory available nowadays, it is almost impossible for a single subtree to be too large to fit into memory. We can partition the database as following: () scan the database to find the frequent items, and sort them in ascending order of their frequencies, they are denoted as F = {i, i 2,, i m }; (2) find p, p 2,, p l, and p i < p i+, for i =, 2,, l, the database is partitioned into l disjoint partitions, every transaction in the jth partition must contain one of the items in {i pj +,, i pj }, but no items before i pj +. The constraint for choosing p i s is that the whole data structure used for mining on each partition must be held in memory. The difficulty is that during the mining process, new prefix-trees will be constructed, their size and quantity are almost impossible to be calculated accurately before mining. We know the size of a prefix-tree is bounded by the total number of item occurrences in it, so we can estimate the size of a prefix-tree built from a partition as following: let sup(i) be the support of item i, the total number of item occurrences in partition j can be estimated as M j = p j t=p sup(i j + t) L avg, where L avg is the average length of the transactions. During the mining process, once the mining on a subtree is finished, it will be disposed and make space for mining the other subtrees. So when choosing p i s, we only need to ensure that we have kept enough memory for mining the first subtree. In practice, we have observed that the maximal memory consumed when mining a prefix-tree can hardly exceed double size of the original prefix-tree. Note that in our approach the patterns mined from each partition are also global patterns with respect to the entire database, and the support counted over the partition is also the final support. So there is no need to keep the patterns in memory. Also there is no need to perform another database scan to verify the support of the patterns. When the number of the patterns is large, the cost for support verification process will also be very high. So our approach is able to perform much better than the random partition approach when the number of frequent patterns is large. 5. Experimental results In this section, we conduct a set of experiments to compare the performance of our algorithm and other algorithms. All the experiments were conducted on a Ghz Pentium III with 256MB memory running Microsoft Windows 2000 Professional. All codes were complied using Microsoft Visual C Table shows the several datasets used for performance study. BMS-WebView- is a web log [8], and it is very sparse. T0I4D00k and T25I20D00k are two Data Sets Size #Trans #Items AvgTlen MaxTlen BMS-WebView-.28M T0I4D00k 5.06M T25I20D00k 2.M 00, Pumsb 4.75M Connect-4 2.4M Mushroom 0.83M Table. Data sets synthetic datasets generated by IBM Quest Synthetic Data Generation Code. T0I4D00k is very sparse, while T25I20D00k is relatively dense. Pumsb, Connect-4 and Mushroom are three dense datasets obtained from UCI machine learning repository 2. Table lists some statistical information about the datasets. The last two columns are the average and maximal length of the transactions respectively. 5. Performance comparison We compare the efficiency of our algorithm with, H-Mine and algorithm. Figure 4 shows the running time of the algorithms over the six datasets. The running time shown in the figures includes only the input time and CPU time. We can see that the running time of our algorithm is rather stable with respect to the decrease of the minimum support threshold, and it outperforms the other three algorithms on all testing datasets. On BMS-WebView- dataset, when the minimum support is greater than 0.%, shows the worst performance. With the decreasing of the minimum support, H-Mine becomes the slowest one. When the minimum support is decreased to 0.05%, the number of patterns increases sharply from 4652 (0.06%) to still can finish mining in less than 5 seconds, while the other three algorithms need more than 000 seconds. On synthetic sparse dataset T0I4D00k, performs the worst. is nearly 3 times faster than. On dataset T25I20D00k, and show similar performance with is slightly faster. Both of them are significantly faster than the other two algorithms. The inefficiency of is caused by the high construction cost, which even exceeds the total running time of and. On Pumsb dataset, when the minimum support is higher than 60%, our is much faster than the other two algorithms. When the minimum support threshold gets lower, the running time of our becomes close to that of, but still significantly faster than. On Connect- 4 and Mushroom datasets, needs the longest running time. At the lowest minimum support threshold (0.% for Mushroom and 30% for Connect-4), our algorithm is about 0 times faster than, and about 60 and 40 times faster than respectively mlearn/mlrepository.html 6

7 Data set: BMS-WebView- H-Mine Data set: T0I4D00k H-Mine Data set: T25I20D00k H-Mine (a) BMS-WebView- (b) T0I4D00k (c) T25I20D00k Data set: pumsb Data set: Connect Data set: Mushroom (d) Pumsb (e) Connect-4 (f) Mushroom We also studied the maximal space usage of the several algorithms. We found that on relatively sparse datasets, such as BMS-WebView-, T0I4D00k and T25I20D00k, our algorithm and H-Mine use similar space since our algorithm will also first construct a hyper-structure from the database on sparse datasets, then construct structures from the hyper-structure. Since we order the items in ascending order of their frequencies, the structure constructed from the hyper-structure is always very small compared with the whole hyper-structure. The space consumed by is almost double of that of our algorithm. On dense datasets, the space consumed by our algorithm is very close to that of. This is due to two reasons: () we use arrays to store single branches; (2) in our algorithm, once a subtree of the root is processed, it will be disposed to make space for mining the other subtrees. While in, the FP-tree structure cannot be touched until all the patterns contained in that tree are mined. 5.2 Scalability We study the scalability of our algorithm by varying the total number of transactions and the size of the available memory. The experimental results are shown in Figure 5. We use IBM synthetic data generator to produce a set of datasets with different number of transactions. For these datasets, we set the average transaction length to 20 and average pattern length to 0, the total number of transactions Figure 4. Performance comparison varies from 200k to 000k. All the other parameters use their default values. We fix the minimum support threshold to 0.%. The results are shown in Figure 5(a). Our algorithm is about 2 times faster than algorithm with respect to all transaction length. All the four algorithms show stable performance with respect to the size of the database, except that when the number of transactions exceeds 000k, the whole data structure used during the mining process for and H-Mine algorithm cannot be held in main memory, the running time of these two algorithms grows very sharply due to large amount of page swaps between disk and main memory. To see the efficiency of our partition technique for handling the cases where the whole data structure used during the mining process cannot resident in main memory, we limit the size of the memory used by our algorithm to 4MB, 8MB, 2MB and 6MB respectively. We implemented a random partition algorithm and compare with it. In this implementation, we use H-Mine to mine each partition, and do not count the space for storing the local patterns, i.e., the available memory is only for storing the hyper-structure. We use dataset T25I20D00k to conduct this experiment. We can see from Figure 5(b) that our algorithm is significantly faster than the random partition based algorithm. When the size of available memory is set to 6MB, the whole data structure used by our algorithm can be held in main memory. When the minimum support threshold is greater than 0.5%, even with memory size 2MB, our 7

8 Acknowledgments Scalability w.r.t. #Transactions(min_sup=0.%) Scalability w.r.t. Memory Size #Transactions(k) H-Mine M -8M -2M -6M Partition-4M Partition-8M Partition-2M Partition-6M We d like to thank Jiawei Han for providing us with his algorithm, and Junqiang Liu for providing us with his algorithm. References Figure 5. Scalability study algorithm needs to scan database only twice. An interesting observation from Figure 5(b) is that when the minimum support threshold is lower than 0.5% and the memory size is increased from 8MB to 2MB, the running time of the P artition algorithm increases. The reason is that with random partition, we do not have a guarantee on the number of local patterns on each partition. With minimum support 0.05%, when the memory size is increased to 2MB, the total number of local patterns is , and the total number of distinct local patterns is , while it is and respectively when the memory size is 8MB. They are both much greater than the total number of global patterns So the random partition algorithm needs a high global pattern verification cost. 6. Conclusions In this paper, we proposed an efficient algorithm for mining the complete set of frequent patterns, which shows significant performance improvement over previous works. Our algorithm adopts the pattern growth approach, and uses the ascending frequency ordered prefix-tree to represent the conditional databases and the tree is traversed topdown. Our analysis and experiment results show that this combination is more efficient than the combination of the bottom-up traversal strategy and descending frequency ordering method, which is adopted by the algorithm. Both of the above two combinations are much more efficient than the other two combinations the combination of the top-down traversal strategy and the descending frequency ordering method, and the combination of the bottom-up traversal strategy and the ascending frequency ordering method. The performance of our algorithm can be further improved by incorporating the opportunistic projection technique proposed in [8], e.g. to adaptively choose the array-based and structure based representation, and the pseudo unfiltered projection and filtered projection. We also described how to handle very large databases using our algorithm, which partitions the database more intelligently than previous works. It can handle the cases where the domain is dense and large amount of patterns exist in the database, while the previous random partition based approaches would fail or become very inefficient in such situation. [] R. Agrawal, C. Aggarwal, and V. Prasad. Depth first generation of long patterns. In Proc. of ACM SIGKDD Conf., pages 08 8, [2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In Proc. of ACM SIGMOD Conf., pages , 993. [3] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of ACM SIGMOD Conf., pages , 997. [4] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proc. of ICDE Conf., pages , 200. [5] K. Gouda and M. J. Zaki. Efficiently mining maximal frequent itemsets. In Proc. of ICDM conf., pages 63 70, 200. [6] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of ACM SIGMOD Conf., pages 2, [7] R. J. B. Jr. Efficiently mining long patterns from databases. In Proc. of ACM SIGMOD Conf., pages 85 93, 998. [8] J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by opportunistic projection. In Proc. of KDD Conf., [9] D. Meretakis, D. Fragoudis, H. Lu, and S. Likothanassis. Scalable association-based text classification. In Proc. of CIKM Conf., pages 5, [0] J. S. Park, M. Chen, and P. S. Yu. An effective hash based algorithm for mining association rules. In Proc. of ACM SIGMOD Conf., pages 75 86, 995. [] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of ICDT Conf., pages , 999. [2] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. H- mine: Hyper-structure mining of frequent patterns in large databases. In Proc. of ICDM Conf., pages , 200. [3] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for mining frequent closed itemsets. In Proc. of ACM SIGMOD DMKD Workshop, pages 2 30, [4] R.C.Agarwal, C.C.Aggarwal, and V.V.V.Prasad. A tree projection algorithm for finding frequent itemsets. Journal on Parallel and Distributed Computing, 6(3):350 37, 200. [5] A. Savasere, E. Omiecinski, and S. B. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. of VLDB Conf., pages , 995. [6] M. J. Zaki and C. Hsiao. Charm: An efficient algorithm for closed association rule mining. Technical report, Rensselaer Polytechnic Institute, 999. [7] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proc. of KDD Conf., pages , 997. [8] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In Proc. of KDD Conf., pages ,

Data Mining for Knowledge Management. Association Rules

Data Mining for Knowledge Management. Association Rules 1 Data Mining for Knowledge Management Association Rules Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Thanks for slides to: Jiawei Han George Kollios Zhenyu Lu Osmar R. Zaïane Mohammad

More information

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets : A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets J. Tahmores Nezhad ℵ, M.H.Sadreddini Abstract In recent years, various algorithms for mining closed frequent

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,

More information

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking Shariq Bashir National University of Computer and Emerging Sciences, FAST House, Rohtas Road,

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai EFFICIENTLY MINING FREQUENT ITEMSETS IN TRANSACTIONAL DATABASES This article has been peer reviewed and accepted for publication in JMST but has not yet been copyediting, typesetting, pagination and proofreading

More information

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets Jianyong Wang, Jiawei Han, Jian Pei Presentation by: Nasimeh Asgarian Department of Computing Science University of Alberta

More information

Appropriate Item Partition for Improving the Mining Performance

Appropriate Item Partition for Improving the Mining Performance Appropriate Item Partition for Improving the Mining Performance Tzung-Pei Hong 1,2, Jheng-Nan Huang 1, Kawuu W. Lin 3 and Wen-Yang Lin 1 1 Department of Computer Science and Information Engineering National

More information

Roadmap. PCY Algorithm

Roadmap. PCY Algorithm 1 Roadmap Frequent Patterns A-Priori Algorithm Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results Data Mining for Knowledge Management 50 PCY

More information

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS APPLYIG BIT-VECTOR PROJECTIO APPROACH FOR EFFICIET MIIG OF -MOST ITERESTIG FREQUET ITEMSETS Zahoor Jan, Shariq Bashir, A. Rauf Baig FAST-ational University of Computer and Emerging Sciences, Islamabad

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets

CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets : Searching for the Best Strategies for Mining Frequent Closed Itemsets Jianyong Wang Department of Computer Science University of Illinois at Urbana-Champaign wangj@cs.uiuc.edu Jiawei Han Department of

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

ETP-Mine: An Efficient Method for Mining Transitional Patterns

ETP-Mine: An Efficient Method for Mining Transitional Patterns ETP-Mine: An Efficient Method for Mining Transitional Patterns B. Kiran Kumar 1 and A. Bhaskar 2 1 Department of M.C.A., Kakatiya Institute of Technology & Science, A.P. INDIA. kirankumar.bejjanki@gmail.com

More information

Data Structure for Association Rule Mining: T-Trees and P-Trees

Data Structure for Association Rule Mining: T-Trees and P-Trees IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 6, JUNE 2004 1 Data Structure for Association Rule Mining: T-Trees and P-Trees Frans Coenen, Paul Leng, and Shakil Ahmed Abstract Two new

More information

Parallelizing Frequent Itemset Mining with FP-Trees

Parallelizing Frequent Itemset Mining with FP-Trees Parallelizing Frequent Itemset Mining with FP-Trees Peiyi Tang Markus P. Turkia Department of Computer Science Department of Computer Science University of Arkansas at Little Rock University of Arkansas

More information

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint William Cheung and Osmar R. Zaïane University of Alberta, Edmonton, Canada {wcheung, zaiane}@cs.ualberta.ca Abstract

More information

Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning

Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning Kun Li 1,2, Yongyan Wang 1, Manzoor Elahi 1,2, Xin Li 3, and Hongan Wang 1 1 Institute of Software, Chinese Academy of Sciences,

More information

On Frequent Itemset Mining With Closure

On Frequent Itemset Mining With Closure On Frequent Itemset Mining With Closure Mohammad El-Hajj Osmar R. Zaïane Department of Computing Science University of Alberta, Edmonton AB, Canada T6G 2E8 Tel: 1-780-492 2860 Fax: 1-780-492 1071 {mohammad,

More information

Mining High Average-Utility Itemsets

Mining High Average-Utility Itemsets Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2013 " An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p YAFIMA: Yet Another Frequent Itemset Mining Algorithm Mohammad El-Hajj, Osmar R. Zaïane Department of Computing Science University of Alberta, Edmonton, AB, Canada {mohammad, zaiane}@cs.ualberta.ca ABSTRACT:

More information

Induction of Association Rules: Apriori Implementation

Induction of Association Rules: Apriori Implementation 1 Induction of Association Rules: Apriori Implementation Christian Borgelt and Rudolf Kruse Department of Knowledge Processing and Language Engineering School of Computer Science Otto-von-Guericke-University

More information

Pattern Lattice Traversal by Selective Jumps

Pattern Lattice Traversal by Selective Jumps Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB, Canada {zaiane, mohammad}@cs.ualberta.ca ABSTRACT Regardless

More information

FP-Growth algorithm in Data Compression frequent patterns

FP-Growth algorithm in Data Compression frequent patterns FP-Growth algorithm in Data Compression frequent patterns Mr. Nagesh V Lecturer, Dept. of CSE Atria Institute of Technology,AIKBS Hebbal, Bangalore,Karnataka Email : nagesh.v@gmail.com Abstract-The transmission

More information

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University 10/19/2017 Slides adapted from Prof. Jiawei Han @UIUC, Prof.

More information

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Parallel Mining of Maximal Frequent Itemsets in PC Clusters Proceedings of the International MultiConference of Engineers and Computer Scientists 28 Vol I IMECS 28, 19-21 March, 28, Hong Kong Parallel Mining of Maximal Frequent Itemsets in PC Clusters Vong Chan

More information

Performance and Scalability: Apriori Implementa6on

Performance and Scalability: Apriori Implementa6on Performance and Scalability: Apriori Implementa6on Apriori R. Agrawal and R. Srikant. Fast algorithms for mining associa6on rules. VLDB, 487 499, 1994 Reducing Number of Comparisons Candidate coun6ng:

More information

Closed Non-Derivable Itemsets

Closed Non-Derivable Itemsets Closed Non-Derivable Itemsets Juho Muhonen and Hannu Toivonen Helsinki Institute for Information Technology Basic Research Unit Department of Computer Science University of Helsinki Finland Abstract. Itemset

More information

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity Unil Yun and John J. Leggett Department of Computer Science Texas A&M University College Station, Texas 7783, USA

More information

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods

Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods Chapter 6 Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods 6.1 Bibliographic Notes Association rule mining was first proposed by Agrawal, Imielinski, and Swami [AIS93].

More information

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold Zengyou He, Xiaofei Xu, Shengchun Deng Department of Computer Science and Engineering, Harbin Institute of Technology,

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

A New Fast Vertical Method for Mining Frequent Patterns

A New Fast Vertical Method for Mining Frequent Patterns International Journal of Computational Intelligence Systems, Vol.3, No. 6 (December, 2010), 733-744 A New Fast Vertical Method for Mining Frequent Patterns Zhihong Deng Key Laboratory of Machine Perception

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 6 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013-2017 Han, Kamber & Pei. All

More information

Item Set Extraction of Mining Association Rule

Item Set Extraction of Mining Association Rule Item Set Extraction of Mining Association Rule Shabana Yasmeen, Prof. P.Pradeep Kumar, A.Ranjith Kumar Department CSE, Vivekananda Institute of Technology and Science, Karimnagar, A.P, India Abstract:

More information

Scalable Frequent Itemset Mining Methods

Scalable Frequent Itemset Mining Methods Scalable Frequent Itemset Mining Methods The Downward Closure Property of Frequent Patterns The Apriori Algorithm Extensions or Improvements of Apriori Mining Frequent Patterns by Exploring Vertical Data

More information

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE 1 P.SIVA 2 D.GEETHA 1 Research Scholar, Sree Saraswathi Thyagaraja College, Pollachi. 2 Head & Assistant Professor, Department of Computer Application,

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Mining Frequent Patterns and Associations: Basic Concepts (Chapter 6) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms

Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms Novel Techniques to Reduce Search Space in Multiple Minimum Supports-Based Frequent Pattern Mining Algorithms ABSTRACT R. Uday Kiran International Institute of Information Technology-Hyderabad Hyderabad

More information

Maintenance of the Prelarge Trees for Record Deletion

Maintenance of the Prelarge Trees for Record Deletion 12th WSEAS Int. Conf. on APPLIED MATHEMATICS, Cairo, Egypt, December 29-31, 2007 105 Maintenance of the Prelarge Trees for Record Deletion Chun-Wei Lin, Tzung-Pei Hong, and Wen-Hsiang Lu Department of

More information

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams * An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams * Jia-Ling Koh and Shu-Ning Shin Department of Computer Science and Information Engineering National Taiwan Normal University

More information

A Taxonomy of Classical Frequent Item set Mining Algorithms

A Taxonomy of Classical Frequent Item set Mining Algorithms A Taxonomy of Classical Frequent Item set Mining Algorithms Bharat Gupta and Deepak Garg Abstract These instructions Frequent itemsets mining is one of the most important and crucial part in today s world

More information

Association Rule Mining

Association Rule Mining Association Rule Mining Generating assoc. rules from frequent itemsets Assume that we have discovered the frequent itemsets and their support How do we generate association rules? Frequent itemsets: {1}

More information

Mining Negative Rules using GRD

Mining Negative Rules using GRD Mining Negative Rules using GRD D. R. Thiruvady and G. I. Webb School of Computer Science and Software Engineering, Monash University, Wellington Road, Clayton, Victoria 3800 Australia, Dhananjay Thiruvady@hotmail.com,

More information

Closed Pattern Mining from n-ary Relations

Closed Pattern Mining from n-ary Relations Closed Pattern Mining from n-ary Relations R V Nataraj Department of Information Technology PSG College of Technology Coimbatore, India S Selvan Department of Computer Science Francis Xavier Engineering

More information

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University Chapter MINING HIGH-DIMENSIONAL DATA Wei Wang 1 and Jiong Yang 2 1. Department of Computer Science, University of North Carolina at Chapel Hill 2. Department of Electronic Engineering and Computer Science,

More information

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports R. Uday Kiran P. Krishna Reddy Center for Data Engineering International Institute of Information Technology-Hyderabad Hyderabad,

More information

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE Saravanan.Suba Assistant Professor of Computer Science Kamarajar Government Art & Science College Surandai, TN, India-627859 Email:saravanansuba@rediffmail.com

More information

This paper proposes: Mining Frequent Patterns without Candidate Generation

This paper proposes: Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation a paper by Jiawei Han, Jian Pei and Yiwen Yin School of Computing Science Simon Fraser University Presented by Maria Cutumisu Department of Computing

More information

Finding frequent closed itemsets with an extended version of the Eclat algorithm

Finding frequent closed itemsets with an extended version of the Eclat algorithm Annales Mathematicae et Informaticae 48 (2018) pp. 75 82 http://ami.uni-eszterhazy.hu Finding frequent closed itemsets with an extended version of the Eclat algorithm Laszlo Szathmary University of Debrecen,

More information

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE Volume 3, No. 1, Jan-Feb 2012 International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info ISSN No. 0976-5697 Performance Analysis of Frequent Closed

More information

A Quantified Approach for large Dataset Compression in Association Mining

A Quantified Approach for large Dataset Compression in Association Mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 3 (Nov. - Dec. 2013), PP 79-84 A Quantified Approach for large Dataset Compression in Association Mining

More information

COFI Approach for Mining Frequent Itemsets Revisited

COFI Approach for Mining Frequent Itemsets Revisited COFI Approach for Mining Frequent Itemsets Revisited Mohammad El-Hajj Department of Computing Science University of Alberta,Edmonton AB, Canada mohammad@cs.ualberta.ca Osmar R. Zaïane Department of Computing

More information

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree

Improved Algorithm for Frequent Item sets Mining Based on Apriori and FP-Tree Global Journal of Computer Science and Technology Software & Data Engineering Volume 13 Issue 2 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

SQL Based Frequent Pattern Mining with FP-growth

SQL Based Frequent Pattern Mining with FP-growth SQL Based Frequent Pattern Mining with FP-growth Shang Xuequn, Sattler Kai-Uwe, and Geist Ingolf Department of Computer Science University of Magdeburg P.O.BOX 4120, 39106 Magdeburg, Germany {shang, kus,

More information

Association Rule Mining from XML Data

Association Rule Mining from XML Data 144 Conference on Data Mining DMIN'06 Association Rule Mining from XML Data Qin Ding and Gnanasekaran Sundarraj Computer Science Program The Pennsylvania State University at Harrisburg Middletown, PA 17057,

More information

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases * Shichao Zhang 1, Xindong Wu 2, Jilian Zhang 3, and Chengqi Zhang 1 1 Faculty of Information Technology, University of Technology

More information

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach *

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach * Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach * Hongyan Liu 1 Jiawei Han 2 Dong Xin 2 Zheng Shao 2 1 Department of Management Science and Engineering, Tsinghua

More information

Keywords: Mining frequent itemsets, prime-block encoding, sparse data

Keywords: Mining frequent itemsets, prime-block encoding, sparse data Computing and Informatics, Vol. 32, 2013, 1079 1099 EFFICIENTLY USING PRIME-ENCODING FOR MINING FREQUENT ITEMSETS IN SPARSE DATA Karam Gouda, Mosab Hassaan Faculty of Computers & Informatics Benha University,

More information

International Journal of Computer Sciences and Engineering. Research Paper Volume-5, Issue-8 E-ISSN:

International Journal of Computer Sciences and Engineering. Research Paper Volume-5, Issue-8 E-ISSN: International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-5, Issue-8 E-ISSN: 2347-2693 Comparative Study of Top Algorithms for Association Rule Mining B. Nigam *, A.

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

On Mining Max Frequent Generalized Itemsets

On Mining Max Frequent Generalized Itemsets On Mining Max Frequent Generalized Itemsets Gene Cooperman Donghui Zhang Daniel Kunkle College of Computer & Information Science Northeastern University, Boston, MA 02115 {gene, donghui, kunkle}@ccs.neu.edu

More information

Frequent Itemsets Melange

Frequent Itemsets Melange Frequent Itemsets Melange Sebastien Siva Data Mining Motivation and objectives Finding all frequent itemsets in a dataset using the traditional Apriori approach is too computationally expensive for datasets

More information

CSCI6405 Project - Association rules mining

CSCI6405 Project - Association rules mining CSCI6405 Project - Association rules mining Xuehai Wang xwang@ca.dalc.ca B00182688 Xiaobo Chen xiaobo@ca.dal.ca B00123238 December 7, 2003 Chen Shen cshen@cs.dal.ca B00188996 Contents 1 Introduction: 2

More information

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method International Journal of Computational Intelligence Research ISSN 0973-1873 Volume 13, Number 5 (2017), pp. 833-840 Research India Publications http://www.ripublication.com Sensitive Rule Hiding and InFrequent

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS ABSTRACT V. Purushothama Raju 1 and G.P. Saradhi Varma 2 1 Research Scholar, Dept. of CSE, Acharya Nagarjuna University, Guntur, A.P., India 2 Department

More information

Product presentations can be more intelligently planned

Product presentations can be more intelligently planned Association Rules Lecture /DMBI/IKI8303T/MTI/UI Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id) Faculty of Computer Science, Objectives Introduction What is Association Mining? Mining Association Rules

More information

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2012 ISSN 1349-4198 Volume 8, Number 7(B), July 2012 pp. 5165 5178 AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR

More information

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets American Journal of Applied Sciences 2 (5): 926-931, 2005 ISSN 1546-9239 Science Publications, 2005 Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets 1 Ravindra Patel, 2 S.S.

More information

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES 1 SALLAM OSMAN FAGEERI 2 ROHIZA AHMAD, 3 BAHARUM B. BAHARUDIN 1, 2, 3 Department of Computer and Information Sciences Universiti Teknologi

More information

Processing Load Prediction for Parallel FP-growth

Processing Load Prediction for Parallel FP-growth DEWS2005 1-B-o4 Processing Load Prediction for Parallel FP-growth Iko PRAMUDIONO y, Katsumi TAKAHASHI y, Anthony K.H. TUNG yy, and Masaru KITSUREGAWA yyy y NTT Information Sharing Platform Laboratories,

More information

Frequent Pattern Mining with Uncertain Data

Frequent Pattern Mining with Uncertain Data Charu C. Aggarwal 1, Yan Li 2, Jianyong Wang 2, Jing Wang 3 1. IBM T J Watson Research Center 2. Tsinghua University 3. New York University Frequent Pattern Mining with Uncertain Data ACM KDD Conference,

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database Algorithm Based on Decomposition of the Transaction Database 1 School of Management Science and Engineering, Shandong Normal University,Jinan, 250014,China E-mail:459132653@qq.com Fei Wei 2 School of Management

More information

ASSOCIATION rules mining is a very popular data mining

ASSOCIATION rules mining is a very popular data mining 472 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 18, NO. 4, APRIL 2006 A Transaction Mapping Algorithm for Frequent Itemsets Mining Mingjun Song and Sanguthevar Rajasekaran, Senior Member,

More information

CLOLINK: An Adapted Algorithm for Mining Closed Frequent Itemsets

CLOLINK: An Adapted Algorithm for Mining Closed Frequent Itemsets Journal of Computing and Information Technology - CIT 20, 2012, 4, 265 276 doi:10.2498/cit.1002017 265 CLOLINK: An Adapted Algorithm for Mining Closed Frequent Itemsets Adebukola Onashoga Department of

More information

FIMI 03: Workshop on Frequent Itemset Mining Implementations

FIMI 03: Workshop on Frequent Itemset Mining Implementations FIMI 3: Workshop on Frequent Itemset Mining Implementations FIMI Repository: http://fimi.cs.helsinki.fi/ FIMI Repository Mirror: http://www.cs.rpi.edu/~zaki/fimi3/ Bart Goethals HIIT Basic Research Unit

More information

Mining Frequent Patterns Based on Data Characteristics

Mining Frequent Patterns Based on Data Characteristics Mining Frequent Patterns Based on Data Characteristics Lan Vu, Gita Alaghband, Senior Member, IEEE Department of Computer Science and Engineering, University of Colorado Denver, Denver, CO, USA {lan.vu,

More information

Efficient Frequent Pattern Mining on Web Log Data

Efficient Frequent Pattern Mining on Web Log Data Efficient Frequent Pattern Mining on Web Log Data Liping Sun RMIT University Xiuzhen Zhang RMIT University Paper ID: 304 Abstract Mining frequent patterns from web log data can help to optimise the structure

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Efficient Incremental Mining of Top-K Frequent Closed Itemsets Efficient Incremental Mining of Top- Frequent Closed Itemsets Andrea Pietracaprina and Fabio Vandin Dipartimento di Ingegneria dell Informazione, Università di Padova, Via Gradenigo 6/B, 35131, Padova,

More information

MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining

MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining Philippe Fournier-Viger 1, Espérance Mwamikazi 1, Ted Gueniche 1 and Usef Faghihi 2 1 Department of Computer Science, University

More information

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules Manju Department of Computer Engg. CDL Govt. Polytechnic Education Society Nathusari Chopta, Sirsa Abstract The discovery

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm? H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases Paper s goals Introduce a new data structure: H-struct J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang Int. Conf. on Data Mining

More information

Association Rule Mining: FP-Growth

Association Rule Mining: FP-Growth Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong We have already learned the Apriori algorithm for association rule mining. In this lecture, we will discuss a faster

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 13: 27/11/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

PC Tree: Prime-Based and Compressed Tree for Maximal Frequent Patterns Mining

PC Tree: Prime-Based and Compressed Tree for Maximal Frequent Patterns Mining Chapter 42 PC Tree: Prime-Based and Compressed Tree for Maximal Frequent Patterns Mining Mohammad Nadimi-Shahraki, Norwati Mustapha, Md Nasir B Sulaiman, and Ali B Mamat Abstract Knowledge discovery or

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.) Apriori: Summary All items Count

More information

Fast Algorithm for Mining Association Rules

Fast Algorithm for Mining Association Rules Fast Algorithm for Mining Association Rules M.H.Margahny and A.A.Mitwaly Dept. of Computer Science, Faculty of Computers and Information, Assuit University, Egypt, Email: marghny@acc.aun.edu.eg. Abstract

More information

PADS: A Simple Yet Effective Pattern-Aware Dynamic Search Method for Fast Maximal Frequent Pattern Mining

PADS: A Simple Yet Effective Pattern-Aware Dynamic Search Method for Fast Maximal Frequent Pattern Mining : A Simple Yet Effective Pattern-Aware Dynamic Search Method for Fast Maximal Frequent Pattern Mining Xinghuo Zeng Jian Pei Ke Wang Jinyan Li School of Computing Science, Simon Fraser University, Canada.

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 18: 01/12/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information