Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Similar documents
PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Closed Non-Derivable Itemsets

Mining Closed Itemsets: A Review

Closed Pattern Mining from n-ary Relations

On Multiple Query Optimization in Data Mining

Memory issues in frequent itemset mining

Finding frequent closed itemsets with an extended version of the Eclat algorithm

Appropriate Item Partition for Improving the Mining Performance

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Temporal Weighted Association Rule Mining for Classification

FP-Growth algorithm in Data Compression frequent patterns

Monotone Constraints in Frequent Tree Mining

Mining Frequent Patterns with Counting Inference at Multiple Levels

Maintenance of the Prelarge Trees for Record Deletion

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

Distinctive Frequent Itemset Mining from Time Segmented Databases Using ZDD-Based Symbolic Processing. Shin-ichi Minato and Takeaki Uno

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version)

Item Set Extraction of Mining Association Rule

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Comparing Performance of Formal Concept Analysis and Closed Frequent Itemset Mining Algorithms on Real Data

Data Mining Query Scheduling for Apriori Common Counting

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

On Frequent Itemset Mining With Closure

2. Discovery of Association Rules

Parallelizing Frequent Itemset Mining with FP-Trees

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

A Trie-based APRIORI Implementation for Mining Frequent Item Sequences

An Improved Apriori Algorithm for Association Rules

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

Lecture 2 Wednesday, August 22, 2007

Processing Load Prediction for Parallel FP-growth

Integration of Candidate Hash Trees in Concurrent Processing of Frequent Itemset Queries Using Apriori

Leveraging Set Relations in Exact Set Similarity Join

Mining Frequent Patterns without Candidate Generation


Mining Imperfectly Sporadic Rules with Two Thresholds

ETP-Mine: An Efficient Method for Mining Transitional Patterns

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach *

Materialized Data Mining Views *

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

CGT: a vertical miner for frequent equivalence classes of itemsets

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Information Cloaking Technique with Tree Based Similarity

ESTIMATING HASH-TREE SIZES IN CONCURRENT PROCESSING OF FREQUENT ITEMSET QUERIES

Mining Negative Rules using GRD

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

The Encoding Complexity of Network Coding

Using Association Rules for Better Treatment of Missing Values

Hierarchical Online Mining for Associative Rules

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

Knowledge discovery from XML Database

Induction of Association Rules: Apriori Implementation

Web page recommendation using a stochastic process model

Fast Algorithm for Mining Association Rules

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

A CSP Search Algorithm with Reduced Branching Factor

TU/e Algorithms (2IL15) Lecture 2. Algorithms (2IL15) Lecture 2 THE GREEDY METHOD

An Algorithm for Mining Large Sequences in Databases

A Novel Texture Classification Procedure by using Association Rules

CSCI6405 Project - Association rules mining

Implementation of object oriented approach to Index Support for Item Set Mining (IMine)

Data Access Paths for Frequent Itemsets Discovery

MySQL Data Mining: Extending MySQL to support data mining primitives (demo)

Finding the boundaries of attributes domains of quantitative association rules using abstraction- A Dynamic Approach

International Journal of Computer Sciences and Engineering. Research Paper Volume-5, Issue-8 E-ISSN:

Data Structure for Association Rule Mining: T-Trees and P-Trees

Improved Frequent Pattern Mining Algorithm with Indexing

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

High Utility Web Access Patterns Mining from Distributed Databases

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Finding Frequent Patterns Using Length-Decreasing Support Constraints

CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees

On Distributed Algorithms for Maximizing the Network Lifetime in Wireless Sensor Networks

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

Mining frequent item sets without candidate generation using FP-Trees

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

Association Rule Mining: FP-Growth

Theorem 2.9: nearest addition algorithm

arxiv: v1 [cs.ai] 17 Apr 2016

Iliya Mitov 1, Krassimira Ivanova 1, Benoit Depaire 2, Koen Vanhoof 2

CHUIs-Concise and Lossless representation of High Utility Itemsets

arxiv: v1 [cs.db] 11 Jul 2012

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Transcription:

Efficient Incremental Mining of Top- Frequent Closed Itemsets Andrea Pietracaprina and Fabio Vandin Dipartimento di Ingegneria dell Informazione, Università di Padova, Via Gradenigo 6/B, 35131, Padova, Italy. E-mails: {capri,vandinfa}@dei.unipd.it Abstract. In this work we study the mining of top- frequent closed itemsets, a recently proposed variant of the classical problem of mining frequent closed itemsets where the support threshold is chosen as the maximum value sufficient to guarantee that the itemsets returned in output be at least. We discuss the effectiveness of parameter in controlling the output size and develop an efficient algorithm for mining top- frequent closed itemsets in order of decreasing support, which exhibits consistently better performance than the best previously known one, attaining substantial improvements in some cases. A distinctive feature of our algorithm is that it allows the user to dynamically raise the value with no need to restart the computation from scratch. 1 Introduction The discovery of frequent (closed) itemsets is a fundamental primitive in data mining. Let I be a set of items, and D a (multi)set of transactions, where each transaction t D is a subset of I. For an itemset X I we define its conditional dataset D X D as the (multi)set of transactions t D that contain X, and define the support of X w.r.t. D, supp D (X) for short, as the number of transactions in D X. An itemset X is closed w.r.t. D if there exists no itemset Y, with X Y I, such that supp D (Y ) = supp D (X). The standard formulation of the problem requires to discover, for a given support threshold σ, the set F(D, σ) of all itemsets with support at least σ, which are called frequent itemsets [1]. In order to avoid the redundancy inherent in F(D, σ), it was proposed in [5] to restrict the discovery to the subset FC(D, σ) F(D, σ) of all closed itemsets with support at least σ, called frequent closed itemsets. A challenging aspect regarding the above formulation of the problem is related to the difficulty of predicting the actual number of frequent (closed) itemsets for a given dataset D and support threshold σ. Indeed, in some cases setting σ too small could yield a number of frequent itemsets impractically large, possibly exponential in the dataset size [11], while setting σ too big could yield very few or no frequent itemsets. This work was supported in part by MIUR of Italy under project MAINSTREAM, and by the EU under the EU/IST Project 15964 AEOLUS.

In [10], an elegant variant of the problem has been proposed which, for a given desired output size 1, requires to discover the set FC (D) of top- frequent closed itemsets (top- f.c.i., for short), defined as the set FC(D, σ ) where σ is the maximum support threshold that such that FC(D, σ ). Although one is not guaranteed that FC (D) =, it is conceivable that parameter be more effective than an independently fixed support threshold σ in controlling the output size. In the same paper, the authors present an efficient algorithm, called, for mining the top- f.c.i. discovers frequent itemsets starting with a low support threshold which is progressively increased, as the execution proceeds, by means of several heuristics, until the final value σ is reached. allows also the user to specify a minimum length min l for the closed itemsets to be returned. The main drawbacks of are that no bound is given on the number of non-closed or infrequent itemsets that the algorithm must process, and that an involved itemset closure checking scheme is required. Moreover, does not appear to be able to handle efficiently a dynamic scenario where the user is allowed to raise the value. A number of results concerning somewhat related problems can be found in [3, 4, 8]. The mining of top- f.c.i. is the focus of this paper. In Section 2, we study the effectiveness of parameter in controlling the output size by proving tight bounds on FC (D). In Section 3 we present a new algorithm,, for mining top- f.c.i. in order of decreasing support. Unlike algorithm, Top- Miner features a provable bound on the number of itemsets touched during the mining process, and, moreover, it allows the user to dynamically raise the value efficiently without restarting the computation from scratch. Section 4 reports some results from extensive experiments which show that consistently exhibits better performance than, with substantial improvements in some cases (more than two orders of magnitude). The efficiency of becomes considerably higher when used in the dynamic scenario 1. 2 Effectiveness of in controlling the output size Let (n) be the family of all datasets D whose transactions comprise n distinct items. Define ρ(n, ) = max D (n) ( FC (D) /), which provides a worst-case estimation of the deviation of the output size from when mining top- closed frequent itemsets. Building on a result by [2] we can prove the following theorem. Theorem 1. For every n 1 and 1, we have ρ(n, ) n. Moreover, for every n 1 and every constant c, there are Ω (n c ) distinct values of such that ρ(n, ) Ω (n). We note that in several tests we performed on real and synthetic datasets with values of between 100 and 10000, the ratio FC (D) / turned out to be much smaller than n and actually very close to 1. In fact, it can be shown that ρ(n, ) = n is attained only for = 1, and we conjecture that ρ(n, ) is a decreasing function of. 1 For lack of space many details have been omitted from the paper and can be found in the companion technical report [6].

3 In this section, we briefly describe our algorithm for mining the top- f.c.i. for a dataset D defined over the set of items I = {a 1, a 2,..., a n } (the indexing of the items is fixed but arbitrary). The algorithm crucially relies on the notion of ppc-extension introduced in [9] and recalled below. For an itemset X I, we define its j-th prefix as X(j) = X {a i : 1 i j}, with 1 j n, and define Clo D (X) = t D X t, which is the smallest closed itemset that contains X. The core index of a closed itemset X, denoted by core(x), is defined as the minimum j such that D X = D X(j). A closed itemset X is a called a prefix-preserving closure extension (ppc-extension) of a closed itemset Y if: (1) X = Clo D (Y {a j }), for some a j Y with j > core(y ); and (2) X(j 1) = Y (j 1). Clearly, if X is a ppc-extension of Y, then supp(x) < supp(y ). Let = Clo D ( ), which is the possibly empty closed itemset consisting of the items occurring in all transactions. It is shown in [9] that any closed itemset X is the ppc-extension of exactly one closed itemset Y. Hence, all closed itemsets can be conceptually organized in a tree whose root is, and where the children of a closed itemset Y are its ppc-extensions. generates the frequent closed itemsets in order of decreasing support by performing a best-first (i.e., highest-support-first) exploration of the nodes of the tree defined above. Specifically, the algorithm receives in input the dataset D, the value for which the top- f.c.i. are sought, and a value. During the course of the algorithm, the user is allowed to dynamically raise up to. The algorithm makes use of a priority queue Q (implemented as a max heap), whose entries correspond to closed itemsets. Let E(Y, s) denote an entry of Q relative to a closed itemset Y with support s. The value s is the key for the entry. Q is initialized with entries corresponding to the ppc-extension of. Then, a main loop is executed where in each iteration the entry E(Y, s) with maximum s is extracted from Q, the corresponding itemset Y is generated and returned in output, and entries for all ppc-extensions of Y are inserted into Q. A variable σ is used to maintain an approximation from below to the support σ of the -th most frequent closed itemset. This variable is initialized by suitable heuristics similar to those employed in [10], and it is updated in each iteration of the main loop to reflect the support of the -th most frequent closed itemset seen so far. The value σ is used to avoid inserting entries with support smaller that σ into Q. Also, after each update of σ, entries with support smaller than σ, previously inserted into Q, can be removed from the queue. After the -th closed itemset is generated, its support is stored in a variable σ. The main loop ends when the last closed itemset of support σ is generated. At this point the user may decide to raise to a new value new. In this case, the main loop is reactivated and the termination condition will depend now on new. It can be shown that processes (i.e., inserts into Q) at most n closed itemsets 2, while one such bound is not known for [10]. 2 In fact, with a slight modification of the algorithm the bound can be lowered to n.

Implementation. The efficient implementation of is a challenging task. For lack of space, we will limit ourself to briefly mention the key ingredients of our implementation. More details can be found in [6]. As in [7] the dataset D is represented through a Patricia trie T D built on the set of transactions with items sorted by decreasing support. An entry E(Y, s) of Q, corresponding to some closed itemset Y, is represented by the quadruple (D Y, s, i, Y (i 1)), where i is the core index of Y. For space and time efficiency, we represent D Y through the list L D (Y ) of nodes of T D which contain the core index item a i of Y and belong to paths associated with transactions in D Y. We observe that (D Y, s, i, Y (i 1)) contains only a prefix, Y (i 1), of Y. The actual generation of Y is delayed to the time when the entry (D Y, s, i, Y (i 1)) is extracted from Q. At this point a clever traversal of the subtrie of T D whose leaves are the nodes of L D (Y ), is employed to generate Y and the quadruples for all of its ppc-extensions to be inserted into Q. 4 Experimental evaluation We experimentally compared the performance of and [10] on all real and synthetic datasets from the FIMI repository (http://fimi.cs.helsinki.fi). The experiments have been conducted on a HP Proliant, using one AMD Opteron 2.2GHz processor, with 8GB main memory, 64B L1 cache and 1MB L2 cache. Both and have been coded in C++ and the source code for has been provided to us by its authors. Due to lack of space, we report only a few representative results relative to datasets kosarac and accidents. The results for the other datasets are consistent with those reported here. The characteristics of the datasets are reported in the following table, including the values σ / D for = 1000 and = 10000: Dataset #Items Avg. Trans. Length # Transactions σ 1000 / D σ 10000 / D accidents 468 33.8 340,183 0.656 0.483 kosarac 41,270 8.1 990,002 0.006 0.002 Figure 1.(a) and 1.(b) show the relative running times of and on kosarac and accidents, respectively, for values of ranging from 1000 to 10000 with step 1000. For, we imposed = so to assess the relative performance of the two algorithms when focused on the basic task of mining top- f.c.i. In another experiment, we tested the effectiveness of the Top- Miner s feature which allows the user to dynamically raise the value up to a maximum value. To this purpose we simulated a scenario where is raised from 1000 to 10000 with step 1000 and run with = 10000 measuring the running time after the computation for each value ended. We compared these running times with those attained by executing from scratch for each value and accumulating the running times of previous executions. The results are shown in Figure 1.(c) for dataset accidents. We also analyzed the memory usage of and on all datasets, for between 1000 and 10000. requires less memory than in almost all cases and, in the worst case, requires a factor 1.5 more memory.

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1200 kosarac 13 accidents 90 accidents 1000 800 600 400 12 11 10 9 8 7 80 70 60 50 40 30 200 6 5 20 10 0 (a) 4 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 (b) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Fig. 1. Running times of and for: (a) kosarac without dynamic update of ; (b) accidents without dynamic update of ; and (c) accidents with dynamic update of. (c) References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Intl. Conference on Management of Data, pages 207 216, 1993. 2. E. Boros, V. Gurvich, L. hachiyan, and. Makino. On maximal frequent and minimal infrequent sets in binary matrices. Annals of Mathematics and Artificial Intelligence, 39:211 221, 2003. 3. Y. Cheung and A. Fu. Mining frequent itemsets without support threshold: with and without item constraints. IEEE Trans. on nowledge and Data Engineering, 16(9):1052 1069, 2004. 4. A. Fu, R. wong, and J. Tang. Mining n-most interesting itemsets. In Proc. of the Intl. Symp. on Methodologies for Intelligent Systems, pages 59 67, 2000. 5. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of the 7th Int. Conference on Database Theory, pages 398 416, Jan. 1999. 6. A. Pietracaprina and F. Vandin. Efficient Incremental Mining of Top- Frequent Closed Itemsets. Technical Report available at http://www.dei.unipd.it/~capri/pietracaprinavtr07.pdf 7. A. Pietracaprina and D. Zandolin. Mining frequent itemsets using Patricia tries. In Proc. of the Workshop on Frequent Itemset Mining Implementations (FIMI03), Vol. 90, Melbourne, USA, Nov. 2003. CEUR-WS Workshop On-line Proceedings. 8. J. Seppanen and H. Mannila. Dense itemsets. In Proc. of the 10th ACM SIGDD Intl. Conference on nowledge Discovery and Data Mining, pages 683 688, 2004. 9. T. Uno, T. Asai, Y. Uchida, and H. Arimura. An efficient algorithm for enumerating closed patterns in transaction databases. In Proc. of 7th Intl. Conf. Discovery Science, pages 16 31, 2004. 10. J. Wang, J. Han, Y. Lu, and P. Tzvetkov. : An efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans. on nowledge and Data Engineering, 17(5):652 664, 2005. 11. G. Yang. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In Proc. of the 10th ACM SIGDD Intl. Conference on nowledge Discovery and Data Mining, pages 344 353, 2004.