SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

Similar documents
AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Integration of Candidate Hash Trees in Concurrent Processing of Frequent Itemset Queries Using Apriori

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

Leveraging Set Relations in Exact Set Similarity Join

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

On Multiple Query Optimization in Data Mining

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Efficient GSP Implementation based on XML Databases

Closed Pattern Mining from n-ary Relations

Incrementally mining high utility patterns based on pre-large concept

Web page recommendation using a stochastic process model

A New Fast Vertical Method for Mining Frequent Patterns

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

A mining method for tracking changes in temporal association rules from an encoded database

Monotone Constraints in Frequent Tree Mining

Closed Non-Derivable Itemsets

Towards Incremental Grounding in Tuffy

A Graph-Based Approach for Mining Closed Large Itemsets

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Data Mining Query Scheduling for Apriori Common Counting

Maintenance of fast updated frequent pattern trees for record deletion

On Privacy-Preservation of Text and Sparse Binary Data with Sketches

Discovery of Frequent Itemset and Promising Frequent Itemset Using Incremental Association Rule Mining Over Stream Data Mining

Mining Temporal Indirect Associations

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

620 HUANG Liusheng, CHEN Huaping et al. Vol.15 this itemset. Itemsets that have minimum support (minsup) are called large itemsets, and all the others

A novel algorithm for frequent itemset mining in data warehouses

Value Added Association Rules

Maintenance of the Prelarge Trees for Record Deletion

Temporal Weighted Association Rule Mining for Classification

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Datasets Size: Effect on Clustering Results

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach *

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

NON-CENTRALIZED DISTINCT L-DIVERSITY

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

An Algorithm for Frequent Pattern Mining Based On Apriori

Finding frequent closed itemsets with an extended version of the Eclat algorithm

Product presentations can be more intelligently planned

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

CS570 Introduction to Data Mining

2. Discovery of Association Rules

Materialized Data Mining Views *

Performance Analysis of Frequent Closed Itemset Mining: PEPP Scalability over CHARM, CLOSET+ and BIDE

C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking

Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning

Mining High Average-Utility Itemsets

Mining Association Rules from Stars

An Approximate Scheme to Mine Frequent Patterns over Data Streams

Mining of Web Server Logs using Extended Apriori Algorithm

Comparing the Performance of Frequent Itemsets Mining Algorithms

Data Structure for Association Rule Mining: T-Trees and P-Trees

Mining Top-K Strongly Correlated Item Pairs Without Minimum Correlation Threshold

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

Web Usage Mining: How to Efficiently Manage New Transactions and New Clients

Maintenance of Generalized Association Rules for Record Deletion Based on the Pre-Large Concept

Maintaining Data Privacy in Association Rule Mining

Memory issues in frequent itemset mining

Real World Performance of Association Rule Algorithms

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Finding Frequent Patterns Using Length-Decreasing Support Constraints

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Searching frequent itemsets by clustering data: towards a parallel approach using MapReduce

Association Rule Mining. Introduction 46. Study core 46

Aggregation and maintenance for database mining

ESTIMATING HASH-TREE SIZES IN CONCURRENT PROCESSING OF FREQUENT ITEMSET QUERIES

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

The Encoding Complexity of Network Coding

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Analysis of Basic Data Reordering Techniques

Mining Vague Association Rules

Efficient Mining of Generalized Negative Association Rules

Mining Temporal Association Rules in Network Traffic Data

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES

Parallel and Distributed Frequent Itemset Mining on Dynamic Datasets

Maintaining Frequent Itemsets over High-Speed Data Streams

Associating Terms with Text Categories

Information Sciences

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

. (1) N. supp T (A) = If supp T (A) S min, then A is a frequent itemset in T, where S min is a user-defined parameter called minimum support [3].

Discovery of Association Rules in Temporal Databases 1

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

Association Rule Mining from XML Data

Random Sampling over Data Streams for Sequential Pattern Mining

Item Set Extraction of Mining Association Rule

Mining Quantitative Association Rules on Overlapped Intervals

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

Transcription:

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027, China zjupaper@yahoo.com xucongfu@cs.zju.edu.cn danhow2008@hotmail.com panyh@sun.zju.edu.cn Abstract. The issue of maintaining privacy in frequent itemset mining has attracted considerable attentions. In most of those works, only distorted data are available which may bring a lot of issues in the datamining process. Especially, in the dynamic update distorted database environment, it is nontrivial to mine frequent itemsets incrementally due to the high counting overhead to recompute support counts for itemsets. This paper investigates such a problem and develops an efficient algorithm SA-IFIM for incrementally mining frequent itemsets in update distorted databases. In this algorithm, some additional information is stored during the earlier mining process to support the efficient incremental computation. Especially, with the introduction of supporting aggregate and representing it with bit vector, the transaction database is transformed into machine oriented model to perform fast support computation. The performance studies show the efficiency of our algorithm. 1 Introduction Recently, privacy becomes one of the prime concerns in data mining. For not compromising the privacy, most of works make use of distortion or randomization techniques to the original dataset, and only the disguised data are shared for data mining [1 3]. Mining frequent itemset models from the distorted databases with the reconstruction methods brings expensive overheads as compared to directly mining original data sets [2]. In [3, 4], the basic formula from set theory are used to eliminate these counting overheads. But, in reality, for many applications, a database is dynamic in the sense. The changes on the data set may invalidate some existing frequent itemsets and introduce some new ones, so the incremental algorithms [5, 6] were proposed for addressing the problem. However, it is not efficient to directly use these incremental algorithms in the update distorted database, because of the high counting overhead to recompute support for itemsets. Although Supported by the Natural Science Foundation of China (No. 60402010), Zhejiang Provincial Natural Science Foundation of China (Y105250) and the Science- Technology Progrom of Zhejiang Province of China (No. 2004C31098). Congfu Xu is the corresponding author.

2 Jinlong Wang et al. [7] has proposed an algorithm for incremental updating, the efficiency still cannot satisfy the reality. This paper investigates the problem of incremental frequent itemset mining in update distorted databases. We first develop an efficient incremental updating computation method to quickly reconstruct an itemset s support by using the additional information stored during the earlier mining process. Then, a new concept supporting aggregate (SA) is introduced and represented with bit vector. In this way, the transaction database is transformed into machine oriented model to perform fast support computation. Finally, an efficient algorithm SA- IFIM (Supporting Aggregate based Incremental Frequent Itemset Mining in update distorted databases) is presented to describe the process. The performance studies show the efficiency of our algorithm. The remainder of this paper is organized as follows. Section 2 presents the SA-IFIM algorithm step by step. The performance studies are reported in Section 3. Finally, Section 4 concludes this paper. 2 The SA-IFIM Algorithm In this section, the SA-IFIM algorithm is introduced step by step. Before mining, the data sets are distorted respectively using the method mentioned by EMASK [3]. In the following, we first describe the preliminaries about incremental frequent itemsets mining, then investigate the essence of the updating technique and use some additional information recorded during the earlier mining and the set theory for quick updating computation. Next, we introduce the supporting aggregate and represent it with bit vector to transform the database into machine oriented model for speeding up computations. Finally, the SA-IFIM algorithm is summarized. 2.1 Preliminaries In this subsection, some preliminaries about the concept of incremental frequent itemset mining are presented, summarizing the formal description in [5, 6]. Let D be a set of transactions and I = {i 1,i 2,...,i m } a set of distinct literals (items). For a dynamic database, old transactions are deleted from the database D and new transactions + are added. Naturally, D. Denote the updated database by D, therefore D = (D ) +, and the unchanged transactions by D = D. Let Fp express the frequent itemsets in the original database D, Fp k denote k-frequent itemsets. The problem of incremental mining is to find frequent itemsets (denoted by Fp ) in D, given,d, +, and the mining result Fp, with respect to the same user specified minimum support s. Furthermore, the incremental approach needs to take advantage of previously obtained information to avoid rerunning the mining algorithms on the whole database when the database is updated. For the clarity, we present s as a relative support value, but δ + c, δ c, σ c, and σ c as absolute ones, respectively in +,, D, D. And set δ c as the change of support count of itemset c. Then δ c = δ + c δ c, σ c = σ c + δ + c δ c.

The SA-IFIM Algorithm 3 2.2 Efficient incremental computation Generally, in dynamically updating environment, the important aspect of mining is how to deal with the frequent itemsets in D, recorded in Fp, and how to add the itemsets, which are non-frequent in D (not existing in Fp) but frequent in D. In the following, for simplicity, we define as the tuple number in the transaction database. 1. For the frequent itemsets in Fp, find the non-frequent or still available frequent itemsets in the updated database D. Lemma 1 If c Fp (σ c D s), and δ c ( + ) s, then c Fp. Proof. σ c=σ c + δ + c δ c ( D s + + s s) =( D + + ) s = D s. Property 1. When c Fp, and δ c < ( + ) s, then c Fp if and only if σ c D s. 2. For itemsets which are non-frequent in D, mine the frequent itemsets in the changed database + and recompute their support counts through scanning D. Lemma 2 If c Fp, and δ c < ( + ) s, then c Fp. Proof. Refer to Lemma 1. Property 2. When c Fp, and δ c ( + ) s, then c Fp if and only if σ c D s. Under the framework of symbol-specific distortion process in [3], 1 and 0 in the original database are respectively flipped with (1 p) and (1 q). In incremental frequent itemset mining, the goal is to mine frequent itemsets from the distorted databases with the information obtained during the earlier process. To test the condition for an itemset not in Fp in the situation Property 2, we need reconstruct an itemset s support in the unchanged database D through scanning D. Not only the distorted support of the itemset itself, but also some other counts related to it need to be tracked of. This makes that the support count computing in Property 2 is difficult and paramount important in incremental mining. And it is nontrivial to directly apply traditional incremental algorithms to it. To address the problem, an efficient incremental updating operation is first developed through computation with the support in the distorted database, then another method is presented to improve the support computation efficiency in the section 2.3. In distorted databases, the support computations of frequent itemsets are tedious. Motivated by [3], the similar support computation method is used in incremental mining. With the method, for computing an itemset s support, we should have the support counts of all its subsets in the distorted database. However, if we save the support counts of all the itemsets, this will be unpractical

4 Jinlong Wang et al. and greatly increase cost and degrade indexing efficiency. Thus in incremental mining, when recording the frequent itemsets and their support counts, the corresponding ones in each distorted database are registered at the same time. In this way, for a k-itemset not in Fp, since all its subsets are frequent in the database, we can use the existing support counts in each distorted database to compute and reconstruct its support in the updated database quickly. Thus, the efficiency is improved. 2.3 Supporting aggregate and database transformation In order to improve the efficiency, we introduce the concept supporting aggregate and use bit vector to represent it. By virtue of elementary supporting aggregate based on bit vector, the database is transformed into the machine oriented data model, which improves the efficiency of itemsets support computation. In the following statement, for transaction database D, let U denote a set of objects (universe), as unique identifiers for the transactions. For simplicity, we refer U as the transactions without differences. For an itemset A I, a transaction u U is said to contain A if A u. Definition 1. supporting aggregate (SA). For an attribute itemset A I, denote S(A) = {u U A u} as its supporting aggregate, where S(A) is the aggregate, composed of the transactions including the attribute itemset A. Generally, S(A) U. For the supporting aggregate of each attribute items, we call it elementary supporting aggregate (ESA). Using ESA, the original transaction database is vertically inverted and transformed into attribute-transaction list. Through the ESA, the SA of an itemset can be obtained quickly with set intersection. And the itemsets support can be efficiently computed. In order to further improve processing speed, for each SA (ESA), we denote it as BV-SA (BV-ESA) with a binary vector of U dimensions ( U is the number of transaction in U). If an itemset s SA contains the ith transaction, its binary vector s ith dimension is set to 1, otherwise, the corresponding position is set to 0. By this representation, the support count of each attribute item can be computed efficiently. With the vertical database representation, where each row presents an attribute s BV-ESA, the attribute items can be removed sequentially due to download closure property [8], which efficiently reduced the size of the data set. On the other hand, the whole BV-ESA sometimes cannot be loaded into memory entirely because of the memory constraints. Our approach seeks to solve the scalable problem through horizontally partitioning the transaction data set into subsets, which is composed of partial objects (transactions), then load them partition by partition. Through the method, each partition is disjointed with each other, which makes it suitable for the parallel and distributed processing. Furthermore, in reality, the optimizational memory swap strategy can be adopted to reduce the I/O cost.

The SA-IFIM Algorithm 5 2.4 The process of SA-IFIM algorithm In this subsection, the algorithm SA-IFIM is summarized as Algorithm 1. When the distorted data sets D, and + are firstly scanned, they are transformed into the corresponding vertical bit vector representations BV (D ), BV ( ) and BV ( + ) partition by partition, and saved into hard disk. From the representations, frequent k-itemsets Fp k can be obtained level by level. And based on the candidate set generation-and-test approach, candidate frequent k-itemsets (C k ) are generated from frequent (k-1)-itemsets (Fp k 1 ). Algorithm 1: Algorithm SA-IFIM Input: D, +,, Fp (Frequent itemsets and the support counts in D), Fp (Frequent itemsets of Fp and the corresponding support counts in D ), minimum support s, and distortion parameter p, q as EMASK [3]. Output: Fp (Frequent itemsets and the support counts in D ) Method: As shown in Fig.1. In the algorithm, we use some temporal files to store the support counts in the distorted database for efficiency. Fig. 1. SA-IFIM algorithm diagram.

6 Jinlong Wang et al. 3 Performance Evaluation This section performed comprehensive experiments to compare SA-IFIM with EMASK, provided by the authors in [9]. And for the better performance evaluation, we also implemented the algorithm IFIM (Similar as IPPFIM [7]). All programs were coded in C++ using Cygwin with gcc 2.9.5. The experiments were done on a P4, 3GHz Processor, with 1G memory. SA-IFIM and IFIM yield the same itemsets as EMASK with the same data set and the same minimum support parameters. Our experiments were performed on the synthetic data sets by IBM synthetic market-basket data generator [8]. In the following, we use the notation as D (number of transactions), T (average size of the transactions), I (average size of the maximal potentially large itemsets), and N (number of items), and set N=1000. In our method, the sizes of + and are not required to be the same. Without loss of generality, let d = + = for simplicity. For the sake of clarity, TxIyDmdn is used to represent an original database with an update database, where the parameters T = x and I = y are the same, only different in the number of the original transaction database D = m and the update transaction database d = n. In the following, we used the distorted benchmark data sets as the input databases to the algorithms. The distortion parameters are same as EMASK [3], with p=0.5 and q=0.97. In the experiments, for a fair comparison of algorithms and scalable requirements, SA-IFIM is run where only 5K transactions are loaded into the main memory one time. 3.1 Different support analysis In Fig.2, the relative performance of SA-IFIM, IFIM and EMASK are compared on two different data sets, T25I4D100Kd10K (sparse) and T40I10D100Kd10K (dense) with respect to various minimum support. As shown in Fig.2, SA-IFIM leads to prominent performance improvement. Explicitly, on the sparse data sets (T25I4D100Kd10K), IFIM is close to EMASK, and SA-IFIM is orders of magnitude faster than them; on the dense data sets (T40I10D100Kd10K), IFIM is faster than EMASK, but SA-IFIM also outperforms IFIM, and the margin grows as the minimum support decreases. 3.2 Effect of the update size Two data sets T25I4D100Kdm and T40I10D100Kdm were experimented, and the results shown in Fig.3. As expected, when the same number of transactions are deleted and added, the time of rerunning EMASK maintains constant, but the one of IFIM increases sharply and surpass EMASK quickly. In Fig.3, the execution time of SA-IFIM is much less than EMASK. SA-IFIM still significantly outperforms EMASK, even when the update size is much large.

The SA-IFIM Algorithm 7 (a) T25I4D100Kd10K (b) T40I10D100Kd10K Fig. 2. Extensive analysis for different support (a) T25I4D100Kdm(s=0.6%) (b) T40I10D100Kdm(s=1.25%) Fig. 3. Different updating tuples analysis 3.3 Scale up performance Finally, to assess the scalability of the algorithm SA-IFIM, two experiments, T25I4Dmd(m/10) at s = 0.6% and T40I10Dmd(m/10) at s = 1.25%, were conducted to examine the scale up performance by enlarging the number of mined data set. The scale up results for the two data sets are obtained as Fig.4, which shows the impact of D and d to the algorithms SA-IFIM and EMASK. In the experiments, the size of the update database is as 10% of the original database, and the size of the transaction database m was increased from 100K to 1000K. As shown in Fig.4, EMASK is very sensitive to the updating tuple but SA-IFIM is not, and the execution time of SA-IFIM increases linearly as the database size increases. This shows that the algorithm can be applied to very large databases and demonstrates good scalability of it.

8 Jinlong Wang et al. (a) T25I4Dmd(m/10)(s=0.6%) (b) T40I10Dmd(m/10)(s=1.25%) Fig. 4. Scale up performance analysis 4 Conclusions In this paper, we explore the issue of frequent itemset mining under the dynamically updating distorted databases environment. We first develop an efficient incremental updating computation method to quickly reconstruct an itemset s support. Through the introduction of the supporting aggregate represented with bit vector, the databases are transformed into the representations more accessible and processible by computer. The support count computing can be accomplished efficiently. Experiments conducted show that SA-IFIM significantly outperforms EMASK of mining the whole updated database, and also have the advantage of the incremental algorithms only based on EMASK. References 1. Agrawal, R., and Srikant, R.: Privacy-preserving data mining. In: Proceedings of SIGMOD. (2000) 439-450 2. Rizvi, S., and Haritsa, J.: Maintaining data privacy in association rule mining. In: Proceedings of VLDB. (2002) 682-693 3. Agrawal, S., Krishnan, V., and Haritsa, J.: On addressing efficiency concerns in privacy-preserving mining. In: Proceedings of DASFAA. (2004) 113-124 4. Xu, C., Wang, J., Dan, H., and Pan, Y.: An improved EMASK algorithm for privacy-preserving frequent pattern mining. In: Proceedings of CIS. (2005) 752-757 5. Cheung, D., Han, J., Ng, V., and Wong, C.: Maintenance of discovered association rules in large databases: An incremental updating tedchnique. In: Proceedings of ICDE. (1996) 104-114 6. Cheung, D., Lee, S., and Kao, B.: A general incremental technique for updating discovered association rules. In: Proceedings of DASFAA. (1997) 106-114 7. Wang, J., Xu, C., and Pan, Y.: An Incremental Algorithm for Mining Privacy- Preserving Frequent Itemsets. In: Proceedings of ICMLC. (2006) 8. Agrawal, R., and Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of VLDB. (1994) 487-499 9. http://dsl.serc.iisc.ernet.in/projects/software/software.html.