SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027, China zjupaper@yahoo.com xucongfu@cs.zju.edu.cn danhow2008@hotmail.com panyh@sun.zju.edu.cn Abstract. The issue of maintaining privacy in frequent itemset mining has attracted considerable attentions. In most of those works, only distorted data are available which may bring a lot of issues in the datamining process. Especially, in the dynamic update distorted database environment, it is nontrivial to mine frequent itemsets incrementally due to the high counting overhead to recompute support counts for itemsets. This paper investigates such a problem and develops an efficient algorithm SA-IFIM for incrementally mining frequent itemsets in update distorted databases. In this algorithm, some additional information is stored during the earlier mining process to support the efficient incremental computation. Especially, with the introduction of supporting aggregate and representing it with bit vector, the transaction database is transformed into machine oriented model to perform fast support computation. The performance studies show the efficiency of our algorithm. 1 Introduction Recently, privacy becomes one of the prime concerns in data mining. For not compromising the privacy, most of works make use of distortion or randomization techniques to the original dataset, and only the disguised data are shared for data mining [1 3]. Mining frequent itemset models from the distorted databases with the reconstruction methods brings expensive overheads as compared to directly mining original data sets [2]. In [3, 4], the basic formula from set theory are used to eliminate these counting overheads. But, in reality, for many applications, a database is dynamic in the sense. The changes on the data set may invalidate some existing frequent itemsets and introduce some new ones, so the incremental algorithms [5, 6] were proposed for addressing the problem. However, it is not efficient to directly use these incremental algorithms in the update distorted database, because of the high counting overhead to recompute support for itemsets. Although Supported by the Natural Science Foundation of China (No. 60402010), Zhejiang Provincial Natural Science Foundation of China (Y105250) and the Science- Technology Progrom of Zhejiang Province of China (No. 2004C31098). Congfu Xu is the corresponding author.

2 Jinlong Wang et al. [7] has proposed an algorithm for incremental updating, the efficiency still cannot satisfy the reality. This paper investigates the problem of incremental frequent itemset mining in update distorted databases. We first develop an efficient incremental updating computation method to quickly reconstruct an itemset s support by using the additional information stored during the earlier mining process. Then, a new concept supporting aggregate (SA) is introduced and represented with bit vector. In this way, the transaction database is transformed into machine oriented model to perform fast support computation. Finally, an efficient algorithm SA- IFIM (Supporting Aggregate based Incremental Frequent Itemset Mining in update distorted databases) is presented to describe the process. The performance studies show the efficiency of our algorithm. The remainder of this paper is organized as follows. Section 2 presents the SA-IFIM algorithm step by step. The performance studies are reported in Section 3. Finally, Section 4 concludes this paper. 2 The SA-IFIM Algorithm In this section, the SA-IFIM algorithm is introduced step by step. Before mining, the data sets are distorted respectively using the method mentioned by EMASK [3]. In the following, we first describe the preliminaries about incremental frequent itemsets mining, then investigate the essence of the updating technique and use some additional information recorded during the earlier mining and the set theory for quick updating computation. Next, we introduce the supporting aggregate and represent it with bit vector to transform the database into machine oriented model for speeding up computations. Finally, the SA-IFIM algorithm is summarized. 2.1 Preliminaries In this subsection, some preliminaries about the concept of incremental frequent itemset mining are presented, summarizing the formal description in [5, 6]. Let D be a set of transactions and I = {i 1,i 2,...,i m } a set of distinct literals (items). For a dynamic database, old transactions are deleted from the database D and new transactions + are added. Naturally, D. Denote the updated database by D, therefore D = (D ) +, and the unchanged transactions by D = D. Let Fp express the frequent itemsets in the original database D, Fp k denote k-frequent itemsets. The problem of incremental mining is to find frequent itemsets (denoted by Fp ) in D, given,d, +, and the mining result Fp, with respect to the same user specified minimum support s. Furthermore, the incremental approach needs to take advantage of previously obtained information to avoid rerunning the mining algorithms on the whole database when the database is updated. For the clarity, we present s as a relative support value, but δ + c, δ c, σ c, and σ c as absolute ones, respectively in +,, D, D. And set δ c as the change of support count of itemset c. Then δ c = δ + c δ c, σ c = σ c + δ + c δ c.

The SA-IFIM Algorithm 3 2.2 Efficient incremental computation Generally, in dynamically updating environment, the important aspect of mining is how to deal with the frequent itemsets in D, recorded in Fp, and how to add the itemsets, which are non-frequent in D (not existing in Fp) but frequent in D. In the following, for simplicity, we define as the tuple number in the transaction database. 1. For the frequent itemsets in Fp, find the non-frequent or still available frequent itemsets in the updated database D. Lemma 1 If c Fp (σ c D s), and δ c ( + ) s, then c Fp. Proof. σ c=σ c + δ + c δ c ( D s + + s s) =( D + + ) s = D s. Property 1. When c Fp, and δ c < ( + ) s, then c Fp if and only if σ c D s. 2. For itemsets which are non-frequent in D, mine the frequent itemsets in the changed database + and recompute their support counts through scanning D. Lemma 2 If c Fp, and δ c < ( + ) s, then c Fp. Proof. Refer to Lemma 1. Property 2. When c Fp, and δ c ( + ) s, then c Fp if and only if σ c D s. Under the framework of symbol-specific distortion process in [3], 1 and 0 in the original database are respectively flipped with (1 p) and (1 q). In incremental frequent itemset mining, the goal is to mine frequent itemsets from the distorted databases with the information obtained during the earlier process. To test the condition for an itemset not in Fp in the situation Property 2, we need reconstruct an itemset s support in the unchanged database D through scanning D. Not only the distorted support of the itemset itself, but also some other counts related to it need to be tracked of. This makes that the support count computing in Property 2 is difficult and paramount important in incremental mining. And it is nontrivial to directly apply traditional incremental algorithms to it. To address the problem, an efficient incremental updating operation is first developed through computation with the support in the distorted database, then another method is presented to improve the support computation efficiency in the section 2.3. In distorted databases, the support computations of frequent itemsets are tedious. Motivated by [3], the similar support computation method is used in incremental mining. With the method, for computing an itemset s support, we should have the support counts of all its subsets in the distorted database. However, if we save the support counts of all the itemsets, this will be unpractical

4 Jinlong Wang et al. and greatly increase cost and degrade indexing efficiency. Thus in incremental mining, when recording the frequent itemsets and their support counts, the corresponding ones in each distorted database are registered at the same time. In this way, for a k-itemset not in Fp, since all its subsets are frequent in the database, we can use the existing support counts in each distorted database to compute and reconstruct its support in the updated database quickly. Thus, the efficiency is improved. 2.3 Supporting aggregate and database transformation In order to improve the efficiency, we introduce the concept supporting aggregate and use bit vector to represent it. By virtue of elementary supporting aggregate based on bit vector, the database is transformed into the machine oriented data model, which improves the efficiency of itemsets support computation. In the following statement, for transaction database D, let U denote a set of objects (universe), as unique identifiers for the transactions. For simplicity, we refer U as the transactions without differences. For an itemset A I, a transaction u U is said to contain A if A u. Definition 1. supporting aggregate (SA). For an attribute itemset A I, denote S(A) = {u U A u} as its supporting aggregate, where S(A) is the aggregate, composed of the transactions including the attribute itemset A. Generally, S(A) U. For the supporting aggregate of each attribute items, we call it elementary supporting aggregate (ESA). Using ESA, the original transaction database is vertically inverted and transformed into attribute-transaction list. Through the ESA, the SA of an itemset can be obtained quickly with set intersection. And the itemsets support can be efficiently computed. In order to further improve processing speed, for each SA (ESA), we denote it as BV-SA (BV-ESA) with a binary vector of U dimensions ( U is the number of transaction in U). If an itemset s SA contains the ith transaction, its binary vector s ith dimension is set to 1, otherwise, the corresponding position is set to 0. By this representation, the support count of each attribute item can be computed efficiently. With the vertical database representation, where each row presents an attribute s BV-ESA, the attribute items can be removed sequentially due to download closure property [8], which efficiently reduced the size of the data set. On the other hand, the whole BV-ESA sometimes cannot be loaded into memory entirely because of the memory constraints. Our approach seeks to solve the scalable problem through horizontally partitioning the transaction data set into subsets, which is composed of partial objects (transactions), then load them partition by partition. Through the method, each partition is disjointed with each other, which makes it suitable for the parallel and distributed processing. Furthermore, in reality, the optimizational memory swap strategy can be adopted to reduce the I/O cost.

The SA-IFIM Algorithm 5 2.4 The process of SA-IFIM algorithm In this subsection, the algorithm SA-IFIM is summarized as Algorithm 1. When the distorted data sets D, and + are firstly scanned, they are transformed into the corresponding vertical bit vector representations BV (D ), BV ( ) and BV ( + ) partition by partition, and saved into hard disk. From the representations, frequent k-itemsets Fp k can be obtained level by level. And based on the candidate set generation-and-test approach, candidate frequent k-itemsets (C k ) are generated from frequent (k-1)-itemsets (Fp k 1 ). Algorithm 1: Algorithm SA-IFIM Input: D, +,, Fp (Frequent itemsets and the support counts in D), Fp (Frequent itemsets of Fp and the corresponding support counts in D ), minimum support s, and distortion parameter p, q as EMASK [3]. Output: Fp (Frequent itemsets and the support counts in D ) Method: As shown in Fig.1. In the algorithm, we use some temporal files to store the support counts in the distorted database for efficiency. Fig. 1. SA-IFIM algorithm diagram.

6 Jinlong Wang et al. 3 Performance Evaluation This section performed comprehensive experiments to compare SA-IFIM with EMASK, provided by the authors in [9]. And for the better performance evaluation, we also implemented the algorithm IFIM (Similar as IPPFIM [7]). All programs were coded in C++ using Cygwin with gcc 2.9.5. The experiments were done on a P4, 3GHz Processor, with 1G memory. SA-IFIM and IFIM yield the same itemsets as EMASK with the same data set and the same minimum support parameters. Our experiments were performed on the synthetic data sets by IBM synthetic market-basket data generator [8]. In the following, we use the notation as D (number of transactions), T (average size of the transactions), I (average size of the maximal potentially large itemsets), and N (number of items), and set N=1000. In our method, the sizes of + and are not required to be the same. Without loss of generality, let d = + = for simplicity. For the sake of clarity, TxIyDmdn is used to represent an original database with an update database, where the parameters T = x and I = y are the same, only different in the number of the original transaction database D = m and the update transaction database d = n. In the following, we used the distorted benchmark data sets as the input databases to the algorithms. The distortion parameters are same as EMASK [3], with p=0.5 and q=0.97. In the experiments, for a fair comparison of algorithms and scalable requirements, SA-IFIM is run where only 5K transactions are loaded into the main memory one time. 3.1 Different support analysis In Fig.2, the relative performance of SA-IFIM, IFIM and EMASK are compared on two different data sets, T25I4D100Kd10K (sparse) and T40I10D100Kd10K (dense) with respect to various minimum support. As shown in Fig.2, SA-IFIM leads to prominent performance improvement. Explicitly, on the sparse data sets (T25I4D100Kd10K), IFIM is close to EMASK, and SA-IFIM is orders of magnitude faster than them; on the dense data sets (T40I10D100Kd10K), IFIM is faster than EMASK, but SA-IFIM also outperforms IFIM, and the margin grows as the minimum support decreases. 3.2 Effect of the update size Two data sets T25I4D100Kdm and T40I10D100Kdm were experimented, and the results shown in Fig.3. As expected, when the same number of transactions are deleted and added, the time of rerunning EMASK maintains constant, but the one of IFIM increases sharply and surpass EMASK quickly. In Fig.3, the execution time of SA-IFIM is much less than EMASK. SA-IFIM still significantly outperforms EMASK, even when the update size is much large.

The SA-IFIM Algorithm 7 (a) T25I4D100Kd10K (b) T40I10D100Kd10K Fig. 2. Extensive analysis for different support (a) T25I4D100Kdm(s=0.6%) (b) T40I10D100Kdm(s=1.25%) Fig. 3. Different updating tuples analysis 3.3 Scale up performance Finally, to assess the scalability of the algorithm SA-IFIM, two experiments, T25I4Dmd(m/10) at s = 0.6% and T40I10Dmd(m/10) at s = 1.25%, were conducted to examine the scale up performance by enlarging the number of mined data set. The scale up results for the two data sets are obtained as Fig.4, which shows the impact of D and d to the algorithms SA-IFIM and EMASK. In the experiments, the size of the update database is as 10% of the original database, and the size of the transaction database m was increased from 100K to 1000K. As shown in Fig.4, EMASK is very sensitive to the updating tuple but SA-IFIM is not, and the execution time of SA-IFIM increases linearly as the database size increases. This shows that the algorithm can be applied to very large databases and demonstrates good scalability of it.

8 Jinlong Wang et al. (a) T25I4Dmd(m/10)(s=0.6%) (b) T40I10Dmd(m/10)(s=1.25%) Fig. 4. Scale up performance analysis 4 Conclusions In this paper, we explore the issue of frequent itemset mining under the dynamically updating distorted databases environment. We first develop an efficient incremental updating computation method to quickly reconstruct an itemset s support. Through the introduction of the supporting aggregate represented with bit vector, the databases are transformed into the representations more accessible and processible by computer. The support count computing can be accomplished efficiently. Experiments conducted show that SA-IFIM significantly outperforms EMASK of mining the whole updated database, and also have the advantage of the incremental algorithms only based on EMASK. References 1. Agrawal, R., and Srikant, R.: Privacy-preserving data mining. In: Proceedings of SIGMOD. (2000) 439-450 2. Rizvi, S., and Haritsa, J.: Maintaining data privacy in association rule mining. In: Proceedings of VLDB. (2002) 682-693 3. Agrawal, S., Krishnan, V., and Haritsa, J.: On addressing efficiency concerns in privacy-preserving mining. In: Proceedings of DASFAA. (2004) 113-124 4. Xu, C., Wang, J., Dan, H., and Pan, Y.: An improved EMASK algorithm for privacy-preserving frequent pattern mining. In: Proceedings of CIS. (2005) 752-757 5. Cheung, D., Han, J., Ng, V., and Wong, C.: Maintenance of discovered association rules in large databases: An incremental updating tedchnique. In: Proceedings of ICDE. (1996) 104-114 6. Cheung, D., Lee, S., and Kao, B.: A general incremental technique for updating discovered association rules. In: Proceedings of DASFAA. (1997) 106-114 7. Wang, J., Xu, C., and Pan, Y.: An Incremental Algorithm for Mining Privacy- Preserving Frequent Itemsets. In: Proceedings of ICMLC. (2006) 8. Agrawal, R., and Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of VLDB. (1994) 487-499 9. http://dsl.serc.iisc.ernet.in/projects/software/software.html.