Privacy Preserving Frequent Itemset Mining Using SRD Technique in Retail Analysis

Size: px

Start display at page:

Download "Privacy Preserving Frequent Itemset Mining Using SRD Technique in Retail Analysis"

Osborn Robinson
5 years ago
Views:

Privacy Preserving Frequent Itemset Mining Using SRD Technique in Retail Analysis Abstract -Frequent item set mining is one of the essential problem in data mining.

1 Privacy Preserving Frequent Itemset Mining Using SRD Technique in Retail Analysis Abstract -Frequent item set mining is one of the essential problem in data mining. The proposed FP algorithm called Privacy Preserving FP algorithm not only provide high data utility and high degree of privacy but also high time efficiency. This algorithm consists of preprocessing phase and mining phase. In preprocessing phase, a splitting method is used to transform the database to improve the utility and privacy tradeoff. In the mining phase, the actual support of itemsets in the database can be estimated. For a given database, the preprocessing phase needs to be performed only once. In the mining phase, to compensate the information loss caused by transaction splitting, a runtime calculation method is devised to estimate the actual support of itemsets in the original database. In addition, by leveraging the downward closure property, a dynamic reduction method to dynamically reduce the amount of noise added to guarantee privacy during the mining process. The performance could be evaluated with databases contain long transactions in terms of scalability and efficiency. Keywords: Frequent Itemset Mining, Splitting Method, Runtime Calculation, Dynamic Reduction. I. INTRODUCTION Data mining, the extraction of hidden predictive information from large databases, is a prominent new technology with great potential to help companies focus on the most important information in their data warehouses. They can provide answers for the questions that conventionally too time consuming to resolve. They search databases for unknown patterns, discovering projecting information that experts may fail to see because it lies outside their expectations. The implementation of these techniques on current software and hardware platforms increase the value of existing information resources, and incorporated with new products and systems as they are brought on-line. FREQUENT ITEMSET MINING (FIM) is one of the most principal problem in data mining. It has practical impact in a wide range of application areas such as text mining, Web mining etc. Consider a database, in which each transaction contains a set of items and FIM tries to find itemsets that occur in transactions more frequently than a given threshold. 1 S.Vimala, 2 D.Kerana Hanirex, 3 K.P.Kaliyamurthie CSE Department, Bharath University, Chennai, Tamil Nadu. two most vital ones. In particular, Apriori is a breadthfirst search, candidate set generation and test algorithm. It needs one database scan if the maximam length of frequent itemset is one. In contrast, FP-growth is a depth-first search algorithm, which requires no candidate generation. Compared with Apriori, FPgrowth only perform two database scans, which makes FP-growth faster than Apriori. This striking feature of FP-growth motivate us to design a privacy preserving FIM algorithm based on the FP-growth algorithm. In this project, a privacy preserving FIM algorithm that provides high data utility, high degree of privacy and high time efficiency has been proposed. Existing work presents an Apriori-based private FIM algorithm. It inflicts the limit by truncating transactions. To address the challenges faced by existing work, a privacy preserving FP-growth (PFP-growth) algorithm, which consists of preprocessing stage and mining stages, is proposed. In the preprocessing stage, the database is transformed to limit the length of transactions. To enforce such a limit, long transactions should be splitted rather than truncated. That is, if a transaction has more items than the limit, it is divided into multiple subsets and guarantee that each subset is under the limit. To preserve more frequency information in subsets, a graph-based approach is proposed to reveal the correlation of items within transactions and utilize such correlation to guide the splitting process. In the mining phase, based on the given transformed database and a user-specified threshold, frequent itemsets were discovered. In spite of the possible advantages of transaction splitting, it may bring frequency information loss. Runtime calculation method is used to offset such information loss. In particular, given the noisy support of an itemset in the database transformed by transaction splitting, first estimate its actual support in the transformed database, and then further compute its actual support in the original database. In addition, by leveraging the downward closure property (i.e., any supersets of an infrequent itemset are infrequent), dynamic reduction method was used. Several algorithms have been projected for mining frequent itemsets. The Apriori and FP-growth are the 21

2 II. LITERATURE SURVEY A large number of studies have been proposed to solve the privacy preserving FIM problem from different aspects. Apriori algorithm [2] has been proposed by R. Agrawal and R. Srikant for finding frequent itemsets. Apriori uses a generate-and-test approach. It generate candidate itemsets and test if they are frequent. The algorithm terminates when more candidate itemsets cannot be constructed for next round. This algorithm needs to do multiple database scans as many times as the length of the largest frequent itemset. Therefore, its performance decreases considerably when the length of the largest frequent itemset is relatively long. The process of frequent patterns generation in FPgrowth (frequent pattern growth) algorithm [3] includes two sub processes: first is construction of the FP-Tree, and second is generating frequent patterns from the FP-Tree. An expanded prefix tree (FP-tree) structure can be used to store the database in a compacted form.. It uses a divide-and-conquer technique to decompose both the mining tasks and the databases. FP-Tree, recovers the two disadvantages of the Apriori, it acquire two database scan and no candidate will be generated. So FP-Tree is faster than the Apriori algorithm. It is more effective in dense databases than in sparse databases. Its major cost is the recursive construction of the FP-trees. To overcome the memory problem for large database which can not fit into main memory Partitioning algorithm is used to find the frequent elements. It is based on the partitioning of database in n parts [4], because small parts of database easily fit into main memory. A Direct hashing and pruning algorithm [5] uses Hash table structure. It reduces the number of candidates in the early passes Ck for k > 1 and the size of database. In DHP technique, support is counted by mapping the items from the candidate list in to the buckets. When a new itemset is occurred, it checks the itemset exist earlier or not, if exist it increases the bucket count else insert itemset into new bucket. And in the end, the buckets which have less support count than the minimum support is deleted from the candidate set. In Sampling algorithm, a random sample is picked up in such a way that the sample can be fit in the main memory, and frequent pattern are mining from this sample. This removes the I/O overhead by not taking the complete database but only a sample of database for checking the frequency [6]. Eclat [7,8] algorithm uses a depth-first approach with the set intersection, and vertical data format. Each item is stored together with its cover (also called tid list). The support count of an itemset X can be easily computed by intersecting the any two subsets of X, like Y and Z are subset of X, such that Y U Z = X. For mining maximal frequent itemsets, Lin and Kedem [9] presented a new approach by combining both top- table and FP-tree are illustrated in Fig down and bottom-up approach; it reduces the difficulty for generating maximal frequent itemsets. In bottom-up approach, starts from 1-itemset, move one-level up in each iteration and proceeds up to n-itemsets like Apriori algorithm while in top-down approach,starts from n itemsets, move many levels down in each iteration and proceeds up to 1-itemset. Both bottoms-up and topdown approaches individually identify the maximam frequent itemsets by examining its candidates. In paper [11], the authors have proposed genetic algorithm based approach for finding frequent itemsets. In paper [12], the authors have presented a TDTR approach for mining frequent itemsets. This approach reduces the number of transactions from the original database based on the minimum threshold value thus improving the performance. III. PRELIMINARIES 3.1 Frequent Itemset Mining Given the alphabet I = {i 1 ;... ; i n }, a transaction t is a subset of I and a transaction database D is a multiset of transactions. Each transaction represents an individual s record. Table 1 shows a simple transaction database. A non-empty set X is called an itemset. The length of an itemset is the number of items in it. An itemset is called a k-itemset if it contains k items. A transaction t contains an itemset X if X is a subset of t. The support of itemset X is the number of transactions containing X in the database. An itemset is frequent if its support is not less than the user-specified minimum support threshold. Given a transaction database and a user-specified minimum support threshold, the goal of FI is to find the complete set of frequent itemsets. Table:1 A simple Transaction Database TID Items 100 f,a,b,c 200 b,c,h 300 e,f,a,b,c 400 b,c,d,h 500 a,g 600 f,a,g 3.2 FP-Growth Algorithm FP-growth is a partitioning-based, depth-first search algorithm. It adopts a divide-and-conquer manner to decompose the mining task into many smaller tasks for finding frequent itemsets in conditional pattern bases. A conditional pattern base is a sub-database which consists of itemsets co-occurring with the prefix itemset. To efficiently generate conditional pattern bases, FP growth leverages two data structures, namely header table and FP-tree. For the header table, it is used to store items and their supports. For the FP-tree, each branch represents an itemset and each node has a counter. In the header table, each item also contains the head of a list which links all the same items in the FP-tree.For example, for the database shown in Table 1, the constructed header

3 Fig.1: The Header Table and FP-Tree for the table 1 After that, based on the constructed header table HT and FP-tree FPtree, FP-growth generates the conditional pattern base of every frequent item. Specifically, for the kth item i k in the header table HT, by following the linked list starting at i k in HT, all branches that contain item i k are found. The portion of these branches from i k to the root forms ik s conditional pattern base Di k. Then, for the first (k-1) items in HT, FP-growth computes their supports in Di k and determines the frequent items in Di k. For each frequent item i in Di k, itemset {i, ik} is a frequent two-itemset in the original databases. Next, based on the frequent items found in Di k, FP-growth generates the header table HTi k and FPtree FPtreei k for Di k. The FP-tree constructed from Di k is called ik s conditional FP-tree. By using header table HTi k and conditional FP-tree FPtreei k, FP-growth progressively grows each generated frequent twoitemset by producing and mining its conditional pattern base. The above procedure is applied recursively until no conditional pattern base can be generated. 4.1 Splitting Method IV. KEY METHODS A graph-based approach is proposed to reveal the correlation of items and leverage the discovered correlation to split transactions. In particular, first construct an undirected weighted graph from the database. Each item i is treated as a vertex v i. An edge e is introduced to connect two vertices v i and v j. iff the support of itemset {i, j} is larger than zero. Moreover, for edge e = (v i, v j ), its weight is assigned as the support of itemset {i, j}. For example, Fig. 2 illustrates the constructed undirected weighted graph for the database shown in Table 1. After constructing the graph, Louvain method [13] is used to identify the communities in the graph, and use the structure of the communities to reflect the correlation of items. The motivation behind this approach is explained as follows. It is observed that there is a connection between the frequent itemsets and the communities detected from the graph. Based on the downward closure property, any subsets (e.g., two-itemsets) of a frequent itemset are frequent. Thus, the items in a frequent itemset are same community. In turn, the vertices (i.e., items) in the same community are more likely to produce frequent itemsets. The Louvain method has been chosen because it provides good results and has low time complexity [13]. In particular, the Louvain method consists of two steps. In the first step, it assigns a different community to each vertex. To maximize the gain in modularity, which measures the quality of communities, the Louvain method greedily moves one vertex from its original community to its adjacent communities. In the second step, it rebuilds the graph with communities as vertices. These two steps are repeated iteratively until a maximum of modularity is attained. According to the communities detected by the Louvain method, a correlation tree structure named CR-tree is constructed. It is used to measure the correlation of items. In particular, the nodes in each level of the CRtree are the intermediate communities found in each iteration. The height of the tree is determined by the number of iterations. A parent node denotes the community which is the Fig 2:Undirected Weighted Graph for Table 1 union of the communities denoted by its children. For example, for the graph in Fig. 2, the CR-tree constructed from the intermediate communities of the Louvain method is shown in Fig. 3. To measure the correlation of two items, use the shortest path length between the leaf nodes containing these two items. Fig 3:CR Tree for the table 1 The motivation behind this measure is based on the following observation. In each iteration of the Louvain method, densely connected vertices are greedily placed in one community. The stronger correlation items are densely connected in the graph, which tend to be in the 23

4 moved into one community. After constructing the CRtree CT, it can be utilized to split transactions. 4.2 Run-time Calculation Despite the potential advantages, transaction splitting might cause information loss. Such information loss comes from two aspects. Suppose a transaction t ={a,b,c,d} is divided into t1={a,b} and t2={c,d} with weight w1, w2 respectively. On the one hand, assigning weights makes the support of itemsets {a,b} and {c,d} decrease from 1 to w1 and w2. On the other hand, splitting t causes the support of some itemsets, such as itemset {a,c} decreases from 1 to 0.To offset the information loss caused by transaction splitting, the run-time calculation method. The method consists of two steps: based on the noisy support of an itemset in the transformed database 1) first estimate its actual support in the transformed database and 2) then further compute its actual support in the original database. For each itemset, estimate its average support to determine whether it is frequent and also estimate its maximal support to decide whether to use it to generate candidate frequent itemsets. 4.3 Dynamic Reduction The main idea is to leverage the downward closure property (i.e., the supersets of an infrequent itemset are infrequent), and dynamically reduce the sensitivity of support computations by decreasing the upper bound on the number of support computations. V. PRIVACY PRESERVING FP-GROWTH ALGORITHM The Privacy Preserving FP-growth algorithm comprises of two stages. In the first stage which is known as preprocessing, numerical information can be extracted from the original database and force the smart splitting method to transform the database. Notice that, for a given database, the preprocessing phase is performed only once. In the mining phase, for a given threshold, find frequent itemsets. The run-time calculation and dynamic reduction methods are used in this phase to improve the quality of the results. The total privacy budget Є into five portions: Є1 is used to compute the maximal length constraint, Є2 is used to estimate the maximal length of frequent itemsets, Є3 is used to reveal the correlation of items within transactions, Є4 is used to compute µ-vectors of itemsets, and Є5 is used for the support computations. 5.1 Preprocessing Algorithm Input: Original database D; Percentage n; budget Є1, Є2, Є3; Output: Transformed database D ; Pseudo code: Privacy get α the noisy number of transactions with different lengths using Є1; and n; get maximal length constraint Lm based on α get β noisy maximal support of itemsets of different lengths using Є2 ; compute Z as a r n matrix using the µ-vectors of itemsets; compute D1= enforce length constraint Lm on D by random truncating; Set2 = compute the noisy support of all 2- itemsets in D1 using Є3; Create an undirected weighted graph G based on Set2; CR-tree T = Louvain(G, L m ); D =Ø; for each transaction t in D do if t > L m then SubTransactions ST = Split_One_Transaction (t, T, L m ); Add each subset in ST with weight 1/ ST into D ; return D ; 5.2 Mining Algorithm 24 else Add transaction t into D ; Input: Transformed database D ; Threshold λ; Privacy budget Є4, Є5; Maximal length constraint Lm; Array b; Matrix Z; Output: Frequent itemsets F ; Pseudo code: Lf =estimate maximal length of frequent itemsets based on β and λ; using Є4/ L f ; for i from 1 to L f do {z i } = get noisy result of row i in Z F = Ø; HT= Ø; Є =Є5/ L f ; for each item c in the alphabet do c.sup n = c.sup + Lap(L m /Є ); c.sup m = max_supp (c.sup n, 1); c.sup a = avg_supp (c.sup n, 1); if c.sup m >=λ then insert (c, HT);

5 if c.sup a >= λ then insert (c, F); Initialize an up-array using HT j; Sort items in HT in estimated maximal support descending order; Generate FP-tree based on HT ; for j decreasing from HT to 2 do Item c j = the j-th item in HT ; List c j = Copy the first (j-1) items in HT ; Dc j = Generate conditional pattern base of cj using FPtree, Listc j ; F = Mining_Conditional_Pattern_Base (Listc j, Dc j, c j, Є,λ, uparray); return F ; F += F ; VI. CONCLUSION In this paper, a privacy preserving FP-growth algorithm has been proposed, which consists of two stages as preprocessing and mining stage. In preprocessing, to better improve the utility-privacy tradeoff, a new splitting method is used to transform the database. In the mining stages, a run-time calculation method is proposed to equalize the the loss of information acquired by transaction splitting. Moreover, by leveraging the downward closure property, a dynamic reduction method is used to dynamically reduce the amount of noise added to guarantee privacy during the mining process. The study and the results of extensive experiments on real datasets will show that Privacy Preserving FP-growth algorithm is time-efficient and can achieve both good utility and good privacy. REFERENCES [1] Sen Su, Shengzhi Xu, Xiang Cheng, Zhengyi Li, and Fangchun Yang, Differentially Private Frequent Itemset Mining via Transaction Splitting in IEEE Transactions On Knowledge And Data Engineering, Vol. 27, No. 7, July [2] R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proc. 20th Int. Conf. Very Large Data Bases, pp ,1994. [3] J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation, in Proc. ACM SIGMOD Int. Conf. Manage. Data, pp. 1 12,2000. [4] Savasere E. Omiecinski and Navathe S., An efficient algorithm for mining association rules in large databases, In Proc. Int l Conf. Very Large Data Bases (VLDB), pp: , [5] Park. J.S, Chen M.S., Yu P.S., An effective hash-based algorithm for mining association rules, In Proc. ACMSIGMOD Int l Conf. Management of Data (SIGMOD), pp: , [6] C Toivonen. H., Sampling large databases for association rules, In Proc. Int l Conf. Very Large Data Bases (VLDB), pp: , [7] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li., New Algorithms for Fast Discovery of Association Rules, Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD 97), AAAI Press, Menlo Park, CA, USA, pp: , [8] C.Borgelt. Efficient Implementations of Apriori and Eclat, Proc. 1st IEEE ICDM Workshop on Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL). CEUR Workshop Proceedings 90, Aachen, Germany [9] Lin, D. and Kedem, Z.M., Pincer-Search: An Efficient Algorithm for Discovering the Maximum Frequent Set, in IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 3, pp: , [10] W. K. Wong, D. W. Cheung, E. Hung, B. Kao, and N. Mamoulis, An audit environment for outsourcing of frequent itemset mining, Proc. VLDB Endowment, vol. 2, no. 1, pp , [11] D. Kerana Hanirex and K.P. Kaliyamurthie Mining Frequent Itemsets Using Genetic Algorithm Middle-East Journal of Scientific Research 19 (6): , 2014,ISSN , IDOSI Publications, [12] D.Kerana Hanirex And Dr.K.P.Kaliyamurthie An Adaptive Transaction Reduction Approach For Mining Frequent Itemsets: A Comparative Study On Dengue Virus Type1 Int J Pharm Bio Sci 2015 April; 6(2): (B) [13] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, Fast unfolding of communities in large networks, J. Statist. Mech.: Theory Experiment, vol. 10, p. P10008,

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,