Accelerating Closed Frequent Itemset Mining by Elimination of Null Transactions

Size: px

Start display at page:

Download "Accelerating Closed Frequent Itemset Mining by Elimination of Null Transactions"

Carmella Simmons
5 years ago
Views:

1 Accelerating Closed Frequent Itemset Mining by Elimination of Null Transactions 1 Binesh Nair, 2 Amiya Kumar Tripathy 1 SIES College of Science,Sion (west), Mumbai University, India 2 CSRE, Indian Institute of Technology Bombay, Mumbai, India ABSTRACT The mining of frequent itemsets is often challenged by the length of the patterns mined and also by the number of transactions considered for the mining process. Another acute challenge that concerns the performance of any association rule mining algorithm is the presence of null transactions. This work proposes a closed frequent itemset mining algorithm viz., Closed Frequent Itemset Mining and Pruning (CFIM-P) algorithm using the sub-itemset pruning strategy. CFIM-P algorithm has attempted to eliminate redundant patterns by pruning closed frequent sub-itemsets. An attempt has even been made towards eliminating the null transactions by using Vertical Data Format representation technique for finding the frequent itemsets. Keywords Data Mining, Frequent Pattern Growth (FP) tree, Frequent Itemsets, Closed Itemsets, Frequent Patterns 1. INTRODUCTION There is a massive volume of data that is present around but, the world is still starving for knowledge [1]. With the exponential increase in the use of Internet, costeffective storage mechanisms, higher data processing standards etc, the data that is being generated is beyond measure [2, 3, 4]. But, although data is present around, there has been less knowledge being generated out of these huge volumes of data. Knowledge is crucial in making complex decisions related to business, emergencies, calamities etc. Frequent itemset mining is an advancing area of research in the domain of data mining. Frequent Itemset mining leads to discovery of associations, correlations among items in large transactional or relational data sets. With massive amounts of data being constantly collected and stored, many industries are becoming interested in mining such patterns from their databases. The relation of interesting correlation relationships among huge amounts of business transaction records can help in many business decision-making processes, such as catalog design, crossmarketing, and customer shopping behavior analysis [1]. A typical example of frequent Itemset mining is market basket analysis [5]. This process analyses customer buying habits by finding associations between different items that customers place in their shopping baskets. It helps in finding customer buying patterns. The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers [1]. There have been many algorithms in literature to mine frequent patterns in an efficient and scalable manner [7, 8, 9]. The most basic algorithm is the Apriori algorithm, for mining frequent Itemsets for Boolean association rules. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1) itemsets [1]. Most of the frequent pattern mining algorithms are based on Apriori [8]. Another popular frequent Itemset mining without candidate generation is Frequent Pattern Growth or simply, FPgrowth algorithm, which adopts a divide-and-conquer strategy. FP-growth mining algorithm offers better performance than Apriori algorithm as the former does not depend on candidate generation. Also, the database is fully scanned just twice. Thus, it more efficient and scalable compared to Apriori. The problem of mining frequent patterns in databases is transformed to that of mining the FP-tree [1]. Although, FP-tree is scalable and efficient, and many contemporary algorithms are based on FP-tree; its performance deteriorates as the length and number of the patterns increases, with most of the patterns being redundant [2, 3, 4, 9, 10, 11]. FP-tree being a tree-based structure which resides in main memory, it becomes difficult to accommodate a very deep, highly-branched tree in memory. Secondly, even though FP-tree scans the database just twice, it still scans the entire set of transactions each time including the null transactions, although it does not contain any relevant information; which in turn hampers performance. The proposed work has made an attempt to overcome the drawbacks of FPtree. Firstly, an algorithm is introduced namely CFIM-P algorithm, which is based on closed frequent itemset mining. CFIM-P algorithm eliminates the redundant patterns thereby, attempting to minimize the depth of the tree, without losing any critical information. Also, before the mining process commences, the proposed framework through Vertical Data Representation technique for frequent itemsets, tracks the null transactions and filters them for the subsequent mining process. Thus, the mining will be restricted just to the relevant set of transactions thereby, saving time and cost. The paper is organized in the following manner. Section 2 briefly describes related works; Section 3 represents the methodology of the 317

2 proposed framework and in Section 4 are the experimental observations. 2. RELATED WORKS There have been many algorithms for mining frequent itemsets based on transactional databases in the data mining literature. Many of these algorithms are Apriori based [5, 11, 12]. They depend on generate-andtest paradigm. That is, they find frequent itemsets from the transaction database by first generating candidates and then checking their support (i.e. their occurrences) against the transaction database. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1) itemsets [1]. Apriori for this reason, is not scalable and is impractical for many realtime scenarios. Another popular frequent Itemset mining without candidate generation is Frequent Pattern Growth or simply, FP-growth algorithm, which adopts a divide-andconquer strategy as follows. First, it compresses the database representing frequent items into a frequentpattern tree or FP-tree, which retains the Itemset association information. It then divides the compressed database into a set of conditional databases (a special kind of projected database), each associated with one frequent item or pattern fragment and mines each such database separately[2, 3, 4, 9, 10, 11, 13]. FP-growth mining algorithm offers better performance than Apriori algorithm as the former does not depend on candidate generation. Also, the database is fully scanned just twice [1]. However, FP-tree algorithm does not drop the so called null transactions for subsequent scanning of conditional databases. Also, when the patterns are too long and redundant, it is impractical to construct a main- memory based FP-tree. Algorithms based on mining maximal frequent itemsets performs better than FP-tree based algorithms since, they avoid redundant patterns [6]. However, the maximal frequent itemset mining does not give complete information on the frequent itemsets, unlike algorithms based on closed frequent itemset mining [1, 8]. Also, they consider null transactions for mining, which is avoidable. An efficient algorithm for discovering all frequent patterns make use of bitmaps for representing transactions which contain a particular itemset [14]. But, the method still demands considering all the transactions in a given transaction dataset for mining. The stream based algorithms uses FP-tree for representing all frequent itemsets which is obtained by scanning all transactions, including null transactions [2, 3, 4, 11]. Transactions which contain just a single itemset can be avoided from the scheme of things even in stream data since; it cannot help in representing any pattern. Zhun et al., 2008 have proposed a modified FPtree which is built obviously by scanning every transaction including null transactions [7]. Takeaki et al., (2004) proposed an efficient means to mine closed frequent itemsets by constructing a tree that consists only of closed frequent itemsets but, it considers all transactions meanwhile for finding these closed frequent itemsets [7]. Saravanabhavan and Parvathi (2011) in mining utility itemsets have relied upon FP-tree [15]. This approach however requires all transactions to be considered for mining. Null transactions composed of just 1-itemset could be avoided for an enhanced performance since, it convey no interesting patterns. Agrawal and Srikant (1994) have proposed couple of algorithms which require scanning of the database just once [8]. Since, these are based on Apriori, these are less scalable. 3. METHODOLOGY OF THE PROPOSED FRAMEWORK Let I= {I 1, I 2,.,I m } be a set of items. Let D, the task-relevant data, is a set of database transactions where each transaction T is a set of items such that T is a subset of I. Each transaction is associated with an identifier, called TID. Let A be a set of items. A transaction T is said to contain A if and only if A is a subset of T. An association rule is an implication of the form A => B, where A is a proper subset of I, B is a proper subset of I, and A intersection B is an empty set [1]. The rule A => B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A U B (i.e., the union of sets A and B, or say both A and B, or say both A and B). This is taken to be the probability, P(A U B) The rule A => B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to be the confidence probability, P(B A) [1]. That is, Support (A => B) = P(A U B) (1) Confidence (A => B) = P(B A). (2) A set of items is said to be an Itemset. An Itemset that contains k items is a k-itemset. The set {computer, antivirus} is a 2-itemset. The occurrence frequency of an Itemset is the number of transactions that contain the Itemset. This is also known as the frequency, support count, or count of the Itemset. Note that the Itemset support defined in equation (1) is sometimes referred to as relative support, whereas the occurrence frequency is called the absolute support. If the relative support of an Itemset I satisfies a prespecified minimum support threshold (i.e., the absolute support of I satisfies the corresponding minimum support count threshold), then I is a frequent Itemset. The set of frequent k-itemsets is commonly denoted by L k. From equation (2), we have Confidence (A => B) = P(B A) = support(a U B)/support(A) = support_count(a U B)/support_count(A). (3) 318

A major challenge in mining frequent Itemsets from a large data set is the fact that such mining often generates a huge number of Itemsets satisfying the minimum support threshold (min_sup),

A long Itemset will contain a combinatorial number of shorter, frequent sub-itemsets.

3 A major challenge in mining frequent Itemsets from a large data set is the fact that such mining often generates a huge number of Itemsets satisfying the minimum support threshold (min_sup), especially when min_sup is set low. This is because, if an Itemset is frequent, each of its subsets is frequent as well. A long Itemset will contain a combinatorial number of shorter, frequent sub-itemsets. For example, a frequent Itemset of length 100, such as {a 1, a 2,, a 100 }, contains 100 C 1 = 100 frequent 1-itemsets: a 1,a 2, a 3..a 100, 100 C 2 frequent 2-itemsets: (a 1, a 2 ), (a 1, a 3 ),, (a 99, a 100 ), and so on [1]. The total number of frequent Itemsets that it contains is thus, 100 C C C 100 = (2 100 ) (4) This is too huge a number of Itemsets for any computer to compute or store. To overcome this difficulty, the concepts of closed frequent Itemsets and maximal frequent Itemsets have been introduced [1]. An Itemset X is closed in a data set S if there does not exists proper super-itemset Y such that Y has the same support count as X. An Itemset X is a closed frequent Itemset in set S if X is both closed and frequent in S. Suppose that a transaction database contain two transactions: {(a 1, a 2,, a 100 ), (a 1, a 2,, a 50 )}. Let the minimum support count threshold be min_sup=1. We find two closed frequent Itemsets and their support counts, that is C = {{a 1, a 2,, a 100 }: 1;{a 1, a 2,, a 50 }: 2}. Comparing this to equation (4), where we determined that there are (2 100 )-1 frequent Itemsets, which is too huge a set to be enumerated. 3.1 Methodology As shown in fig 3.1, the proposed framework consists of 3 phases; the first phase traces the null transactions and filters them for subsequent mining procedures. The second phase uses CFIM-P algorithm to find closed frequent itemsets. Finally, these closed frequent itemsets constitute to form patterns. Table 3.1: Transaction Database (Modified from Han and Kamber, 2006). TID List of item-id s T100 I 1,I 2,I 5 T200 I 2 T300 I 2,I 3 T400 I 1,I 2,I 4 T500 I 1 T600 I 2 T700 I 2,I 1,I 5 T800 I 1,I 5 T900 I 6, I 7 Assuming that the minimum support is set to 2, the FP-tree while mining the frequent itemsets will have considered all the transactions. However, the proposed work considers transactions T200, T500 and T600 as null transactions, since they contain just 1-itemsets (TID stands for Transaction ID). These single itemsets apparently won t give any information for association, for the simple reason that they are not associated with any itemset in that particular transaction Screening Null Transactions Null transactions may outweigh the non-null transactions in any real time merchandise. For example, in a grocery store, customer may buy neither by coffee powder nor washing soaps, if these itemsets are assumed to be two of the frequent itemsets. Null transactions also influence the various association and correlation measures [1]. Thus, in this proposed framework an attempt has been made to eliminate the null transactions thereby, attempting to reduce the processing time for finding frequent k- itemsets. Finding null transactions and later eliminating them from future scheme of things is the initial part of this proposed framework. Consider for instance that, an electronic shop has 100 transactions to begin with of which, 40% are null transactions (assuming the best case). Now, FP-tree method of mining or any other related method in that case would scan all the 100 transactions while, the proposed framework attempts to reduce the transactions to 60% by considering just the valid 60 transactions after screening the 40 null transactions. This saves a lot of precious computation time. An attempt has been made to find null transactions by using Vertical Data Format of representation. In this format, data is represented in the {item-set: Trans-ID} format. Thus, with this representation it is quite possible to find the null transactions by finding those transactions that don t appear against any frequent single-itemset. Fig. 3.1: Work flow diagram for mining frequent patterns 319

4 Table 3.2 Transaction database of an electronic shop in Vertical Data Format Item I 1 I 2 I 3 I 4 I 5 I 6 I 7 Trans-ID s T100, T400, T700, T800 T100, T300, T400,T700 T300 T400 T100, T700, T800 T900 T900 Now consider the above Vertical Data Format representation of the same database given in table 3.1. One can notice that, the null transactions containing just 1- itemsets have been removed prior to mining. Also, itemsets I 3 I 4, I 6 and I 7 does not satisfy the minimum support count of 2, and is hence, avoided for mining. The resultant FP-tree formed from the dataset given in table 3.2 is shown below. Item Support Links I 2 4 I 1 4 I5 3 I 5:2 Fig. 3.2 FP-tree corresponding to table 3.1 It is to be noted that, the proposed work will not consider transactions T200, T500, T600 and T900 when it scans the dataset (table 3.1) for the second time to construct the FP-tree since, they are null transactions Closed Frequent Itemset Mining- Pruning (CFIM-P) Algorithm The proposed algorithm for finding closed frequent patterns by mining closed frequent itemsets (by sub-itemset pruning strategy) is given below. CFIM-P (FP-tree, min-sup) for each frequent single-itemset construct conditional pattern bases, b = {b 1, b 2, b 3,, b n } for each b i (where i = 1, 2,, n) if b i min-sup and support-count(b i ) > support-count(b j ), for i > j insert b i to a set of frequent patterns I 1:3 I 2:4 {} I 1:1 I 5: 1 CFIM-P algorithm uses the sub-itemset pruning strategy of closed frequent itemset mining. If a frequent itemset X is a proper subset of an already found closed itemset Y and SupportCount(X) = SupportCount(Y), then X and all of X s descendents in the set enumeration tree cannot be frequent closed itemsets and thus can be pruned [1]. This will help in eliminating the redundant patterns and thereby, receiving refined patterns. But, unlike maximal frequent itemsets mining algorithms [6], there is no loss of any information in CFIM-P. The elimination of redundant patterns is done through sub-itemset pruning strategy. That is, the closed frequent itemsets which are already included in any of its closed frequent superset are pruned. CFIM-P algorithm mines closed frequent patterns from the FP-tree, constructed prior. The frequent patterns are mined based on the minimum support count. If the mined frequent pattern is a subset of an already mined frequent superset then the former is eliminated. The mining happen in a top-down manner i.e. starting from the most frequent itemset and traversing through to the least frequent item. This approach helps in easy tracking of closed frequent itemsets, since the longer patterns are mined first. If a closed frequent itemset is found, it is added to a list of frequent itemsets. With reference to table 3.1 and fig. 3.2, the closed frequent patterns will be {I2, I1: 3} and {I2, I1, I5: 2}. Thus, the redundant patterns that might have been mined by FP-tree algorithm like, {I2, I1: 2}, {I2, I5:2}, {I1, I5:2} has been omitted; giving much refined patterns. 4. EXPERIMENTAL OBSERVATIONS 4.1 Data Collection The real-time data has been collected from a local restaurant. The data collected consists of a set of transactions performed on one day. Each transaction has a transaction-id and an associated list of itemsets. Detail pertaining to itemsets like, price, quantity etc. is not within the purview of the proposed work. The average length of itemsets for any transaction in the dataset is just under 3. The purpose of considering this dataset is that, it provides a good means to perform frequent pattern mining and at the same time helps to explain the effectiveness of the proposed work. A customized sample dataset has been considered as a test dataset, before the real-time dataset is considered. But, the finding based on the sample dataset is pertinent. 4.2 Implementation Both CFIM-P algorithm and FP-tree algorithm have been implemented in Java. The database in stored in MySQL database. A tree data structure has been considered for constructing the FP-tree. For each algorithm the average time taken to find frequent patterns was noted, instead of relying on any one reading, to give a 320

5 realistic reading. The frequent patterns are even stored in the database. A sample output screen is given as follows. Table 4.3.1: % of Null transactions based on respective minimum support for sample dataset Minimum Support % of Null Transactions 0 % 45 % 4 % 48 % 9 % 63 % 13 % 72 % 20 % 78 % Table 4.3.2: % of Null transactions based on respective minimum support for real-time dataset Fig Sample Output Screen (Part 1) Minimum Support % of Null Transactions 0 % 32 % 4 % 37 % 7 % 41 % 10 % 54 % 14 % 55 % 18 % 55 % Fig : Impact of elimination of null transactions in the performance of CFIM-P algorithm, in a sample dataset. Fig Sample Output Screen (Part 2) 4.3 Experimental Analysis The experiments have been based on two datasets. First being a sample dataset which is based on an electronic shop, consisting of 100 transactions and 30 itemsets. The second being a real life dataset based on a restaurant, which consists of 192 transactions and 64 itemsets. Fig : Impact of elimination of null transactions in the performance of CFIM-P algorithm with real-time dataset 321

6 Fig and fig shows the impact of screening of null transactions in the performance of CFIM-P algorithm. With higher minimum support count, the number of frequent itemsets descends which in turn result in a greater ratio of null transactions in the dataset. These null transactions are ultimately eliminated in the proposed framework thereby, boosting the performance. It can even be observered from the above-mentioned Fig.s that, the performance of FP-tree algorithm lags since, it considers even null transactions for the mining process. Fig : Distribution of patterns in FP-tree algorithm for real-time dataset Fig to show the distribution of mined patterns across varying minimum support count. It can be observed that, CFIM-P mines lesser number of patterns compared to that in FP-tree in each case since, it eliminates the redundant patterns through closed frequent itemset mining methodology. Fig : Distribution of Patterns in CFIM-P algorithm for the sample data Fig : Distribution of patterns in FP-tree algorithm for sample dataset. Fig : Comparison of the execution times of CFIM-P and FP-Tree algorithm for the sample dataset. Fig : Distribution of patterns in CFIM-P algorithm for real-time dataset. Fig : Comparison of the execution times of CFIM-P and FP-Tree algorithm for the real-time dataset. 322

7 Fig and Fig give a performance comparison between CFIM-P and FP-tree. It can be observed that, CFIM-P fairs well compared to FP-tree. The performance of CFIM-P improves with higher percentage of minimum support, since the latter will constitute a higher percentage of null transactions. 5. CONCLUSION Mining frequent patterns requires mining massive datasets. With varying minimum support, the number of frequent items descends, which in turn result in a greater ratio of null transactions in a dataset. Also, frequent pattern mining encounters the issue of redundant patterns. The proposed work has attempted to resolve both these issues. From the experimental results, it has been observed that, CFIM-P algorithm performs better than the traditional FP-tree algorithm especially for higher minimum support count; since a higher minimum support result in a greater ratio of null transactions. Since, CFIM-P algorithm is based on Closed Frequent Itemset Mining; it even eliminates the redundant patterns thereby, giving refined patterns as compared to that obtained through FPtree. It can be taken into consideration that, the proposed framework can be more efficient compared to the traditional FP-tree in mining massive, real-time merchandise dataset. REFERENCES [1] Jaiwei Han, Micheline Kamber Data Mining: Concepts and Techniques, Elsevier Publication, 2nd Edition, 2006, Pages: [2] Hui Chen Mining Frequent Patterns in Recent Time Window over Data Streams, 10th IEEE International Conference on High Performance Computing and Communications, September 2008, Dalian, China, Pages: [3] Leung C.K., S. Boyu Hao Mining of Frequent Itemsets from Streams of Uncertain Data, Proceedings of the 2009 IEEE International Conference on Data Engineering, 29 March April 2009, Shanghai, China, Pages: [4] Jia-Dong Ren, Hui-Ling He, Chang-Zhen Hu, Li-Na Xu, Li-Bo Wang Mining Frequent Patterns based on Fading Factor in Data Streams, Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, July, 2009, Baoding, China, Volume 04, Pages [5] Liu Yongmei, Guan Yong Application in Market Basket Research Based on FP- Growth Algorithm, Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering, 31 March 2009-April , Los Angeles, California USA, Volume 04, Pages: [6] Yan Hu, Ruixue Han An Improved Algorithm for Mining Maximal Frequent Patterns, International Joint Conference on Artificial Intelligence, April, 2009, Hainan Island, China, Pages: [7] Takeaki Uno, Tatsuya Asai, Yuzo Uchida, Hiroki Arimura LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets, Proceedings of Workshop on Frequent itemset Mining Implementations, Japan, Volume 54, Pages: [8] Rakesh Agarwal, Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules, Proceedings of the 20th VLDB Conference Santiago, Chile, [9] Zhun Zhou, Bingru Yang, Yunfeng Zhao, Wei Hou Research on Algorithms for Association Rules Mining Based on FPtree, 2 nd International Symposium on Systems and Control in Aerospace and Astronautics, December, 2008, Shenzhen, China, Pages: 1-5. [10] Chen Hong-ye, Jin Guo-ying Incremental FP_Growth Mining Algorithm Based on Web Information Extraction, Second International Conference on Information and Computing Science, May 2009, Manchester, UK, Volume 1, Pages: [11] Cong-Rui Ji, Zhi-Hong Deng Mining Frequent Patterns without Candidate Generation, Fourth International Conference on Fuzzy Systems and Knowledge Discovery, August 2007, Haikou, China, Volume 1, Pages: [12] XU Yusheng, MA Zhixin, CHEN Xiaoyun, LI Lian Improving Frequent Patterns Mining by LFP, 4 th International Conference on Wireless Communications, Networking and Mobile Computing, October, 2008, Dalian, China, Pages:

8 [13] Show-Jane Yen, Yue-Shi Lee, Chiu-Kuang Wang, Jung-Wei Wu An Efficient Approach for Mining Frequent Patterns Based on Traversing a Frequent Pattern Tree, International Conference on Computer Science and Software Engineering, December 2008, Wuhan, Hubei, China, Volume 1, Pages: [14] Fuzan Chen, Minqiang Li, Jisong Kou An Efficient Algorithm for Discovering all frequent patterns, Proceedings of the 2009 WRI Global Congress on Intelligent Systems, May 2009, Xiamen, China, Volume 02, Pages: [15] C. Saravanabhavan, R.M.S. Parvathi Utility FP-Tree: An Efficient Approach to Mine Weighted Utility Itemsets, European Journal of Scientific Research, ISSN X, Volume 50, Pages: ,

Incremental Mining of Frequently Correlated, Associated- Correlated and Independent Patterns Synchronously by Removing Null Transactions

Incremental Mining of Frequently Correlated, Associated- Correlated and Independent Patterns Synchronously by Removing Null Transactions Md. Rezaul Karim 1, Azam Hossain 1, A.T.M Golam Bari 1, Byeong-Soo