Bit Stream Mask-Search Algorithm in Frequent Itemset Mining

Size: px

Start display at page:

Download "Bit Stream Mask-Search Algorithm in Frequent Itemset Mining"

Laurence Parker
5 years ago
Views:

1 European Journal of Scientific Research ISSN X Vol.27 No.2 (2009), pp EuroJournals Publishing, Inc Bit Stream Mask-Search Algorithm in Frequent Itemset Mining E.Ramaraj Director, Computer Centre, Alagappa University, Karaikudi, Tamilnadu. India N.Venkatesan Asst.Prof and Head Dept of IT Bharathiyar College of Engg and Technology Karaikal. Pondichery, India Abstract Association Rules in data mining are generated by identifying relationships among set of items in transaction database. Finding frequent itemsets is computationally the most expensive step in Association rule discovery and therefore it has attracted significant research attention. Although several techniques have emerged, they are all inherently dependent on the memory availability. This paper describes an efficient algorithmic approach called Bit Stream Mask Search which sorts the transaction database by transforming to numeric attributes. In the next step, frequent itemsets are found out, algorithms generated and the data hidden during the process time. During the search process, Masked Itemset Processing (MIP) searches the itemsets with a low execution time. Experimental evaluations show that this approach is faster and occupies less memory space during interaction compared to Apriori like and related algorithms. Keywords: Data Mining, Association Rules, Frequent Itemsets, Apriori, BitStreamMask, MIPSearch. 1. Introduction Data Mining is used to extract knowledge automatically from large data sets. Association Rules, Classifications and clustering are major areas of interest in data mining. The process of mining association rules [4] consists of two steps. 1. Finding the frequent itemset in the database using Support. 2. Constructing the association rule from the frequent itemset with specified confidence. Frequent itemset finding [18][19] is the most expensive of the two steps, since the number of item sets grows exponentially with the number of items. A large number of algorithms to mine frequent itemsets have been developed over the years [1][2][5]. Apriori algorithm and FP Growth algorithm are two key algorithms commonly used in frequent itemset mining. Apriori algorithms generate candidate itemsets from frequent itemsets. The frequency of any itemset is computed by counting its occurrence in each transaction. Many variants of the Apriori algorithm have been developed [3][10][12], such as TprioriTid, AprioriHybrid, Direct Hashing and Pruning (DHP), Dynamic Itemset counting (DIC), Partition algorithm, TprioriTrie, etc. FP Growth [13] uses the FP-tree data structure to achieve a condensed representation of the database transactions and employs a divide-and-conquer approach to decompose the mining problem into a set of smaller

2 Bit Stream Mask-Search Algorithm in Frequent Itemset Mining 287 problems. In essence, it mines all the frequent itemsets by recursively finding all frequent 1-itemsts in the conditional pattern have that is efficiently constructed with the halp of a node link structure. A variant of FP-Growth is H-mine algorithm. Data Structures [17] used for mining frequent itemsets are either array based or tree based. This paper presents a new array based optimization technique called BitStreamMask and Masked Itemset Processing (MIP) Search for mining complete frequent itemsets. Data representation with additional storage of sixteen elements in one process memory array location has been compared with the available implementation of previous apriori and FP-Growth algorithms. This technique shows improved performance of mining frequent itemsets on a number of typical datasets. The paper is presented with a view on previous works in Section 2. Section 3 describes the new approach and its data structure. The experimental results of BitStreamMask-Search are discussed and detailed comparison with the performance analysis of the new approach in Section 4 with the other algorithms. The paper is concluded in section 5 along with concise idea on future enhancement. 2. Previous Works 2.1. Apriori-Trie The Apriori [12] generates the candidate itemsets by joining the large itemsets of the previous and deleting those subsets which are small in the previous pass without considering the transactions in the database. An association rule is valid if its confidence and support are greater than or equal to corresponding threshold values. Apriori steps are as follows: a) Counting of all item occurrences to determine the frequent item sets. b) Generation of candidates. c) Counting the support of each item sets pruning process and ensuring that the candidate sizes are already known to be the frequent item sets. d) Subset of a frequent itemsets is also frequent. Figure 2.1: Frequent item set mapping The data structure trie used in the Apriori [14][15] algorithm is a root (downward) directed tree like a hash tree. The root is defined to be at depth 0, and a node at depth d can point to nodes at depth d+1. A pointer is also called edge or link which is labeled by a letter. There exists a special letter * which represents an end character. If node u points to node v then well can u the parent of v, and v is a child node of u.

3 288 E.Ramaraj and N.Venkatesan Every leaf l represents a word which is the concatenation of the letters in the path from the root to l. Note that if the first k letters are the same in two words, then the first k steps on their paths are the same as well. Tries are suitable to store and retrieve not only words, but any finite ordered sets. In this setting a link is labeled by an element of the set, and the trie contains a set if there exists a path where the links are labeled by the elements of the set, in increasing order FP-Growth The FP tree algorithm [6][18] scans the database twice. In the first time it determines the frequent items that will be used to create the FP-tree and sorts them in frequency order. The top node of the graph is the root. The first node, underneath the tool, is the most frequent item for each record scanned along with a count. Similarly many records are sorted and the most frequent items identified. The basic process involves laying out each record in a frequent order and creating a node for each item under the root. As more items are added, there will be common prefixes. For instance, one record {A,B,C) has a common prefix with {A,B,D namely {A,B. Nodes are not repeated, but the counts for A and B nodes are incremented. When the C node is reached, a new at the same level for C is created with the value D. Note that non frequent items are ignored in the FP-tree construction. In addition, a linked list of frequent items is also maintained, thus every occurrences of A is linked to every other node. The inherent advantages of this structure are the relatively compact representation of the database and the exclusion of non-frequent items. This makes it easy to fit the FP-tree into memory and this is easy to scan for rule development. After completion of construction, the tree is mined for frequent pattern as a) Deriving a set of conditional paths. These are suffix patterns from the FP-tree. b) Constructing a conditional FP-tree for the conditional paths. c) Exploring the conditional tree recursively to find the Frequent Patterns and determine the support level for each pattern. Note that the tree contains only frequent items. No step is wasted with non-frequent items. In addition, since the most frequent items are near the top or root of the tree, the mining algorithm works well, but there are some limitations. a) The databases must be scanned twice. b) Updating of the database requires a complete repetition of the scan process and construction of a new tree, because the frequent items may change with database update. c) Lowering the minimum support level requires complete rescan and construction of a new tree. d) The mining algorithm is designed to work in memory and performs poorly if a higher memory paging is required. 3. Bit Stream Mask-Search Algorithms BitStreamMask is a novel approach in which the input file is first transformed into numerical data. After this the transaction file is compressed into an array for further processing. This approach increases the overall efficiency of the apriori algorithm in terms of time and space complexity. The algorithms are implemented based on the following theorem.

4 Bit Stream Mask-Search Algorithm in Frequent Itemset Mining 289 Theorem If t 1,t 2,..t n are n transactions, finding a set of transactions t 1,t 2,..t n is in such a way that t i s are in terms of t 1,t 2,..t n and t i Ώ t i-1 = ф Proof Choose t 1 = t 1 t 2 = t 2 -t 1 But t 1 U t 2 U. t n = t 1 U t 2 U...U t n... t n = t n -t n-1 Claim t i Ώ t i-1 = ф if t i Ώ t i-1 ф Let x Є t i x Є t i-1 X Є t i x Є t i but x t i-1 X Є t i-1 x Є t i-1 but x t i-2 x Є t i-1 and x t i-1 which is a contradiction Therefore, such an x cannot exist. Hence t i Ώ t i-1 = ф This theorem is executed in the following algorithms to show reduced search time Algorithm 1: Transforming the Items as Unique Integer Values This algorithm transforms each item in the database into a numerical value counting from (1,2,.n), then checks whether a value is already assigned, if not assigns a value which is value of previous item +1. The input is a text file in which each item is given a unique number. The numerical dataset is the output. Numerical_transform(database) { for each item in db { if item already scanned current item { assign new number = old number +1 return (numerical file) For Example, when patients symptom data is chosen for implementation, the sample outputs are as follows. Table 3.1: Example Medical Data set Transformation Symptoms Transformed Items Fever, Cough, throat pain Fever, Cough, breathlessness Swallowing difficulty, fever, neck swelling, Breathlessness Cough, vomiting Cyanosis, noisy breading, chest retraction Cough, ear pain, ear discharge Breathlessness, nasal block, cough, noisy breading, fever Breathlessness, cyanosis

5 290 E.Ramaraj and N.Venkatesan Using the above output the following algorithms are implemented. Algorithm 2: BitStreamMask This algorithm read the transaction file generated by the Algorithm 1 for each transaction it take item 1 to n and transform it into Bit Stream format which makes the overall checking of item combinations for all itemsets (1 to n) optimized Input: Numerical dataset file formed by above algorithm // allocate Memory for storing the Masked information BitStreamMask( ) { BitStreamMask [no of Transaction] [((Maxitem-1)/16)+1] For each transaction in input file { for each item in transaction { pos=(item -1)/16; if (item%16=0) then item = 16; else item = item %16 BitStreamMask [transaction][pos] + = power(2,item) return(bitstreammask array) Steps for BitStreamMask transformation: Before transformation allocate space for MIP array If The number of unique items in the database is N, and Number of transaction in database is T, then The BitStreamMask array is declared as BitStreamMask [T][(N-1)/16] End Step 1: read each item in transaction 1 to N Step 2: compress 16 items into one single value Consider this transaction having items This can be stored in MIP array as follows: BitStreamMask BitStreamMask BitStreamMask [0][0] [0][1] [0][2] ( ( ( = ) = ) =1+8) =2085 =2320 =9 In BitStreamMask [0][0], the items 1 to 16 is masked In BitStreamMask [0][1], the items 17 to 32 is masked, where 17,18,19,,32 is taken as 1,2,3,..,16 In BitStreamMask [0][2], the items 33 to 48 is masked, where 33,34,35,,48 is taken as 1,2,3,..,16 for each transaction the above transformation is done. Normally, search algorithms explore the whole database for each combination of itemsets to gather the required itemsets. This process has several disadvantages in the form of increasing search

6 Bit Stream Mask-Search Algorithm in Frequent Itemset Mining 291 time, memory occupation, etc. But MIPSearch picks out the required itemsets at a single glance. This technique uses code to search the number of occurrences of a particular subset in itemset i. Algorithm 3: Searching of itemset k Masked Item processing [MIP] MIPSearch (i th item combination in itemset k, minimum support) { MIP Search [(Maxitem-1)/16] for each item in itemset i combination { pos = (item-1)/16 if(item%16=0) item=16; else item=item%16; MIP Search [pos] += power (2, item) for each transaction in MIP array { if(mips Search [0,1..n] & MIPSearch [transaction][0,1,..n] = MIP Search [0,1,..n]) itemset_count = count++; if (itemset_count >= minsupport) add itemset i else delete itemset i Step 1: mask the item subset(masked_subset) Consider the 2 item subset (2,3) this is masked as follows = = 6 and position to search in MIP array is 0 because the items are between 1 to 16. Step 2: perform AND operation between Masked_subset and each transaction in MIP array for (2,3) Masked_subset = 6 and position is 0 so, AND 6 and MIP[1,2, n][0] if the result is same as Masked subset, ie., 6 then the item subset is present in that transaction else not present Itemset_2 and Itemset_3 to N Join Step of Apriori joins items in L k itemset to form items in L k+1 subset of itemset. If L k itemset has common items then they are not combined. They are combined only if they have a different item. Frequent Itemset This is used to check whether the subsets formed in the subset module are frequent or not. This is done to make sure that an itemset is frequent only if its subsets are frequent. If subsets are found to be frequent then the corresponding itemset is added to the candidate itemset else it is discarded thus, reducing the search space and hence the time involved in searching. 4. Experimental Results and Performance Analysis All these algorithms were experimented on six data sets, which exhibit different characteristics and the results evaluated. The data set used were: T10100K, T40I200200K, pump, chess, connect and mushroom obtained from FIMI web site. For the experiments, we used Intel Pentium 2.5 GHz processor, Windows XP with 256 MB RAM was used. The results for these data sets are discussed as

7 292 E.Ramaraj and N.Venkatesan shown in figure 4.1 to Figure 4.6. Each figure represents the results for respective dataset implementation of Apriori-Trie, FP-Growth and BitStreamMask-Search. (BSMS) Diagrams are represented as the comparison of various support level and execution time which is given in seconds Comparison with AprioriTrie and FP-Growth Six sets of data were used in our experiments. Two of these sets [13] are synthetic data (T10I4D100K, and T40I10D100k). Table 4.1: Characteristics of Experiment Data Sets Data #items avg. trans. length length # transactions T10100K ,000 Pump ,219 T40I200200K ,000 mushroom ,124 Connect ,557 Chess ,225 The other datasets are real data (pump, Mushroom, chess and Connect-4 data) which are dense in long frequent patterns. These data sets were often used in the previous study of association rules mining and were downloaded from and palmeri/datam/dci /datasets.php. Some characteristics of these datasets are shown in table 4.1. Bit Stream Mask Search algorithm were mainly compared with two popular algorithms - AprioriTrie and FP-growth, the implementations of which were downloaded from fimi.cs.helsinki.fi software implementation using these datasets. They were compiled in Visual C++. The BitStreamMask-Search algorithm was implemented based on these codes. Table 4.2: Run Time (S) For T10100k Data Support(%) AprioriTrie FP Growth BSMS Table 4.2 shows the running time of the compared algorithms on T10100K data with different minimum supports represented by percentage of the total transactions. Under minimum supports, BSMS runs faster than AprioriTrie and FP-Growth. BSMS algorithm runs faster than both algorithms under almost all support values. On an average, BSMS algorithm runs almost two times faster than AprioriTrie and thrice to FP-Growth and AprioriTrie. Figure 4.1 shows the performance comparison under various support levels. Table 4.3 and Fig. 4.2 show the performance comparison of the compared algorithms on T40I200200K data. BSMS runs faster than various minimum support values. BSMS algorithm runs twice faster than AprioriTrie on an average and thrice faster than FP-Growth.

8 Bit Stream Mask-Search Algorithm in Frequent Itemset Mining 293 Figure 4.1: Comparison of Run Time (S) for T10100k Data T10100K SUPPORT (%) AprioriTrie FP Growt h BAMS Table 4.3: Run Time (S) For T40i200200k Data Support(%) AprioriTrie FP Growth BSMS Figure 4.2: Comparison of run time (S) for T40I200200K Data T K SUPPORT (%) AprioriTrie FP Growth BAMS Table 4.4: Run Time (S) for Pump Data Support(%) AprioriTrie FP Growth BSMS Table 4.4 and Fig. 4.3 show the performance comparison of the compared algorithms on Pump data. BSMS algorithm runs faster in all support level. For this dataset, BSMS algorithm runs faster than other two algorithms.

9 294 E.Ramaraj and N.Venkatesan Figure 4.3: Comparison of Run Time (S) For Pump Data PUMP EXECUTION TIME (SECONDS) AprioriTrie FP Grow th BAMS SUPPORT (%) Table 4.5: Run Time (S) For Connect-4 Data Support(%) AprioriTrie FP Growth BSMS Table 4.5 and Fig. 4.4 show the relative performance of the algorithms on Connect-4 data. Connect-4 data is very dense. In the implementation BSMS algorithm runs faster than AprioriTrie in all support level and thrice faster than FP-Growth. Figure 4.4: Comparison of run Time (S) For Connect-4 Data CONNECT EXECUTION TIME (SECONDS) AprioriTrie FP Grow th BAMS SUPPORT (%) Table 4.6 and Fig. 4.5 show the relative performance of the algorithms on Mushroom data. From implementation, BSMS algorithm is faster than FP-Growth almost all support levels. Comparing with AprioriTrie execution time is less. Table 4.6: Run time (S) for Mushroom Data Support(%) AprioriTrie FP Growth BSMS

10 Bit Stream Mask-Search Algorithm in Frequent Itemset Mining 295 Figure 4.5: Comparison of Run Time (s) for Mushroom Data MUSHROOM SUPPORT (%) AprioriTrie FP Growth BAMS Table 4.7 and Fig. 4.6 compare the algorithms of interest on Chess data. AprioriTrie is better than FPGrowth while BSMS is better than FP-Growth. Table 4.7: Run Time (S) for Chess Data Support(%) AprioriTrie FP Growth BAMS Figure 4.6: Comparison of run Time (S) for Chess Data CHESS SUPPORT (%) AprioriTrie FP Growth BAMS 4.2. Run Time The data sets used in the above experiments have often been used in previous research and the times shown include the time needed in all the steps. BSMS algorithm outperforms FP-growth and AprioriTrie. Actually, it is also faster than AprioriTrie. It is expected that Bit Stream Mask Search algorithms will be faster than the other Apriori algorithms because search either one or more itemsets in a transaction only once Storage Cost Storage cost for maintaining the itemsets is less than that for maintaining id lists in the AprioriTrie algorithm. Because of the use of bit for process memory as 1:16, every number is converted as one bit storage. For every number we use to convert it as one bit storage. Once bit is generated, the corresponding operations are done efficiently. The storage is also less than that of FP-tree (FP-tree has a header table and a node link). In the Bit Stream procedure, usage memory is reduced one sixteenth of the actual storage costs.

11 296 E.Ramaraj and N.Venkatesan 4.4 About Comparisons This paper focuses on algorithmic concepts with detailed implementations. For the same algorithm, the run time is different for different implementations. In this paper, the AprioriTrie and FP-Growth implementation were downloaded from FIMI open source code and then implemented. For example, FP-tree construction should be slower than Bit Stream process. For the BSMS algorithm we justify our implementation is faster than tree construction. Our implementation, however, is best to the run time as faster one. 5. Conclusion and Future Work This paper has proposed a novel data structure, a BitStreamMask-Search algorithm to mine frequent itemsets. Quantitative proof that BitStreamMask-Search is superior to AprioriTrie and FP-Growth because Execution time is reduced by half for T40I200200K, T10100K, PUMP data sets. It also reduces execution time by 8.5% for connect-4 datasets. BSMS execution time is lesser than FP-Growth but slightly higher than AprioriTrie for CHESS and MUSHROOM datasets where sizes of the data sets are small. Reduces the search space during each iteration. Reduces the memory space for finding the frequent itemsets. Increases the efficiency due to MIP based search. Decrease the time complexity. The advantages of BitStreamMask-Search over existing algorithms listed above are good evidence for efficiency. BitStreamMask-Search scores a scalable height especially when transactions are large and out perform other algorithms in such transactions. Extension of this new data structure to other algorithms and closed itemsets may reveal new dimensions in future. References [1] Zhi-Choa Li, Pi-Lian He, Ming Lei, A High Efficient AprioriTID Algorithm for mining Association rule, Proceedings of 4 th International Conference on machine learning and cybernetics, pp AUG [2] He Li-jian, Chen Li-chao, Liu shuang-ying, Improvement of AprioriTid Algorithm for Mining Association Rules, Journal of Yantai University, Vol.16, No.4, [3] R. Agrawal, J.Shafer, Parallel mining of association rules, IEEE Transactions on knowledge and Data Engineering, 8(6), December [4] R. Agrawal, T. Imielinski, and A.N. Swami. Mining association rules between sets of items in large databases. In P. Buneman and S. Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, volume 22(2) of SIGMOD Record, pages ACM Press, [5] Ke Su, Fengsdhan Bai Mining weighted Association Rules IEEE transactions on KDE , April 2008 [6] J. Han, J. Pei, and Y. Yin, Mining frequent patterns without candidate generation, Procedings of ACM SIGMOD Intnational Conference on Management of Data, ACM Press, Dallas, Texas, pp. 1-12, May [7] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, Hmine: Hyper-structure mining of frequent patterns in large databases, Proc. of IEEE Intl. Conference on Data Mining, pp , [8] A. Pietracaprina, and D. Zandolin, Mining frequent itemsets using Patricia Tries, FIMI 03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, Melbourne, Florida, December 2003.

12 Bit Stream Mask-Search Algorithm in Frequent Itemset Mining 297 [9] G. Grahne, and J. Zhu, Efficiently using prefix-trees in mining frequent itemsets, FIMI 03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, Melbourne, Florida, December [10] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. Fast discovery of association rules. In U.M. Fayyad, G. Piatetsky- Shapiro, P. Smyth, and R.Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages MIT Press, [11] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. Proceedings 20th International Conference on Very Large Data Bases, pages Morgan Kaufmann, [12] C. Borgelt and R. Kruse. Induction of association rules: Apriori implementation. In W. H ardle and B. R onz, editors, Proceedings of the 15th Conference on Computational Statistics, pages , Physica-Verlag. [13] R. Agrawal and R. Srikant. Quest Synthetic Data Generator. IBM Almaden Research Center, San Jose, California, [14] Christian Borgelt Efficient implementation of Apriori and Eclat FIMI i04 [15] Survey on Frequent Pattern Mining, Bart Goethals, HIIT Basic Research Unit, University of Helsinki, Finland. [16] Ja-Hwung Su, Wen-Yang Lin: CBW: An efficient algorithm for Frequent Itemset Mining, Proceedings of 37 th Hawaii International Conference on System Science [17] Data Mining Concepts and Techniques, Jiawei Han, Micheline Kamber 2004 Edn [18] Mingju Song and Sanguthevar Rajasekaran A transaction mapping for frequent itemsets mining IEEE transactions on Knowledge and Data Engineering 18(4): , April 2006.

A Theoretical Formulation of Bit Mask Search Mining Technique for mining Frequent Itemsets

A Theoretical Formulation of Bit Mask Search Mining Technique for mining Frequent Itemsets Jayshree Boaddh 1, Prof. Urmila Mahor 2 Prof.Niket Bhargava 3 1 Student, Department of Computer Science & Engg.,