Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania University, Rajasthan 2 Department of Computer Science and Engineering, Meerut Institute of Engg. & Technology, Meerut, 3 Department of Math s & MCA Maulana Azad National Inst. Of Tech., Bhopal 1 vk_shrivastava@yahoo.com, 2 drparveen@apiit.edu.in, 3 kamalrajp@hotmail.com Abstract The problem of discovering association rules at single level has received significant research attention and several algorithms for mining frequent itemsets have been developed. Previous studies in data mining have yielded efficient algorithms for discovering association rules. However, discovery of association rules at multiple concept levels may lead to the mining of more specific and concrete knowledge from datasets. The discovery of multilevel association rules is very much useful in many applications. In most of the studies for multi-level association rule mining, the database is scanned repeatedly which affects the efficiency of mining process. In this research paper, a method for discovery of multi-level association rules from primitive level FP-tree is proposed in order to reduce main memory usage and make execution faster. Keywords: Data mining, discovery of association rules, multiplelevel association rules, FP-tree, FP(l)-tree, COFI-tree, primitive level FP-tree. 1. Introduction Data mining [4] is the searches for relationships and global patterns that exists in the large databases but are hidden among the vast amount of data, such as relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database. Data mining, the extraction of the hidden predictive information from large databases, is a powerful new technology with great potential to analyze important information in the data warehouse. Association rule mining is one of the important techniques of data mining. Association Rule mining techniques can be used to discover unknown or hidden correlation between items found in the database of transactions. An association rule [1, 3, 4] is a rule, which implies certain association relationships among a set of objects (such as occurs together or one implies to other ) in a database. Association rule mining [1, 5] is the discovery of association rules attribute-value conditions that occur frequently together in a given data set. Association rule mining is widely used for market basket or transaction data analysis. One of the basic algorithms for mining frequent itemsets is Apriori. It was proposed by Agrawal and Srikant in 1994. It is also called the level-wise algorithm. It is the most popular and influent algorithm to find all the frequent sets. The discovery of multi-level association rule involves items at different levels of abstraction. For many applications, it is difficult to find strong association among items at low level or primitive level of abstraction due to the sparsity of data in multilevel dimension. Strong associations discovered at higher levels may represent common sense knowledge. In the discovery of multi-level frequent patterns from primitive level FP-tree, first requirement is to offer an efficient method for generating frequent items at multiple level of abstraction. This requirement can be full filled by providing concept taxonomies from the primitive level concepts to higher level. There are many possible to way to explore efficient discovery of multi-level association rules. One way is to apply the existing single level association rule mining methods to mine multi-level association rules. If we apply same minimum support and minimum confidence thresholds (as single level) to the multiple levels, it may lead to some undesirable results. For example, if we apply Apriori algorithm [2, 7] to find data items at multiple level of abstraction under the same minimum support and minimum confidence thresholds, it may lead to generation of some uninteresting associations at higher or intermediate levels. Therefore, if we want to find strong relationship at relatively low level in hierarchy, the minimum support threshold must be reduced substantially. However, it may lead to generation of many uninteresting associations such as mouse antivirus. On the other hand it will generate many strong association rules at a primitive concept level. In order to remove the uninteresting rules generated in association mining process, one should apply different minimum support to different concept levels. Some algorithms are developed and progressively reducing the minimum support threshold at different level of abstraction is one of the approaches [6, 8, 9]. Figure 1: concept hierarchy of an electronics stores This paper is organized as follows. The section two describes the basic concepts of multi-level association rules. In section 506

three, a method a method for discovery of multi-level association rules from primitive level FP-tree is proposed. Section four discusses the experimental results and section five describes the conclusions of proposed research work. 2. Multiple-level Association Rules: To understand association rule mining, let assume that the database contains: i. an item data set which contains the description of each item in I in the form of <A i, description i >, where A i Є I, and ii. a transaction data set T, which consists of a set of transactions <T i, {A p,...,a q }>, where T i is a transaction identifier and A i Є I (for i. p,..., q). 2.1 Definition: A pattern or an itemset A, is one item A i or a set of conjunctive items A i ^ A j, where A i,..., A j Є I. The support of a pattern A in a set S, s(a/s), is the number of transactions (in S) which contain A versus the total number of transactions in S. The confidence of A B in S, c(a B/S), is the ratio of s(a B/S) versus s(a/s), i.e., the probability that pattern B occurs in S when pattern A occurs in S. To generate relatively frequent occurring patterns and reasonably strong rule implications, one may specify two thresholds: minimum support s, and minimum confidence c. Observe that, for discovery of multi-level association rules, different minimum support and/or minimum confidence can be specified at different levels. 2.2 Definition: A patterns A is frequent in set S at level l if the support of A is no less than its corresponding minimum support threshold s. A rule A B/S is strong if, for a set S, each ancestor (i.e., the corresponding high-level item) of every item in A and B, if any, is frequent at its corresponding level, A B/S is frequent (at the current level), and the confidence of A B/S is no less than minimum confidence threshold at the current level. For example, we assume that a database contains dataset which contains the description of each item in I in the form of A i ; description, where A i the product code. A transaction data set, T={T 1, T 2, T 3, T 4,., T n }, which consists of a set of transactions T 1 = { A x,.., A z ;} in the format of (TID, A i ); where TID is the transaction number and A i is an item from the item data set. Multi-level database stores hierarchy information encoded transaction table instead of the original transaction table [9, 10, 11]. Therefore, in transaction table each item is encoded as a sequence of digits. The product category table, which is depicted in table 1, contains product category codes and a description for each product category code which is only needed for final display. To concept hierarchy depicted in figure 1, we need to first encode the transaction data or information. For example, we encode hierarchy information by assigning a number to each item of each level. In figure 1, we have 3 levels (excluding All, since there is only 1 item at the root level). At level 3, we have computer, printer, accessories and software. We can assign 1 to computer, 2 to printer, 3 to accessories and so on. Product Category Code 0 All electronics 1 Computer 2 Printer 3 Accessories 4 Software Description 11 Desktop computer 12 Laptop computer 111 IBM Desktop computer 112 IBM Laptop computer 211 HP Inkjet Printer 221 HP Laser Printer. Table 1: Product category codes of an Electronics Stores At the sub-level for computer, level 2, we assign 1 to desktop computer, 2 to laptop computer and 3 to server computer etc. Furthermore, at the sub-level for desktop computer, we assign 1 to IBM 2 to HP and 3 to Dell. As a result, 121 represents IBM laptop computer, while 211 implies HP inkjet printer. The encoding can be extended to permit more than 10 alternatives per level. The difference between this encoding method and that found in single level is that in the single level case we encode all the items explicitly. Transaction Number Items T1 111,112,311,411 T2 112.121,312,411,422 T3 111,112,311,411 T4 113,112,311,411 T5 112.121,312,411,423 T6 111,112,311,411 T7 112.121,312,411,422 T8 111,112,311,411 T9 112.121,312,411,421 10 111,112,311,411 Table 2.2: Encoded transaction table However, in the multi-level context we will add "*" to encode higher level items. For example, at level 3 we will have 1**, 2**, 3** and 4** etc., and at level 2, we will have items like 11*, 13*, and 31* etc. At level 1, will have items like 111, 112 and 113 etc. Consequently, we would get the encoded transaction table shown in Table 2.2. The distinct items per level describes to the number of different items per concept level and the total transaction elements describes to items that appear in the itemset and occur repeatedly in the levels themselves. For example, we assume that in the encoded transaction table, the distinct items at level 1 are 111, 112, 131, 132, 211, 212, 221, 222, 311, 312, 321, 322, 412, and 423. Therefore the number of distinct items at level 1 is 14. 507

3. Method for discovery of multi-level association rules from primitive level FP- tree In this section, we propose a method for discovery of multilevel association rules from primitive level FP- tree. This method uses a hierarchy information encoded transaction table instead of the original transaction table [10, 11, 12]. It is often beneficial to use an encoded table. Although our method does not rely on the derivation of such an encoded table because the encoding can always be performed on the fly. This works in bottom-up manner. For the given FP-tree of primitive level and support threshold s and confidence c for of level l, it discovers multi-level association rules as follows: Input: FP-tree of primitive level, support threshold s and confidence threshold c for level l. Output: frequent pattern association rules for level l. 1. Construct FP(l)-tree from FP tree of primitive level by transforming all items in the header table and all nodes in the FP tree of primitive level. 2. If there are repeated nodes in a path, then top node will be maintained and others will be removed. 3. For each item in the header table, if its support count is less than s, then eliminate the item and its relative nodes form both header table and FP(l)-tree. 4. Merge the same items and relative nodes in FP(l)-tree according to each item in new header table. 5. Eliminate the duplicate items and relative nodes. Accumulate support count of relative nodes. 6. Arrange header table and nodes in FP(l)-tree in descending order of item frequency. 7. Adjust the node-links and path-links in the FP(l)-tree. 8. Generate frequent itemsets 9. For each frequent 1-item 10. Call COFI-tree // Algorithm 2 as shown in method 1. 11. Generate frequent pattern association rules for the level l. Method 1: Multi-level frequent patterns association rules mining from primitive level FP-tree Algorithm 2 COFI-tree // Build Co-Occurrence Frequent Item tree with pruning and Mining COFI-trees for FP(l)-tree Input: FP(l)-Tree, the support threshold s for level l Output: frequent item sets for level l 1. LFI= the least frequent item on the header table of FP(l)-Tree 2. While (true) do 2.1 add up the frequency of all items that share item (LFI) a path. Frequencies of all items that share the same path are the same as of the frequency of the (LFI) items 2.2 Eliminate all non-locally frequent items for the frequent list of item (LFI) 2.3 Create a root node for the (LFI)-COFI-tree with both frequency-count and contribution-count = 0 2.3.1 LOFI is the path of locally frequent items in the path of item FLI to the root 2.3.2 Items on LOFI form a prefix of the (LFI)-COFI-tree. 2.3.3 If the prefix is new then Set frequency_count= frequency of (LFI) node and contributioncount = 0 for all nodes in the path else 2.3.4 Update the frequency_count of the previously exist part of the path. 2.3.5 Update the pointers of the Header list if needed 2.3.6 find the next node for item LFI in the FP(l)-tree and go to 2.3.1 2.4 Call Mine_COFI_tree (LFI) // described in algorithm 3 2.5 Release (LFI) COFI-tree 2.6 LFI = next frequent item from the header table of FP(l)-Tree 3. Goto 2 Algorithm 3 Mine_COFI_tree (LFI) 1. nodelfi = select next node 2. while (there are still nodes) do 2.1. D = set of nodes from nodelfi to the root 2.2. F = nodelfi.frequency-count-nodelfi. contributioncount 2.3 Generate all Candidate patterns X from items in D. Patterns that do not have LFI will be discarded. 2.4. Patterns in X that do not exist in the LFI-Candidate List will be added to it with frequency = F otherwise just increment their frequency with F 2.5. Increment the value of contribution-count by F for all items in LOFI 2.6. nodelfi = select next node 3. Goto 2 4. Based on support threshold s remove non-frequent patterns from LFI Candidate List. 4. Experimental Results To test the performance of proposed a method for discovery of multi-level association rules from primitive level frequent pattern tree. We have collected data from two sources. First, we collected sales database of S. K. Technologies. Second, we have generated synthetic transaction databases using a randomized item set generation algorithm similar to that described in [2]. The transaction database is converted into an encoded transaction table, according to the information about the generalized items in the item description table. The maximum level of the conceptual hierarchy in the item table is set to 5. The encoded transaction database DB1 has parameters setting as shown in table 3. We have generated three synthetic databases BD2, DB3 and DB3 with the parameters setting as shown in table 4. We have used IBM 8175 desktop with Intel P-IV processor 2.8 GHz, 865 mother board, DDR RAM 512 MB @ 400 MHz and Western Digital 40 GB IDE Hard Drive. We have used Microsoft Windows XP operating system and Turbo C++ to test experiments. 508

Parameter Name Description N 100 T 150000 I 10 L 5 B 0 Table 3: Parameters settings for encoded database DB1 Database N T I L B DB2 100 200000 5 5 0 Table 4: Shows parameters setting for DB2. Figure 2: Running time on DB1 Figure 3: Memory usage in KB on DB1 Figure 5: Memory usage in KB on DB2 Experimental results are shown in figures 2, 3, 4, and 5. These figures show that our proposed method performs better than ML_T2L1 [10] in order to main memory usage and also runs fast. 4. Conclusion In this research work, a method for discovery of multi-level association rules from primitive level FP- tree is proposed. This proposed method constructs FP-tree(l) from FP-tree of primitive level. To generate frequent pattern in multi-level it uses COFI-tree method which reduces the memory usage in comparison to ML_T2L1. Therefore, it can mine larger database with smaller main memory available and it also runs fast. This method uses the non recursive mining process and a simple traversal of the COFI-tree, a full set of frequent items can be generated. It also uses an efficient pruning method that is to remove all locally non frequent patters, leaving the COFItree with only locally frequent items. It reaps the advantages of both the FP growth and COFI. Our experimental results show that our proposed method works efficiently in order to reduce main memory usage and also reduces the execution complexity. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIG- MOD International Conference on Management of Data, pages 207-216, Washington, DC, May 26-28 1993. [2] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, Proc. of the 20th International Conference on Very Large Data Bases (VLDB '94), pp. 487-499, Santiago de Chile, Chile, September 12-15, 1994. [3] A. Savasere, E. Omiecinski and S. Navathe An Efficient Algorithm for Mining Association Rules in Large Databases. Proceedings of the 21 st VLDB conference Zurich, Swizerland, 1995. [4] R Srikant, Q. Vu and R Agrawal. Mining Association Rules with Item Constrains. IBM Research Centre, San Jose, CA 95120, USA. [5] A. K Pujai Data Mining techniques. University Press (India) Pvt. Ltd. 2001. [6] J. Han and Y. Fu Discovery of Multiple-Level Association Rules from Large Databases. Proceedings of the 21st VLDB Conference Zurich, Swizerland, 1995. [7] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco, CA, 2001. Figure 4: Running time on DB2 509

[8] Y. Wan, Y. Liang, L. Ding Mining multilevel association rules with dynamic concept hierarchy. In proceedings of the seventh international conference on machine learning and cybernetics, kunming, 12-15 july 2008. [9] J. Han and Y. Fu Discovery of Multiple-Level Association Rules from Large Databases. IEEE Trans. on Knowledge and Data Eng. Vol. 11 No. 5 pp 798-804, 1999. [10] J. Han, J, Pei and Y Yin. Mining Frequent Patterns Without Candidate Generation. In ACM SIGMOD Conf. Management of Data, May 2000. [11] R. S. Thakur, R. C. Jain and K. R. Pardasani Fast Algorith for mining multi-level association rules in large databases. Asian Journal of International Management 1(1):19-26, 2007. [12] M. El-Hajj and O. R. Za ıane. COFI-tree Mining: A New Approach to Pattern Growth with in the context of interactive mining. In Proc. 2003 Int l Conf. on Data Mining and Knowledge Discovery (ACM SIGKDD), August 2003. 510