Utility Mining: An Enhanced UP Growth Algorithm for Finding Maximal High Utility Itemsets C. Sivamathi 1, Dr. S. Vijayarani 2 1 Ph.D Research Scholar, 2 Assistant Professor, Department of CSE, Bharathiar University, Coimbatore, India 1 c.sivamathi@gmail.com 2 vijimohan_2@yahoo.com Abstract: Efficient discovery of high utility itemsets from a database is defined as utility mining. An itemset whose utility value is greater than utility threshold is known as high utility itemsets. In recent years, several algorithms have been proposed in utility mining. In a traaction database, utility mining algorithms extract high profitable itemsets. In a huge volume traaction database, utility mining produces a large set of itemsets. To retrieve compact patter, maximal high utility itemsets are introduced. An itemset is defined as maximal high utility itemsets, if there is no proper subset of this itemsets are high utility itemsets. The main objective of this work is to retrieve maximal high utility itemsets from a traaction database using UP Growth algorithm. This work uses UP Growth algorithm approach to store utility information of itemsets. From the resultant high utility itemsets, it retrieves only the maximal high utility itemsets. A Chess dataset is used for performance analysis of the algorithm. An experimental result discusses the performance factors like execution time, Memory space and number of maximal high utility items retrieved by the algorithm. Keywords: Utility Mining, High Utility Itemsets, UP Growth, Maximal High Utility Itemsets, Pruning strategy. 126 I. INTRODUCTION Utility mining results a huge number of high utility itemsets. It needs to apply post-processes on these itemsets to discover necessary high utility itemsets [1]. Moreover some itemsets may be irrelevant to users. Hence compact patter like maximal and closed itemsets are introduced. An itemset is called maximal if it is not a subset of any other high utility itemsets. An itemset is called closed if it has no superset with the same utility value. These two compact patter do not reduce the huge number of utility itemsets, but it provide meaningful results. Moreover if a traaction database is very dee and the minimum utility threshold very low, then mining all high utility item sets might not be a good idea. For example, if there is a high utility item set with size n, then all 2n nonempty subsets of the item set have to be generated. Hence if subset of item set is high utility, it is sufficient to discover only all the maximal high utility item sets. Mining high utility item sets can thus be reduced to mine a border in the item set lattice. All item sets above the border are high utility itemset and those that are below the border are low utility itemset. However, mining only maximal itemsets has the following deficiency: From a maximal itemset and its utility threshold, it is known that all its subsets are high utility and the utility threshold is more than minimum threshold, but the exact utility value is not known. In this work, the algorithm discover maximal high utility itemsets, which are not only high utility but also maximal itemset. The algorithm is based on UP- Growth algorithm. As per our knowledge, this is the first work to retrieve maximal high utility itemsets in data mining. II. RELATED WORK Association rule mining is coidered to be an interesting research area and studied widely [2] [3] by many researchers. In the recent years, some relevant methods have been proposed for mining high utility itemsets from traaction databases. In 1994, Agrawal.R et al. proposed Apriori algorithm by exploit downward closure property [3][4] which is the pioneer for efficiently mining association rules from large databases. This algorithm generated and tested candidate itemsets iteratively. This may scan database multiple times, so the computational cost is high. In order to overcome the disadvantages of Apriori algorithm and efficiently finds frequent itemsets without generating candidate itemsets, a frequent pattern Growth (FP-Growth) is proposed by Han.J et al. [5]. The FP-Growth was used to compress a database into a tree structure which shows a better performance than Apriori. Although it has two limitatio: (i) It treats all items with the same price. (ii) In one traaction each item appears in a binary (/1) form, i.e. either present or absent. In the real world, each item in the supermarket has a different prices and single customer may take same item multiple times. Therefore, finding only traditional frequent patter in a database cannot fulfill the requirement of finding the most valuable customers/itemsets that contribute the most to the total profit in a retail business. In 26, H. Yao et al. proposed UMining [6] algorithm to find almost all the high utility itemsets from an original database. But it suffers to capture a complete set of high utility itemsets. Later, in 21 V. S. Tseng et al. [7] proposed UP-Growth algorithm to rectify the problems of FP-Growth. Another algorithm named Two-Phase is able to find high utility itemsets. The Two-Phase algorithm is used to prune down the number of candidates and obtain the complete set of high utility itemsets. In the first phase, traaction-weighted downward closure property of search space is used to expedite the identification of candidates. In the second phase, one extra database scan
is performed to identify high utility itemsets. However, this algorithm cannot deal with negative item values in utility mining. In order to find high utility itemsets with negative item values some candidate itemsets are lost. Hence, the Two-Phase algorithm focuses on positive item values and is not suited to negative item values in utility mining [8]. An algorithm named THUI (Temporal High Utility Itemsets) was proposed [9] and it is the first algorithm for finding temporal high utility itemsets in temporal databases. The algorithm integrated the advantages of the Two-Phase algorithm and the SWF algorithm and augment with the incremental mining techniques for mining temporal high utility itemsets efficiently. However, this algorithm only focuses on high utility itemsets with positive item values and is not suited to negative item values. Hence, the algorithm cannot find high utility itemsets with negative item values. Lin et al. first developed the HAUP-tree structure and the HAUP-growth algorithm for mining HAUIs. In the HAUP-tree, each node at the end of a path stores the average-utility upper bound of the corresponding item as well as the quantities of the preceding items in the same path. This approach can thus be used to speed up the discovery of HAUIs [1]. Lan et al. [11] proposed a projection-based averageutility itemset mining (PAI) algorithm to reveal HAUIs using a level-wise approach. Based on the proposed upper-bound model, the number of unpromising candidates can be greatly reduced compared to previous work based on the TWU model. 127 III. PROPOSED METHODOLOGY The UP-Growth is one of the efficient algorithms to generate high utility itemsets. It uses a tree structure called, global UP-Tree. This UPMAXTree maintai the traactio information, so that there is no need to scan the database again and again. In aupmaxtree, each node has node s item name, support count, parent name, node link to which it points to a node and a set of child nodes. The UPMAXTree maintai a table named Header Table. In header table, each entry records an item name, an overestimated utility, and a link. The cotruction of a global UP-Tree is performed with two database sca. In the first scan, Traaction Utility (TU) of all traactio and Traaction Weighted Utility (TWU) of all items are calculated. An item whose TWU is less than minimum utility threshold is said to be unpromising item. An unpromising item and all its supersets are not high utility itemsets. DGU (Discarding Global Unpromising Items during Cotructing a Global UP-Tree) states discard global unpromising items and their actual utilities from traactio and traaction utilities of the database. The traactio are ierted into aupmaxtree in the second scan. When a traaction is retrieved, the unpromising items are removed from the traaction and their utilities are also eliminated from the traactio utility (TU). Thus, new TU are calculated after pruning unpromising items which are called reorganized traaction utility (abbreviated as RTU). Then, Reorganized will be cotructed with the RTU [1]. DGN (Decreasing Global Node Utilities during Cotructing a Global UP-Tree) states decrease the global node utilities for the nodes of global UP-Tree by actual utilities of descendant nodes. DGN is especially suitable for the databases containing lots of long traactio. In other words, a traaction contai more items; more utilities can be discarded by DGN. PHUI itemsets are retrieved from the tree. The PHUI is similar to TWU, which compute all itemsets utility with the help of estimated utility. Finally, identify high utility itemsets from PHUIs values. The global UP-Tree contai many sub paths. Each path is coidered from bottom node of header table. This path is named as conditional pattern base (CPB). Traaction Id Table I. Example Traaction Database Utility of Traaction T1 (i1,1) (i2,3) (i3,2) 19 T2 (i1,2) (i3,3) (i4,1) 21 T3 (i2,2) (i3,1) 1 T4 (i2,3) (i3,1) (i4,3) (i5,2) 32 T5 Item (i1,1) (i2,3) (i3,2) (i4,1) (i5,2) Table III. Profit of Items 28 Profit (utility) i1 2 i2 3 i3 4 i4 5 i5 2 Item Table IIIII. TWU of Items TWU i1 68 i2 89 i3 11 i4 81 i5 6 Calculate TWU of all items, which is the sum of traaction utility of the item in which it appears. This is shown in table 3. Here minimum utility threshold was chosen as 7, hence the unpromising items i1 and i5 are
removed from database and RTU was cotructed and this is shown in table 4. The items are arranged in descending order with respect to their TWU. Now cotruction of UPMAXTree is done and this is given in figure 1. Bottom-up tracing: Each branch in the tree is traced from the leaf node. First, a pointer is set to the leftist leaf node.the npointer moves from the leaf node to root node. For checking a node N, if N s utility is larger than or equal to minimum utility threshold, then the itemset is put to a high utility itemset list.also, the ancient nodes from N to the root are also labeled as checked, i.e., these nodes do not need to be checked since they are impossible to be maximal. If N s utility is less than MinU, the pointer goes to the parent node of N. After all branches are checked, the process is finished. From the figure utility itemsets are: {2} {3} {2, 3} { 4 3} {4,2} {4, 2, 3}. Maximal utility itemsets is {4 2 3}. Pseudo code for UPMaxUtilityItemsets Algorithm: UPMaxUtilityItemsets Input :UPMAXTree, HeadertableHx, Itemset X, min_utility_threshold. Output :MaximalHighUtilityItemsets. 1. For each entry i in Hx do 2. Trace each node related to i and calculate nu sum (i). // nu sum (i) = sum of node utilities of (i). 3. If (nu sum (i) >min_utility_threshold) do 4. Generate PHUI Y = X i. 5. Set pu(i) as estimated utility of i. 6. Cotruct Y-CPB 7. Put local promising items in Hv 8. Apply DLU 9. Apply DLN 1. Set a pointer pt to leftist leaf node of UPMAXTree. 11. List = botttomuptracing(upmaxtree, min_utility_threshold, pt) 12. Output MaximalHighUtilityItemsets from List. 13. If ( Tx NULL) then UPMaxUtilityItemsets (Tx, Hx, X) 14. End if. 15. End for. Traaction id 128 Table IV. Reorganized Reorganised RTU T1 (i3,2 )(i2,3) 17 T2 (i3,3) (i4,1) 17 T3 (i3,1) (i2,2) 1 T4 (i3,1) (i2,3) (i4,3) 28 T5 (i3,2) (i2,3) (i4,1) 22 Fig 1. UPMAX Tree IV. EXPERIMENTAL EVALUATION The algorithm is implemented in Java language. The software tool used is NetBea IDE 8.. The dataset used in the experiment is Foodmart. It is dataset of customer traactio from a retail store, obtained and traformed from SQL-Server 2. It coists of 4141 traactio with an average of 6 items per traactio. Table V. Performance Measures Database size 1 Traactio 2 Traactio Traactio Traactio Min. Utility Threshol d Execution Time in ms. Memory Coumptio n in MB High Utility Itemset Counts Generated Max. High Utility Itemset s 78 16.14 16 624 25 13.46 3 1 9 19.95 4219 1545 121 22.27 12958 5489 33 34.72 7245 4728 222 27.62 16 74 386 42.58 185 8925 444 48.2 3945 18524 396 81 987 3694 31 69.7 386 132 454 74.23 24848 12584 67 84.4 55714 2371 21 15.61 17439 5784 378 16.1 769 528 4 12.84 428 192 12 372 11.74 261 154 From the table V, it was found that maximal high utility itemsets retrieves compact view of high utility itemsets. It includes only the superset of high utility itemsets. Figure 2 shows the comparison of execution time of different size of traactio at different utility threshold.figure 3 gives the comparison of memory coumption of different size of traactio at different utility threshold and figure 4 depicts the number of high utility itemsets retrieved. Figure 5 represents the number of maximal high utility itemsets retrieved.
12 12 12 12 International Journal of Electrical Electronics & Computer Science Engineering 7 6 5 4 3 2 1 2 2 1 9 8 7 6 5 4 3 2 1 6 2 1 Fig. 2. Comparison of Execution Time 1 Fig. 3. Comparison of Memory Coumed Fig. 4. Comparison of Number of High Utility Items Retrieved 129 2 2 1 2 1 2 Fig. 5. Comparison of Number of Maximal High Utility Items Retrieved V. CONCLUSION Utility Mining coiders utility factors of itemset, which is an emerging topic in data mining. It is very beneficial in several real-life applicatio. In this work, a novel algorithm for effectively mining maximal high utility item sets from a traaction database is proposed. It sca the database only once. This maximal high utility itemsets give a compact list of high utility itemsets. In this work UPMAXTree was cotructed and then maximal high utility itemsets are retrieved from it. It also implements DGU and DGNpruning strategies. Hence it reduces time and number of candidates generation. The algorithm was implemented using and experiments are done using chess dataset. Execution time, Memory space, number of maximal high utility items and number of high utility itemsets retrieved by the algorithm are found. VI. REFERENCES [1] Vincent S. Tseng, Bai-En Shie, Cheng-Wei Wu, and Philip S. Yu, Fellow, Efficient Algorithms for Mining High Utility Itemsets from Traactional Databases, IEEE Traaction on knowledge and data engineering, vol. 25, no. 8, Aug 213. [2] C.F. Ahmed, S.K. Tanbeer, B.-S. Jeong, and Y.-K. Lee, Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases, IEEE Tra. Knowledge and Data Eng., vol. 21, no. 12, pp. 178-1721, Dec. 29 [3] R. Agrawal, T. Imieliki, A. Swami, 1993, mining association rules between sets of items in large databases, in: proceedings of the ACM SIGMOD International Conference on Management of data, pp. 27-216 [4] R. Agrawal, R Srikant, Fast algorithms for mining association rules,in : Proceedings of 2th
international Conference on Very Large Databases,Santiago, Chile, 1994, pp.487-499 [5] J Han, J.Pei, Y.Yin,R. Mao Mining frequent Patter without candidate generation:a frequent - pattern tree approach, Data Mining and Knowledge Discovery 8(1)(24) 53-87 [6] Liu. Y, Liao. W,A. Choudhary, A fast high utility itemsets mining algorithm, in: Proceedings of the Utility-Based Data Mining Workshp, August 25 [7] Y.-C. Li, J.-S. Yeh, and C.-C. Chang, Isolated Items Discarding Strategy for Discovering High Utility Itemsets, Data and Knowledge Eng., vol. 64, no. 1, Jan. 28. [8] C.H. Cai, A.W.C. Fu, C.H. Cheng, and W.W. Kwong, Mining Association Rules with Weighted Items, Proc. Int l Database Eng. and Applicatio Symp. (IDEAS 98), 1998. [9] R. Chan, Q. Yang, and Y. Shen, Mining High Utility Itemsets, Proc. IEEE Third Int l Conf. Data Mining, pp. 19-26, Nov. 23. [1] V.S. Tseng, C.-W. Wu, B.-E. Shie, and P.S. Yu, UP-Growth: An Efficient Algorithm for High Utility Itemsets Mining, Proc. 16th ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD 1), 21. [11] H. Yao, H.J. Hamilton, and L. Geng, A Unified Framework for Utility-Based Measures for Mining Itemsets, Proc. ACM SIGKDD Second Workshop Utility-Based Data Mining, Aug. 26. [12] Jiawei Han, Hong Cheng, Dong Xin and Xifeng Yan, Frequent pattern mining: current status and future directio, Data Mining Knowledge Discovery, January 27. 13