Mining High Average-Utility Itemsets

Similar documents
AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

A New Method for Mining High Average Utility Itemsets

Incrementally mining high utility patterns based on pre-large concept

A New Method for Mining High Average Utility Itemsets

Performance and Scalability: Apriori Implementa6on

A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets

High Utility Web Access Patterns Mining from Distributed Databases

Generation of Potential High Utility Itemsets from Transactional Databases

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

CS570 Introduction to Data Mining

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

A Review on Mining Top-K High Utility Itemsets without Generating Candidates

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

An Efficient Algorithm for finding high utility itemsets from online sell

FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

Chapter 4: Mining Frequent Patterns, Associations and Correlations

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

Model for Load Balancing on Processors in Parallel Mining of Frequent Itemsets

An Algorithm for Frequent Pattern Mining Based On Apriori

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Fast Algorithm for Finding the Value-Added Utility Frequent Itemsets Using Apriori Algorithm

Appropriate Item Partition for Improving the Mining Performance

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Utility Mining Algorithm for High Utility Item sets from Transactional Databases

Improved Frequent Pattern Mining Algorithm with Indexing

Parallel Mining of Maximal Frequent Itemsets in PC Clusters

Roadmap. PCY Algorithm

DATA MINING II - 1DL460

Association Rule Mining from XML Data

Mining of Web Server Logs using Extended Apriori Algorithm

Utility Mining: An Enhanced UP Growth Algorithm for Finding Maximal High Utility Itemsets

RHUIET : Discovery of Rare High Utility Itemsets using Enumeration Tree

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Mining Techniques

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

CHUIs-Concise and Lossless representation of High Utility Itemsets

Maintenance of the Prelarge Trees for Record Deletion

Mining Temporal Association Rules in Network Traffic Data

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Product presentations can be more intelligently planned

An Efficient Generation of Potential High Utility Itemsets from Transactional Databases

Anju Singh Information Technology,Deptt. BUIT, Bhopal, India. Keywords- Data mining, Apriori algorithm, minimum support threshold, multiple scan.

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Mining Quantitative Association Rules on Overlapped Intervals

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

AN ENHANCED SEMI-APRIORI ALGORITHM FOR MINING ASSOCIATION RULES

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

PLT- Positional Lexicographic Tree: A New Structure for Mining Frequent Itemsets

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Efficient Mining of High Average-Utility Itemsets with Multiple Minimum Thresholds

Temporal Weighted Association Rule Mining for Classification

Finding frequent closed itemsets with an extended version of the Eclat algorithm

A Fast Algorithm for Mining Rare Itemsets

A Decremental Algorithm for Maintaining Frequent Itemsets in Dynamic Databases *

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Data Structure for Association Rule Mining: T-Trees and P-Trees

Enhancing the Performance of Mining High Utility Itemsets Based On Pattern Algorithm

Structure of Association Rule Classifiers: a Review

IMPROVING APRIORI ALGORITHM USING PAFI AND TDFI

Comparative Study of Subspace Clustering Algorithms

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Applying Data Mining to Wireless Networks

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Optimized Frequent Pattern Mining for Classified Data Sets

Frequent Pattern Mining

A Review on High Utility Mining to Improve Discovery of Utility Item set

Discovering High Utility Change Points in Customer Transaction Data

A Novel method for Frequent Pattern Mining

Chapter 4: Association analysis:

A SURVEY OF DIFFERENT ASSOCIATIVE CLASSIFICATION ALGORITHMS

Mining of High Utility Itemsets in Service Oriented Computing

BCB 713 Module Spring 2011

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

Ascending Frequency Ordered Prefix-tree: Efficient Mining of Frequent Patterns

ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 2, Issue 2, March 2013

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

A Graph-Based Approach for Mining Closed Large Itemsets

Implementation of CHUD based on Association Matrix

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Implementation of Efficient Algorithm for Mining High Utility Itemsets in Distributed and Dynamic Database

Maintenance of fast updated frequent pattern trees for record deletion

Advance Association Analysis

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

A Modern Search Technique for Frequent Itemset using FP Tree

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

SIMULATED ANALYSIS OF EFFICIENT ALGORITHMS FOR MINING TOP-K HIGH UTILITY ITEMSETS

Finding Generalized Path Patterns for Web Log Data Mining

A Literature Review of Modern Association Rule Mining Techniques

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Keywords: Frequent itemset, closed high utility itemset, utility mining, data mining, traverse path. I. INTRODUCTION

SETM*-MaxK: An Efficient SET-Based Approach to Find the Largest Itemset

Closed Non-Derivable Itemsets

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment

Incremental Mining of Frequently Correlated, Associated- Correlated and Independent Patterns Synchronously by Removing Null Transactions

Transcription:

Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Mining High Itemsets Tzung-Pei Hong Dept of Computer Science and Information Engineering tphong@nukedutw Cho-Han Lee Institute of Electrical Engineering prescott2005@hotmailcom Shyue-Liang Wang Dept of Information Management slwang@nukedutw Abstract The average utility measure is adopted in this paper to reveal a better utility effect of combining several items than the original utility measure A mining algorithm is then proposed to efficiently find the high average-utility itemsets It uses the summation of the maximal utility among the items in each transaction including the target itemset as the upper bounds to overestimate the actual average utilities of the itemset and processes it in two phases As expected, the mined high averageutility itemsets in the proposed way will be fewer than the high utility itemset under the same threshold Experiments results also show the performance of the proposed algorithm Keywords utility mining, average utility, two-phase mining, downward closure I INTRODUCTION In the past, Liu et al then presented a two-phase algorithm for fast discovering all high utility itemsets [2, 3] In this paper, we proposed a new idea to evaluate the utilities of itemsets Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length Thus, the utility of an itemset in a transaction will increase along with the increase of its length That is, longer itemsets in a transaction result in higher utility values Thus, using the same minimum threshold to judge itemsets with different lengths is not fair In order to alleviate the effect of the length of itemsets and identify really good utility itemsets, the average utility measure is adopted in this paper to reveal a better utility effect of combining several items than the original utility measure It is defined as the total utility of an itemset divided by its number of items within it The average utility of an itemset is then compared with a threshold to decide whether it is a high average-utility itemset An algorithm is also proposed to find all the high average-utility itemsets Like two-phase mining for high utility itemsets, the proposed mining algorithm for high average-utility itemsets uses average-utility upper bounds to overestimate the actual average utilities of itemsets for satisfying the downward closure property The average-utility upper bound of an itemset is designed here as the summation of the maximal utility among the items in each transaction including the itemset Only the combinations of the itemsets which have their averageutility upper bounds beyond the user-defined threshold are added into the candidate set in a level-wise way The downward closure property can thus be maintained in this way Finally, the performance of the proposed mining algorithm is verified by real-world market data II REVIEW OF RELATED MINING ALGORITHMS Agrawal and Srikant proposed the Apriori algorithm [1] to mine association rules from a set of transactions In each pass, Apriori employs the downward-closure (anti-monotone) property to prune impossible candidates, thus improving the efficiency of identifying frequent itemsets Many other algorithms based on the property have then been proposed to discover frequent itemsets rapidly [4-7] Traditional association-rule mining does not, however, consider the quantities sold in transactions and the profit of each item sold, which are important to some applications as well Yao et al thus proposed the utility model to measure how useful an itemset is by considering both the quantities and the profits of items [8] In utility mining, the downward-closure property no long exists since the utility of an itemset will grow monotonically and the frequency of an itemset will reduce monotonically along with the number of items in an itemset The two different monotonic properties make the downwardclosure property invalid in utility mining Thus, Barber and Hamilton proposed the approaches of Zero pruning (ZP) and Zero subset pruning (ZSP) to exhaustively search for all high utility itemsets in the database [9] Li et al then proposed the FSM, the ShFSM and the DCG methods [10, 11] to discover all high utility itemsets by taking advantage of the level-closure property Besides, Yao proposed a framework for mining high utility itemsets based on mathematical properties of utility constraints [12] Liu et al then presented a two-phase algorithm for fast discovering all high utility itemsets [2, 3] The proposed approach is based on the two-phased approach III MINING HIGH AVERAGE-UTILITY ITEMSETS Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length Thus, the utility of an itemset in a transaction will increase along with the increase of its length That is, longer itemsets in a transaction result in higher utility values For example, assume a transaction is given as shown in Table 1 There are five items in the transaction, respectively denoted A to E The value attached to each item is the quantity sold in the transaction TABLE 1 A TRANSACTION AS THE EXAMPLE TID A B C D E t x 1 1 4 1 0 978-1-4244-2794-9/09/$2500 2009 IEEE 2600

Assume the predefined profit of each item is defined in Table 2 The utility of the 1-itemset {A} in the transaction is thus calculated as 1*3, which is 3, according to the above two tables The utility of the 2-itemset {AB} in the transaction is calculated as 1*3 + 1*10, which is 13 Similarly, the utility of the 3-itemset {ABC} is calculated as 1*3 + 1*10 + 4*1, which is 17 Accordingly, the utility of the 3-itemset {ABC} is larger than the 2-itemset {AB}, which is further larger than the 1- itemset {A} Longer itemsets result in higher utility values This property is very obvious since longer itemsets will include some more items than their proper subsets This effect will attenuate the judgment about whether an itemset is really better than its subsets TABLE 2 THE PREDEFINED PROFIT VALUES OF THE ITEMS Item Profit A 3 B 10 C 1 D 6 E 5 The mined itemsets in the proposed way will be fewer than those in the original way under the same threshold Our proposed approach can thus be executed under a larger threshold than the original, thus with a more significant and relevant criterion The approach for mining useful itemsets under the proposed criterion is stated below IV THE PROPOSED ALGORITHM FOR MINING HIGH AVERAGE-UTILITY ITEMSETS There are two phases in the proposed algorithm In phase 1, the average-utility upper bound is used to overestimate the itemsets In phase 2, we just need to scan the database once to check the result of phase 1 is actually high or not Two-Phase algorithms for mining high average-utility itemsets INPUT: 1 A set of m items I = {i 1, i 2,, i j,, i m }, each i j with a profit value p j, j = 1 to m; 2 A transaction database D = {T 1, T 2,, T n }, in which each transaction includes a subset of items with quantities; 3 The minimum average-utility threshold λ OUTPUT: A set of high average-utility itemsets STEP 1:Calculate the utility value u jk of each item i j in each transaction T k as u jk = q jk *p j, where q jk is the quantity of i j in T k for j = 1 to m and k = 1 to n STEP 2:Find the maximal utility value mu k in each transaction T k as mu k = max{u 1k, u 2k,, u mk } for k = 1 to n STEP 3:Calculate the average-utility upper bound ub j of each item i j as the summation of the maximal utilities of the transactions which include i j That is: ubj = muk ij Tk STEP 4:Check whether the average-utility upper bound of an item i j is larger than or equal to λ If i j satisfies the above condition, put it in the set of candidate averageutility 1-itemsets, C 1 That is: C1 = { ij ubj λ,1 j m} STEP 5:Set r = 1, where r is used to represent the number of items in the current candidate average-utility itemsets to be processed STEP 6:Generate the candidate set C r+1 from C r with all the r- subitemsets in each candidate in C r+1 must be contained in C r STEP 7:Calculate the average-utility upper bound ub s of each candidate average-utility (r+1)-itemset as the summation of the maximal utilities of the transactions which include s That is: ubs = muk s Tk STEP 8:Check whether the average-utility upper bound of each candidate (r+1)-itemsets s is larger than or equal to λ If s does not satisfy the above condition, remove it from C r+1 That is: New Cr+ 1 = { s ubs λ, s original Cr+ 1} STEP 9:IF C r+1 is null, do the next step; otherwise, set r = r + 1 and repeat STEPs 6 to 9 STEP 10: For each candidate average-utility itemset s, calculate its actual average-utility value au s as follows: au s = s Tk ij s u jk s, where u jk is the utility value of each item i j in transaction T k and s is the number of items in s STEP 11: Check whether the actual average-utility value au s of each candidate average-utility itemset s is larger than or equal to λ If s satisfies the above condition, put it in the set of high average-utility itemsets, H That is: H = { saus λ, s C}, where C is the set of all the candidate average-utility itemsets V AN EXAMPLE In this section, an example is given to demonstrate the proposed mining algorithm based on the average-utility of items Assume the ten transactions shown in Table 3 are used 2601

for mining Each transaction consists of two features, transaction identification (TID) and items purchased TABLE 3 THE SET OF TEN TRANSACTION DATA FOR THIS EXAMPLE TID A B C D E t 1 1 1 4 1 0 t 2 0 1 0 3 0 t 3 2 0 0 1 0 t 4 0 0 1 0 0 t 5 1 2 0 1 3 t 6 1 1 1 1 1 t 7 0 2 3 0 1 t 8 0 0 0 1 2 t 9 7 0 1 1 0 t 10 0 1 1 1 1 Also assume that the predefined profit value for each single item is defined in Table 4 TABLE 4 THE PREDEFINED PROFIT VALUES OF THE ITEMS Item Profit A 3 B 10 C 1 D 6 E 5 Moreover, the minimum average-utility threshold λ is set as 454, which is the 20% of total utility In order to find the high average-utility itemsets from the data in Table 3, the proposed mining algorithm proceeds as follows After STEPs 1 to 3, the upper-bound values of all the items are shown in Table 5 TABLE 5 THE AVERAGE-UTILITY UPPER BOUNDS OF 1-ITEMSETS Itemset A 67 B 88 C 72 D 105 E 70 Check whether the average-utility upper bound of 1- itemsets exceeds the minimum average-utility threshold λ All the items are recorded as candidate average-utility 1-itemsets, C 1, because their average-utility upper bounds are larger than or equal to the user-defined minimum average-utility threshold λ, which is 454 Then the candidate average-utility 2-itemsets (C 2 ) are generated from C 1 and the average-utility upper bound of each 2-itemset is calculated The upper-bound values of all the 2- itemsets are shown in Table 6 TABLE 6 THE AVERAGE-UTILITY UPPER BOUNDS OF THE 2-ITEMSETS 2-Itemset AB 40 AC 41 AD 67 AE 30 BC 50 BD 68 BE 60 CD 51 CE 40 DE 50 After this first phase, all the candidate average-utility itemsets are shown in Table 7 TABLE 7 ALL THE CANDIDATE AVERAGE-UTILITY ITEMSETS IN THE EXAMPLE Itemset A 67 B 88 C 72 D 105 E 70 AD 67 BC 50 BD 68 BE 60 CD 51 DE 50 Next, the second phase begins and the actual averageutility value au s of each candidate average-utility itemset is calculated The actual average-utility value of each candidate average-utility itemset is then compared with the user-defined minimum average-utility threshold λ In this example, the final results are shown in Table 8 TABLE 8 HIGH AVERAGE-UTILITY ITEMSETS High Average- Utility Itemset B 80 D 60 BD 51 2602

Three high average-utility itemsets are generated Note that if the traditional utility criterion is used, the results will be {B}, {D}, {AD}, {BC}, {BD}, {BE} and {DE} The number of the high average-utility itemsets is less than that of the high utility itemset VI EXPERIMENTAL RESULTS Experiments were made to show the performance of the proposed approach All the experiments were performed on an Intel Core 2 Duo E6550 (233GHz) PC with 2 GB main memory, running the Windows XP Professional operating systems The proposed algorithm was implemented in Visual C# 90 and applied to a real data set A real data set from a major grocery chain store in America was used for the experiments There were 21,556 transactions and 1,559 distinct items in the database Each transaction consisted of the products sold and their quantities The average transaction length is 403 The total profit of the dataset is $104,450,739 Figure 1 shows the number of candidate itemsets generated by our approach (TPAU) vs Liu s TP The minimum utility threshold varies from 0008% to 0012% Compared to TP, TPAU generates much fewer candidate itemsets The number of candidate itemsets generated by TPAU decreases substantially Number of Itemsets 250000 200000 150000 100000 50000 0 0008% 0009% 0010% 0011% 0012% Minimum Utility Threshold TPAU Figure 1 Number of candidate itemsets with varying minimum utility threshold of TPAU vs TP Table 9 presents the summary of the number of candidate itemsets (CI), high average-utility itemsets (HAUI), and high utility itemsets (HUI) of TPAU vstp In Phase I, TPAU generates much fewer candidate itemsets In Phase II, the number of high average-utility itemsets (HAUI) is much less than that of high utility itemsets (HUI) TPAU can discover high average-utility itemsets whose utility values are much closer to the minimum utility threshold compared to high utility itemsets TP TABLE 9 COMPARISON OF THE NUMBER OF CANDIDATE ITEMSETS (CI), HIGH AVERAGE-UTILITY ITEMSETS (HAUI), AND HIGH UTILITY ITEMSETS (HUI) OF TPAU VSTP CI (Phase I) Phase II Threshold TPAU TP HAUI HUI 0012% 1583 37707 1556 3497 0011% 1614 53324 1557 4557 0010% 1677 80735 1565 6486 0009% 1896 125920 1579 9997 0008% 2288 197251 1605 18005 Figure 2 shows the execution time of TPAU vs TP The execution time of TPAU is less than TP in all the compared minimum threshold setting Execution Time(sec) 3500 3000 2500 2000 1500 1000 500 0 0008% 0009% 0010% 0011% 0012% Minimum Utility Threshold TPAU Figure 2 Execution time with varying minimum utility threshold of TPAU vs TP VII CONCLUSIONS This paper defines a new mining measure called average utility and proposes a two-phase mining algorithm to discover high average-utility itemsets The proposed mining algorithm is divided into phases In phase I, this algorithm overestimates the utility of itemsets from the perspective of transactions, this process matins the downward closure property to efficiently prune impossible utility itemsets level by level In phase II, according to the candidate itemsets generated from phase I, one database scan is needed to determine the actual high averageutility itemsets Considering that the length of itemsets is a major factor to influence the utility values of itemsets The measure average-utility is used to evaluate the utility values The experimental results show that our algorithms can obtain fewer itemsets with purer and accurate utility REFERENCES [1] R Agrawal and R Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," in Proceedings of the 20th International Conference on Very Large Data Bases: Morgan Kaufmann Publishers Inc, 1994 TP 2603

[2] Y Liu, W Liao, and A Choudhary, "A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets," in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2005 [3] Y Liu, W-k Liao, and A Choudhary, "A Fast High Utility Itemsets Mining Algorithm," in Proceedings of the 1st International Workshop on Utility-based Data Mining Chicago, Illinois: ACM, 2005 [4] M J Zaki, S Parthasarathy, M Ogihara, and W Li, "New Algorithms for Fast Discovery of Association Rules," University of Rochester, 1997 [5] S Brin, R Motwani, J D Ullman, and S Tsur, "Dynamic Itemset Counting and Implication Rules for Market Basket Data," in Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data Tucson, Arizona, United States: ACM, 1997 [6] J S Park, M-S Chen, and P S Yu, "An Effective Hash-based Algorithm for Mining Association Rules," in Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data San Jose, California, United States: ACM, 1995 [7] J Han, J Pei, and Y Yin, "Mining Frequent Patterns without Generation," in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data Dallas, Texas, United States: ACM, 2000 [8] H Yao, H Hamilton, and C Butz, "A Foundational Approach to Mining Itemset Utilities from Databases," in Proc of the 4 th SIAM International Conference on Data Mining, 2004, pp 211-225 [9] B Barber and H J Hamilton, "Algorithms for Mining Share Frequent Itemsets Containing Infrequent Subsets," in Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery: Springer-Verlag, 2000 [10] Y Li, J Yeh, and C Chang, "Efficient Algorithms for Mining Share- Frequent Itemsets," in Fuzzy Logic, Soft Computing and Computational Intelligence-11th World Congress of International Fuzzy Systems Association (IFSA 2005), 2005, pp 534 539 [11] Y Li, J Yeh, and C Chang, "Direct s Generation: A Novel Algorithm for Discovering Complete Share-Frequent Itemsets," Lecture Notes in Computer Science, vol 3614, p 551, 2005 [12] H Yao and H J Hamilton, "Mining Itemset Utilities from Transaction Databases," Data Knowl Eng, vol 59, pp 603-626, 2006 2604