Mining of High Utility Itemsets in Service Oriented Computing

Size: px

Start display at page:

Download "Mining of High Utility Itemsets in Service Oriented Computing"

Christina Wilson
6 years ago
Views:

1 Mining of High Utility Itemsets in Service Oriented Computing 1 Mamta Singh, 2 D.R. Ingle 1,2 Department of Computer Engineering, Bharati Vidyapeeth s College of Engineering Kharghar, Navi Mumbai 1 singhmamta86@gmail, 2 dringleus@gmail.com Abstract : Service Oriented Computing which use Knowledge as a service makes the use of Utility Mining approach. Here, we have proposed an architecture called Knowledge as a Service (KaaS) where we use utility mining algorithms for extracting the knowledge data from the data owners when the knowledge consumers are in need of a particular knowledge data. The main motive behind proposing architecture is to provide Utility Mining as a service in a distributed computing environment which can be applied in business such as cross selling approach.efficient discovery of frequent itemsets in large datasets is a crucial task of data mining. In recent years, several approaches have been proposed for generating high utility patterns, they arise the problems of producing a large number of candidate itemsets for high utility itemsets and probably degrades mining performance in terms of speed and space. Recently proposed compact tree structure, viz., UP-Tree, maintains the information of transactions and itemsets, facilitate the mining performance and avoid scanning original database repeatedly. In this paper, UP- Tree (Utility Pattern Tree) is adopted, which scans database only twice to obtain candidate items and manage them in an efficient data structured way. Applying UP-Tree to the UP-Growth takes more execution time for Phase II. Hence this paper presents modified algorithm aiming to reduce the execution time by effectively identifying high utility itemsets. Keywords : Service Oriented Computing, Knowledge as a service, Candidate pruning, Utility Mining, Data Mining, Frequent Itemsets, Downward Closure property, UP- Growth Algorithm, UP-Growth+ Algorithm. I. INTRODUCTION Data mining is the process of revealing nontrivial, previously unknown and potentially useful information from large databases. Discovering useful patterns hidden in a database plays an essential role in several data mining tasks, such as frequent pattern mining, weighted frequent pattern mining, and high utility pattern mining. Among them, frequent pattern mining is a fundamental research topic that has been applied to different kinds of databases, such as transactional databases streaming databases and time series databases and various application domains, such as bioinformatics, Web clickstream analysis and mobile environments. Nevertheless, relative importance of each item is not considered in frequent pattern mining. To address this problem, weighted association rule mining was proposed. In this framework, weights of items, such as unit profits of items in transaction databases, are considered. With this concept, even if some items appear infrequently, they might still be found if they have high weights. However, in this framework, the quantities of items are not considered yet. Therefore, it cannot satisfy the requirements of users who are interested in discovering the itemsets with high sales profits, since the profits are composed of unit profits, i.e., weights, and purchased quantities. In view of this, utility mining emerges as an important topic in data mining field. Mining high utility itemsets from databases refers to finding the itemsets with high profits. Here, the meaning of itemset utility is interestingness, importance, or profitability of an item to users. Utility of items in a transaction database consists of two aspects: 1) the importance of distinct items, which is called external utility, and 2) the importance of items in transactions, which is called internal utility. Utility of an itemset is defined as the product of its external utility and its internal utility. An itemset is called a high utility itemset if its utility is no less than a user-specified minimum utility threshold; otherwise, it is called a low-utility itemset. Mining high utility itemsets from databases is an important task has a wide range of applications such as website click stream analysis business promotion in chain hypermarkets, cross marketing in retail stores online e-commerce management, mobile commerce environment planning and even finding important patterns in biomedical applications. The Service-Oriented Computing (SOC) paradigm uses services to support the development of rapid, low-cost, interoperable, evolvable, and massively distributed applications. A web service is an effort to build a distributed computing platform for the Web. Services are 43

2 self governing, platform-independent units that can be described, published, discovered, and loosely coupled in novel ways. Services reflect a service-oriented approach to programming that is based on the idea of composing applications by discovering and invoking network-available services to achieve some task. However, mining high utility itemsets from databases is not an easy task since downward closure property in frequent itemset mining does not hold. In other words, pruning search space for high utility itemset mining is difficult because a superset of a low-utility itemset may be a high utility itemset. A naive method to address this problem is to enumerate all itemsets from databases by the principle of exhaustion. Obviously, this method suffers from the problems of a large search space, especially when databases contain lots of long transactions or a low minimum utility threshold is set. Hence, how to effectively prune the search space and efficiently capture all high utility itemsets with no miss is a crucial challenge in utility mining. Existing studies applied overestimated methods to facilitate the performance of utility mining. In these methods, potential high utility itemsets (PHUIs) are found first, and then an additional database scan is performed for identifying their utilities. However, existing methods often generate a huge set of PHUIs and their mining performance is degraded consequently. This situation may become worse when databases contain many long transactions or low thresholds are set. The huge number of PHUIs forms a challenging problem to the mining performance since the more PHUIs the algorithm generates, the higher processing time it consumes. To address this issue, we propose two novel algorithms as well as a compact data structure for efficiently discovering high utility itemsets from transactional databases. Major contributions of this work are summarized as follows: 1. Two algorithms, named utility pattern growth (UPGrowth) and UP-Growth+, and a compact tres structure, called utility pattern tree (UP-Tree), for discovering high utility itemsets and maintaining important information related to utility patterns within databases are proposed. High-utility itemsets can be generated from UP-Tree efficiently with only two scans of original databases. 2. Several strategies are proposed for facilitating the mining processes of UP-Growth and UP-Growth+ by maintaining only essential information in UP-Tree. By these strategies, overestimated utilities of candidates can be well reduced by discarding utilities of the items that cannot be high utility or are not involved in the search space. The proposed strategies can not only decrease the overestimated utilities of PHUIs but also greatly reduce the number of candidates. 3. Different types of both real and synthetic data sets are used in a series of experiments to compare the performance of the proposed algorithms with the stateof-the-art utility mining algorithms. Experimental results show that UP-Growth and UP-Growth+ outperform other algorithms substantially in terms of execution time, especially when databases contain lots of long transactions or low minimum utility thresholds are set in details. II. LITERATURE REVIEW Traditional approaches in data mining focuses on support and confidence measures which are just statistics based. Support and confidence measures which are based on the frequency count of the items enable us to derive the frequent itemsets. The frequency of the items, as a single factor does not represent the interestingness of the items. To enhance the process of data mining tasks based on the value of the product, several researches were conducted. Association rules were developed in the field of computer science and are often used in important applications such as market basket analysis, to measure the associations between products purchased by a particular customer, and web clickstream analysis, to measure the associations between pages viewed sequentially by a website visitor. In general, the objective is to underline groups of items that typically occur together in a set of transactions. There are many recognized ARM algorithms like Apriori [1], FP-Growth used to generate association rules. Tao et al. [13] indicated that Traditional ARM methods cannot discover significant rules with low support and heavy weight; instead, they generate less important rules with high support and light weight. In order to overcome the weakness of the traditional association rules mining, utility mining model and weighted association rule mining have been proposed. Weighted Association Rule Mining (WARM) is an approach that considers the weight of each item. Rather than allowing multiple minimum support rates, WARM does not need to convert each item s weight into a minimum support threshold. However, WARM does not consider each item s quantity. WARM deals with the importance of individual items in a database [9]. WARM [2][12][10][14] is concerned with the analysis of significance of items or transactions in a set of data. It is proposed to find out different kinds of interesting patterns from a set of data with item weight or transaction weight. The weights in these approaches may be thought of as an extension of traditional support in association-rule mining. 44

3 One of the key research areas in Data Mining is based on the Utility factors. In today s world the most provoking task is mining of high utility itemsets precisely. High utility itemsets being identified is manifested as utility mining. Utility mining is a vast area which wraps all aspects of mercantile utility in data mining. The utility value of an itemset can be computed in terms of cost, profit or other interpretation of user preferences. An itemset x is said to be a high utility itemset if and only if u(x) minutil, where minutil is a user defined minimum utility threshold. Chan et al. thus proposed Utility mining to discover high utility itemsets [3]. They considered not only the quantities of the items in a product combination but also their profits. (i.e.) They considered both individual profits and quantities of products (items) in transactions, and used them to find out actualutility values of itemsets. Formally, local transaction utility and external utility are used to measure the utility of an item. Thelocal transaction utility of an item is directly obtained from the information stored in a transaction dataset, like the quantity of the item sold in a transaction. The external utility of an item is given by users, like its profit [8]. proposed a two-phased algorithm to discover high utility itemsets from a database by adopting the downward closure property. They named their approach as the Transaction- Weighted-Utilization (TWU) model. whole utility of a transaction as the upper bound of an itemset in that transaction to keep the downward-closure property. Chun-Jung Chu et al. [4] proposed a innovative approach to mine the HUI from the iii)present the computational experiments for investigating genetic parameters and operators using three sizes of dataset.[3] Guangzhu Yu et al. [6] have proposed a hybrid method, which is composed of a row enumeration algorithm (i.e., Intertransaction) and a column enumeration algorithm (i.e., Twophase), to discover high utility itemsets from two directions: Two-phase seeks short high utility itemsets from the bottom, while Inter-transaction seeks long high utility itemsets from the top. In addition, optimization technique was adapted to improve the performance of computing the intersection of transactions. Contemporary researches are based on incorporating both the attributes (weightage and utility) for mining of valuable association rules. M. Sulaiman Khan et al. [12] proposed the Weighted Utility Association Rule Mining (WUARM) which can be considered as the extension of weighted association rule mining in the sense that it considers items weights as their significance in the dataset and also deals with the frequency of occurrences of items in transactions. Thus, weighted utility association rule mining is concerned with both the frequency and significance of itemsets and is also helpful in identifying the most valuable and high selling items which contribute more to the company s profit. In real time, the process of finding the High Utility Itemsets from large set of stored data was bit difficult and so there occurred some time delays during the computation of the Itemsets. Thus, to overcome these time delays during the retrieval of the High Utility Itemsets, S.Shankar et al [11] proposed a novel algorithm so called Fast Utility Mining algorithm using which greater number of itemsets where identified as High Utility Itemsets within certain utility threshold even when the distinct items increases in the database. This Fast Utility was considered to be proven experimentally that it computes more effectively and faster than the previous Utility Mining techniques. Kannimuthu et. al [7] is done an improvement over the Fast Utility Mining (FUM), wherein the system computes even faster than FUM with minimum Utility Threshold and this improvisation over Fast Utility Mining (FUM) is termed as Improved Fast Utility Mining (ifum) [7]. Xu et al. [15] proposed a methodology for Knowledge as a Service (KaaS) and Knowledge Breaching which offers new types of services based on knowledge typically extracted from large volumes of data owned and maintained by different parties. It also focuses on the security aspect of the paradigm, and particularly on the problem we call knowledge breaching attack, which may allow an adversary to recover the knowledge underlying a knowledge service. In this work, we provide Knowledge as a Service through web services which makes use of Utility mining and extraction of High Utility Itemsets is done from XML data so that it provides more versatility. Mining using XML data in structure and semantics, has always been a challenging task when compared as of with the traditional databases. It has always been a tough task to obtain wellorganized data with useful information from the Web, but XML overcomes this difficulty. III. METHODOLOGY Mining of high utility itemsets from databases in Service Oriented Computing for finding the itemsets with high profits efficiently in terms of speed and memory cost on large databases composed of long transactions, which are difficult for existing high utility itemsets mining algorithms to handle. The framework of the proposed methods consists of three steps: 1) Scan the database twice to construct a global UP- Tree with the first two strategies ; 2) recursively generate PHUIs from global UP-Tree and local UP-Trees by UP- Growth with the third and fourth strategies or by UP- Growth+ with the last two strategies ; and 3) identify 45

4 actual high utility itemsets from the set of PHUIs. Note that we use a new term potential high utility itemsets to distinguish the patterns found by our methods from HTWUIs since our methods are not based on traditional TWU model. By our effective strategies, the set of PHUIs will become much smaller than the set of HTWUIs. The UP-Growth is one of the efficient algorithms to generate high utility itemsets depending on construction of a global. (Phase 1) The framework of UP-Tree follows three steps: (i) Construction of UP-Tree (ii) Generate PHUIs from UP-Tree. (iii) Identify high utility itemsets using PHUI. The construction of global UP-Tree is follows: (i) Discarding global unpromising items (i.e., DGU strategy) is to eliminate the low utility items and their utilities from the transaction utilities. (ii) Discarding global node utilities (i.e., DGN strategy) during global UP-Tree construction. By DGN strategy, node utilities which are nearer to UP-Tree root node are effectively reduced. The PHUI is similar to TWU, which compute all itemsets utility with the help of estimated utility. Finally, identify high utility itemsets (not less than min_sup) from PHUIs values. The global UP-Tree contains many sub paths. Each path is considered from bottom node of header table. This path is named as conditional pattern base (CPB). 1.1 Basic Concepts and Definition This subsection starts with the definition of a set of terms that leads to the formal definition of utility mining problem which is given in [16]. I={I 1, I 2, I 3, I n }is a set of items, D={T 1, T 2, T 3, T n } be the transaction database where each transaction T i D is a subset of I. o(i p, T q ) is a local transaction utility value, represents the quantity of item i p in transaction T q. For example, o(a, T8) = 3, in Table 1. s(i p ), external utility, is the value associated with item i p in the Utility table. This value reflects the importance of an item, which is independent of transactions. For example, in Table 2, the external utility of item A, s(a) is 3. u(i p, T q ), utility, the quantitative measure of utility i p in transaction T q, is defined as o(i p, T q ) s(i p ). For example,u(a, T 8 ) = 3 3 in Table 1. u(x,t q ), utility of an itemset X in transaction T q, is defined as ), where X={i 1, i 2, i 3,,i k } is a k-itemset, X T q and 1 k m. u(x), utility of an itemset X, is defined as. Table 1. Transaction Table with 9 Transactions and 5 Distinct Items TID A B C D E T T T T T T T T T T Table 2. Profit Table Item A B C D E Profit We find all the high utility itemsets using Utility Mining. An itemset X is a high utility itemset if u(x) minutil, where X I and minutil is the minimum utility threshold. For example, in Table 1, u(a, T8) = 3 3 = 9, u({a,d,e}, T 8 ) = u(a, T 8 ) + u(d, T 8 ) + u(e, T 8 ) = = 32, and u({a,d,e}) = u({a,d,e}, T 4 ) + u({a,d,e}, T 8 ) = = 46. If minutil=120, then {A,D,E} is not a high utility itemset. 3.2 UP-Growth + Algorithm Although DGU and DGN strategies are efficiently reduce the number of candidates (i.e., global UP-Tree). But they cannot be applied during the construction of the local UP-Tree. Instead use, DLU strategy (Discarding local unpromising items) to discarding utilities of low utility items from path utilities of the paths and DLN strategy (Discarding local node utilities) to discarding item utilities of descendant nodes during the local UP- Tree construction. Even though, still the algorithm facing some performance issues in phase-2. To overcome this, maximum transaction weight utilizations (MTWU) are computed from all the items and considering multiple of min_sup as a user specified threshold value as shown in algorithm. By this modification, performance will increase compare with existing UP-Tree construction also improves the performance of UP-growth algorithm. An improved utility pattern growth is abbreviated as IUPG. Input: Transaction database D, user specified threshold. Output: high utility itemsets. Begin 46

5 1) Scan database of transactions Td ϵ D 2) Determine transaction utility of Td in D and TWU of itemset (X) 3) Compute min_sup (MTWU * user specified threshold) 4) If (TWU(X) min_sup) then Remove Items from transaction database 5) Else insert into header table H and to keep the items in the descending order. 6) Repeat step 4 & 5 until end of the D. 7) Insert Td into global UP-Tree 8) Apply DGU and DGN strategies on global UP- tree 9) Re-construct the UP-Tree 10) For each item ai in H do 11) Generate a PHUI Y= X U ai 12) Estimate utility of Y is set as ai s utility value in H 13) Put local promising items in Y-CPB into H 14) Apply strategy DLU to reduce path utilities of the paths 15) Apply strategy DLN and insert paths into Td 16) If Td null then call for loop End for End. IV. EXPERIMENTATION: SHOPPING CART SYSTEM The proposed system is implemented in a real time application called Online Shopping Cart System. The list of all the items available is displayed along with their detailed descriptions in the website. Based on their needs, the customer may chose the required product and add it to the shopping cart. Once the item is added to the shopping cart, it is directed to the online shopping cart server. I In turn online shopping cart server calls Utility Mining service to find the High Utility Itemsets. In this work, UP-Growth+ algorithm used to provide KaaS. Utility Mining service gets data from the data Online Shopping Cart System Server Knowledge Server. It is responsible for extracting Knowledge from database which is distributed anywhere in the location Utility Mining Service gets data from Data owners that are located anywhere in the distributed network and it returns the HUI to Online Shopping Cart Server. Server associates the added item with any item has low productivity or any product category similar to the product which newer to the market that corresponds to the organization in the HUI. If so, it is given as a discount or for free to increase the productivity or promote the product. This information is displayed in the website after user selects any item. This will attracts the consumers to buy an item and associated item consecutively. In this way, the proposed framework is effectively used in E-Commerce. Table 3. Profit Table for the Items Item Name Profit Webcam 5 Player 100 Printer 38 Laptop 1 Table 4. Transaction Dataset TID Webcam Player Printer Laptop T T T T T T T T T T Table 5. Support and Profit for all ItemSets ItemSets Support Profit Webcam Player Printer Laptop Webcam Player Webcam Printer Webcam Laptop Player Printer Player Laptop Printer Laptop Webcam Player Printer Webcam Player Laptop Webcam Printer Laptop Player Printer Laptop Webcam Player Printer Laptop The experiment is conducted in sample shopping cart data set and comparison of proposed approach with support based measure is presented in Table 5. Here we keep 190$ as minimum utility threshold. The result shows that some of the itemsets having low support values are High Utility Itemsets. Algorithms based on support based measures failed to retrieve the High Utility Itemsets. V. CONCLUSION AND FUTURE WORK A novel computing arena called as Knowledge as a Service (KaaS) for mining the High Utility Itemsets 47

6 (HUI) in a distributed environment where the entire transformations are done using the web services as proposed. The Key idea behind the proposal of working in a distributed computing environment is that the data integration cost is very much reduced because the KaaS provides better data independency as of when compared with the centralized environment. The paradigm of KaaS which we have proposed in this paper utilizes the UP- Growth+ algorithm in distributed environment. The future work would focus on the different issues to improve phase-i in terms of execution and memory space cost. REFERENCES [1] Agrawal, R., Imielinski, T., Swami, A.N., Mining association rules between sets of items in large databases, In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, 1993, pp [2] C. H. Cai, A. W. C. Fu, C. H. Cheng, and W. W. Kwon Mining association rules with weighted items, The International Database Engineering and Applications Symposium (IDEAS), pp , [3] Chan, R., Yang, Q., and Shen, Y. D., Mining high utility itemsets, In Proceedings of the 2003 IEEE International Conference on Data Mining, Melbourne, FL, November 2003,pp [4] Chun-Jung Chu, Vincent S. Tseng, Tyne Liang, An efficient algorithm for mining temporal high utility itemsets from data streams, Journal of System Software, Vol.81, No. 7, pp [5] Chun-Jung Chu, Vincent S. Tseng, Tyne Liang, An efficient algorithm for mining high utility itemsets with negative item values in large databases, Applied Mathematics and Computation, Elsevier Journal(2009),Vol. 215(2), pp [6] Guangzhu Yu, Keqing Li, Shihuang Shao, Mining High Utility Itemsets in Large High Dimensional Data, International Workshop on Knowledge Discovery and Data Mining WKDD, pp [7] Kannimuthu S, Dr. K. Premalatha, S. Shankar, ifum - Improved Fast Utility Mining, International Journal of Computer Applications (2011), Vol. 11(6), doi: / [8] Y. Liu, W. Liao, and A. Choudhary, A fast high utility Itemsets mining algorithm, The Utility- Based Data Mining Workshop, pp , [9] S. Lu, H. Hu, and F. Li, Mining weighted association rules, Intelligent Data Analysis, Vol. 5, No. 3, pp , 2001 [9] Liu, Y., Liao, W.K., and Choudhary, A., A twophase algorithm for fast discovery of high utility itemsets, In Proc. of the Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD), [10] S. Lu, H. Hu, and F. Li, Mining weighted association rules, Intelligent Data Analysis, Vol. 5, No. 3, pp , 2001 [11] S.Shankar, Dr.T.Purusothaman, S.Jayanthi A Fast Algorithm for Mining High Utility Itemsets, IEEE International Advance Computing Conference (IACC 2009) Patiala, India, 6-7 March [12] M. Sulaiman Khan, M. Muyeba, and F.Coenen, A weighted tility framework for mining association rules, The 2008 Second UKSIM European Symposium on Computer Modeling and Simulation, pp , 2008 [13] Tao, F., Murtagh, F., and Farid, M., Weighted association rule mining using weighted support and significance framework, Proc. of International Conference on Knowledge Discovery and Data mining, [14] B. Y. Wang and S. M. Zhang, A mining algorithm for fuzzy weighted association rules, The Second International Conference on Machine Learning and Cybernetics, pp , [15] S.Xu and W. Zhang. Knowledge as service and knowledge breaching (full version). Technical report, Department of Computer Science, University of Texas at San Antonio,2005. [16] Yao, H., Hamilton, H.J. and Butz, C. J., A foundational approach to mining itemset utilities from databases, Proc. of the 4th SIAM International Conference on Data Mining, Florida,USA,2004,pp

An Efficient Algorithm for finding high utility itemsets from online sell

An Efficient Algorithm for finding high utility itemsets from online sell Sarode Nutan S, Kothavle Suhas R 1 Department of Computer Engineering, ICOER, Maharashtra, India 2 Department of Computer Engineering,