INVESTIGATIONS ON MODERN ALGORITHM FOR UTILITY MINING. Bannari Amman Institute of Technology Erode, TamilNadu, India 2 Associate Professor/CSE

Size: px

Start display at page:

Download "INVESTIGATIONS ON MODERN ALGORITHM FOR UTILITY MINING. Bannari Amman Institute of Technology Erode, TamilNadu, India 2 Associate Professor/CSE"

Jonas Morrison
5 years ago
Views:

Volume 119 No. 16 2018, 4451-4460 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.

1 Volume 119 No , ISSN: (on-line version) url: INVESTIGATIONS ON MODERN ALGORITHM FOR UTILITY MINING Abstract 1 Nandhini S S,, 2 Kannimuthu S 1 Assistant Professor/CSE Bannari Amman Institute of Technology Erode, TamilNadu, India 2 Associate Professor/CSE Karpagam College of Engineering Coimbatore, TamilNadu, India nandhiniss@bitsathy.ac.in, It is obvious that data mining will generate millions of patterns from data given. The irony is, the resulting patterns itself need to be mined on a loop. Since most of identified patterns from traditional data mining algorithm is already known to the group who owns the data or on the other hand the pattern mined may not possess anything useful commercially and may not bringin profit to the group. So, the resultant patterns look cluttered with most non-profitable, unwanted, infrequent and uninteresting patterns since the data is uncertain. These so called unwanted, uninterested patterns can be removed by applying frequent itemset mining and yet to clear the clutter, apply high itemset mining which provides a clutter-free patterns which are frequent as well as profitable. This survey review the various algorithms proposed for mining high itemset from uncertain databases and compares them based on the domain, data structure used, data set taken for utilization and with that it gives strategies for selecting an appropriate algorithm for applications and identifies opportunities for further development in mining. 1.Introduction This article surveys the popular algorithm on mining and its further development. Data mining algorithms mines patterns and frequent itemset mining extended from data mining produces frequent patterns. In the age of Big Data, uncertainty is very common in data. Data is constantly growing in volume, variety, velocity and uncertainty. Uncertain data is found in abundance today on the web, in sensor networks, within enterprises both in their structured and unstructured sources. Mining such uncertain data is important to discover interesting high profitable itemsets. As one of the most fundamental issues of uncertain data 4451

2 mining, the problem of mining uncertain frequent item sets has attracted much attention in the database and data mining communities. Although some efficient approaches of mining uncertain frequent item sets have been proposed, most of them only consider each item in one transaction as a random variable and ignore the of each item in the real scenarios. Frequent pattern mining is a popular problem in data mining, which consists in finding frequent patterns in transaction databases. The objective of frequent itemset mining is to find frequent itemsets. Many well-known algorithms are available to discover frequent itemsets such as Apriori, FP-Growth, LCM, Eclat, etc. With minimum support threshold, the algorithms return all the itemsets that appears in at least minimum transactions as specified. For example, consider the transactional database with detailed transactions and items with profit values, Item Profit per unit Transaction ID Items with quantities a 5 b 2 c 1 d 2 e 3 P1 P2 P3 P4 P5 a(1),b(5),c(1),d(3),e(1) b(4),c(3),d(3),e(1) a(1),c(1),d(1) a(2),c(6),e(2) b(2),c(2),e(1) For the above given sample transactions with minsup value as 2, c will be identified as most frequent item since it is present in all the transactions. But by considering the quantities and profit of the items, A high- itemset mining algorithm outputs all the high- itemsets, that is the itemsets that generates at least minutil profit. For example, consider that minutil is set to 25 by the user. The result of a high itemset mining algorithm would be the following. High itemsets: {a,c}:28, {a,b,c,d,e}:25, {b,c,d}:34, {b,c,e}:37, {b,d,e}:36, {c,e}:27, {a,c,e}:31, {b,c}:28, {b,c,d,e}:40, {b,d}:30, {b,e}:

3 So, the limitation of frequent itemset mining is that the itemset with actually high profit will not be discovered as interesting or frequent itemset and it also finds some frequent itemsets that are not interesting. Ultimately frequent itemset mining may miss out some rare patterns that are highly profitable in a transaction database. To address these limitations, the problem of frequent itemset mining has been redefined as the problem of high- itemset mining. In addition, to prune the search space in frequent itemset mining apriori property is used which says if an itemset is infrequent then its superset will also be infrequent. But this is not the case in high- itemset mining and hence it is interesting than frequent itemset mining. 2. Algorithms to mine high itemset 2.1 Algorithm 1: A multi-objective evolutionary algorithm for mining frequent and high itemsets [6] This algorithm aims at mining itemset that is both frequent and with high. Many quoted already existing algorithms like FP growth, HUI miner, HUIM ACS and TKU miner based on the weight parameter Ɵ, where Ɵ is the weight parameter that decides the importance of over support and it can be decided by the user. This multi-objective algorithm refers to two objectives, support and. As an evolutionary algorithm[5] this works as maximization problem, as the ultimate aim is to find itemsets with maximum support and maximum. But, the irony is, the two measures (support, ) conflict with each other. In other words, the itemset with high support may lead to have low and itemset with high often leads to low support. Hence this algorithm is framed a optimization algorithm between the two measures. This can be represented as, Maximize F(X) = max {(supp(x), util(x)) T } Where F(X) is the optimization function, X represents the itemset, T refers to transaction. Something that need to be noted here is, min_sup and min_util values are not needed as like other mentioned algorithms where itemsets will be mined with aim of having support and greater than or equal to min_sup and min_util threshold values as specified by the user. 4453

4 Two more parameters has been proposed in this algorithm to evaluate the quality of the recommended itemsets by this algorithm, they are HyperVolume(HV) and Coverage (Cov) to measure the convergence and diversity of the recommended itemsets in the list. This algorithm has been applied on twelve real data sets and they have plotted the comparison results. From the observation of the results, this algorithm works better than other considered algorithms in recommending the itemsets with comparatively high support and high. On the other hand, other algorithms produces itemsets either with high support/low or low support/high when compared to itemsets produced by MOEA-FHUI. 2.2 Algorithm 2: RUP/FRUP-Growth: An efficient algorithm for mining high itemsets [3] This algorithm is designed to mine frequent and high itemsets. They proposed an improvement to UP-Growth algorithm as RUP-Growth and then it is developed into FRUP- Growth algorithm. This considers both minimum support and minimum threshold value. There are many existing algorithms stated here to mine such itemsets but their performance is decided by the number of candidate itemsets to mine. The number of candidate itemsets will get increased with decreasing minimum and increasing of count of lengthy transactions. Here, of an item is defined as product of internal and external. Internal of an item refers to the quantity if the item within the transaction. Profit value of an item which is not available in the transactions is defined as external. Utility is represented as, u(i,t) = p(i) X q(i,t) where u(i,t) is of item i in transaction t, p(i) (external ) is profit of item i irrespective of the transaction, q(i,t) (internal ) is quantity of item i in transaction t. Further it is extended to compute of an itemset X in a transaction T, by adding the of all the items present in the itemset X in that transaction T. Utility of an itemset X in the given database is calculated by adding the of the itemset in all the transactions. This approach is divided into two phases. Initially UP-Growth algorithm is improved and that is referred as RUP-Growth algorithm and further by adopting minimum support and 4454

5 minimum threshold values to mine frequent and high itemset, there evolves the FRUP-Growth algorithm. Collectively, these two improved approaches has three steps and they are, (i) Construct an UP Tree (ii) Mine candidates for frequent and high itemset based on tree from (i) (iii) Identify actual frequent and high itemset This approach concludes that before identifying the actual high itemset, reduce the number of candidate itemset. As per the result quoted, RUP-Growth outperforms the earlier algorithm. 2.3 Algorithm 3: High -itemset mining and privacy-preserving mining [2] Mining high itemsets from the candidate itemset within given database is HUIM High itemset mining. The drawback is, it may lead to publish private or secure data in mined high itemset. To overcome this, privacy-preserving mining (PPUM) is used to hide the private high itemset mined from the candidate itemset. They proposed two evolutionary algorithms one to find the high itemset and the other to perform PPUM[3]. The evolutionary algorithm for mining high itemset constitutes four processes and they are, pre-processing, particle encoding, fitness evaluation and updating process. Similarly, the proposed evolutionary algorithm for PPUM ultimately hides the sensitive private high itemset identified from the previous evolutionary algorithm. It outperforms the HUPE umu - GRAM algorithm in runtime. 2.4 Algorithm 4: Efficiently mining of Effective web traversal patterns with average [7] This algorithm deals with finding high average web patterns. Issue in already existing algorithm that is overcome by this proposed algorithm is that, the existing algorithms calculate transaction weighted by adding of all the transactions in which it exists and the prefix of that transaction is not considered. The algorithm proposed addresses these issues in already existing algorithms. Usually, the will be calculated by adding the internal and external. Here, only the internal of the transaction is considered. Also, value increases with the 4455

6 pattern length, longer pattern with less may result in good high values similar to short length patterns with high values. So by choosing the high average patterns, it could be more effective to find the interesting web traversal patterns with effect to length. Ultimately, this algorithm reduces the search space for finding the effective web path traversal patterns. Similarly, the transaction weighted is calculated only with the projected sequence and not by adding of all the transactions where it exists which is an issue in existing algorithm addressed by the algorithm proposed. 2.5 Algorithm 5: Mining of high itemsets of size-2 with pruning strategies [1] The MHUIS-2wPS algorithm utilizes the transactional experiences of the retail stores and outputs the size-2 clubs. The MHUI-NIV algorithm caters for the items with negative item values. The dissertation applies various pruning strategies for the discovery of high itemsets. This pruning will help remove the unnecessary formation of the low extensions. The proposed MHUIS-2wPS algorithm follows the sequential approach for finding the high itemsets. Using the list, the high itemsets will be found. Then applying the pruning concepts of EUCS and PUCS, the itemsets will be made minimal resulting in the formation of high itemsets. It builds the necessary data structures and parameters for carrying out the processing. It also initiates the finding of the clubs of items. Later, it checks the other extra areas i.e. the itemset clubs which can be searched here itself for calling as high or not. Lastly the validation of the formed clubs is done using the decisions of EUCS and PUCS 3. A Comparative study on the algorithms S. Author No. 1. Lei Zhang, Guang long Fu, Fan Cheng, Jianfe ng Qiu, Yanse Name of the algorithm MOEA- FHUI (Multi- Objective Evolutionar y Algorithm for mining Frequent and High Utility Itemsets Objective To mine both frequent and high itemset ( a maximizatio n problem) Parameters considered Hypervolume, Coverage, Support, Utility Data set utilized 12 real data sets are used (USCensu s_10%, BMS- Web- View- 1,etc) Advantages a. No need of minimum support and minimum threshold values. b. Only one run is required for multiple itemset Disadvantages a. This is not compares with similar objective algorithms. b. Only frequency and quantity are considered as measures 4456

7 n Su 2. Jue Jin, Shui Wang 3. Jerry Chun- Wei Lin, Wensh eng Gan, Philip pe Fourni er- Viger, Lu Yang, Qiank un Liu, Jarosla v Frnda, Lukas Sevcik, Mirosl av Vozna k 4. Thilag u M, Nadar ajan R RUP/FRUP- Growth: An efficient algorithm for mining high itemsets High itemset mining and privacypreserving mining Efficiently mining of Effective web To mine frequent and high itemsets To mine high itemset and hide the sensitive high itemsets in PPUM To produce high average web Minimum support, minimum, support, Minimum, Time spent on a traversal, pattern length, minimum- Chainstore dataset (Californi a) Chess dataset, synthetic T10I4D10 0K dataset CTI, kosarak recommend ation a. Frequency, quantity, profit are considered as measures b. Reduces the number of candidates for high itemsets a. Privacy in the high itemset is preserved and hidded. a. Both longer patterns with less a. It requires user to fix threshold values for minimum support and minimum b. Support is not directly dealt in the approach a. Frequent itemset is not mined a. External is not considered. b. All pages 4457

8 5. Gaura v Gahlot, Naga mma Patil traversal patterns with average Mining of high itemsets of size-2 with pruning strategies traversal pattern To find high itemset by pruning avergae- Transaction weighted, minimum Synthetic dataset page and shorter patterns with high page is considered b. Pattern length is considered as a parameter a. Pruning is applied in identifying the high itemset are considered to have equal significance. c. Traversal patterns with backward references are not considered a. A comparison plot is plotted with only 9 transactions 4. Conclusion To mine high itemset from the real-world dataset is getting importance today. As of the item affects the interestingness in the resultant itemset, mining emerged from data mining. In that context, itemset with high and high support bring in matching interestingness as expected. Many algorithms have been proposed to mine frequent itemsets and after mining emerged, lot more algorithms are proposed based on quantity, profit to mine high itemset. Here, we have analyzed broad category of algorithms that works to compute frequent and high itemsets. All the algorithms have outperformed the previous reference algorithm either in running time or in finding better frequent high itemset with comparatively high support and high among the candidate itemset. So, further in mining frequent high itemsets, the various interestingness measure used by all these algorithms can be collectively used to get better results. Few interestingness measure used here are HV, Cov, support,, internal, transactional, transactional weight, profit, quantity, time, etc., By combining the measures, further it can be extended by giving weightage factors to all the interestingness measure so that importance of the measure can be changed depending upon the application domain and user flexibility. 4458

9 References 1. Gaurav Gahlot, Nagamma Patil, Mining of high itemsets of size-2 with pruning strategies and negative item values for B2C companies based on experiential marketing approach, Perspectives in Science, 8, 2016, Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Lu Yang, Qiankun Liu, Jaroslav Frnda, Lukas Sevcik, Miroslav Voznak, High -itemset mining and privacypreserving mining, Perspectives in Science, 7, 2016, Jue Jin, Shui Wang, RUP/FRUP-Growth: An efficient algorithm for mining high itemsets, Procedia Engineering, 174, 2017, Kannimuthu, S., Premalatha, K., A fast perturbation algorithm using tree structure for privacy preserving mining. Expert Syst. Appl. 42 (3), Kannimuthu, S., Premalatha, K., Discovery of high itemsets using genetic algorithm with ranked mutation. Appl. Artif. Intell. 28 (4), Lei Zhang, Guanglong Fu, Fan Cheng, Jianfeng Qiu, Yansen Su, MOEA-FHUI (Multi- Objective Evolutionary Algorithm for mining Frequent and High Utility Itemsets, Applied Soft computing, 62, 2018, Thilagu M, Nadarajan R, Efficiently mining of Effective web traversal patterns with average, Procedia Technology, 6, 2012, Vinod kumar, Ramjeevan Singh Thakur, High Fuzzy Utility Strategy Based Webpages sets mining from weblog database, International Journal of Intelligent Engineering and Systems,

10 4460

FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning Philippe Fournier-Viger 1 Cheng Wei Wu 2 Souleymane Zida 1 Vincent S. Tseng 2 presented by Ted Gueniche 1 1 University