High Utility Web Access Patterns Mining from Distributed Databases

Similar documents
Mining High Average-Utility Itemsets

A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets

Generation of Potential High Utility Itemsets from Transactional Databases

RHUIET : Discovery of Rare High Utility Itemsets using Enumeration Tree

FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

A New Method for Mining High Average Utility Itemsets

Implementation of Efficient Algorithm for Mining High Utility Itemsets in Distributed and Dynamic Database

An Efficient Generation of Potential High Utility Itemsets from Transactional Databases

AN EFFICIENT GRADUAL PRUNING TECHNIQUE FOR UTILITY MINING. Received April 2011; revised October 2011

Information Sciences

Minig Top-K High Utility Itemsets - Report

Web Usage Mining: How to Efficiently Manage New Transactions and New Clients

Incremental Mining of Frequently Correlated, Associated- Correlated and Independent Patterns Synchronously by Removing Null Transactions

Enhancing the Performance of Mining High Utility Itemsets Based On Pattern Algorithm

Utility Mining: An Enhanced UP Growth Algorithm for Finding Maximal High Utility Itemsets

AN EFFECTIVE WAY OF MINING HIGH UTILITY ITEMSETS FROM LARGE TRANSACTIONAL DATABASES

Heuristics Rules for Mining High Utility Item Sets From Transactional Database

FHM: Faster High-Utility Itemset Mining using Estimated Utility Co-occurrence Pruning

Incrementally mining high utility patterns based on pre-large concept

A New Method for Mining High Average Utility Itemsets

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

Mining High Utility Itemsets from Large Transactions using Efficient Tree Structure

An Improved Apriori Algorithm for Association Rules

EFFICIENT TRANSACTION REDUCTION IN ACTIONABLE PATTERN MINING FOR HIGH VOLUMINOUS DATASETS BASED ON BITMAP AND CLASS LABELS

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Efficient Algorithm for Mining High Utility Itemsets from Large Datasets Using Vertical Approach

Mining High Utility Itemsets in Big Data

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Efficient High Utility Itemset Mining using extended UP Growth on Educational Feedback Dataset

Efficient Tree Based Structure for Mining Frequent Pattern from Transactional Databases

CHUIs-Concise and Lossless representation of High Utility Itemsets

Mining of High Utility Itemsets in Service Oriented Computing

Utility Mining Algorithm for High Utility Item sets from Transactional Databases

AN ENHNACED HIGH UTILITY PATTERN APPROACH FOR MINING ITEMSETS

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Keywords: Frequent itemset, closed high utility itemset, utility mining, data mining, traverse path. I. INTRODUCTION

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

UP-Growth: An Efficient Algorithm for High Utility Itemset Mining

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Infrequent Weighted Item Set Mining Using Frequent Pattern Growth

Efficient Algorithm for Frequent Itemset Generation in Big Data

Mining of Web Server Logs using Extended Apriori Algorithm

Discovery of High Utility Itemsets Using Genetic Algorithm

A Review on Mining Top-K High Utility Itemsets without Generating Candidates

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

Comparing the Performance of Frequent Itemsets Mining Algorithms

UTILITY, IMPORTANCE AND FREQUENCY DEPENDENT ALGORITHM FOR WEB PATH TRAVERSAL USING PREFIX TREE DATA STRUCTURE

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Implementation of Data Mining for Vehicle Theft Detection using Android Application

EFIM: A Fast and Memory Efficient Algorithm for High-Utility Itemset Mining

Mining Web Access Patterns with First-Occurrence Linked WAP-Trees

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

FUFM-High Utility Itemsets in Transactional Database

Implementation of CHUD based on Association Matrix

An Efficient Algorithm for finding high utility itemsets from online sell

Design of Search Engine considering top k High Utility Item set (HUI) Mining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

A Review on High Utility Mining to Improve Discovery of Utility Item set

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Extracting Feature Sequences in Software Vulnerabilities Based on Closed Sequential Pattern Mining

Multi-Stage Rocchio Classification for Large-scale Multilabeled

An Algorithm for Mining Large Sequences in Databases

Web page recommendation using a stochastic process model

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

A Graph-Based Approach for Mining Closed Large Itemsets

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

Parallel Popular Crime Pattern Mining in Multidimensional Databases

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Discovering High Utility Change Points in Customer Transaction Data

EFIM: A Fast and Memory Efficient Algorithm for High-Utility Itemset Mining

Efficient Mining of High-Utility Sequential Rules

UP-Hist Tree: An Efficient Data Structure for Mining High Utility Patterns from Transaction Databases

A Clustering Method with Efficient Number of Clusters Selected Automatically Based on Shortest Path

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Discovering Quasi-Periodic-Frequent Patterns in Transactional Databases

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Efficiently Finding High Utility-Frequent Itemsets using Cutoff and Suffix Utility

REDUCTION OF LARGE DATABASE AND IDENTIFYING FREQUENT PATTERNS USING ENHANCED HIGH UTILITY MINING. VIT University,Chennai, India.

Chapter 4: Mining Frequent Patterns, Associations and Correlations

SIMULATED ANALYSIS OF EFFICIENT ALGORITHMS FOR MINING TOP-K HIGH UTILITY ITEMSETS

Systolic Tree Algorithms for Discovering High Utility Itemsets from Transactional Databases

EFIM: A Highly Efficient Algorithm for High-Utility Itemset Mining

Parallelizing Frequent Itemset Mining with FP-Trees

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Efficient Mining of High Average-Utility Itemsets with Multiple Minimum Thresholds

Efficiently Mining Positive Correlation Rules

Improved UP Growth Algorithm for Mining of High Utility Itemsets from Transactional Databases Based on Mapreduce Framework on Hadoop.

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

Mining Quantitative Association Rules on Overlapped Intervals

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

Mining Top-k High Utility Patterns Over Data Streams

FOSHU: Faster On-Shelf High Utility Itemset Mining with or without Negative Unit Profit

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

Efficient High Utility Itemset Mining using Buffered Utility-Lists

Comparison of Different Navigation Prediction Techniques

Improved Aprori Algorithm Based on bottom up approach using Probability and Matrix

Fast Algorithm for Finding the Value-Added Utility Frequent Itemsets Using Apriori Algorithm

Transcription:

High Utility Web Access Patterns Mining from Distributed Databases Md.Azam Hosssain 1, Md.Mamunur Rashid 1, Byeong-Soo Jeong 1, Ho-Jin Choi 2 1 Database Lab, Department of Computer Engineering, Kyung Hee University 1 Seochun-dong, Kihung-gu, Youngin-si, Kyunggi-do, 446-701, Republic of Korea 2 Computer Science Dept., Korea Advanced Institute of Science and Technology 335 Guseong-dong, Yuseong-gu, Daejeon 305-701, Republic of Korea {azam, mamun, jeong)@khu.ac.kr, hojinc@kaist.ac.kr Abstract. Utility based web access pattern mining and knowledge discovery from database has become an interesting research domain in recent time. Traditional pattern mining algorithms deal with only binary occurrence of a web page and do not consider the weight or profit of the web page. Hence, utility-based web path traversal pattern mining technique has evolved and got much interest in recent time. However, the methods of mining high utility web access pattern from distributed databases have not explored yet. All of the existing utility pattern mining algorithms are based on centralized database. In this paper, we have been proposed a parallel and distributed method for mining high utility web access patterns. We also proved that our algorithm maintains downward closure property. Extensive experimental analyses reflect that our proposed technique is highly efficient on large databases in distributed environment Keywords: Web Access, High Utility Pattern, Distributed Database. 1 Introduction The World Wide Web offers huge, widely distributed global information system. It contains a rich and dynamic collection of web link, web page access and usage information [1]. Web access pattern mining algorithm discovers the usage patterns of user s frequent web page access. Therefore, mining frequent web access can be used to improve the webpage design, advertisement placement in most popular page and so on. 137

Moreover, today s internet era databases are inherently distributed. Most of the organizations operate business in global markets require to perform data mining on distributed data sources to turn them into realistic and meaningful knowledge for their future use and the volume of data available for usage is very high. This inherent distribution source of data and the voluminous in size emerges to develop scalable parallel and distributed algorithm for data mining. In recent time, the concept of utility-based web path traversal mining has been developed in [7]. The concept of utility for web path traversal first time was introduces in [5].They adopted the browsing time of a user at particular web site as a utility of that web site. They used the same definitions of utility from the high utility pattern mining algorithm [2-6]. Farhan et al. [8] proposed the EUWPTM algorithm for utility based web path traversal pattern mining that exploit the pattern growth sequential pattern mining approach. However, the technique of mining high utility web access pattern from distributed databases have not explored yet. All of the existing utility-based web path traverse pattern mining algorithms are considered the centralized database. In real scenario, web access data mining techniques need to handle the large database. Therefore, it is necessity to focus on large-scale parallel and distributed high utility-based web site traversal patterns mining. In summary, our main contributions in this work are We define the problem for high utility-based web access patterns mining in the parallel and distributed environment We have been proposed a distributed method for mining the high utility web path traversal pattern from large databases. We also proved that our proposed technique maintain the downward closure property. The rest of the paper is organized as follows. Section 2 presents the review of some recent related research works. Section 3 describes problem definition and Section 4 presents the proposed method in details. Section 5 elaborates the experimental results and finally Section 6 concludes the paper. 2 Related Work WEBMINER [1] was proposed to discover the information from the data of World Wide Web transactions. They focused to analyze the interesting and meaningful relationships from a large pole of web database. A web utilization miner (WUM) [9] finds the interesting web navigation pattern. They primarily focused on the MINT query language and its execution mechanism. The theoretical model and definitions of high utility pattern mining were given in [2]. This method, called MEU (mining with expected utility), cannot maintain the downward closure property. In this approach, they used heuristics to determine whether an itemset should be considered as a candidate itemset. This is impractical whenever the number of distinct items is large and the utility threshold is low. Later, the same authors proposed two new algorithms, UMining and UMining_H, to calculate high utility patterns [3]. In UMining, a pruning strategy based on the utility 138

upper bound property is used. Furthermore, these methods do not satisfy the downward closure property and may overestimate too many patterns. Based on the definitions of [2], the Two-Phase [4, 5] algorithm was developed to find high utility itemsets. The authors have defined the transaction weighted utilization (twu) [6] and using it they proved that it is possible to maintain the downward closure property. In recent time, the concept of utility-based web path traversal mining has been developed in [8]. The concept of utility for web path traversal first time was introduces in [7].They adopted the browsing time of a user at particular web site as a utility of that web site. They used the same definitions of utility from the high utility pattern mining algorithm [2-6]. Farhan et al. [8] proposed the EUWPTM algorithm for utility based web path traversal pattern mining that exploit the pattern growth sequential pattern mining approach. By using the pattern growth mining approach, their technique prunes a huge number of candidate items at first database scan. 3 Problem Definition Let S { s1, s2, s3,... sj} = be a j-sequence of traversal path and { T } D= T1, T2, T3,... m be web-log, where T i is traversal path. A traversal path database is presented in the table 1. Transaction T 1 contains the user browsing sequence B, C, and D websites. The quantity inside the parenthesis is the internal utility that indicates the time spent by the user at that website. Table 2 shows the external importance of a website. As for example, internal utility for website B is 3 where external utility is 10. Table 1 A web traversal Database Table 2 External Utility TID Traversed Path Transaction Utility T 1 B(3), C(2), D(3) 59 T 2 A(3), D(2), E(2) 42 T 3 B(3), E(2) 40 T 4 A(1), B(1), C(1) 20 T 5 A(2), B(3,), D?(5) 77 T 6 A(3), B(4) 58 T 7 E(1) 5 T 8 B(2), D(2) 34 Process Assigned Total = 335 The formal definition for high utility mining model for web path traversal pattern is as follows: Definition 1: Utility of a website ij in transaction Tk is the quantitative measure of the website which is multiplication of internal and external utility of a web page. u i j, T k = l i j, T k e i j P 0 P 1 ( ) ( ) ( ) For example, in table 1 u (D, T 1 ) = 3 7 = 21. Item Utility A 6 B 10 C 4 D 7 E 5 139

Definition 2: The utility of a website sequence S in transaction Tk sum of utility of the entire website that belongs to that sequence and it is defined by, u S, T u ij, Tk ( ) ( ) k = Definition 3: Transaction Utility (tu) of a transaction is total utility of that transaction and it is defined by, tu Tk u ij, Tk ij S = ( ) ( ) ij Tk For example, in table 1 tu (T4) = u (A, T4) + u (B, T4) + u (C, T4) = 6 + 10 + 4 = 20 Definition 4: Local transaction utility utilization of an itemset S, denoted by ltwu(s), is the sum of the transaction utilities of all transactions containing S of a particular site Pi and defined by, ltwu( S) = tu( Tk) X Tk D For example, ltwc (BD) = tu (T 1 ) = 59 is since BD appears only in T 1 in the local site P 0. Definition 5: Global transaction utility utilization of an itemset S, denoted by gtwu(s), is the sum of the transaction utilities of all transactions of all the site that containing S and defined by, gtwu S i= p = i= 1 X Tk D ( ) tui( Tk) As for example twc (BD) = tu (T 1 ) + tu (T 5 ) = 59 + 77 = 136. S = S1 S2 S3... Sq is a pattern where q is a positive integer Theorem 1: If { } and Sj is a sub-pattern of S and j [1, q], then global transaction utility utilization i= p i= 1 { 1 i 2 i 3 i j i} gtwu( S) min ltwu( S ), ltwu( S ), ltwu( S ),..., ltwu( S). where i [1, p] and p = number of nodes. 4 Proposed Method In this section, we present the proposed method for parallel and distributed high utility pattern mining. Our proposed algorithm works in the following steps. Master Node: 1. Broadcast the information of master site to all the Slave sites. 2. Wait for local transaction weighted utilization (ltwu) from the Slave node. 3. Compute global transaction utility utilization (gtwu). 4. Broadcast the gtwu. 5. Receive the global potential high utility pattern. 6. Calculate the high utility pattern from potential high utility pattern. Slave Node: 7. Scan the local database. 140

8. Calculate the local transaction utility utilization (ltwu) of each website. 9. Send (ltwu) to the Master node. 10. Wait for global transaction weighted utilization (gtwu). 11. Build the PHUP tree. 12. Find the potential global high utility patterns and send to Master Node. In the first database scan each node calculates the local transaction-weightedutilization (ltwu) as the definition 4 of each website. Global transaction weighted utilization is calculated by using the definition 5 and is broadcasted by Master node. Then each local site builds their global transaction utility table using global transaction-utility-utilization and pruned the items that do not satisfied the given threshold min_util ( ). As for example, total transaction utility value is 335. If is 25% than the minimum utility value will be min_util = 0.25 335 = 83.75. gtwu(c) < min_util ( ), so C is pruned. After that local site builds PHUP tree by scanning the local database and find the potential global high utility patterns that satisfied the given threshold min_util ( ). Each local site calculates the actual utility u(s) from potential high utility patterns by scanning the database and sends to Master node. Finally, global high utility pattern are accumulated by the Master node. 5 Experimental Results We experiment our algorithm in the system where all sites are identical. Each site consists of a 2.66 GHz CPU with 2 GB memory and running on Linux. Communications among site are established using Message Passing Interface and programs are implemented in C++ language. Experiments have been carried out on real BMS-Web View-1 datasets which contains several months world of click stream of an e-commerce web site. It has 59602 transactions and 497 distinct items. It is observed from the figure 1 that as the number of process increases, the execution time decreases. We also see that if the minimum threshold is higher than number of candidates is fewer. Therefore, our approach is highly scalable and efficient. (a) Execution Time (b) No. of Candidate Figure 1: Performance on BMS-Web View-1 Dataset 141

Figure 1 (b) demonstrates that our proposed algorithm generates less number of candidates than existing algorithm Utility [7] since our method maintains the downward closure property. 6 Conclusions The main contribution of this paper is to provide first time a distributed method for utility-based web sequence access mining. Our proposed maintain the downward closure property that helps to prune the website with low utility at the beginning. So, the proposed method is effectively suited for large databases which provide the high scalability and performance gain and require minimum communication among the process. Acknowledgement This work was supported by a grant from the NIPA (National IT Industry Promotion Agency, Korea) in 2012 (Global IT Talents Program) and partly supported by the National Research Foundation (NRF) grant (No. 2011-0018264) of Ministry of Education, Science and Technology (MEST) of Korea. References 1. Mobasher B, Jain N, Han et al.:web Mining: Pattern Discovery from World Wide Web Transactions. Tech Rep: Tr96-050 (1996) 2. Yao H., Hamilton H.J., Butz C. J.: A foundation approach to mining itemset utilities from databases. In Proceedings of the 4th International Conference on Data Mining, pp.482-486, (2004) 3. Yao H., Hamilton H.J.: Mining itemsets utilities from transaction databases. Data and Knowledge Engineering, vol. 59, no. 3, pp. 603-626 (2006) 4. Liu Y., W.-keng. Liao, Choudhary A. N.: A two- phase algorithm for fast discovery of high utility itemsets. In Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 689-695 (2005) 5. Liu Y., W.-keng. Liao, Choudhary A. N.: A fast high utility itemsets mining algorithm. In Proceedings of the first International Conference on Utility-Based Data Mining, pp. 90-99, (2005) 6. Ahmed C. F., Tanbeer S. K., Jeong B.-S., Lee Y.-Koo.: Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases. Transactions on Knowledge and Data Engineering, Vol. 21, No. 12, pp. 1708 1721 (2009) 7. Zhou L., Liu Y., Wang J., Shi Y.: Utility-Based Web Path Traversal Pattern Mining. In Proceedings 7 th International Conference on Data Mining Workshops, pp. 373-378 (2007) 8. Ahmed C. F., Tanbeer S. K., Jeong B.-S., Lee Y.-Koo: Efficient Mining of Utility-Based Web Path Traversal Pattern. In Proceeedings of the 11 th International Conference on Advanced Communication Technology (IEEE-ICACT 09), pp.719-724, (2009) Spiliopoulou M, Faulstich L.C.: Wum: A Web Utilization miner. EDBT Workshop WebDB98 Springer Verlag (1996) 142