Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Similar documents
Fast incremental mining of web sequential patterns with PLWAP tree

Mining Web Access Patterns with First-Occurrence Linked WAP-Trees

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Data Mining Part 3. Associations Rules

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

PROBLEM STATEMENTS. Dr. Suresh Jain 2 2 Department of Computer Engineering, Institute of

Mining Frequent Patterns without Candidate Generation

Appropriate Item Partition for Improving the Mining Performance

DATA MINING II - 1DL460

Adaption of Fast Modified Frequent Pattern Growth approach for frequent item sets mining in Telecommunication Industry

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

Chapter 13, Sequence Data Mining

A Novel Method of Optimizing Website Structure

An Improved Frequent Pattern-growth Algorithm Based on Decomposition of the Transaction Database

Data Mining for Knowledge Management. Association Rules

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Monotone Constraints in Frequent Tree Mining

This paper proposes: Mining Frequent Patterns without Candidate Generation

Sequential PAttern Mining using A Bitmap Representation

An Efficient Method of Web Sequential Pattern Mining Based on Session Filter and Transaction Identification

Finding the boundaries of attributes domains of quantitative association rules using abstraction- A Dynamic Approach

Mining Recent Frequent Itemsets in Data Streams with Optimistic Pruning

An Approach for Finding Frequent Item Set Done By Comparison Based Technique

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment

Association Rule Mining. Introduction 46. Study core 46

Data Structure. IBPS SO (IT- Officer) Exam 2017

An Algorithm for Mining Large Sequences in Databases

CS145: INTRODUCTION TO DATA MINING

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: [35] [Rana, 3(12): December, 2014] ISSN:

Binary Trees

Association Rule Mining

CSCI6405 Project - Association rules mining

EFFICIENT TRANSACTION REDUCTION IN ACTIONABLE PATTERN MINING FOR HIGH VOLUMINOUS DATASETS BASED ON BITMAP AND CLASS LABELS

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Chapter 4: Association analysis:

A Comparative Study of Association Rules Mining Algorithms

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES

MEIT: Memory Efficient Itemset Tree for Targeted Association Rule Mining

Chapter 4 Trees. Theorem A graph G has a spanning tree if and only if G is connected.

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

Efficient Frequent Pattern Mining on Web Log Data

Priority Queues and Binary Heaps

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

International Journal of Scientific Research and Reviews

An Improved Apriori Algorithm for Association Rules

Association Rule Mining

Mining High Average-Utility Itemsets

UMCS. Annales UMCS Informatica AI 7 (2007) Data mining techniques for portal participants profiling. Danuta Zakrzewska *, Justyna Kapka

EE 368. Weeks 5 (Notes)

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Comparative Study of Techniques to Discover Frequent Patterns of Web Usage Mining

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

Trees. Truong Tuan Anh CSE-HCMUT

Tree. Virendra Singh Indian Institute of Science Bangalore Lecture 11. Courtesy: Prof. Sartaj Sahni. Sep 3,2010

Maintenance of fast updated frequent pattern trees for record deletion

A Comprehensive Survey on Sequential Pattern Mining

LECTURE 11 TREE TRAVERSALS

CS570 Introduction to Data Mining

Web Service Usage Mining: Mining For Executable Sequences

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Parallelizing Frequent Itemset Mining with FP-Trees

Comparing the Performance of Frequent Itemsets Mining Algorithms

A New Fast Vertical Method for Mining Frequent Patterns

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

An Extended Byte Carry Labeling Scheme for Dynamic XML Data

Data and File Structures Laboratory

SeqIndex: Indexing Sequences by Sequential Pattern Analysis

Section 5.5. Left subtree The left subtree of a vertex V on a binary tree is the graph formed by the left child L of V, the descendents

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

Product presentations can be more intelligently planned

Mining Frequent Patterns with Counting Inference at Multiple Levels

A Review on Mining Top-K High Utility Itemsets without Generating Candidates

Maintenance of the Prelarge Trees for Record Deletion

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015)

Generation of Potential High Utility Itemsets from Transactional Databases

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

A Comparative study of CARM and BBT Algorithm for Generation of Association Rules

A Graph-Based Approach for Mining Closed Large Itemsets

Finding Frequent Patterns Using Length-Decreasing Support Constraints

Association Rules. A. Bellaachia Page: 1

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS

Trees. Tree Structure Binary Tree Tree Traversals

Association Rules. Berlin Chen References:

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Information Sciences

Transcription:

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya Department of Computer Science and Engineering, Manipal Institute of Technology, Manipal University, Manipal, India Abstract Sequential pattern mining is the process of applying data mining techniques to a sequential database, to extract frequent subsequences to discover correlation that exists among the ordered list of events. Web Usage mining (WUM) discovers and extracts interesting knowledge/patterns from Web logs is one of the applications of Sequential Pattern Mining. In this paper, we present a survey of the sequential pattern mining techniques based Web Access Pattern (WAP) tree on Web logs. I. INTRODUCTION Sequential pattern mining, which finds high-frequency patterns with respect to time or other patterns, was first introduced by [3] as follows: given a sequence database where each sequence is a list of transactions ordered by transaction time and each transaction consists of a set of items, find all sequential patterns with a user specified minimum support, where the support is the number of data sequences that contain the pattern. Since the access patterns from web log take on obvious time sequence characteristic, it is natural to apply the technology of sequential pattern mining to web mining. Sequential pattern mining [4] discovers the existing frequent sequences in a given database, when the data to be mined has some sequential nature. These patterns may represent very valuable information. A sequential pattern (as applied to Web usage mining) is defined as an ordered set of pages that satisfies a given support and is maximal (i.e., it has no subsequence that is also frequent). Support is the percentage of the customers who have the pattern. Since a user may have many sessions, it is possible that a sequential pattern should span a lot of sessions. It also need not be contiguously accessed pages. There exists a great diversity of algorithms for sequential pattern mining. Sequential access pattern mining techniques are mainly based on two approaches: Apriori-based mining algorithms and WAP tree based mining algorithms. Most of the earlier algorithms for sequential pattern mining are based on the Apriori property proposed in association rule mining [3]. The apriori-like algorithms are not efficient when applied to long sequential patterns, which is an important drawback when working with Web logs. In [5], Pei et al proposed another category of algorithm based on Web access pattern (WAP) tree. Several improvements of WAP-tree algorithm, the Pre-order Linked WAP tree and Layer Coded Breadth-First WAP-tree were then proposed. In the next section, we present a brief discussion of these algorithms. In order to make it easier for us to compare those algorithms, we use the same sequential database, a web log (shown in Table I). Suppose the minimum support threshold is set at 75%, which means an access sequence, s should have a count of 3 out of 4 records in our example, to be considered frequent. The frequent sub-sequences are shown in the 3 rd column of Table I are used as input. II. A. WAP-Mine WAP TREE BASED MINING ALGORITHMS The WAP-mine algorithm[5] based on WAP-tree is specifically tailored for mining Web access patterns. The algorithm scans the access sequence database twice. The set of frequent events are determined in the first scan. Using these frequent events, a WAP-tree with count information is built in the next scan. The WAP-tree stores the web log data in a prefix tree format. Then WAP-mine recursively mine the WAPtree using conditional search to find all Web access patterns. Conditional search narrows the search space by looking for patterns with the same suffix and count frequent events in the set of prefixes with respect to condition as suffix. The process of recursive mining of a conditional suffix WAP tree ends when it has only one branch or is empty. The WAP-tree algorithm first stores the frequent items as header nodes for linking all the nodes of their type in the WAP-tree in the order the nodes are inserted. TABLE I SEQUENTIAL DATABASE ADAPTED FROM [5] Initially, a virtual root (Root) is first inserted in the WAP-tree. Then, for each frequent sequence in the transaction, a branch from the Root to a leaf node of the 319

tree is constructed. Each event in a sequence is inserted as a node with count 1 from Root if that node type does not yet exist. If the node type already exists, then the count of the node is increased by 1. Also, the head link for the inserted event is connected (in broken lines) to the newly inserted node from the last node of its type that was inserted or from the header node of its type if it is the very first node of that event type inserted. Fig.1 shows the complete WAP-tree for the sequential data of Table I. The basic structure of mining all Web access patterns in WAP-tree is as follows: If the WAP-tree has only one branch, return all (ordered) combinations events in the branch; Otherwise, for each frequent event ei in the WAP-tree by following the ei-queue started from header table, an ei-conditional access sequence base is constructed, denoted as PS ei, which contains all and only all prefix sequences of ei. Each prefix sequence in PS ei carries its count from the WAP-tree. For each prefix sequence of ei with count c, when it is inserted in to PS ei, all of its sub-prefix sequences of ei are inserted in to PS ei with count c, to prevent it contributing twice. It is easy to show that by accumulating counts of prefix sequences, a prefix sequence in PS ei holds its unsubsumed count. Then, the complete set of Web access patterns which are prefix sequence of ei can be mined by concatenating ei to all Web access patterns returned from mining the conditional WAP-tree, and ei itself. For the WAP-tree of Fig. 1, first the prefix sequence of the base c (the lowest frequent event) or the conditional sequence base of c is computed as follows: aba:2; ab:1; abca:1; ab: 1; baba:1; abac:1; aba: 1. The conditional sequence list of a suffix event is obtained by following the header link of the event and reading the path from the root to each node (excluding the node). The count for each conditional base path is subsequence that is also a conditional sequence of a node of the same base, the count of this new subsequence is subtracted because it has contributed before. Thus, the conditional sequence list above has two sequences ab and aba with counts of 1. This is because, when the subsequence, abca is added to the list, its subsequence ab was already in the list, from the first c on the same branch. Thus, the count of ab with 1 has to be added to prevent it from contributing twice. To qualify as a conditional frequent event, one event must have a count of 3. Therefore, after counting the events in sequences above, the conditional frequent events are a(4) and b(4) and c with a count of 2, which is less than the minimum support, is discarded. So, the conditional sequences based on c are: aba:2; ab:1; aba:1; ab: 1; baba:1; aba:1; aba: 1 Using these conditional sequences, a conditional WAP tree, WAP-tree c, is built using the same method as shown in Fig. 1. The new conditional WAP-tree is shown in Fig. 2(a). Recursively, based on this WAPtree, the next conditional sequence base for the next suffix subsequence, bc is computed as a(3),, ba(1). Since a is the only frequent pattern in this base, a(4) is used to construct the next WAP tree for the frequent sequence base of bc as shown in Fig. 2(b). Now the reconstruction of WAP trees that with suffix sequences c and bc is completed and the frequent patterns found along this line are c, bc and abc. Next, the recursion continues with the suffix path c, ac, bac, aac. Now, the conditional search of c is completed. The search for frequent patterns that have the suffix of other header frequent events are also mined in the same way. After mining the whole tree, the discovered frequent pattern set is: {c, aac, bac, abac, ac, abc, bc, b, ab, a, aa, ba, aba}. Figureas1.the WAP-tree in Table the same countfor onthe thefrequent suffixsequence node itself. TheI first sequence in the list above, aba represents the path to the first c node in the WAP tree. When a conditional sequence in a branch of a WAP-tree, has a prefix Conditional WAP-tree c (a) Conditional WAP-tree bc (b) Figure 2. Conditional WAP-tree for Conditional Sequence Base c 320

WAP-tree algorithm scans the original database only twice and avoids the problem of generating explosive candidate sets as in Apriori-like algorithms. So, the Mining efficiency is improved sharply. The main drawback of WAP-tree algorithm is that it requires recursive construction of large numbers of intermediate WAP-trees during mining. This necessitates storing intermediate patterns and hence consumes much time and space. mentioned position coding rule. Once the frequent sequence of the last database transaction is inserted in the PLWAP-tree, the tree is traversed in pre-order fashion to build the frequent header node linkages. Event-node queue links all the nodes in the tree with the same label. The event-node queue with label ei is also called ei-queue. There is one header table L for a PLWAP-tree, and the head of each event-node queue is registered. Fig. 3 shows the completely constructed PLWAP tree with the pre-order linkages for the sequence database of Table I. The algorithm starts by mining the tree in Fig. 3 for the first element in the header linkage list, a following the a-link to find the first occurrences of a nodes in a:3:1 and a:1:101 of the suffix trees of the Root since this is the first time the whole tree is passed for mining a frequent 1-sequence. Now the list of mined frequent patterns F is {a} since the count of event a in this current suffix trees is 4 (sum of a:3:1 and a:1:101 counts), and more than the minimum support. The mining of frequent 2-sequences that start with event a would continue with the next suffix trees of a rooted at {b:3:11, b:1:1011} shown in Fig. 4 as un-shadowed nodes. The objective here is to find if 2-sequences aa, ab and ac are frequent using these suffix trees. In order to confirm aa frequent, we need to confirm event a frequent in the current suffix tree set, and similarly, to confirm ab frequent, we should again follow the b link to confirm event b frequent using this suffix tree set, same for ac. The process continues to obtain the frequent sequence sets: {a, aa, aac, ab, aba, abac, abc, ac, b, ba, bac, bc, c} The PLWAP algorithm eliminates the need to store numerous intermediate WAP trees during mining, which drastically cuts off huge memory access costs. The pre-order linking of header nodes makes the search process more efficient. Thus, it saves the processing time and more so when the number of frequent patterns increases and the minimum support threshold is low. B. Pre-order Linked WAP (PLWAP) Pre-Order linked WAP tree algorithm [6] assigns unique binary position code to each node of the WAP tree. The header nodes are linked in pre-order fashion. Both the pre-order linkage and binary position codes avoids recursive re-construction of the intermediate WAP trees. The position code for each node are assigned as follows: the root has null code, and the leftmost child of any parent node has a code that appends 1 to the position code of its parent, while the position code of any other node has 0 appended to the position code of its nearest left sibling. In general, for the nth leftmost child, the position code is obtained by appending the binary number 2n 1 to the parent s code. In PLWAP, first the common prefix sequences are found. The main idea is to find a frequent pattern by progressively finding its common frequent subsequences starting with the first frequent event in a frequent pattern. For example, if pqrs is a frequent pattern to be discovered, the PLWAP technique, will find the prefix event p first, then, using the suffix trees of node p, it will find the next prefix subsequence pq and continuing with the suffix tree of q, it will find the next prefix subsequence pqr and finally, pqrs. The binary position codes are introduced for identifying the position of every node in the WAP tree. The PLWAP algorithm builds a prefix tree, PLWAP tree, by inserting the frequent sequence of each transaction in the tree in the same way as in WAPMine. The position codes are set by applying the above Figure 4. a Suffix tree Figure 3. PLWAP Tree adapted from [6] 321

PLWAP s performance seems to degrade with very long sequences having sequence length more than 20 because of the increase in the size of position code that it needs to build and process for very deep PLWAP tree C. Layer Coded Breadth-First WAP-tree The Layer Coded Breadth First WAP-tree algorithm [7] builds the frequent header node links of the original WAP-tree in a Breadth-First fashion and uses the layer code of each node to identify the ancestor-descendant relationships between nodes of the tree. It then, finds each frequent sequential pattern, through progressive Breadth-First sequence search, starting with its first Breadth-First subsequence event. Suppose that any ancestor node in WAP tree has not more than n (n<100) children node. Given a WAP-tree with some nodes, the layer code for each node is assigned as follows: the root has a layer code of 0 and the leftmost child of the root has a code of 001, the second leftmost node has a code of 002, and so on. So the layer code of a node in WAP tree is its parent node layer code plus its position (from left to right) 01-99. Also, a node α is an ancestor of another node β if and only if the layer code of α equals the leftmost n bits layer code of β, n is the length of layer code a. The Pre-Order event linkage [6] can t reflect ancestor-descendant relationship directly and clearly. In Fig. 4, by following the linkage started in head table, the chain of event a is {a:3:1, a:2:111, a:1:11101, a:1:101, a:1:10111}. In this chain, a:2:111 appears before a:1:101, but a:1:101 is the second level of root and a:2:111 is the third level of root. So in order to find all first descendant of root of event a, we must scan entire chain. Hence, this algorithm employs the Breadth-First traversal mechanism. Breadth-First traversal mechanism makes use of a Queue into which the nodes of the tree are inserted. First all child nodes of root enter the queue from left to right. Then, as per the FIFO (first in first out) rule, the first element of the queue is accessed. After adding it to the corresponding event queue, all of its children nodes are inserted into the queue and then it is deleted from the queue. The remaining elements of the queue are dealt in the same manner. Fig.5 shows the completely constructed BFWAP tree for the sequence database of Table I. The algorithm starts mining the tree in Fig. 5 for finding frequent 1-sequence by following the a link to find the first occurrences of a nodes in a:3:001 and a:1:00201 of the suffix trees of the Root. Since the count of event a, in this current suffix trees is 4 (sum of a:3:001 and a:1:00201 counts), and more than the minimum support {a} is listed as frequent pattern. Next, we need to confirm whether the 2-sequences aa, ab and ac are frequent using the suffix trees rooted at a i.e at { b:3:00101, b:1:0020101}. For example, to confirm aa is frequent, we need to confirm that event a is frequent in the current suffix tree set. The layer codes Figure 5. BFWAP-tree adapted from [7] of the nodes are used to identify which nodes of each suffix tree are descendants of the first event node ei in the same subtree. To mine all the frequent events in the suffix trees of a:3:001 and a:1:201, which are rooted at b:3: 00101 and b:1: 0020101 respectively, we find the first occurrence of 'a' on each suffix tree, as a:2:0010101, a: 1:001010201 and a:1:002010101. The total count of a is 4 making a, the next frequent event in sequence. Thus, a is appended to the last list of frequent sequence 'a', forming the new frequent sequence 'aa'. This way, we continue to mine other frequent events in the suffix trees. Backtracking in the order of previous conditional suffix BFWAP-tree mined, we search for other frequent events the same as above. Finally, we have the following frequent sequence set: { a, aa, aac, ab, aba, abac, abc, ac, b, ba, bac, bc, c}. The BFWAP algorithm makes use of the WAP-tree structure for storing frequent sequential patterns to be mined. However, to improve the mining efficiency, the algorithm finds layered patterns instead of the prefix patterns as done by PLWAP. The Breadth-First linkage provides a way to traverse the event queue without going backwards. The layer codes are used to identify the level of nodes in the BFWAP tree. With these two methods, the next frequent event in each suffix tree is found without traversing the whole WAP-tree. Thus, it avoids re-constructing WAP-tree recursively. It has been experimented in [7], that mining web log using BFWAP algorithm is much more efficient than with WAP-tree and GSP algorithms, especially when the average frequent sequence becomes longer and the original database becomes larger. 322

III. CONCLUSIONS Sequential Pattern Mining techniques based on Apriori require multiple scans of the sequence database and generate huge number of candidate sets for long web access sequences. This problem makes these algorithms ineffective and inefficient in dealing with large sequence sets with long sequences, especially with web logs. Experimental results on some limited data sets carried out by researchers indicated that the WAP-tree algorithm is an order of magnitude faster than conventional Apriori methods. Several modifications were then proposed in order to further improve the speed of the WAP-tree algorithm. This paper presents a study of different sequential pattern mining algorithms based on web access pattern tree. It is found that the efficiency and effectiveness of these algorithms will deteriorate rapidly when applied to larger log files with long clicking streams. This weakness will become even more serious as the numbers of web pages and users still keep increasing quickly. Therefore, there is a real need for improving the current web usage mining algorithms or developing new efficient ones. REFERENCES [1] [2] [3] [4] [5] [6] [7] R. Cooley, B. Mobasher, and J. Srivastava, Data Preparation for Mining World Wide Web Browsing Patterns, In Journal of Knowledge and Information Systems, Vol. 1,Issue 1, pp. 5-32, 1999. J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann, 2000. R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, 20th International Conference on Very Large Databases (VLDB 94), pp. 487-499, 1994. R. Agrawal and R. Srikant, Mining sequential patterns, In 11th International Conference of Data Engineering (ICDE 95), pp. 314, 1995. J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu., Mining access patterns efficiently from web logs, In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000. C. Ezeife & Y. Lu., Mining web log sequential patterns with position coded pre-order linked wap-tree, International Journal of Data Mining and Knowledge Discovery (DMKD), Springer Science + Business Media, Vol. 10, pp.5 38, 2005. Liu, L., & Liu, J., Mining maximal sequential patterns with layer coded Breadth-First linked WAP-tree., Asia-Pacific Conference on Computational Intelligence and Industrial Applications (PACIIA), IEEE, pp. 61-65, 2009. 323