SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Similar documents
Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Appropriate Item Partition for Improving the Mining Performance

Graph based Approach for Mining Frequent Sequential Access Patterns of Web pages

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web page recommendation using a stochastic process model

Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal

Data Mining of Web Access Logs Using Classification Techniques

A Survey on Web Personalization of Web Usage Mining

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Survey Paper on Web Usage Mining for Web Personalization

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Finding Generalized Path Patterns for Web Log Data Mining

International Journal of Advanced Research in Computer Science and Software Engineering

Mining Rare Periodic-Frequent Patterns Using Multiple Minimum Supports

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Keywords Web Mining, Web Usage Mining, Web Structure Mining, Web Content Mining.

Mining of Web Server Logs using Extended Apriori Algorithm

An Algorithm for Mining Large Sequences in Databases

Mining Web Access Patterns with First-Occurrence Linked WAP-Trees

Comparing the Performance of Frequent Itemsets Mining Algorithms

I. Introduction II. Keywords- Pre-processing, Cleaning, Null Values, Webmining, logs

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Mining Frequent Patterns without Candidate Generation

Probability Measure of Navigation pattern predition using Poisson Distribution Analysis

UMCS. Annales UMCS Informatica AI 7 (2007) Data mining techniques for portal participants profiling. Danuta Zakrzewska *, Justyna Kapka

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

EFFECTIVELY USER PATTERN DISCOVER AND CLASSIFICATION FROM WEB LOG DATABASE

Performance Analysis of Apriori Algorithm with Progressive Approach for Mining Data

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

Heuristics miner for e-commerce visitor access pattern representation

Web Usage Mining for Comparing User Access Behaviour using Sequential Pattern

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

Hierarchical Online Mining for Associative Rules

Pattern Classification based on Web Usage Mining using Neural Network Technique

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

A Novel Method of Optimizing Website Structure

Maintenance of the Prelarge Trees for Record Deletion

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

Improved Frequent Pattern Mining Algorithm with Indexing

ANALYSIS OF DENSE AND SPARSE PATTERNS TO IMPROVE MINING EFFICIENCY

Data Mining Part 3. Associations Rules

Mining High Average-Utility Itemsets

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

Support System- Pioneering approach for Web Data Mining

Web Service Usage Mining: Mining For Executable Sequences

Web Usage Mining: An Incremental Positive and Negative Association Rule Mining Approach Anuradha veleti #, T.Nagalakshmi *

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

The Objectivity Measurement of Frequent Patterns

Proxy Server Systems Improvement Using Frequent Itemset Pattern-Based Techniques

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Parallelizing Frequent Itemset Mining with FP-Trees

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Inferring User Search for Feedback Sessions

Applying Data Mining to Wireless Networks

Generation of Potential High Utility Itemsets from Transactional Databases

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Comparative Study of Techniques to Discover Frequent Patterns of Web Usage Mining

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

Categorization of Sequential Data using Associative Classifiers

Mining Bibliographic Patterns

An Efficient Method of Web Sequential Pattern Mining Based on Session Filter and Transaction Identification

Association Rule Mining

Review Paper Approach to Recover CSGM Method with Higher Accuracy and Less Memory Consumption using Web Log Mining

A Comprehensive Survey on Sequential Pattern Mining

Efficient Frequent Pattern Mining on Web Log Data

Knowledge Discovery from Web Usage Data: Complete Preprocessing Methodology

Web Data mining-a Research area in Web usage mining

Fuzzy Cognitive Maps application for Webmining

Web Usage Mining: A Research Area in Web Mining

An Algorithm for Frequent Pattern Mining Based On Apriori

Ontology Generation from Session Data for Web Personalization

Web Usage Mining: How to Efficiently Manage New Transactions and New Clients

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Monotone Constraints in Frequent Tree Mining

Frequent Pattern Mining

IMPLEMENTATION AND COMPARATIVE STUDY OF IMPROVED APRIORI ALGORITHM FOR ASSOCIATION PATTERN MINING

International Journal of Software and Web Sciences (IJSWS)

Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment

Mining Quantitative Association Rules on Overlapped Intervals

This paper proposes: Mining Frequent Patterns without Candidate Generation

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Integration of Candidate Hash Trees in Concurrent Processing of Frequent Itemset Queries Using Apriori

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

Effectively Capturing User Navigation Paths in the Web Using Web Server Logs

Performance Analysis of Data Mining Algorithms

Maintenance of fast updated frequent pattern trees for record deletion

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

Association Rule Mining from XML Data

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

Efficient GSP Implementation based on XML Databases

Transcription:

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract Sequential Pattern Mining involves applying data mining methods to large web data repositories to extract usage patterns. The growing popularity of the World Wide Web, many websites typically experience thousands of visitors every day. Analysis of who browsed what, can give important insight into the buying pattern of existing customers. Correct and timely decisions made based on this knowledge have helped organizations in reaching new heights in the market. In this paper, the sequence tree algorithm is implemented for pattern mining and this is experimented on web log data. The web log data which is considered as secondary data of the web has been considered for the discovery of frequent sequential patterns. The results have shown that the sequence tree algorithm performs better than the well-known Generalized Sequential Pattern (GSP) algorithm. The experiment shows that the running time of sequence tree algorithm is faster than the standard GSP algorithm and also Sequence Tree algorithm discovers more number of patterns than the standard GSP algorithm. Index Terms: Web usage mining, sequential patterns, sequence tree, web log data. --------------------------------------------------------------------- *** ------------------------------------------------------------------------ 1. INTRODUCTION Knowledge extraction from the World Wide Web has become an important and challenging task as enormous amount of data in form of semi-structured nature is available. Web mining approaches such as web content mining, web structure mining and web usage mining are available [1, 2, 3] which help to extract and produce useful knowledge from the web. Web usage mining is used for applications such as customer shopping sequences, web clicks, biological sequences, creation of dynamic user profiles, etc. In this paper, the discovery of frequent patterns from web log data is being considered. A web server usually registers a log entry for every access of a web page. First, raw web log data needs to be cleaned, condensed and transformed in order to retrieve and analyze significant and useful information. Second, pattern mining can be performed on log records to find association patterns, sequential patterns and trends of web accessing. With the use of such patterns, studies have been conducted on analyzing system performance, improving system design by web caching, web page pre-fetching, understanding the nature of web traffic, etc. The process of web usage mining consists of three major steps [4] (1) data pre-processing, (2) pattern discovery, and (3) pattern analysis stages. Log files are stored on the server side, on the client side and on the proxy servers. In this work, only the server side web log data is being used which is the Click stream data set [5]. The data pre-processing phase includes the data cleaning, user identification, session identification and data transformation respectively. The pattern discovery phase involves the discovery of frequent sequences. The pattern analysis phase involves the analysis of the frequent patterns generated by the pattern discovery phase. Many different techniques for mining frequent sequential patterns from the log data have been proposed in the recent past. The authors in [10] discuss that the candidategeneration-and-test approach outperforms the pattern-growth approach on mining short patterns, while pattern-growth approach is better on mining long patterns. The authors in [11] discuss the process of web log mining. Sequence mining is accomplished in [12], where a so-called WAP-tree is used for storing the patterns efficiently. Tree-like topology patterns and frequent path traversals are searched by [13, 14, 15, 16]. 2. DATA PRE-PROCESSING Data pre-processing phase involves data cleaning, user identification and session identification and data transformation [4]. In data cleaning stage phase, the web log is examined and irrelevant or redundant items such as image, sound, video files, executable cgi files and HTML files which could be downloaded without an explicit user request are removed. Data cleaning stage also involves removal of Available online @ http://www.ijesat.org 204

HTTP errors, records created by crawlers, etc. Data cleaning is performed by checking for file extensions such as GIF, JPEG, jpg, mpg, avi, etc. These are removed from the log. Each record of the log contains the following 6 fields. They are the shop-id, time the page was visited, IP address of the user visiting the page, a unique session identifier, the current page that is visited and the last field is the referrer which refers to the page from which the current page was visited. The user identification phase involves identification of users from the log data. The following procedure is adopted for identifying the users. (1) A new IP identifies a new user. (2) If the same IP is used, but different web browsers or different operating system in terms of type and version is being used, then this is considered as new user. The session identification stage involves identifying sessions according to different users. A session is a group of activities performed by the user while navigating through a given site within a given time period. In this paper, a set of pages visited by a specific user is considered as a single user session if the pages are requested at a time interval not larger than a specified time period of 60 milliseconds. In the data transformation phase, the web log data is converted into a format needed by the mining algorithms. For every session that is identified, the log data is converted into two databases transactional database and sequential database. The transactional database consists of two fieldstransaction id and the set of pages visited in a session. A sequential database consists of sequence of pages visited in a session and the data set used is Click Stream data which is server side log data. The records are sorted in ascending order based on IP address and the number of users is identified. The boundaries of each user are marked and stored in a two dimensional array. The boundaries specify the starting index and the ending index of a particular user in a sorted log file. In the session identification phase of the pre-processing stage, a session timestamp value is set which specifies the total time of a particular session. The value of the session timestamp has been set as 60 milliseconds. The record obtained is checked to see if it belongs to a particular session. If it belongs to the same session, it is added to the session list and the next record is processed. If it does not belong to the same session, then the previous session records are put onto the file and then the session list is cleared. A new session is created and the current record is added to the new session and its timestamp value is initialized as the session start time. This process is repeated for every user and all sessions are identified. The records in every session are then transformed into item sets and sequences. In the data transformation phase, all the fields of the records are transformed into appropriate format. 3. FREQUENT SEQUENCES Sequential Pattern mining involves discovering frequent sequences from a database where data to be mined is in some sequential order, this was first introduced by Agrawal and Srikant [6], and from then on the goal of sequential pattern mining is to discover all frequent sequences of itemsets in a dataset. In particular, an itemset is a non-empty subset of elements from a set C, the item collection, called items. In this manner, an itemset represents the set of items that occur together. The itemset composed of items a and b is denoted by (ab). A sequence is an ordered list of itemsets. A sequence is maximal if it is not contained in any other sequence. A sequence with k items is called a k- sequence. The number of elements (itemsets) in a sequence s is the length of the sequence and is denoted by s. The i th itemset in the sequence is represented by s i and the set of considered sequences is usually designated by database (DB), and the number of sequences by database size ( DB ). A subsequence s of s is denoted by s s. Formally, a sequence a=<a 1, a 2, a 3,..a n > is a subsequence of b=<b 1, b 2, b 3,.b m >, if there exists integers 1 i 1, i 2, i 3,. m such that a 1 b i1, a 2 b i2,.. a n b in [8]. Let I = {a 1, a 2,...,a m } be a set of items, and a transaction database DB = <T 1, T 2,..., T n>, where T i (i [1...n]) is a transaction which contains a set of items in I. The support (or occurrence frequency) of a pattern A, is the number of transactions containing A in DB, where A is a set of items. A pattern A is frequent if A s support is no less than a predefined minimum support threshold,. Given a transaction database DB and a minimum support threshold, the problem of finding the complete set of frequent patterns is called the frequent-pattern mining problem. [9] 4. SEQUENCE TREE ALGORITHM Sequential tree algorithm is used to find the frequent sequences in the log file. It involves two stages- construction of the sequence tree and mining of the sequence tree for finding frequent sequences. It takes as input the sequential database records which specify the various sequences of pages visited by the user in a particular session and also threshold. The frequent sequences that have been identified from the log file based on the minimum support threshold are obtained as output. The sequence tree algorithm is described in section 3.1 from steps (1) to (6). Available online @ http://www.ijesat.org 205

4.1 Sequence Tree (1) Read the sequential database and create a map with key-value pairs. Key refers to the unique page name that was visited and sequence value refers to the frequency or number of times the page occurs. (2) Sort the map in descending order of frequency of the keys (3) Create a sequence tree using the following steps A root node is termed as null For every row of sequential database that is read Attach the sequence as a branch to the root. If the sequence is already present increment the counter a new branch is created with that sequence. (4) Include a header table with number of rows equal to sequences with value one. Each row of header table is a linked list of nodes which specify the position where nodes are present. (5) Perform mining process For every row in the header table If frequency < minimum support threshold Ignore the row of header table For every node in the particular linked list If frequency < minimum support threshold Ignore the path Traverse the tree up till the root and store the path in the file as frequent sequences. EndFor EndFor (6) The frequent sequences identified from the algorithm are stored in a file Table 1: Comparison between the running times of GSP and Sequence tree algorithms Threshold Values GSP (in ms) Sequence Tree (in ms) 10 44444 47 20 35146 47 30 38860 47 50 37690 32 80 37501 31 Table 2: Comparisons between number of patterns generated by GSP and Sequence tree algorithms Threshold Values GSP Sequence Tree 10 179 1568 20 70 658 30 34 374 50 11 183 80 5 107 Graph 1 shows the comparison of running time of GSP and Sequence tree algorithm for different threshold values. Graph 2 shows the number of patterns found by GSP and Sequence tree algorithms for different threshold values. Graph 1: Comparison of running time of GSP and Sequence tree algorithm for different threshold values 5. EXPERIMENTAL RESULTS The experimental results and analysis of frequent sequences discovered from web log data is described in this section. The experimental analysis is conducted on the Click Stream Data set [5] which is a server side web log data containing 12,000 records. Each record contains the following fields a shop identifier, time stamp, IP address, unique session identifier, page visited, and referrer. The performance is tested on a computer with a 1.41GHz processor. The program is developed using JDK 1.6. Table 1 shows the comparison between the running times of GSP[7] and Sequence tree algorithms for different threshold values. Table 2 shows the comparisons between number of patterns generated by GSP and Sequence tree algorithms for different threshold values. Available online @ http://www.ijesat.org 206

Graph 2: The number of patterns found by GSP and Sequence tree algorithms [5] Stream data downloaded from the ECML/PKDD 2005 Discovery Challenge2. http://lisp.vse.cz/challenge/current [6] Agrawal R and Srikant R, Mining Sequential Patterns, in Int'l. Conf. Data Engineering (ICDE 95), ( 1995) 3-14 [7] Srikant R, Agrawal R: Mining Sequential Patterns: Generalizations and Performance Improvements, in Int 'l Conf Extending Database Technology. Springer ( 1996) 3-17 It is observed that the running time taken by Sequence tree algorithm is lesser than that of GSP algorithm. The total number of patterns generated by Sequence tree algorithm is greater than the GSP algorithm. 6. CONCLUSION Web usage mining techniques has been applied to large web repositories to extract usage patterns. The sequence tree algorithm is one such web usage mining technique which extracts frequent sequential patterns by formation of a tree. The running time and number of patterns generated are observed. The experimental results show that the Sequence Tree algorithm shows faster running time than the standard GSP algorithm and it also discovers more number of patterns than the standard GSP algorithm. REFERENCES [1] Kosala and Blockeel, Web mining research: A survey, SIGKDD: SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery and Data Mining, ACM, Vol. 2, 2000 [2] S. K. Madria, S. S. Bhowmick, W. K. Ng, and E.-P. Lim, Research issues in web data mining, in Data Warehousing and Knowledge Discovery, pp. 303-312, 1999. [3] J. Borges and M. Levene, Data mining of user navigation patterns, in WEBKDD, pp. 92-111, 1999. [4] Cooley R., Mobasher B., and Srivastava J., Data Preparation for Mining World Wide Web Browsing Patterns, In J. Knowledge and Information Systems, pp. 5.32, vol. 1, no. 1, 1999. 15. R. Kosala, H. Blockeel, Web Mining Research: A Survey, In SIGKDD Explorations, ACM. [8] Cláudia Antunes and Arlindo L. Oliveira Sequential Pattern Mining s: Trade-offs between Speed and Memory. [9] Jiawei Han, Jian pei, Yiwen yin and Runying mao, Mining Frequent Patterns Without Candidate Generation: A Frequent Pattern Tree Approach, Data Mining and Knowledge Discovery, 8, 53 87, 2004. [10] Sun, L and Zhang, X 2004, efficient frequent pattern mining on web logs, in JX Yu et al, Advanced Web Technologies and Applications:6 th Asia-Pacific Web Conference, APWeb 2004, Berlin, March 2004. [11] Renáta Iváncsy, István Vajk, Frequent Pattern Mining in Web Log Data, pp.77-70, Vol. 3, No. 1, 2006. [12] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu, Mining access patterns efficiently from web logs, in PADKK 00: Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications London, UK: Springer-Verlag, 2000, pp. 396-407 [13] M. S. Chen, J. S. Park, and P. S. Yu, Data mining for path traversal patterns in a web environment, in Sixteenth International Conference on Distributed Computing Systems, 1996, pp. 385-392 [14] X. Lin, C. Liu, Y. Zhang, and X. Zhou, Efficiently computing frequent tree-like topology patterns in a web environment, in TOOLS 99: Proceedings of the 31st International Conference on Technology of Object-Oriented Language and Systems. Washington, DC, USA: IEEE Computer Society, 1999, pp. 440 [15] A. Nanopoulos and Y. Manolopoulos, Finding generalized path patterns for web log data mining, Available online @ http://www.ijesat.org 207

in ADBIS-DASFAA 00: Proceedings of the East- European Conference on Advances in Databases and Information Systems Held Jointly with International Conference on Database Systems for Advanced Applications. London, UK: Springer-Verlag, 2000, pp. 215-228 [16] A. Nanopoulos and Y. Manolopoulos, Mining patterns from graph traversals, Data and Knowledge Engineering, Vol. 37, No. 3, pp. 243-266, 2001 Available online @ http://www.ijesat.org 208