Data Mining of Web Access Logs Using Classification Techniques

Similar documents
SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Survey Paper on Web Usage Mining for Web Personalization

A Survey on Web Personalization of Web Usage Mining

The influence of caching on web usage mining

Pattern Classification based on Web Usage Mining using Neural Network Technique

Knowledge Discovery from Web Usage Data: Complete Preprocessing Methodology

I. Introduction II. Keywords- Pre-processing, Cleaning, Null Values, Webmining, logs

CHAPTER - 3 PREPROCESSING OF WEB USAGE DATA FOR LOG ANALYSIS

Support System- Pioneering approach for Web Data Mining

Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal

Pre-processing of Web Logs for Mining World Wide Web Browsing Patterns

International Journal of Software and Web Sciences (IJSWS)

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

An Effective method for Web Log Preprocessing and Page Access Frequency using Web Usage Mining

EFFECTIVELY USER PATTERN DISCOVER AND CLASSIFICATION FROM WEB LOG DATABASE

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Data Preprocessing Method of Web Usage Mining for Data Cleaning and Identifying User navigational Pattern

Heuristics miner for e-commerce visitor access pattern representation

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Association Rule Mining among web pages for Discovering Usage Patterns in Web Log Data L.Mohan 1

Semantic Web Mining and its application in Human Resource Management

Web page recommendation using a stochastic process model

Context-based Navigational Support in Hypermedia

Discovering Paths Traversed by Visitors in Web Server Access Logs

USER INTEREST LEVEL BASED PREPROCESSING ALGORITHMS USING WEB USAGE MINING

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Keywords: Figure 1: Web Log File. 2013, IJARCSSE All Rights Reserved Page 1167

A Review Paper on Web Usage Mining and Pattern Discovery

Web Mining Team 11 Professor Anita Wasilewska CSE 634 : Data Mining Concepts and Techniques

INTRODUCTION. Chapter GENERAL

A SURVEY- WEB MINING TOOLS AND TECHNIQUE

Keywords Web Mining, Web Usage Mining, Web Structure Mining, Web Content Mining.

Improved Data Preparation Technique in Web Usage Mining

Ontology Generation from Session Data for Web Personalization

Effectively Capturing User Navigation Paths in the Web Using Web Server Logs

A New Web Usage Mining Approach for Website Recommendations Using Concept Hierarchy and Website Graph

Research/Review Paper: Web Personalization Using Usage Based Clustering Author: Madhavi M.Mali,Sonal S.Jogdand, Deepali P. Shinde Paper ID: V1-I3-002

Chapter 2 BACKGROUND OF WEB MINING

Web Mining and Knowledge Discovery of Usage Patterns

Analytical survey of Web Page Rank Algorithm

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Web Usage Mining: A Research Area in Web Mining

A PRAGMATIC ALGORITHMIC APPROACH AND PROPOSAL FOR WEB MINING

THE STUDY OF WEB MINING - A SURVEY

CLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES

A Framework for Personal Web Usage Mining

WEB USAGE MINING: ANALYSIS DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISE ALGORITHM

Inferring User Search for Feedback Sessions

EXTRACTION OF INTERESTING PATTERNS THROUGH ASSOCIATION RULE MINING FOR IMPROVEMENT OF WEBSITE USABILITY

Keywords: Web Mining, Web Usage Mining, Web Structure Mining, Web Content Mining, Graph theory.

ARS: Web Page Recommendation System for Anonymous Users Based On Web Usage Mining

Weighted Page Content Rank for Ordering Web Search Result

Web Mining for Web Personalization

Web Usage Mining for Comparing User Access Behaviour using Sequential Pattern

An Approach To Web Content Mining

A Lime Light on the Emerging Trends of Web Mining

Web Mining Using Cloud Computing Technology

Data warehousing and Phases used in Internet Mining Jitender Ahlawat 1, Joni Birla 2, Mohit Yadav 3

Behaviour Recovery and Complicated Pattern Definition in Web Usage Mining

Fuzzy Cognitive Maps application for Webmining

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

IJMIE Volume 2, Issue 9 ISSN:

COMPREHENSIVE FRAMEWORK FOR PATTERN ANALYSIS THROUGH WEB LOGS USING WEB MINING: A REVIEW

IJITKMSpecial Issue (ICFTEM-2014) May 2014 pp (ISSN )

Text Document Clustering Using DPM with Concept and Feature Analysis

Mining of Web Server Logs using Extended Apriori Algorithm

A Comparative Study of Selected Classification Algorithms of Data Mining

Web Data mining-a Research area in Web usage mining

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Identification of Navigational Paths of Users Routed through Proxy Servers for Web Usage Mining

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

Web Log Data Cleaning For Enhancing Mining Process

Semantic Clickstream Mining

On the Effectiveness of Web Usage Mining for Page Recommendation and Restructuring

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Discovery from Web Usage Data: An Efficient Implementation of Web Log Preprocessing Techniques

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

Hybrid Approach for Improvement of Web page Response Time

Mohri, Kurukshetra, India

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

A NOVEL APPROACH: VARIOUS CLASSIFICATIONS IN WEB MINING USING DIAGNOSIS OF FAULTS IN ELECTRO- MECHANICAL DEVICES FROM VIBRATION MEASUREMENTS

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Web Mining Evolution & Comparative Study with Data Mining

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

A Survey on Preprocessing Techniques in Web Usage Mining

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Construction of Web Community Directories by Mining Usage Data

EXTRACTION AND ALIGNMENT OF DATA FROM WEB PAGES

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Modelling Structures in Data Mining Techniques

Mining for User Navigation Patterns Based on Page Contents

Chapter 27 Introduction to Information Retrieval and Web Search

Fault Identification from Web Log Files by Pattern Discovery

E-COMMERCE WEBSITE DESIGN IMPROVEMENT BASED ON WEB USAGE MINING

Keywords Web Usage, Clustering, Pattern Recognition

American International Journal of Research in Science, Technology, Engineering & Mathematics

Transcription:

Data Mining of Web Logs Using Classification Techniques Md. Azam 1, Asst. Prof. Md. Tabrez Nafis 2 1 M.Tech Scholar, Department of Computer Science & Engineering, Al-Falah School of Engineering & Technology, Dhauj, Faridabad, Haryana, INDIA 2 Asst. Prof. Department of Computer Science Faculty of Management & IT, Jamia Hamdard, Abstract: This paper focus on classification, technological advancement, business have gone online. Because of the growing popularity of the World Wide Web. Many website typically experience Lac of visitors every day. Data mining techniques such as association rules, classification, and clustering and attribute selection are considered very useful in web usage mining. Data Mining is gaining more popularity because of its power to extract knowledge from voluminous data where it is beyond the reach of traditional techniques of knowledge discovery and human comprehension. A data mining tool, WEKA is used. The WEKA is a collection of machine learning algorithms for solving real-world data mining problems. Keyword: Data Mining, Data Mining Techniques, Data Classification. I. INTRODUCTION Data Mining is ahead more recognition because of its power to extract knowledge from voluminous data where it is beyond the reach of traditional techniques of knowledge discovery and human understanding. The entire process of data mining can be divided into 4 phases: data collection, data preparation, pattern discovery and pattern analysis. The data collection phase involves collection of data from different sources and identifying desired features for mining. Data collected from different sources may be heterogeneous. The data preparation phase involves the process of normalizing the data and representing them in a structure so that they become more convenient. Features recognized in the previous phase are extracted and at last, formatting is applied to represent data in a format required by the data mining tool to be used. Mining Techniques Data mining techniques can be used to find access patterns hidden inside huge volumes of web access data. Once the data are prepared and formatted, the data mining techniques are applied to discover patterns. For the experiments of this research paper, a data mining tool, WEKA is used. The WEKA is a collection of machine learning algorithms for solving real-world data mining problems. Data mining tools for preprocessing, classification, clustering, attribute selection and visualization are implemented in the WEKA. Prototype Analysis Prototype analysis is the final phase of the knowledge discovery process. In this phase, the discovered patterns are judged for their interestingness. The exact methodology depends upon the application for which mining is done [3]. The analysis can be done using the content and structure of the website. Visualization techniques that represent the values at different positions using different colors are often useful in the Prototype analysis [3]. Network Mining Incredible growth of the World Wide Web, the raw web data has become a gigantic source of information. accordingly, this has turned researchers attention towards the use of data mining Page 1

techniques to this data. Network Mining is referred to as the application of Network mining technologies to the web data [4]. Network mining can be categorized into content mining, structure mining or usage mining depending upon which part of the web to mine [5]. Content Mining The web content is the authentic data the web page was deliberate to convey to the users. It consists of several types of data such as unstructured text, graphics, sound, video and semistructured hypertext. Content mining can be referred to as the application of data mining algorithms to the content of the web [3]. A conceptual schema can be created [6] that can describe the semantics of a large volume of unstructured web data to manage them [5]. Discussed various categories of the web content mining such as text mining which is mining of unstructured texts and multimedia data mining which is mining of multiple types of data such as unstructured and image data. The access log files of the June- and the March-2014 were used for the experiments. One experiment addresses one goal on one access log file. For example, experiment was conducted for the goal Nalanda Open Univeristy VsNot Nalanda Open University using the March-2014 log file. For each experiment, data mining techniques were used. These techniques were classification, association rules, clustering and attribute selection. A pattern discovery task for one of these data mining techniques was performed using one of four different feature sets extracted from the preprocessed data. The four different feature sets used for the experiments are,, and 20- Most-. As an example, a pattern discovery task can be performed to find association rules using the feature set. Net Logs and Net Usage Mining The Net access log is the main resource for Net usage mining because it stores data pertaining to accesses of the website. The usage data can be stored in Common Log Format (CLF) or Extended Log Format (ELF). The Net access log in CLF format has information of the IP address of a visitor's machine, the user id of visitor if available and date/time of the page request. The method is a means of page request. It can be GET, PUT, POST or HEAD. The URL is the page that is requested. The protocol is the means of communication used, HTTP for example. The status is the completion code. For example, 500 is the code for success. The size field shows the bytes transferred as a result of a page request. The Extended Log Format, in addition to these information, stores referrer, which is the page this request has come from and agent is the web browser used. Data Mining Experiments This section brief describes the organization of experiments conducted and the pattern discovery tasks performed for each experiment. A hierarchical view of the pattern discovery tasks performed in the experiments of this paper and is intended to show the organization of the experiments. A Hierarchical View of the Pattern Discovery Tasks Performed Web Log Data Information about web log data used for the experiments. The web crawler entries were not useful for the experiments. crawlers generally access robots.txt file for the website access permissions. Therefore, these entries were removed by preparing a list of crawlers that accessed the robots.txt file. However, this technique does not remove all crawler entries. The log entries of IP addresses that no longer existed were also not useful for the experiments. Hence, those entries were removed. Page 2

The image entries, entries by proxy servers and entries of bad requests, were not useful for the experiments of this paper. Therefore, they were removed. Web Logs File Number Entries of Period 2010 1000000 29/05/2010-03/06/2010 11391157 04/02/ 23/04/ Detail of Web Log Files Operational Identification An IP-Day consists of all entries of the pages visited from one IP address in a day. Once the extraneous entries were removed from the web log data, IP-Days were extracted based on the IP addresses. The transactions were identified from the IP-Days by the time duration between two consecutive visits. If the time duration between accesses of two pages X and Y in an IP- Day is more than 30 minutes then X was considered as the last page accessed in one transaction and Y was considered as the first page accessed in another transaction. A time period of 30 minutes was considered appropriate to distinguish two transactions. It may be possible that a visitor starts a visit at 12:55 pm and ends at 00:11 am in which case there will be two transactions generated instead of one. It was conjectured that transactions that have at least 5 pages visited would be useful for this data mining task. Transactions with fewer than 5 visited pages were not used. The number of transactions used for the experiments from each log file. Web Log file 2010 Number of Transaction 4591 55602 Number of Transaction Used from Web Log Files Experiment: IndVsOutsideInd The experiment was conducted to compare access patterns between visitors from within India and visitors from outside India using the web access log file access. Classification The results were analysed only if the classification accuracies were above 70 %. The output of the J48 and the 1R algorithms indicated the root as the most significant attribute. The partial decision tree that if visitors visit the root page at all then they are from within India. This result is consistent with the result obtained using the 20- Most- feature set. Data Mining Technique Classification Association Rules Clustering Web Log File Feature Used Set Significant Pattern Discovered Page 3

Attribute Selection Experiment2: IndVsOutsideInd - Summary of results The decision tree also shows that if visitors do not visit the root page and visit the /timetables" page then they are from within India. This may be because visitors from within India tend to look at the timetables of various subjects of their interest. It may be possible that this result is dominated by students of Nalanda Open University who know the URL of the /timetables page and hence visit this page directly. The decision tree also shows that if visitors do not visit the root page and they also do not visit the /timetables page but visit /students page then they are from within India. This result is obtained may be because the students of Nalanda Open University visit this page frequently for various activities such as checking the emails. The output of the 1R algorithm indicated /courses as the most significant attribute. The partial decision tree the root as the most significant attribute. The decision tree also shows that if visitors do not visit the root page and visit the /timetables" page then they are from within India. The decision tree also showed that if visitors do not visit the root page and they also do not visit the /timetables page but they visit /students page then they are from within India. This outcome is consistent with the result obtained using the feature set. Partial Decision Tree of the WEKA J48 Program Using the 20- Most- Feature Set in Experiment Association Rules The association rules generated by the Apriori algorithm were analysed to find their interestingness. The Apriori algorithm found one interesting rule which can be interpreted as if visitors visit the /timetable page then they are from within India. This result is consistent with the results obtained using the feature set. Attribute Selection The attributes selected by the process of Attribute Selection using cfsseteval" attribute evaluator together with BestFirst" search method were analysed for their interestingness. Attribute selection using this feature set indicated link0, which is the first page visited in a transaction, as the most significant attribute. This result is consistent with the results obtained the feature set. Page 4

Attribute selection using this feature set indicated link0 as the most significant attribute. This result is consistent with the result obtained using the feature set. Attribute selection using this feature set indicated the root page as the most significant attribute. This result is consistent with the result using the feature set. [5] B. Mobasher, R. Cooley, and J. Srivastava. Creating adaptive web sites through usage-based clustering of urls. In IEEE Knowledge and Data Engineering Workshop (KDEX'99), 1999. [6] C. R. Anderson. A machine learning approach to web personalization. PHD Thesis, University of Washington, 2002. Attribute selection using this feature set indicated the root page, /students" and /timetables" as the most significant attributes in order. This result is consistent with the result shown by decision trees using the and the 20- Most- feature sets. II. CONCLUSIONS The goal of the experiments was to compare access patterns between visitors from within India and visitors from outside India. Visitors from within India mostly visit the root page, whereas visitors from outside India mostly visit specific pages directly. This is probably because they use search engines. REFERENCES [1] J. Srivastava, R. Cooley, M. Deshpande, and P. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12{23, 2000. [2] J. Borges and M. Levene. Data mining of user navigation patterns. In Proc. of the Workshop on Web Usage Analysis and User Profiling (WEBKDD'99), pages 92{111, 2001. [3] R. Kosala and H. Blockeel. Web mining research: A survey. ACM SIGKDD, 2(1):1{15, 2000. [4] K.W. Tan, H. Han, and R. Elmasri. Web data cleansing and preparation for ontology extraction using WordNet. In First International Conference on Web Information Systems Engineering (WISE'00), volume 2. Page 5