INTRODUCTION. Chapter GENERAL

Similar documents
EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Overview of Web Mining Techniques and its Application towards Web

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

DATA MINING II - 1DL460. Spring 2014"

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

DATA MINING II - 1DL460. Spring 2017

FILTERING OF URLS USING WEBCRAWLER

An Improved Markov Model Approach to Predict Web Page Caching

DATA MINING - 1DL105, 1DL111

Mining Web Data. Lijun Zhang

Semantic Website Clustering

Pattern Classification based on Web Usage Mining using Neural Network Technique

arxiv: v3 [cs.ni] 3 May 2017

Conclusions. Chapter Summary of our contributions

Mining Web Data. Lijun Zhang

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Context Based Web Indexing For Semantic Web

World Wide Web has specific challenges and opportunities

Data Mining of Web Access Logs Using Classification Techniques

CHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1)

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Information Retrieval

Characterizing Web Usage Regularities with Information Foraging Agents

SOM Improved Neural Network Approach for Next Page Prediction

AccWeb Improving Web Performance via Prefetching

A New Technique for Ranking Web Pages and Adwords

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Review Paper on Web Usage Mining and Pattern Discovery

Chapter 2: Literature Review

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Data warehousing and Phases used in Internet Mining Jitender Ahlawat 1, Joni Birla 2, Mohit Yadav 3

COMP5331: Knowledge Discovery and Data Mining

Web Crawling As Nonlinear Dynamics

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

Chapter 27 Introduction to Information Retrieval and Web Search

Web Mining Using Cloud Computing Technology

THE WEB SEARCH ENGINE

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering Recommendation Algorithms

: Semantic Web (2013 Fall)

Improved Data Preparation Technique in Web Usage Mining

International Journal of Software and Web Sciences (IJSWS)

Information Retrieval May 15. Web retrieval

An Cross Layer Collaborating Cache Scheme to Improve Performance of HTTP Clients in MANETs

Searching the Web What is this Page Known for? Luis De Alba

Web Mining Evolution & Comparative Study with Data Mining

A GEOGRAPHICAL LOCATION INFLUENCED PAGE RANKING TECHNIQUE FOR INFORMATION RETRIEVAL IN SEARCH ENGINE

Chapter 6: Information Retrieval and Web Search. An introduction

Experimental study of Web Page Ranking Algorithms

Domain Based Categorization Using Adaptive Preprocessing

= a hypertext system which is accessible via internet

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

PERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM

Inferring User Search for Feedback Sessions

Searching the Web [Arasu 01]

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Domain Specific Search Engine for Students

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Web Structure Mining using Link Analysis Algorithms

A SURVEY ON WEB LOG MINING AND PATTERN PREDICTION

Information Retrieval Spring Web retrieval

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

An Introduction to Search Engines and Web Navigation

Bruno Martins. 1 st Semester 2012/2013

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

IJITKMSpecial Issue (ICFTEM-2014) May 2014 pp (ISSN )

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

SEARCH ENGINE INSIDE OUT

IMPROVING WEB SERVER PERFORMANCE USING TWO-TIERED WEB CACHING

An Algorithm for user Identification for Web Usage Mining

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

MODELING USER INTERESTS FROM WEB BROWSING ACTIVITIES. Team 11. research paper review: author: Fabio Gasparetti publication date: November 1, 2016

AN EFFECTIVE SEARCH ON WEB LOG FROM MOST POPULAR DOWNLOADED CONTENT

Context-based Navigational Support in Hypermedia

Analytical survey of Web Page Rank Algorithm

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

Sathyamangalam, 2 ( PG Scholar,Department of Computer Science and Engineering,Bannari Amman Institute of Technology, Sathyamangalam,

INDEXING FOR DOMAIN SPECIFIC HIDDEN WEB

Developing Focused Crawlers for Genre Specific Search Engines

Automated Path Ascend Forum Crawling

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

SEO Technical & On-Page Audit

Part I: Data Mining Foundations

Deep Web Crawling and Mining for Building Advanced Search Application

Chapter 3 Process of Web Usage Mining

Ontology-based Architecture Documentation Approach

Received: 15/04/2012 Reviewed: 26/04/2012 Accepted: 30/04/2012

Correlation based File Prefetching Approach for Hadoop

PROXY DRIVEN FP GROWTH BASED PREFETCHING

A Web Page Recommendation system using GA based biclustering of web usage data

A Survey on Information Extraction in Web Searches Using Web Services

Automated Online News Classification with Personalization

Smartcrawler: A Two-stage Crawler Novel Approach for Web Crawling

A Framework for Predictive Web Prefetching at the Proxy Level using Data Mining

Determining the Number of CPUs for Query Processing

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Transcription:

Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which people communicate with each other as well as with machines. Since its inception in 1989, it is connecting people from all walks of life from anywhere in the world who are crossing their paths in one way or the other. As a result, Internet has become global village [2] with 16 million people surfing the web in December 1995 to 2095 million people in March 2011 [3]. The ever increasing interest of the people over the information spread across WWW, has led to the development of the other interlinked field Information retrieval. Information Retrieval (IR) [4, 5] is the area concerned with retrieving information about a subject from a collection of data objects. IR is different from Data Retrieval, which in the context of documents consists mainly in searching which documents of the collection contain keywords of a user query. IR deals with finding information needed by the user. The WWW has distinctive properties. For example, it is extremely complex, massive in size, and highly dynamic in nature. Owing to this unique nature, the ability to search and retrieve information from the Web efficiently and effectively is a challenging task especially when the goal is to realize its full potential. With powerful workstations and parallel processing technology, efficiency is not a bottleneck. In fact, some existing search tools sift through gigabyte-size precompiled web indexes in a fraction of a second. But effective retrieval of information is still a developing research area.

Current search tools retrieve too many documents, of which only a small fraction are relevant to the user query. This is where Information retrieval search engines have to focus. In general, search engines allow users to search information by submitting queries in the form of keywords in the search interface. The search engines in turn retrieve the links of relevant information. The brief overview of the search engine process is discussed in the next section. 1.2 SEARCH ENGINE: AN INFORMATION RETRIEVAL TOOL In general, search engines [6] allow users to search documents by submitting queries in the form of keywords in the search interface. The search engines in turn retrieve the links to relevant documents. Broadly, the working of the search engine components can be divided into two modules as shown in Fig 1.1: Query Independent Module and Query Dependent Module. Search Engine Knowledge Base Query Dependent Module Crawling Indexing Query Processor Ranking Fig 1.1 Classification of Search Engine As can be seen from Fig 1.1, at operational level, search engines [7] comprise of following four major components: 2

Crawler Indexer Query Processor Ranking The brief discussion of each component is as follows: 1. Crawler: It is an automated web browser [8] that follows every hyperlink on the various web sites of WWW to retrieve web pages. These web pages are stored at the search engine s side in databases. Therefore, crawler is a query independent module. The contents of each page stored at the search engine s side are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called metatags). Meta information about the web pages is stored in an index for later queries. 2. Indexer: It is also a query independent module. Search engine indexing [10] is done after web crawler has stored documents in the search engine s database. Indexer analyzes the documents for extracting out important terms for creating an appropriate index for fast retrieval of the documents against user queries. 3. Query Processor: It is a query dependent module [11] which uses search engine s index to consult the database for retrieval of related documents. The searching process also manages log files to optimize both indexer and the crawler towards providing information about active set of pages which are actually seen and sought by users. 4. Ranking: It is a query dependent process [12]. The active sets of pages, before presenting to the users, are first ranked according to a ranking strategy. Since, different search engines follow different strategies; the search process may lead to different active sets of pages. The search engines are continuously being evolved to improve the effectiveness of the active sets of web pages returned to the user against their submitted query. Even after adopting the complex algorithms/strategies at both the query independent and query dependent modules, the user is presented with the huge list of information for their 3

submitted query. To tackle with this problem of information overkill, the results presented to the user must be refined. This leads to the necessity of developing various automated tools at the server side or the client side. In the following section, the role of web mining and web prefetching the documents to retrieve the relevant documents for the user has been highlighted. 1.3 ROLE OF WEB MINING With the constant increase in the amount of information present on WWW, providing relevant information to the user in the least amount of time has become a challenging task. However, the perceived latency* of the user can be effectively reduced by employing web mining techniques in IR. Web mining [13] is the automatic retrieval, extraction and evaluation of information in the form of interesting patterns for knowledge discovery from web documents using data mining techniques. In web mining domain, data is most important based on which the quality of information to be mined depends. There are three types of web data that can be mined: content, usage, and structure. Content includes text and multimedia mining. Usage includes Web log mining which further includes search logs and other usage data and Structure implies analyzing the link structure of the Web. The three type of web data help in determining the depth of web mining domain. Web mining is consistently being improved to reduce this user perceived latency time. A critical look at the available literature indicates that the remedy to reduce this wait time is to prefetching the web documents with the help of suitable prediction techniques. * User perceived latency is the delay from the time a request is issued until response is received. 4

1.4 ROLE OF WEB PREFETCHING Although web performance can be improved by introducing caches at the appropriate places but the advantages get limited in the wake of dynamic content present on WWW. In fact the delay in bringing the required information can further be reduced through prefetching web documents precisely. Web Prefetching [14] can be defined as the process of prefetching the web documents from the web servers even before they have been requested by the user. However, if not predicted properly, prefetching can greatly increase overheads to the already overloaded network bandwidth. There are various web prefetching strategies which are being adopted by the researchers to minimize the user perceived latency. Some of the popular strategies are: Popularity based strategies: Predictions are made based on the popularity of the web pages. Semantic prefetching strategies: Content of the web pages is analyzed to make predictions and Statistic prefetching strategies: These make predictions based on the statistics formed from the user sessions. The next section discusses the various problem areas and provides their appropriate solutions. 1.5 PROBLEM IDENTIFICATION The WWW is publicly indexable web. The following characteristics of WWW present researchers with the challenges towards retrieving and mining the information from it. 1. To dig the relevant information: Search engines use crawlers to fetch pages from the WWW which it then stores and indexes. Based on their popularity, these indexed documents are ranked. However, the problem with the current search 5

engines is that they consider only the popularity i.e. their forward and backward links. Whereas, there is a possibility that the more relevant documents which may be less popular according to the user s query are left out. Thus, a technique needs to be developed that considers the user s query in order to find out more relevant information as user s query is more important as compared to links in the web pages. Solution: This problem is solved by introducing a mechanism that considers not only the popularity of the web pages but also significantly considers the much needed relevancy of the user s submitted query. The proposed mechanism has infact improved over the google s PageRank method. 2. High User Perceived Latency: The delay in response (i.e. the time when user submits the query and he/she receives the results for the submitted query) perceived by users in retrieving the web objects is known as User Perceived Latency. Due to increase in the size of WWW, this delay increases. As a result, users experience long waits to meet their requests over the web. Hence, the need to develop a technique that can effectively reduce this latency. Solution: In this thesis, a framework named Predictive Prefetching Engine (PPE) has been introduced at the search engine side and proxy side which reduces the user s perceived latency by prefetching the relevant web pages based on the users past browsing experience. The important feature of this framework is that it adds least burden on the additional network bandwidth requirements as it makes its predictions based on the rules that are generated dynamically depending on the size of the database. 3. Lack of personalization of WWW: With the exponential growth of WWW and its users, it becomes very difficult to retrieve the information that is looked into by particular groups of users. For example, employees of an organization may need similar type of information. Therefore, a need for a mechanism is strongly felt that can personalize the contents of WWW according to the groups of users. 6

Solution: In order to solve this problem, a mechanism has been proposed that works on different groups of users in order to provide personalized information. This is done by designing an agent based mechanism that activates the agents for different groups after identifying the incoming user s request from a particular IP address. 4. Information Overkill: Even with the introduction of prefetching mechanism that aims to reduce the user perceived latency, unsuccessful predictions made to prefetch the pages may result in information overkill. Thus, a mechanism is required that could actually make credible predictions for only those pages that are more relevant, i.e. make correct predictions to minimize the problem of information overkill. Solution: In order to minimize this problem of information overkill, k-order Markov Predictors have been used in the proposed work. These predictors generate the rules that are refined with each increasing level of k, thus generating more and more relevant predictions of web pages. 5. Huge size: The rate of web s growth has been and continues to be exponential. The number of its user has increased from 16 million in 1995 to approximately 2 billion in 2010 [15].This huge size of WWW has transformed it into huge repository of knowledge in which highly diverse information is linked in an extremely complex manner. But still, WWW shows a particular order in the sense that follows a web like structure of the hyperlinks i.e. when web user surfs a web site, the various documents are well arranged through internal hyperlinks. Moreover, the Meta information about the web documents is stored in the web logs providing an inherent order among them. Therefore, this ordering can be exploited in order to mine desired information from WWW. Solution: In order to solve the above said problem, various data mining techniques e.g. association rules, sequential patterns and clustering have been 7

applied in the proposed work on the repository of raw knowledge stored in various logs at the proxy and server side. 1.6 ORGANIZATION OF THESIS This thesis focuses on Web prefetching encompassing web mining in general and web usage mining in particular. In this framework, various algorithms have been proposed for designing the effective web prefetching mechanism. The aim of this work had been to design an effective prefetching technique that could predict web pages even before the users have asked for the same with the view to make and change predictions dynamically depending on the database. The thesis has been divided into six main chapters as listed below: Chapter 2 provides an overview of WWW and search engines which are utilized by the users to search information from this publicly indexable sea of web documents. It also provides insight into the literature review on the role of web mining, its application areas, web prefetching techniques and the various strategies followed for prefetching the documents. This chapter also provides the backdrop of the existing work and the challenging areas that need consideration. Chapter 3 focuses on the issue of lack of relevant pages returned to the user by the general search engines for their submitted queries. It happens because search engine s page rank mechanism gives more importance to the popularity of the documents rather than their relevancy. It addresses this issue by introducing the mechanism that considers not only the individual keywords of the user query but also the associations of those keywords within the documents. It proposes the novel algorithm that gives due weightage to both the relevancy and the popularity of the web pages. Chapter 4 proposes the framework for Predictive Prefetching Engine (PPE). This framework has been introduced at the search engine s side where it is known as Search engine side Predictive Prefetching Engine (SPPE) as well as at the proxy side where it is known as Proxy side Predictive Prefetching Engine (PPPE). This 8

framework carries out its task of making credible predictions for prefetching web pages in three phases. The first phase introduces a novel approach for clustering the user sessions obtained after preprocessing the user transactions present in the server/proxy logs. The second phase adopts the mechanism for applying k-order Markov Predictors for determining the rules that govern the predictions for prefetching the web documents. The third phase is the Rule Activator phase which makes use of agent based approach to fire the right set of rules thus prefetching the web documents likely to be used by the user. Chapter 5 presents the implementation details and the analysis of PPE. The PPE has been implemented in Java using Eclipse IDE for Java Developer s version 1.2.1. The chapter also verifies the rules formed for prediction of web pages using Zipf s Law thereby proving the accuracy of the rules formed. Chapter 6 concludes the outcome of the work. Major achievements have been highlighted in this chapter. Further, it also endeavors to explore the possibilities of the future research work in this area. Appendix A briefly explains the Zipf s Law. Bibliography includes references to publications in this area. 9