INTRODUCTION. Chapter GENERAL

Size: px

Start display at page:

Download "INTRODUCTION. Chapter GENERAL"

Blaze Reynolds
5 years ago
Views:

1 Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which people communicate with each other as well as with machines. Since its inception in 1989, it is connecting people from all walks of life from anywhere in the world who are crossing their paths in one way or the other. As a result, Internet has become global village [2] with 16 million people surfing the web in December 1995 to 2095 million people in March 2011 [3]. The ever increasing interest of the people over the information spread across WWW, has led to the development of the other interlinked field Information retrieval. Information Retrieval (IR) [4, 5] is the area concerned with retrieving information about a subject from a collection of data objects. IR is different from Data Retrieval, which in the context of documents consists mainly in searching which documents of the collection contain keywords of a user query. IR deals with finding information needed by the user. The WWW has distinctive properties. For example, it is extremely complex, massive in size, and highly dynamic in nature. Owing to this unique nature, the ability to search and retrieve information from the Web efficiently and effectively is a challenging task especially when the goal is to realize its full potential. With powerful workstations and parallel processing technology, efficiency is not a bottleneck. In fact, some existing search tools sift through gigabyte-size precompiled web indexes in a fraction of a second. But effective retrieval of information is still a developing research area.

2 Current search tools retrieve too many documents, of which only a small fraction are relevant to the user query. This is where Information retrieval search engines have to focus. In general, search engines allow users to search information by submitting queries in the form of keywords in the search interface. The search engines in turn retrieve the links of relevant information. The brief overview of the search engine process is discussed in the next section. 1.2 SEARCH ENGINE: AN INFORMATION RETRIEVAL TOOL In general, search engines [6] allow users to search documents by submitting queries in the form of keywords in the search interface. The search engines in turn retrieve the links to relevant documents. Broadly, the working of the search engine components can be divided into two modules as shown in Fig 1.1: Query Independent Module and Query Dependent Module. Search Engine Knowledge Base Query Dependent Module Crawling Indexing Query Processor Ranking Fig 1.1 Classification of Search Engine As can be seen from Fig 1.1, at operational level, search engines [7] comprise of following four major components: 2

3 Crawler Indexer Query Processor Ranking The brief discussion of each component is as follows: 1. Crawler: It is an automated web browser [8] that follows every hyperlink on the various web sites of WWW to retrieve web pages. These web pages are stored at the search engine s side in databases. Therefore, crawler is a query independent module. The contents of each page stored at the search engine s side are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called metatags). Meta information about the web pages is stored in an index for later queries. 2. Indexer: It is also a query independent module. Search engine indexing [10] is done after web crawler has stored documents in the search engine s database. Indexer analyzes the documents for extracting out important terms for creating an appropriate index for fast retrieval of the documents against user queries. 3. Query Processor: It is a query dependent module [11] which uses search engine s index to consult the database for retrieval of related documents. The searching process also manages log files to optimize both indexer and the crawler towards providing information about active set of pages which are actually seen and sought by users. 4. Ranking: It is a query dependent process [12]. The active sets of pages, before presenting to the users, are first ranked according to a ranking strategy. Since, different search engines follow different strategies; the search process may lead to different active sets of pages. The search engines are continuously being evolved to improve the effectiveness of the active sets of web pages returned to the user against their submitted query. Even after adopting the complex algorithms/strategies at both the query independent and query dependent modules, the user is presented with the huge list of information for their 3

4 submitted query. To tackle with this problem of information overkill, the results presented to the user must be refined. This leads to the necessity of developing various automated tools at the server side or the client side. In the following section, the role of web mining and web prefetching the documents to retrieve the relevant documents for the user has been highlighted. 1.3 ROLE OF WEB MINING With the constant increase in the amount of information present on WWW, providing relevant information to the user in the least amount of time has become a challenging task. However, the perceived latency* of the user can be effectively reduced by employing web mining techniques in IR. Web mining [13] is the automatic retrieval, extraction and evaluation of information in the form of interesting patterns for knowledge discovery from web documents using data mining techniques. In web mining domain, data is most important based on which the quality of information to be mined depends. There are three types of web data that can be mined: content, usage, and structure. Content includes text and multimedia mining. Usage includes Web log mining which further includes search logs and other usage data and Structure implies analyzing the link structure of the Web. The three type of web data help in determining the depth of web mining domain. Web mining is consistently being improved to reduce this user perceived latency time. A critical look at the available literature indicates that the remedy to reduce this wait time is to prefetching the web documents with the help of suitable prediction techniques. * User perceived latency is the delay from the time a request is issued until response is received. 4

5 1.4 ROLE OF WEB PREFETCHING Although web performance can be improved by introducing caches at the appropriate places but the advantages get limited in the wake of dynamic content present on WWW. In fact the delay in bringing the required information can further be reduced through prefetching web documents precisely. Web Prefetching [14] can be defined as the process of prefetching the web documents from the web servers even before they have been requested by the user. However, if not predicted properly, prefetching can greatly increase overheads to the already overloaded network bandwidth. There are various web prefetching strategies which are being adopted by the researchers to minimize the user perceived latency. Some of the popular strategies are: Popularity based strategies: Predictions are made based on the popularity of the web pages. Semantic prefetching strategies: Content of the web pages is analyzed to make predictions and Statistic prefetching strategies: These make predictions based on the statistics formed from the user sessions. The next section discusses the various problem areas and provides their appropriate solutions. 1.5 PROBLEM IDENTIFICATION The WWW is publicly indexable web. The following characteristics of WWW present researchers with the challenges towards retrieving and mining the information from it. 1. To dig the relevant information: Search engines use crawlers to fetch pages from the WWW which it then stores and indexes. Based on their popularity, these indexed documents are ranked. However, the problem with the current search 5

6 engines is that they consider only the popularity i.e. their forward and backward links. Whereas, there is a possibility that the more relevant documents which may be less popular according to the user s query are left out. Thus, a technique needs to be developed that considers the user s query in order to find out more relevant information as user s query is more important as compared to links in the web pages. Solution: This problem is solved by introducing a mechanism that considers not only the popularity of the web pages but also significantly considers the much needed relevancy of the user s submitted query. The proposed mechanism has infact improved over the google s PageRank method. 2. High User Perceived Latency: The delay in response (i.e. the time when user submits the query and he/she receives the results for the submitted query) perceived by users in retrieving the web objects is known as User Perceived Latency. Due to increase in the size of WWW, this delay increases. As a result, users experience long waits to meet their requests over the web. Hence, the need to develop a technique that can effectively reduce this latency. Solution: In this thesis, a framework named Predictive Prefetching Engine (PPE) has been introduced at the search engine side and proxy side which reduces the user s perceived latency by prefetching the relevant web pages based on the users past browsing experience. The important feature of this framework is that it adds least burden on the additional network bandwidth requirements as it makes its predictions based on the rules that are generated dynamically depending on the size of the database. 3. Lack of personalization of WWW: With the exponential growth of WWW and its users, it becomes very difficult to retrieve the information that is looked into by particular groups of users. For example, employees of an organization may need similar type of information. Therefore, a need for a mechanism is strongly felt that can personalize the contents of WWW according to the groups of users. 6

7 Solution: In order to solve this problem, a mechanism has been proposed that works on different groups of users in order to provide personalized information. This is done by designing an agent based mechanism that activates the agents for different groups after identifying the incoming user s request from a particular IP address. 4. Information Overkill: Even with the introduction of prefetching mechanism that aims to reduce the user perceived latency, unsuccessful predictions made to prefetch the pages may result in information overkill. Thus, a mechanism is required that could actually make credible predictions for only those pages that are more relevant, i.e. make correct predictions to minimize the problem of information overkill. Solution: In order to minimize this problem of information overkill, k-order Markov Predictors have been used in the proposed work. These predictors generate the rules that are refined with each increasing level of k, thus generating more and more relevant predictions of web pages. 5. Huge size: The rate of web s growth has been and continues to be exponential. The number of its user has increased from 16 million in 1995 to approximately 2 billion in 2010 [15].This huge size of WWW has transformed it into huge repository of knowledge in which highly diverse information is linked in an extremely complex manner. But still, WWW shows a particular order in the sense that follows a web like structure of the hyperlinks i.e. when web user surfs a web site, the various documents are well arranged through internal hyperlinks. Moreover, the Meta information about the web documents is stored in the web logs providing an inherent order among them. Therefore, this ordering can be exploited in order to mine desired information from WWW. Solution: In order to solve the above said problem, various data mining techniques e.g. association rules, sequential patterns and clustering have been 7

8 applied in the proposed work on the repository of raw knowledge stored in various logs at the proxy and server side. 1.6 ORGANIZATION OF THESIS This thesis focuses on Web prefetching encompassing web mining in general and web usage mining in particular. In this framework, various algorithms have been proposed for designing the effective web prefetching mechanism. The aim of this work had been to design an effective prefetching technique that could predict web pages even before the users have asked for the same with the view to make and change predictions dynamically depending on the database. The thesis has been divided into six main chapters as listed below: Chapter 2 provides an overview of WWW and search engines which are utilized by the users to search information from this publicly indexable sea of web documents. It also provides insight into the literature review on the role of web mining, its application areas, web prefetching techniques and the various strategies followed for prefetching the documents. This chapter also provides the backdrop of the existing work and the challenging areas that need consideration. Chapter 3 focuses on the issue of lack of relevant pages returned to the user by the general search engines for their submitted queries. It happens because search engine s page rank mechanism gives more importance to the popularity of the documents rather than their relevancy. It addresses this issue by introducing the mechanism that considers not only the individual keywords of the user query but also the associations of those keywords within the documents. It proposes the novel algorithm that gives due weightage to both the relevancy and the popularity of the web pages. Chapter 4 proposes the framework for Predictive Prefetching Engine (PPE). This framework has been introduced at the search engine s side where it is known as Search engine side Predictive Prefetching Engine (SPPE) as well as at the proxy side where it is known as Proxy side Predictive Prefetching Engine (PPPE). This 8

9 framework carries out its task of making credible predictions for prefetching web pages in three phases. The first phase introduces a novel approach for clustering the user sessions obtained after preprocessing the user transactions present in the server/proxy logs. The second phase adopts the mechanism for applying k-order Markov Predictors for determining the rules that govern the predictions for prefetching the web documents. The third phase is the Rule Activator phase which makes use of agent based approach to fire the right set of rules thus prefetching the web documents likely to be used by the user. Chapter 5 presents the implementation details and the analysis of PPE. The PPE has been implemented in Java using Eclipse IDE for Java Developer s version The chapter also verifies the rules formed for prediction of web pages using Zipf s Law thereby proving the accuracy of the rules formed. Chapter 6 concludes the outcome of the work. Major achievements have been highlighted in this chapter. Further, it also endeavors to explore the possibilities of the future research work in this area. Appendix A briefly explains the Zipf s Law. Bibliography includes references to publications in this area. 9

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,