A Random Walk Web Crawler with Orthogonally Coupled Heuristics

Similar documents
arxiv:cs/ v1 [cs.ir] 26 Apr 2002

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

On near-uniform URL sampling

A STUDY ON THE EVOLUTION OF THE WEB

1.1 Our Solution: Random Walks for Uniform Sampling In order to estimate the results of aggregate queries or the fraction of all web pages that would

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Random Sampling of Search Engine s Index Using Monte Carlo Simulation Method

Recent Researches in Ranking Bias of Modern Search Engines

Breadth-First Search Crawling Yields High-Quality Pages

Lecture 17 November 7

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

Information Retrieval. Lecture 9 - Web search basics

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

INTRODUCTION. Chapter GENERAL

Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach *

A Novel Interface to a Web Crawler using VB.NET Technology

Web Structure Mining using Link Analysis Algorithms

Dynamic Visualization of Hubs and Authorities during Web Search

Link Analysis in Web Information Retrieval

Compressing the Graph Structure of the Web

Do TREC Web Collections Look Like the Web?

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Context-based Navigational Support in Hypermedia

Conclusions. Chapter Summary of our contributions

URL-Enhanced Adaptive Page-Refresh Models

CS Search Engine Technology

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Models for the growth of the Web

World Wide Web has specific challenges and opportunities

DATA SEARCH ENGINE INTRODUCTION

Estimating Page Importance based on Page Accessing Frequency

Administrative. Web crawlers. Web Crawlers and Link Analysis!

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

Simulation Study of Language Specific Web Crawling

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

Dynamics of the Chilean Web structure

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Information Retrieval Spring Web retrieval

Cloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao

Mining Web Data. Lijun Zhang

Information Retrieval May 15. Web retrieval

Bow-tie Decomposition in Directed Graphs

Local Methods for Estimating PageRank Values

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

COMP Page Rank

Crawling the Infinite Web: Five Levels are Enough

The Case for Browser Provenance

Automated Path Ascend Forum Crawling

The Structure of E-Government - Developing a Methodology for Quantitative Evaluation -

Project Report. An Introduction to Collaborative Filtering

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Link Analysis and Web Search

An Improved Computation of the PageRank Algorithm 1

Information Retrieval. Lecture 11 - Link analysis

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Web Crawling As Nonlinear Dynamics

Review on Techniques of Collaborative Tagging

A Signaling Game Approach to Databases Querying

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Evaluation of Long-Held HTTP Polling for PHP/MySQL Architecture

Review: Searching the Web [Arasu 2001]

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Compact Encoding of the Web Graph Exploiting Various Power Laws

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

Engineering Quality of Experience: A Brief Introduction

deseo: Combating Search-Result Poisoning Yu USF

Collection Building on the Web. Basic Algorithm

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

On Veracious Search In Unsystematic Networks

Crawlability Metrics for Web Applications

Analysis of Meta-Search engines using the Meta-Meta- Search tool SSIR

A METHODOLOGY FOR THE EVALUATION OF WEB GRAPH MODELS AND A TEST CASE. Antonios Kogias Dimosthenis Anagnostopoulos

MURDOCH RESEARCH REPOSITORY

Finding Neighbor Communities in the Web using Inter-Site Graph

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

THE WEB SEARCH ENGINE

Distributed Indexing of the Web Using Migrating Crawlers

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Keywords: web crawler, parallel, migration, web database

Browsing the Semantic Web

Fast Low-Cost Estimation of Network Properties Using Random Walks

Overlay (and P2P) Networks

A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

Outline. Last 3 Weeks. Today. General background. web characterization ( web archaeology ) size and shape of the web

Crawling the Hidden Web Resources: A Review

Evaluation Methods for Focused Crawling

Chapter 16 Heuristic Search

Developing Focused Crawlers for Genre Specific Search Engines

Frontiers in Web Data Management

Studying the Properties of Complex Network Crawled Using MFC

The 2011 IDN Homograph Attack Mitigation Survey

A Personalized Multimedia Web-Based Educational System with Automatic Indexing for Multimedia Courses

Statistical Testing of Software Based on a Usage Model

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Indian Institute of Technology Kanpur. Visuomotor Learning Using Image Manifolds: ST-GK Problem

Competitive Intelligence and Web Mining:

Deep Web Content Mining

Transcription:

A Random Walk Web Crawler with Orthogonally Coupled Heuristics Andrew Walker and Michael P. Evans School of Mathematics, Kingston University, Kingston-Upon-Thames, Surrey, UK Applied Informatics and Semiotics Group University of Reading, Reading, UK M.Evans@computer.org Abstract We present a novel web crawling agent, the heuristics of which are orthogonally coupled to the enemy generation strategy of a computer game. Input from the player of the computer game provides genuine randomness to the web crawling agent, effectively driving a random walk across the web. The random walk, in turn, uses the structure of the Web Graph upon which it crawls to generate a stochastic process, which is used in the enemy generation strategy of the game to add novelty. We show the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the web crawler, and present the results of the sample of web pages collected. Keywords Web crawler, heuristics, computer game, Web Graph, power law 1. Introduction Web crawlers rely on heuristics to perform their function of crawling the web. A web crawler is a software agent that downloads web pages referenced by URLs parsed from previously crawled web pages. Web crawlers that sample the web employ random walk heuristics to determine which page to crawl next. However, randomness is not an easy process for a computer to generate, and the random walk can be further compromised by the structure of the web conspiring to introduce further bias (Bar-Yossef et al, 2000). Computer games, too, must rely on heuristics when controlling the enemy (or enemies) a player must fight. Getting the heuristics right is critical to the playability of the game, as they determine the challenge it poses. If the heuristics are too simple, the game will be easily conquered; too difficult, and the player will swiftly become fed up at her lack of progress, and will stop playing. In addition, heuristics need to add variety to maintain the interest of the player. We noticed a commonality between both applications that we could exploit using orthogonal coupling. This is a method in which two components or applications that normally operate independently from one another are coupled to improve their overall efficiency or effectiveness. 1

As such, we coupled the heuristics of a web crawler with the enemy generation strategy of a computer game. Such unorthodox coupling benefits both applications by introducing true randomness to the web crawler s random walk, and a stochastic enemy generation process to the computer game that is dependent upon the results returned from the web crawler, and thus different every time. In short, the coupling improves the effectiveness of both applications. The paper proceeds as follows. Firstly, we discuss the principle behind the orthogonal coupling of a computer game and a web crawler, before moving on to present the architecture of AlienBot: a web crawler coupled to a computer game. Section 4 presents the results of our design, validating the heuristics used by the AlienBot web crawl, and also revealing some interesting statistics gained from a crawl of 160,000 web pages. Finally, the paper concludes by discussing some of the issues we faced, and some suggestions for further work. 2. Orthogonally Coupling Computer Game and Web Crawling Heuristics 2.1. Computer Game Heuristics A simple example of a computer game s heuristics can be seen in a shoot-em-up game, in which the player controls a lone spacecraft (usually at the bottom of the screen), and attempts to defeat a number of enemy spacecraft that descend upon him. Early games, such as Namco s 1979 hit Galaxians, relied on heuristics in which the enemies would initially start at the top of the screen, but would intermittently swoop down towards the player in batches of two or three. Modern shoot-em-ups, such as Treasure s 2003 hit Ikaruga, are more sophisticated and offer superb visuals, but still rely on a fixed set of heuristics, in which the enemies attack in a standard pattern that can be discerned and predicted after only a few plays. As such, we recognized that such games could be made more enjoyable by introducing a stochastic process into the enemy s heuristics, thus generating a measure of surprise, and making the game subtly different each time. 2.1 Randomly Walking the Web Graph We found that such a process could be obtained by performing a random walk across the Web Graph. This is the (directed or undirected) graph G= (V, E), where V = {v 1, v 2, v n } is the set of vertices representing web pages, and E the collection of edges representing hyperlinks (or links) that connect the web pages. Thus, G represents the structure of the web in terms of its pages and links. A random walk across G is therefore a stochastic process that iteratively visits the vertices of the graph G (Bar-Yossef et al, 2000). However, as various experiments have discovered, the Web Graph G has unusual properties that introduce considerable bias into a traditional random walk. A random walk across G should generate a finite-state discrete time Markov chain, in which each variable v n (where v n V) in the chain is independent of all other variables in the chain. Thus, the probability of reaching the next web page should be independent of the previous web page, given the current state (Baldi et al, 2003). However, G is neither undirected nor regular, and a straightforward walk will have a heavy bias towards pages with high in-degree (i.e. links pointing to them) (Bar-Yossef et al, 2

2000). This leads to a dependence between pages, in which a page on the walk affects the probability that another page is visited. In particular, some pages that are close to one another may be repeatedly visited in quick succession due to the nature of the links between them and any intermediate pages (Henzinger et al, 2000). 2.2. Orthogonally Coupling the Computer Game to the Web Graph As described, a random walk across the web is a stochastic process that can contain discernible patterns. Although unwelcome for sampling the web, such a process is ideal for our computer game. In addition, the Web Graph is constantly evolving, with nodes being created and deleted all the time (Baldi et al, 2003). As such, the dynamics of the structure are such that novelty is virtually guaranteed. This is the reason we chose to couple the two applications. We achieved this through the use of a web crawler, which performs the required random walk by parsing a web page for hyperlinks, and following one at random. We coupled the two applications by mapping each hyperlink parsed by the crawler to the generation of one enemy in our computer game. In this way, the exact number of enemies that appear cannot be predicted in advance, but patterns may be discerned, as the sampling bias inherent within the Web Graph is reflected in the pattern of enemies generated from it. Furthermore, we couple the two applications tighter still by making each enemy represent its associated hyperlink, and sending this link back to the crawler when the enemy is shot. In this way, the choice of enemy shot by the player is used to determine the next web page crawled by the crawler, as the enemy represents the hyperlink. As each enemy is indistinguishable from its neighbour, the player has no reason to shoot one over another, and thus implicitly adds true randomness into the URL selection process. The player therefore blindly selects the hyperlink on behalf of the crawler, and thus drives the random walk blindly across the web while she plays the game. The web crawler and the computer game, traditionally seen as orthogonal applications, are therefore tightly coupled to one another. 3. Alienbot A Novel Design For a Web Crawler. 3.1. A General Overview of Alienbot. Our crawler, called Alienbot, comprises two major components: a client-side game produced using Macromedia Flash a server-side script using PHP. The client-side program handles the interaction with the user, whereas the server-side program is responsible for the bulk of the crawling work (see section 3.2). Alienbot is based on the game Galaxians. It runs in the user s web browser (see Figure 1), and interacts with the server-side script through HTTP requests. In the game, URLs (i.e. hyperlinks) are associated with aliens in a one to one mapping. When an alien is shot by the user, its physical 3

representation on screen is removed and the URL it represents is added to a queue of URLs within the client. The client works though this list one at a time, on a first in first out basis, sending each URL to the server (Figure 2). Figure 1 AlienBot Figure 2 - Overview of the AlienBot Process Upon receiving a new URL, the server retrieves the referenced page, and parses it for new links. Meanwhile, the client listens for a response from the server, which is sent once the server-side script has finished processing the page. The response sent back from the server consists of a list of one or more URLs retrieved from the hyperlinks on the page being searched. The game can then create new aliens to represent the URLs it has received. 3.2 Processing a new web page When the client sends a URL to the server indicating that an alien has been shot, the server performs the following operations (see Figure 3): Step 1 Download the page referenced by the URL supplied by the client. Search the page for links and other information about the page. URL Resolver - resolve local addresses used in links to absolute URLs. That is, converting links to the form http://www.hostname.com/path/file.html Remove links deemed to be of no use (e.g. undesired file type, such as exe files). Database checks - check the database to see if the page has already been crawled Step 2 Record in the database all URLs found on the page Select a random sample (where the randomness is generated by a random number generator) of the resolved URLs to send back to the client (Note: this step is performed to prevent too many links being sent back to the client, as the user can only manage so many 4

Step 3 enemies at once! Hence Alienbot only returns a random subset of the links found, with the number returned calculated by R = (n mod 5) + 1, where R is the number of links sent back to the game and N the number of resolved links on the page that remain after database checks have been made). If there are no links of use (or N=0 in the previous step), a random jump is performed. Each URL sent to the client is used to represent an enemy. When the player shoots an enemy, its associated link is returned to the server, where its page is crawled, and the process begins again. Thus, Alienbot selects its URLs using a combination of programmatic and user interaction. Figure 3 - AlienBot's Random Walk 4. Results 4.1 Analyzing the Web Crawler s Performance In order to test the web crawler, we ran it between 28/04/2003 and 29/07/2003. In all, some 160,000 URLs were collected in total. After the testing process was complete, we analyzed the web pages referenced by these URLs, and used the statistics obtained to compare the results generated from Alienbot with those of other web crawlers. 5

Figures 5a and 5b show the distribution of out links (i.e. those on a web page that reference another web page) across the different web pages crawled by AlienBot, and give a good indication of the underlying structure of the Web Graph. Both results clearly show the power law that exists in the Web Graph, and compare well with similar results by Broder et al. (2000), Henzinger et al. (2000), Barabasi and Bonabeau (2003), and Adamic and Huberman (2001). In particular, the line of best fit for figure 5b reveals an exponent of 3.28, which compares well with Kumar et al. s (2000) value of 2.72, obtained with a crawl of some 200 million web pages. These results therefore validate the effectiveness of our web crawling heuristics in accurately traversing the Web Graph. % 7 6 5 4 3 2 1 0 80 110 Out link distribution 140 170 200 230 260 290 Out Link number 320 350 380 410 440 Figure 4a - Distribution of OutLinks Number of Pages (Log Scale) Log - Log Plot of Outlink Distribution 3.5 3.0 2.5 2.0 1.5 1.0 1.5 1.7 1.9 2.1 2.3 2.5 2.7 Number of Links (Log Scale) Figure 5b Log-Log Plot revealing Power Law 4.2 Analyzing the Game s Performance The design of the AlienBot architecture introduced no detectable latency to the gameplay from the round-trips to the server, and the unpredictability of the number of aliens to be faced next certainly added to the game s playability. In particular, revealing the URL associated with an alien that the user had just shot added a new and fascinating dimension to the game, as it gave a real sense of the crawling process that the user was inadvertently driving (see Figure 5). Figure 5 - Randomly Walking the World From the game s perspective, the power law of the distribution of links crawled by AlientBot added to the surprise factor in terms of the number of enemies generated. Power law distributions are characterized by scale-free properties, in which a high percentage of one variable (e.g. number of links) gives rise to a low percentage of another (e.g. web pages), with the ratio of the percentages remaining (roughly) constant at all scales. Thus, a high number of links 6

will be found on a small number of web pages, which produces the surprise (in terms of enemy generation) at unknown intervals. This was the stochastic process we aimed to capture by using the results from the crawl to drive the game s enemy generation strategy. The validation of the crawler shows that we accurately captured the stochastic process, while the (albeit subjective) playability of the game revealed the benefit of the whole approach. 4.3 Analysis of the URL Sample Set In addition to validating the crawling heuristics, we also used the data from our URL sample to provide some results from the Web Graph, and to take some web metrics. 4.3.1. Comparison of the Alienbot Data Set with the Google Index. Results from random walks have been used in other studies to compare the coverage of various search engines by calculating the percentage of pages crawled by a random walk that have not been indexed by the search engine (Henzinger et al, 1999). As such, we compared the results of AlienBot with that of Google, and discovered that 36.85% of the pages we located are not indexed by Google, suggesting that Google s coverage represents 63.15% of the publicly available web. This compares with an estimate of 18.6% that Lawrence and Giles (1999) found in 1999. Given that Google s coverage extends to 3,307,998,701 pages (as of November 2003), we estimate the size of the web to be 5,238,319,400 pages, or 6.5 times its estimated size in 1998 (Lawrence and Giles, 1999). In Google Not In Google Number % of Number % of of pages pages of pages pages 15296 63.15% 8927 36.85% Table 1 AlienBot Pages indexed in Google Number % of of Pages Pages JavaScript 15214 60.86% VBScript 222 0.89% Total Pages with Scripting 15436 61.74% Total Pages Using Flash 477 1.91% Table 2 - Web Page Statistics 4.3.2 Miscellaneous Page statistics Of the 160,000 URLs we collected during our random walk, we downloaded 25,000 at random to sample the technologies that are currently prevalent on the web. As can be seen in Table 2, 61.74% of all web pages use a scripting language. Of these pages, 99.11% use JavaScript, and only 0.89% use Microsoft s VBScript. Furthermore, we found only 1.91% of pages use Macromedia s Flash 7. Issues and Further Work Ideally, Alienbot would have gathered a larger sample of documents. The small size of the sample means it has a bias towards pages in the neighbourhood of the seed set of URLs. Furthermore, the way in which Alienbot attempts to filter out undesirable content using a list of unwanted file extensions is not ideal. This has allowed some non-html pages to make it into 7

the data set, thereby adding some unwanted noise to the results. A future version could make use of HTTP s content-type header in order to filter out pages that are not created in some HTML based mark-up. However, the results from the crawl show that these limitations did not introduce a significant bias into the crawler when compared with results from other studies, particularly considering the relatively small number of web pages of our study. 8. Conclusion We have presented a novel web crawling agent, the heuristics of which are orthogonally coupled to the enemy generation strategy of a computer game. The computer game adds genuine randomness to the web crawling agent, effectively driving a random walk across the web. The random walk, in turn, generates a stochastic process via the structure of the Web Graph upon which it crawls, thereby introducing novelty to the enemy generation strategy of the game. We have shown the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the web crawler, and have presented some of the results of the sample of web pages we collected. We intend to analyze our data further, and repeat the study with a larger data set. References Albert, R., Jeoung, H., and Barabasi, A-L. (1999), The Diameter of the World Wide Web, Nature 401:130. Baldi, P., Frasconi, F., and Smyth, P. (2003), Modeling the Internet and the Web, John Wiley and Sons, England. Barabási, A., and Bonabeau, E. (2003), Scale-free networks, Scientific American 288, (5):50-59, May 2003. Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, F., and Weitz, D. (2000) Approximating Aggregate Queries about Web Pages via Random Walks. In Proc. 26th International Conference on Very Large Databases, 535 544 Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J.L (2000) Graph structure in the Web. Ninth International World Wide Web Conference/ Computer Networks 33(1-6): 309-320 (2000). Adamic, L.A. and Huberman, B.A., The Web's Hidden Order, Communications of the ACM archive, Volume 44, Issue 9 (September 2001), Pages: 55-60 Henzinger, M.R., Heydon, A., Mitzenmacher, M., and Najork, M. (1999), Measuring Index Quality Using Random Walks on the Web. In Proc. of the 8th Int. World Wide Web Conference, Toronto, Canada, pages 213-225. Elsevier Science, May, 1999 Henzinger, M.R., Heydon, A., Mitzenmacher, M., and Najork, M. (2000), On near-uniform URL sampling. In Proc. of the 9th Int. World Wide Web Conference, Amsterdam, May 2000. Kumar, R., Raghavan,P., Rajagopalan, S., Sivakumar, D., Tomkins, A.S., and Upfal, E. (2000), The Web as a Graph, Proc. 19th ACM SIGACT-SIGMOD-AIGART Symp. Principles of Database Systems, PODS. Lawrence, S. and Giles, C.L., (1999), Accessibility of Information on the Web, Nature Vol. 400, 8 th July 1999 8