A Random Walk Web Crawler with Orthogonally Coupled Heuristics

Size: px

Start display at page:

Download "A Random Walk Web Crawler with Orthogonally Coupled Heuristics"

Coleen Lang
6 years ago
Views:

1 A Random Walk Web Crawler with Orthogonally Coupled Heuristics Andrew Walker and Michael P. Evans School of Mathematics, Kingston University, Kingston-Upon-Thames, Surrey, UK Applied Informatics and Semiotics Group University of Reading, Reading, UK Abstract We present a novel web crawling agent, the heuristics of which are orthogonally coupled to the enemy generation strategy of a computer game. Input from the player of the computer game provides genuine randomness to the web crawling agent, effectively driving a random walk across the web. The random walk, in turn, uses the structure of the Web Graph upon which it crawls to generate a stochastic process, which is used in the enemy generation strategy of the game to add novelty. We show the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the web crawler, and present the results of the sample of web pages collected. Keywords Web crawler, heuristics, computer game, Web Graph, power law 1. Introduction Web crawlers rely on heuristics to perform their function of crawling the web. A web crawler is a software agent that downloads web pages referenced by URLs parsed from previously crawled web pages. Web crawlers that sample the web employ random walk heuristics to determine which page to crawl next. However, randomness is not an easy process for a computer to generate, and the random walk can be further compromised by the structure of the web conspiring to introduce further bias (Bar-Yossef et al, 2000). Computer games, too, must rely on heuristics when controlling the enemy (or enemies) a player must fight. Getting the heuristics right is critical to the playability of the game, as they determine the challenge it poses. If the heuristics are too simple, the game will be easily conquered; too difficult, and the player will swiftly become fed up at her lack of progress, and will stop playing. In addition, heuristics need to add variety to maintain the interest of the player. We noticed a commonality between both applications that we could exploit using orthogonal coupling. This is a method in which two components or applications that normally operate independently from one another are coupled to improve their overall efficiency or effectiveness. 1

2 As such, we coupled the heuristics of a web crawler with the enemy generation strategy of a computer game. Such unorthodox coupling benefits both applications by introducing true randomness to the web crawler s random walk, and a stochastic enemy generation process to the computer game that is dependent upon the results returned from the web crawler, and thus different every time. In short, the coupling improves the effectiveness of both applications. The paper proceeds as follows. Firstly, we discuss the principle behind the orthogonal coupling of a computer game and a web crawler, before moving on to present the architecture of AlienBot: a web crawler coupled to a computer game. Section 4 presents the results of our design, validating the heuristics used by the AlienBot web crawl, and also revealing some interesting statistics gained from a crawl of 160,000 web pages. Finally, the paper concludes by discussing some of the issues we faced, and some suggestions for further work. 2. Orthogonally Coupling Computer Game and Web Crawling Heuristics 2.1. Computer Game Heuristics A simple example of a computer game s heuristics can be seen in a shoot-em-up game, in which the player controls a lone spacecraft (usually at the bottom of the screen), and attempts to defeat a number of enemy spacecraft that descend upon him. Early games, such as Namco s 1979 hit Galaxians, relied on heuristics in which the enemies would initially start at the top of the screen, but would intermittently swoop down towards the player in batches of two or three. Modern shoot-em-ups, such as Treasure s 2003 hit Ikaruga, are more sophisticated and offer superb visuals, but still rely on a fixed set of heuristics, in which the enemies attack in a standard pattern that can be discerned and predicted after only a few plays. As such, we recognized that such games could be made more enjoyable by introducing a stochastic process into the enemy s heuristics, thus generating a measure of surprise, and making the game subtly different each time. 2.1 Randomly Walking the Web Graph We found that such a process could be obtained by performing a random walk across the Web Graph. This is the (directed or undirected) graph G= (V, E), where V = {v 1, v 2, v n } is the set of vertices representing web pages, and E the collection of edges representing hyperlinks (or links) that connect the web pages. Thus, G represents the structure of the web in terms of its pages and links. A random walk across G is therefore a stochastic process that iteratively visits the vertices of the graph G (Bar-Yossef et al, 2000). However, as various experiments have discovered, the Web Graph G has unusual properties that introduce considerable bias into a traditional random walk. A random walk across G should generate a finite-state discrete time Markov chain, in which each variable v n (where v n V) in the chain is independent of all other variables in the chain. Thus, the probability of reaching the next web page should be independent of the previous web page, given the current state (Baldi et al, 2003). However, G is neither undirected nor regular, and a straightforward walk will have a heavy bias towards pages with high in-degree (i.e. links pointing to them) (Bar-Yossef et al, 2

3 2000). This leads to a dependence between pages, in which a page on the walk affects the probability that another page is visited. In particular, some pages that are close to one another may be repeatedly visited in quick succession due to the nature of the links between them and any intermediate pages (Henzinger et al, 2000) Orthogonally Coupling the Computer Game to the Web Graph As described, a random walk across the web is a stochastic process that can contain discernible patterns. Although unwelcome for sampling the web, such a process is ideal for our computer game. In addition, the Web Graph is constantly evolving, with nodes being created and deleted all the time (Baldi et al, 2003). As such, the dynamics of the structure are such that novelty is virtually guaranteed. This is the reason we chose to couple the two applications. We achieved this through the use of a web crawler, which performs the required random walk by parsing a web page for hyperlinks, and following one at random. We coupled the two applications by mapping each hyperlink parsed by the crawler to the generation of one enemy in our computer game. In this way, the exact number of enemies that appear cannot be predicted in advance, but patterns may be discerned, as the sampling bias inherent within the Web Graph is reflected in the pattern of enemies generated from it. Furthermore, we couple the two applications tighter still by making each enemy represent its associated hyperlink, and sending this link back to the crawler when the enemy is shot. In this way, the choice of enemy shot by the player is used to determine the next web page crawled by the crawler, as the enemy represents the hyperlink. As each enemy is indistinguishable from its neighbour, the player has no reason to shoot one over another, and thus implicitly adds true randomness into the URL selection process. The player therefore blindly selects the hyperlink on behalf of the crawler, and thus drives the random walk blindly across the web while she plays the game. The web crawler and the computer game, traditionally seen as orthogonal applications, are therefore tightly coupled to one another. 3. Alienbot A Novel Design For a Web Crawler A General Overview of Alienbot. Our crawler, called Alienbot, comprises two major components: a client-side game produced using Macromedia Flash a server-side script using PHP. The client-side program handles the interaction with the user, whereas the server-side program is responsible for the bulk of the crawling work (see section 3.2). Alienbot is based on the game Galaxians. It runs in the user s web browser (see Figure 1), and interacts with the server-side script through HTTP requests. In the game, URLs (i.e. hyperlinks) are associated with aliens in a one to one mapping. When an alien is shot by the user, its physical 3

4 representation on screen is removed and the URL it represents is added to a queue of URLs within the client. The client works though this list one at a time, on a first in first out basis, sending each URL to the server (Figure 2). Figure 1 AlienBot Figure 2 - Overview of the AlienBot Process Upon receiving a new URL, the server retrieves the referenced page, and parses it for new links. Meanwhile, the client listens for a response from the server, which is sent once the server-side script has finished processing the page. The response sent back from the server consists of a list of one or more URLs retrieved from the hyperlinks on the page being searched. The game can then create new aliens to represent the URLs it has received. 3.2 Processing a new web page When the client sends a URL to the server indicating that an alien has been shot, the server performs the following operations (see Figure 3): Step 1 Download the page referenced by the URL supplied by the client. Search the page for links and other information about the page. URL Resolver - resolve local addresses used in links to absolute URLs. That is, converting links to the form Remove links deemed to be of no use (e.g. undesired file type, such as exe files). Database checks - check the database to see if the page has already been crawled Step 2 Record in the database all URLs found on the page Select a random sample (where the randomness is generated by a random number generator) of the resolved URLs to send back to the client (Note: this step is performed to prevent too many links being sent back to the client, as the user can only manage so many 4

5 Step 3 enemies at once! Hence Alienbot only returns a random subset of the links found, with the number returned calculated by R = (n mod 5) + 1, where R is the number of links sent back to the game and N the number of resolved links on the page that remain after database checks have been made). If there are no links of use (or N=0 in the previous step), a random jump is performed. Each URL sent to the client is used to represent an enemy. When the player shoots an enemy, its associated link is returned to the server, where its page is crawled, and the process begins again. Thus, Alienbot selects its URLs using a combination of programmatic and user interaction. Figure 3 - AlienBot's Random Walk 4. Results 4.1 Analyzing the Web Crawler s Performance In order to test the web crawler, we ran it between 28/04/2003 and 29/07/2003. In all, some 160,000 URLs were collected in total. After the testing process was complete, we analyzed the web pages referenced by these URLs, and used the statistics obtained to compare the results generated from Alienbot with those of other web crawlers. 5

6 Figures 5a and 5b show the distribution of out links (i.e. those on a web page that reference another web page) across the different web pages crawled by AlienBot, and give a good indication of the underlying structure of the Web Graph. Both results clearly show the power law that exists in the Web Graph, and compare well with similar results by Broder et al. (2000), Henzinger et al. (2000), Barabasi and Bonabeau (2003), and Adamic and Huberman (2001). In particular, the line of best fit for figure 5b reveals an exponent of 3.28, which compares well with Kumar et al. s (2000) value of 2.72, obtained with a crawl of some 200 million web pages. These results therefore validate the effectiveness of our web crawling heuristics in accurately traversing the Web Graph. % Out link distribution Out Link number Figure 4a - Distribution of OutLinks Number of Pages (Log Scale) Log - Log Plot of Outlink Distribution Number of Links (Log Scale) Figure 5b Log-Log Plot revealing Power Law 4.2 Analyzing the Game s Performance The design of the AlienBot architecture introduced no detectable latency to the gameplay from the round-trips to the server, and the unpredictability of the number of aliens to be faced next certainly added to the game s playability. In particular, revealing the URL associated with an alien that the user had just shot added a new and fascinating dimension to the game, as it gave a real sense of the crawling process that the user was inadvertently driving (see Figure 5). Figure 5 - Randomly Walking the World From the game s perspective, the power law of the distribution of links crawled by AlientBot added to the surprise factor in terms of the number of enemies generated. Power law distributions are characterized by scale-free properties, in which a high percentage of one variable (e.g. number of links) gives rise to a low percentage of another (e.g. web pages), with the ratio of the percentages remaining (roughly) constant at all scales. Thus, a high number of links 6

7 will be found on a small number of web pages, which produces the surprise (in terms of enemy generation) at unknown intervals. This was the stochastic process we aimed to capture by using the results from the crawl to drive the game s enemy generation strategy. The validation of the crawler shows that we accurately captured the stochastic process, while the (albeit subjective) playability of the game revealed the benefit of the whole approach. 4.3 Analysis of the URL Sample Set In addition to validating the crawling heuristics, we also used the data from our URL sample to provide some results from the Web Graph, and to take some web metrics Comparison of the Alienbot Data Set with the Google Index. Results from random walks have been used in other studies to compare the coverage of various search engines by calculating the percentage of pages crawled by a random walk that have not been indexed by the search engine (Henzinger et al, 1999). As such, we compared the results of AlienBot with that of Google, and discovered that 36.85% of the pages we located are not indexed by Google, suggesting that Google s coverage represents 63.15% of the publicly available web. This compares with an estimate of 18.6% that Lawrence and Giles (1999) found in Given that Google s coverage extends to 3,307,998,701 pages (as of November 2003), we estimate the size of the web to be 5,238,319,400 pages, or 6.5 times its estimated size in 1998 (Lawrence and Giles, 1999). In Google Not In Google Number % of Number % of of pages pages of pages pages % % Table 1 AlienBot Pages indexed in Google Number % of of Pages Pages JavaScript % VBScript % Total Pages with Scripting % Total Pages Using Flash % Table 2 - Web Page Statistics Miscellaneous Page statistics Of the 160,000 URLs we collected during our random walk, we downloaded 25,000 at random to sample the technologies that are currently prevalent on the web. As can be seen in Table 2, 61.74% of all web pages use a scripting language. Of these pages, 99.11% use JavaScript, and only 0.89% use Microsoft s VBScript. Furthermore, we found only 1.91% of pages use Macromedia s Flash 7. Issues and Further Work Ideally, Alienbot would have gathered a larger sample of documents. The small size of the sample means it has a bias towards pages in the neighbourhood of the seed set of URLs. Furthermore, the way in which Alienbot attempts to filter out undesirable content using a list of unwanted file extensions is not ideal. This has allowed some non-html pages to make it into 7

8 the data set, thereby adding some unwanted noise to the results. A future version could make use of HTTP s content-type header in order to filter out pages that are not created in some HTML based mark-up. However, the results from the crawl show that these limitations did not introduce a significant bias into the crawler when compared with results from other studies, particularly considering the relatively small number of web pages of our study. 8. Conclusion We have presented a novel web crawling agent, the heuristics of which are orthogonally coupled to the enemy generation strategy of a computer game. The computer game adds genuine randomness to the web crawling agent, effectively driving a random walk across the web. The random walk, in turn, generates a stochastic process via the structure of the Web Graph upon which it crawls, thereby introducing novelty to the enemy generation strategy of the game. We have shown the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the web crawler, and have presented some of the results of the sample of web pages we collected. We intend to analyze our data further, and repeat the study with a larger data set. References Albert, R., Jeoung, H., and Barabasi, A-L. (1999), The Diameter of the World Wide Web, Nature 401:130. Baldi, P., Frasconi, F., and Smyth, P. (2003), Modeling the Internet and the Web, John Wiley and Sons, England. Barabási, A., and Bonabeau, E. (2003), Scale-free networks, Scientific American 288, (5):50-59, May Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, F., and Weitz, D. (2000) Approximating Aggregate Queries about Web Pages via Random Walks. In Proc. 26th International Conference on Very Large Databases, Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J.L (2000) Graph structure in the Web. Ninth International World Wide Web Conference/ Computer Networks 33(1-6): (2000). Adamic, L.A. and Huberman, B.A., The Web's Hidden Order, Communications of the ACM archive, Volume 44, Issue 9 (September 2001), Pages: Henzinger, M.R., Heydon, A., Mitzenmacher, M., and Najork, M. (1999), Measuring Index Quality Using Random Walks on the Web. In Proc. of the 8th Int. World Wide Web Conference, Toronto, Canada, pages Elsevier Science, May, 1999 Henzinger, M.R., Heydon, A., Mitzenmacher, M., and Najork, M. (2000), On near-uniform URL sampling. In Proc. of the 9th Int. World Wide Web Conference, Amsterdam, May Kumar, R., Raghavan,P., Rajagopalan, S., Sivakumar, D., Tomkins, A.S., and Upfal, E. (2000), The Web as a Graph, Proc. 19th ACM SIGACT-SIGMOD-AIGART Symp. Principles of Database Systems, PODS. Lawrence, S. and Giles, C.L., (1999), Accessibility of Information on the Web, Nature Vol. 400, 8 th July

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

arxiv:cs/ v1 [cs.ir] 26 Apr 2002 Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884