A Random Walk Web Crawler with Orthogonally Coupled Heuristics

Size: px
Start display at page:

Download "A Random Walk Web Crawler with Orthogonally Coupled Heuristics"

Transcription

1 A Random Walk Web Crawler with Orthogonally Coupled Heuristics Andrew Walker and Michael P. Evans School of Mathematics, Kingston University, Kingston-Upon-Thames, Surrey, UK Applied Informatics and Semiotics Group University of Reading, Reading, UK Abstract We present a novel web crawling agent, the heuristics of which are orthogonally coupled to the enemy generation strategy of a computer game. Input from the player of the computer game provides genuine randomness to the web crawling agent, effectively driving a random walk across the web. The random walk, in turn, uses the structure of the Web Graph upon which it crawls to generate a stochastic process, which is used in the enemy generation strategy of the game to add novelty. We show the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the web crawler, and present the results of the sample of web pages collected. Keywords Web crawler, heuristics, computer game, Web Graph, power law 1. Introduction Web crawlers rely on heuristics to perform their function of crawling the web. A web crawler is a software agent that downloads web pages referenced by URLs parsed from previously crawled web pages. Web crawlers that sample the web employ random walk heuristics to determine which page to crawl next. However, randomness is not an easy process for a computer to generate, and the random walk can be further compromised by the structure of the web conspiring to introduce further bias (Bar-Yossef et al, 2000). Computer games, too, must rely on heuristics when controlling the enemy (or enemies) a player must fight. Getting the heuristics right is critical to the playability of the game, as they determine the challenge it poses. If the heuristics are too simple, the game will be easily conquered; too difficult, and the player will swiftly become fed up at her lack of progress, and will stop playing. In addition, heuristics need to add variety to maintain the interest of the player. We noticed a commonality between both applications that we could exploit using orthogonal coupling. This is a method in which two components or applications that normally operate independently from one another are coupled to improve their overall efficiency or effectiveness. 1

2 As such, we coupled the heuristics of a web crawler with the enemy generation strategy of a computer game. Such unorthodox coupling benefits both applications by introducing true randomness to the web crawler s random walk, and a stochastic enemy generation process to the computer game that is dependent upon the results returned from the web crawler, and thus different every time. In short, the coupling improves the effectiveness of both applications. The paper proceeds as follows. Firstly, we discuss the principle behind the orthogonal coupling of a computer game and a web crawler, before moving on to present the architecture of AlienBot: a web crawler coupled to a computer game. Section 4 presents the results of our design, validating the heuristics used by the AlienBot web crawl, and also revealing some interesting statistics gained from a crawl of 160,000 web pages. Finally, the paper concludes by discussing some of the issues we faced, and some suggestions for further work. 2. Orthogonally Coupling Computer Game and Web Crawling Heuristics 2.1. Computer Game Heuristics A simple example of a computer game s heuristics can be seen in a shoot-em-up game, in which the player controls a lone spacecraft (usually at the bottom of the screen), and attempts to defeat a number of enemy spacecraft that descend upon him. Early games, such as Namco s 1979 hit Galaxians, relied on heuristics in which the enemies would initially start at the top of the screen, but would intermittently swoop down towards the player in batches of two or three. Modern shoot-em-ups, such as Treasure s 2003 hit Ikaruga, are more sophisticated and offer superb visuals, but still rely on a fixed set of heuristics, in which the enemies attack in a standard pattern that can be discerned and predicted after only a few plays. As such, we recognized that such games could be made more enjoyable by introducing a stochastic process into the enemy s heuristics, thus generating a measure of surprise, and making the game subtly different each time. 2.1 Randomly Walking the Web Graph We found that such a process could be obtained by performing a random walk across the Web Graph. This is the (directed or undirected) graph G= (V, E), where V = {v 1, v 2, v n } is the set of vertices representing web pages, and E the collection of edges representing hyperlinks (or links) that connect the web pages. Thus, G represents the structure of the web in terms of its pages and links. A random walk across G is therefore a stochastic process that iteratively visits the vertices of the graph G (Bar-Yossef et al, 2000). However, as various experiments have discovered, the Web Graph G has unusual properties that introduce considerable bias into a traditional random walk. A random walk across G should generate a finite-state discrete time Markov chain, in which each variable v n (where v n V) in the chain is independent of all other variables in the chain. Thus, the probability of reaching the next web page should be independent of the previous web page, given the current state (Baldi et al, 2003). However, G is neither undirected nor regular, and a straightforward walk will have a heavy bias towards pages with high in-degree (i.e. links pointing to them) (Bar-Yossef et al, 2

3 2000). This leads to a dependence between pages, in which a page on the walk affects the probability that another page is visited. In particular, some pages that are close to one another may be repeatedly visited in quick succession due to the nature of the links between them and any intermediate pages (Henzinger et al, 2000) Orthogonally Coupling the Computer Game to the Web Graph As described, a random walk across the web is a stochastic process that can contain discernible patterns. Although unwelcome for sampling the web, such a process is ideal for our computer game. In addition, the Web Graph is constantly evolving, with nodes being created and deleted all the time (Baldi et al, 2003). As such, the dynamics of the structure are such that novelty is virtually guaranteed. This is the reason we chose to couple the two applications. We achieved this through the use of a web crawler, which performs the required random walk by parsing a web page for hyperlinks, and following one at random. We coupled the two applications by mapping each hyperlink parsed by the crawler to the generation of one enemy in our computer game. In this way, the exact number of enemies that appear cannot be predicted in advance, but patterns may be discerned, as the sampling bias inherent within the Web Graph is reflected in the pattern of enemies generated from it. Furthermore, we couple the two applications tighter still by making each enemy represent its associated hyperlink, and sending this link back to the crawler when the enemy is shot. In this way, the choice of enemy shot by the player is used to determine the next web page crawled by the crawler, as the enemy represents the hyperlink. As each enemy is indistinguishable from its neighbour, the player has no reason to shoot one over another, and thus implicitly adds true randomness into the URL selection process. The player therefore blindly selects the hyperlink on behalf of the crawler, and thus drives the random walk blindly across the web while she plays the game. The web crawler and the computer game, traditionally seen as orthogonal applications, are therefore tightly coupled to one another. 3. Alienbot A Novel Design For a Web Crawler A General Overview of Alienbot. Our crawler, called Alienbot, comprises two major components: a client-side game produced using Macromedia Flash a server-side script using PHP. The client-side program handles the interaction with the user, whereas the server-side program is responsible for the bulk of the crawling work (see section 3.2). Alienbot is based on the game Galaxians. It runs in the user s web browser (see Figure 1), and interacts with the server-side script through HTTP requests. In the game, URLs (i.e. hyperlinks) are associated with aliens in a one to one mapping. When an alien is shot by the user, its physical 3

4 representation on screen is removed and the URL it represents is added to a queue of URLs within the client. The client works though this list one at a time, on a first in first out basis, sending each URL to the server (Figure 2). Figure 1 AlienBot Figure 2 - Overview of the AlienBot Process Upon receiving a new URL, the server retrieves the referenced page, and parses it for new links. Meanwhile, the client listens for a response from the server, which is sent once the server-side script has finished processing the page. The response sent back from the server consists of a list of one or more URLs retrieved from the hyperlinks on the page being searched. The game can then create new aliens to represent the URLs it has received. 3.2 Processing a new web page When the client sends a URL to the server indicating that an alien has been shot, the server performs the following operations (see Figure 3): Step 1 Download the page referenced by the URL supplied by the client. Search the page for links and other information about the page. URL Resolver - resolve local addresses used in links to absolute URLs. That is, converting links to the form Remove links deemed to be of no use (e.g. undesired file type, such as exe files). Database checks - check the database to see if the page has already been crawled Step 2 Record in the database all URLs found on the page Select a random sample (where the randomness is generated by a random number generator) of the resolved URLs to send back to the client (Note: this step is performed to prevent too many links being sent back to the client, as the user can only manage so many 4

5 Step 3 enemies at once! Hence Alienbot only returns a random subset of the links found, with the number returned calculated by R = (n mod 5) + 1, where R is the number of links sent back to the game and N the number of resolved links on the page that remain after database checks have been made). If there are no links of use (or N=0 in the previous step), a random jump is performed. Each URL sent to the client is used to represent an enemy. When the player shoots an enemy, its associated link is returned to the server, where its page is crawled, and the process begins again. Thus, Alienbot selects its URLs using a combination of programmatic and user interaction. Figure 3 - AlienBot's Random Walk 4. Results 4.1 Analyzing the Web Crawler s Performance In order to test the web crawler, we ran it between 28/04/2003 and 29/07/2003. In all, some 160,000 URLs were collected in total. After the testing process was complete, we analyzed the web pages referenced by these URLs, and used the statistics obtained to compare the results generated from Alienbot with those of other web crawlers. 5

6 Figures 5a and 5b show the distribution of out links (i.e. those on a web page that reference another web page) across the different web pages crawled by AlienBot, and give a good indication of the underlying structure of the Web Graph. Both results clearly show the power law that exists in the Web Graph, and compare well with similar results by Broder et al. (2000), Henzinger et al. (2000), Barabasi and Bonabeau (2003), and Adamic and Huberman (2001). In particular, the line of best fit for figure 5b reveals an exponent of 3.28, which compares well with Kumar et al. s (2000) value of 2.72, obtained with a crawl of some 200 million web pages. These results therefore validate the effectiveness of our web crawling heuristics in accurately traversing the Web Graph. % Out link distribution Out Link number Figure 4a - Distribution of OutLinks Number of Pages (Log Scale) Log - Log Plot of Outlink Distribution Number of Links (Log Scale) Figure 5b Log-Log Plot revealing Power Law 4.2 Analyzing the Game s Performance The design of the AlienBot architecture introduced no detectable latency to the gameplay from the round-trips to the server, and the unpredictability of the number of aliens to be faced next certainly added to the game s playability. In particular, revealing the URL associated with an alien that the user had just shot added a new and fascinating dimension to the game, as it gave a real sense of the crawling process that the user was inadvertently driving (see Figure 5). Figure 5 - Randomly Walking the World From the game s perspective, the power law of the distribution of links crawled by AlientBot added to the surprise factor in terms of the number of enemies generated. Power law distributions are characterized by scale-free properties, in which a high percentage of one variable (e.g. number of links) gives rise to a low percentage of another (e.g. web pages), with the ratio of the percentages remaining (roughly) constant at all scales. Thus, a high number of links 6

7 will be found on a small number of web pages, which produces the surprise (in terms of enemy generation) at unknown intervals. This was the stochastic process we aimed to capture by using the results from the crawl to drive the game s enemy generation strategy. The validation of the crawler shows that we accurately captured the stochastic process, while the (albeit subjective) playability of the game revealed the benefit of the whole approach. 4.3 Analysis of the URL Sample Set In addition to validating the crawling heuristics, we also used the data from our URL sample to provide some results from the Web Graph, and to take some web metrics Comparison of the Alienbot Data Set with the Google Index. Results from random walks have been used in other studies to compare the coverage of various search engines by calculating the percentage of pages crawled by a random walk that have not been indexed by the search engine (Henzinger et al, 1999). As such, we compared the results of AlienBot with that of Google, and discovered that 36.85% of the pages we located are not indexed by Google, suggesting that Google s coverage represents 63.15% of the publicly available web. This compares with an estimate of 18.6% that Lawrence and Giles (1999) found in Given that Google s coverage extends to 3,307,998,701 pages (as of November 2003), we estimate the size of the web to be 5,238,319,400 pages, or 6.5 times its estimated size in 1998 (Lawrence and Giles, 1999). In Google Not In Google Number % of Number % of of pages pages of pages pages % % Table 1 AlienBot Pages indexed in Google Number % of of Pages Pages JavaScript % VBScript % Total Pages with Scripting % Total Pages Using Flash % Table 2 - Web Page Statistics Miscellaneous Page statistics Of the 160,000 URLs we collected during our random walk, we downloaded 25,000 at random to sample the technologies that are currently prevalent on the web. As can be seen in Table 2, 61.74% of all web pages use a scripting language. Of these pages, 99.11% use JavaScript, and only 0.89% use Microsoft s VBScript. Furthermore, we found only 1.91% of pages use Macromedia s Flash 7. Issues and Further Work Ideally, Alienbot would have gathered a larger sample of documents. The small size of the sample means it has a bias towards pages in the neighbourhood of the seed set of URLs. Furthermore, the way in which Alienbot attempts to filter out undesirable content using a list of unwanted file extensions is not ideal. This has allowed some non-html pages to make it into 7

8 the data set, thereby adding some unwanted noise to the results. A future version could make use of HTTP s content-type header in order to filter out pages that are not created in some HTML based mark-up. However, the results from the crawl show that these limitations did not introduce a significant bias into the crawler when compared with results from other studies, particularly considering the relatively small number of web pages of our study. 8. Conclusion We have presented a novel web crawling agent, the heuristics of which are orthogonally coupled to the enemy generation strategy of a computer game. The computer game adds genuine randomness to the web crawling agent, effectively driving a random walk across the web. The random walk, in turn, generates a stochastic process via the structure of the Web Graph upon which it crawls, thereby introducing novelty to the enemy generation strategy of the game. We have shown the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the web crawler, and have presented some of the results of the sample of web pages we collected. We intend to analyze our data further, and repeat the study with a larger data set. References Albert, R., Jeoung, H., and Barabasi, A-L. (1999), The Diameter of the World Wide Web, Nature 401:130. Baldi, P., Frasconi, F., and Smyth, P. (2003), Modeling the Internet and the Web, John Wiley and Sons, England. Barabási, A., and Bonabeau, E. (2003), Scale-free networks, Scientific American 288, (5):50-59, May Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, F., and Weitz, D. (2000) Approximating Aggregate Queries about Web Pages via Random Walks. In Proc. 26th International Conference on Very Large Databases, Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J.L (2000) Graph structure in the Web. Ninth International World Wide Web Conference/ Computer Networks 33(1-6): (2000). Adamic, L.A. and Huberman, B.A., The Web's Hidden Order, Communications of the ACM archive, Volume 44, Issue 9 (September 2001), Pages: Henzinger, M.R., Heydon, A., Mitzenmacher, M., and Najork, M. (1999), Measuring Index Quality Using Random Walks on the Web. In Proc. of the 8th Int. World Wide Web Conference, Toronto, Canada, pages Elsevier Science, May, 1999 Henzinger, M.R., Heydon, A., Mitzenmacher, M., and Najork, M. (2000), On near-uniform URL sampling. In Proc. of the 9th Int. World Wide Web Conference, Amsterdam, May Kumar, R., Raghavan,P., Rajagopalan, S., Sivakumar, D., Tomkins, A.S., and Upfal, E. (2000), The Web as a Graph, Proc. 19th ACM SIGACT-SIGMOD-AIGART Symp. Principles of Database Systems, PODS. Lawrence, S. and Giles, C.L., (1999), Accessibility of Information on the Web, Nature Vol. 400, 8 th July

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

arxiv:cs/ v1 [cs.ir] 26 Apr 2002 Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884

More information

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri¹ Computer Engineering Department, Sharif University of Technology, Tehran, Iran daneshpajouh@ce.sharif.edu,

More information

On near-uniform URL sampling

On near-uniform URL sampling Computer Networks 33 (2000) 295 308 www.elsevier.com/locate/comnet On near-uniform URL sampling Monika R. Henzinger a,ł, Allan Heydon b, Michael Mitzenmacher c, Marc Najork b a Google, Inc. 2400 Bayshore

More information

A STUDY ON THE EVOLUTION OF THE WEB

A STUDY ON THE EVOLUTION OF THE WEB A STUDY ON THE EVOLUTION OF THE WEB Alexandros Ntoulas, Junghoo Cho, Hyun Kyu Cho 2, Hyeonsung Cho 2, and Young-Jo Cho 2 Summary We seek to gain improved insight into how Web search engines should cope

More information

1.1 Our Solution: Random Walks for Uniform Sampling In order to estimate the results of aggregate queries or the fraction of all web pages that would

1.1 Our Solution: Random Walks for Uniform Sampling In order to estimate the results of aggregate queries or the fraction of all web pages that would Approximating Aggregate Queries about Web Pages via Random Walks Λ Ziv Bar-Yossef y Alexander Berg Steve Chien z Jittat Fakcharoenphol x Dror Weitz Computer Science Division University of California at

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

Random Sampling of Search Engine s Index Using Monte Carlo Simulation Method

Random Sampling of Search Engine s Index Using Monte Carlo Simulation Method Random Sampling of Search Engine s Index Using Monte Carlo Simulation Method Sajib Kumer Sinha University of Windsor Getting uniform random samples from a search engine s index is a challenging problem

More information

Recent Researches in Ranking Bias of Modern Search Engines

Recent Researches in Ranking Bias of Modern Search Engines DEWS2007 C7-7 Web,, 169 8050 1 104 169 8555 3 4 1 101 8430 2 1 2 E-mail: {hirate,yoshida,yamana}@yama.info.waseda.ac.jp Web Web Web Recent Researches in Ranking Bias of Modern Search Engines Yu HIRATE,,

More information

Breadth-First Search Crawling Yields High-Quality Pages

Breadth-First Search Crawling Yields High-Quality Pages Breadth-First Search Crawling Yields High-Quality Pages Marc Najork Compaq Systems Research Center 13 Lytton Avenue Palo Alto, CA 9431, USA marc.najork@compaq.com Janet L. Wiener Compaq Systems Research

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS Satwinder Kaur 1 & Alisha Gupta 2 1 Research Scholar (M.tech

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach *

Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach * Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach * Li Xiaoming and Zhu Jiaji Institute for Internet Information Studies (i 3 S) Peking University 1. Introduction

More information

A Novel Interface to a Web Crawler using VB.NET Technology

A Novel Interface to a Web Crawler using VB.NET Technology IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 6 (Nov. - Dec. 2013), PP 59-63 A Novel Interface to a Web Crawler using VB.NET Technology Deepak Kumar

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

Compressing the Graph Structure of the Web

Compressing the Graph Structure of the Web Compressing the Graph Structure of the Web Torsten Suel Jun Yuan Abstract A large amount of research has recently focused on the graph structure (or link structure) of the World Wide Web. This structure

More information

Do TREC Web Collections Look Like the Web?

Do TREC Web Collections Look Like the Web? Do TREC Web Collections Look Like the Web? Ian Soboroff National Institute of Standards and Technology Gaithersburg, MD ian.soboroff@nist.gov Abstract We measure the WT10g test collection, used in the

More information

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo Social and Technological Network Data Analytics Lecture 5: Structure of the Web, Search and Power Laws Prof Cecilia Mascolo In This Lecture We describe power law networks and their properties and show

More information

Context-based Navigational Support in Hypermedia

Context-based Navigational Support in Hypermedia Context-based Navigational Support in Hypermedia Sebastian Stober and Andreas Nürnberger Institut für Wissens- und Sprachverarbeitung, Fakultät für Informatik, Otto-von-Guericke-Universität Magdeburg,

More information

Conclusions. Chapter Summary of our contributions

Conclusions. Chapter Summary of our contributions Chapter 1 Conclusions During this thesis, We studied Web crawling at many different levels. Our main objectives were to develop a model for Web crawling, to study crawling strategies and to build a Web

More information

URL-Enhanced Adaptive Page-Refresh Models

URL-Enhanced Adaptive Page-Refresh Models URL-Enhanced Adaptive Page-Refresh Models Robert Warren, Dana Wilkinson, Alejandro López-Ortiz School of Computer Science, University of Waterloo {rhwarren, d3wilkin, alopez-o}@uwaterloo.ca Technical Report

More information

CS Search Engine Technology

CS Search Engine Technology CS236620 - Search Engine Technology Ronny Lempel Winter 2008/9 The course consists of 14 2-hour meetings, divided into 4 main parts. It aims to cover both engineering and theoretical aspects of search

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Models for the growth of the Web

Models for the growth of the Web Models for the growth of the Web Chi Bong Ho Introduction Yihao Ben Pu December 6, 2007 Alexander Tsiatas There has been much work done in recent years about the structure of the Web and other large information

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

DATA SEARCH ENGINE INTRODUCTION

DATA SEARCH ENGINE INTRODUCTION D DATA SEARCH ENGINE INTRODUCTION The World Wide Web was first developed by Tim Berners- Lee and his colleagues in 1990. In just over a decade, it has become the largest information source in human history.

More information

Estimating Page Importance based on Page Accessing Frequency

Estimating Page Importance based on Page Accessing Frequency Estimating Page Importance based on Page Accessing Frequency Komal Sachdeva Assistant Professor Manav Rachna College of Engineering, Faridabad, India Ashutosh Dixit, Ph.D Associate Professor YMCA University

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

Simulation Study of Language Specific Web Crawling

Simulation Study of Language Specific Web Crawling DEWS25 4B-o1 Simulation Study of Language Specific Web Crawling Kulwadee SOMBOONVIWAT Takayuki TAMURA, and Masaru KITSUREGAWA Institute of Industrial Science, The University of Tokyo Information Technology

More information

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80]. Focus: Accustom To Crawl Web-Based Forums M.Nikhil 1, Mrs. A.Phani Sheetal 2 1 Student, Department of Computer Science, GITAM University, Hyderabad. 2 Assistant Professor, Department of Computer Science,

More information

Dynamics of the Chilean Web structure

Dynamics of the Chilean Web structure Dynamics of the Chilean Web structure Ricardo Baeza-Yates *, Barbara Poblete Center for Web Research, Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile Abstract

More information

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho, Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University {cho, hector}@cs.stanford.edu Abstract In this paper we study how we can design an effective parallel crawler. As the size of the

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Cloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao

Cloak of Visibility. -Detecting When Machines Browse A Different Web. Zhe Zhao Cloak of Visibility -Detecting When Machines Browse A Different Web Zhe Zhao Title: Cloak of Visibility -Detecting When Machines Browse A Different Web About Author: Google Researchers Publisher: IEEE

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Bow-tie Decomposition in Directed Graphs

Bow-tie Decomposition in Directed Graphs 14th International Conference on Information Fusion Chicago, Illinois, USA, July 5-8, 2011 Bow-tie Decomposition in Directed Graphs Rong Yang Dept. of Mathematics and Computer Science Western Kentucky

More information

Local Methods for Estimating PageRank Values

Local Methods for Estimating PageRank Values Local Methods for Estimating PageRank Values Yen-Yu Chen Qingqing Gan Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 yenyu, qq gan, suel @photon.poly.edu Abstract The Google search

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

Crawling the Infinite Web: Five Levels are Enough

Crawling the Infinite Web: Five Levels are Enough Crawling the Infinite Web: Five Levels are Enough Ricardo Baeza-Yates and Carlos Castillo Center for Web Research, DCC Universidad de Chile {rbaeza,ccastillo}@dcc.uchile.cl Abstract. A large amount of

More information

The Case for Browser Provenance

The Case for Browser Provenance The Case for Browser Provenance Daniel W. Margo and Margo Seltzer Harvard School of Engineering and Applied Sciences Abstract In our increasingly networked world, web browsers are important applications.

More information

Automated Path Ascend Forum Crawling

Automated Path Ascend Forum Crawling Automated Path Ascend Forum Crawling Ms. Joycy Joy, PG Scholar Department of CSE, Saveetha Engineering College,Thandalam, Chennai-602105 Ms. Manju. A, Assistant Professor, Department of CSE, Saveetha Engineering

More information

The Structure of E-Government - Developing a Methodology for Quantitative Evaluation -

The Structure of E-Government - Developing a Methodology for Quantitative Evaluation - The Structure of E-Government - Developing a Methodology for Quantitative Evaluation - Vaclav Petricek :: UCL Computer Science Tobias Escher :: UCL Political Science Ingemar Cox :: UCL Computer Science

More information

Project Report. An Introduction to Collaborative Filtering

Project Report. An Introduction to Collaborative Filtering Project Report An Introduction to Collaborative Filtering Siobhán Grayson 12254530 COMP30030 School of Computer Science and Informatics College of Engineering, Mathematical & Physical Sciences University

More information

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson Web Crawling Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc.

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Crawling CE-324: Modern Information Retrieval Sharif University of Technology Crawling CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Sec. 20.2 Basic

More information

Web Crawling As Nonlinear Dynamics

Web Crawling As Nonlinear Dynamics Progress in Nonlinear Dynamics and Chaos Vol. 1, 2013, 1-7 ISSN: 2321 9238 (online) Published on 28 April 2013 www.researchmathsci.org Progress in Web Crawling As Nonlinear Dynamics Chaitanya Raveendra

More information

Review on Techniques of Collaborative Tagging

Review on Techniques of Collaborative Tagging Review on Techniques of Collaborative Tagging Ms. Benazeer S. Inamdar 1, Mrs. Gyankamal J. Chhajed 2 1 Student, M. E. Computer Engineering, VPCOE Baramati, Savitribai Phule Pune University, India benazeer.inamdar@gmail.com

More information

A Signaling Game Approach to Databases Querying

A Signaling Game Approach to Databases Querying A Signaling Game Approach to Databases Querying Ben McCamish 1, Arash Termehchy 1, Behrouz Touri 2, and Eduardo Cotilla-Sanchez 1 1 School of Electrical Engineering and Computer Science, Oregon State University

More information

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh

More information

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates

More information

Evaluation of Long-Held HTTP Polling for PHP/MySQL Architecture

Evaluation of Long-Held HTTP Polling for PHP/MySQL Architecture Evaluation of Long-Held HTTP Polling for PHP/MySQL Architecture David Cutting University of East Anglia Purplepixie Systems David.Cutting@uea.ac.uk dcutting@purplepixie.org Abstract. When a web client

More information

Review: Searching the Web [Arasu 2001]

Review: Searching the Web [Arasu 2001] Review: Searching the Web [Arasu 2001] Gareth Cronin University of Auckland gareth@cronin.co.nz The authors of Searching the Web present an overview of the state of current technologies employed in the

More information

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,

More information

Compact Encoding of the Web Graph Exploiting Various Power Laws

Compact Encoding of the Web Graph Exploiting Various Power Laws Compact Encoding of the Web Graph Exploiting Various Power Laws Statistical Reason Behind Link Database Yasuhito Asano, Tsuyoshi Ito 2, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 Department

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

Engineering Quality of Experience: A Brief Introduction

Engineering Quality of Experience: A Brief Introduction Engineering Quality of Experience: A Brief Introduction Neil Davies and Peter Thompson November 2012 Connecting the quality of user experience to parameters a network operator can directly measure and

More information

deseo: Combating Search-Result Poisoning Yu USF

deseo: Combating Search-Result Poisoning Yu USF deseo: Combating Search-Result Poisoning Yu Jin @MSCS USF Your Google is not SAFE! SEO Poisoning - A new way to spread malware! Why choose SE? 22.4% of Google searches in the top 100 results > 50% for

More information

Collection Building on the Web. Basic Algorithm

Collection Building on the Web. Basic Algorithm Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring

More information

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet

More information

On Veracious Search In Unsystematic Networks

On Veracious Search In Unsystematic Networks On Veracious Search In Unsystematic Networks K.Thushara #1, P.Venkata Narayana#2 #1 Student Of M.Tech(S.E) And Department Of Computer Science And Engineering, # 2 Department Of Computer Science And Engineering,

More information

Crawlability Metrics for Web Applications

Crawlability Metrics for Web Applications Crawlability Metrics for Web Applications N. Alshahwan 1, M. Harman 1, A. Marchetto 2, R. Tiella 2, P. Tonella 2 1 University College London, UK 2 Fondazione Bruno Kessler, Trento, Italy ICST 2012 - Montreal

More information

Analysis of Meta-Search engines using the Meta-Meta- Search tool SSIR

Analysis of Meta-Search engines using the Meta-Meta- Search tool SSIR 2010 International Journal of Computer Applications (0975 8887 Analysis of Meta-Search engines using the Meta-Meta- Search tool SSIR Manoj M Senior Research Fellow Computational Modelling and Simulation

More information

A METHODOLOGY FOR THE EVALUATION OF WEB GRAPH MODELS AND A TEST CASE. Antonios Kogias Dimosthenis Anagnostopoulos

A METHODOLOGY FOR THE EVALUATION OF WEB GRAPH MODELS AND A TEST CASE. Antonios Kogias Dimosthenis Anagnostopoulos Proceedings of the 2006 Winter Simulation Conference L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, eds. A METHODOLOGY FOR THE EVALUATION OF WEB GRAPH MODELS AND A

More information

MURDOCH RESEARCH REPOSITORY

MURDOCH RESEARCH REPOSITORY MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout

More information

Finding Neighbor Communities in the Web using Inter-Site Graph

Finding Neighbor Communities in the Web using Inter-Site Graph Finding Neighbor Communities in the Web using Inter-Site Graph Yasuhito Asano 1, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 1 Graduate School of Information Sciences, Tohoku University

More information

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Implementation of Enhanced Web Crawler for Deep-Web Interfaces Implementation of Enhanced Web Crawler for Deep-Web Interfaces Yugandhara Patil 1, Sonal Patil 2 1Student, Department of Computer Science & Engineering, G.H.Raisoni Institute of Engineering & Management,

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Distributed Indexing of the Web Using Migrating Crawlers

Distributed Indexing of the Web Using Migrating Crawlers Distributed Indexing of the Web Using Migrating Crawlers Odysseas Papapetrou cs98po1@cs.ucy.ac.cy Stavros Papastavrou stavrosp@cs.ucy.ac.cy George Samaras cssamara@cs.ucy.ac.cy ABSTRACT Due to the tremendous

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

Keywords: web crawler, parallel, migration, web database

Keywords: web crawler, parallel, migration, web database ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: Design of a Parallel Migrating Web Crawler Abhinna Agarwal, Durgesh

More information

Browsing the Semantic Web

Browsing the Semantic Web Proceedings of the 7 th International Conference on Applied Informatics Eger, Hungary, January 28 31, 2007. Vol. 2. pp. 237 245. Browsing the Semantic Web Peter Jeszenszky Faculty of Informatics, University

More information

Fast Low-Cost Estimation of Network Properties Using Random Walks

Fast Low-Cost Estimation of Network Properties Using Random Walks Fast Low-Cost Estimation of Network Properties Using Random Walks Colin Cooper, Tomasz Radzik, and Yiannis Siantos Department of Informatics, King s College London, WC2R 2LS, UK Abstract. We study the

More information

Overlay (and P2P) Networks

Overlay (and P2P) Networks Overlay (and P2P) Networks Part II Recap (Small World, Erdös Rényi model, Duncan Watts Model) Graph Properties Scale Free Networks Preferential Attachment Evolving Copying Navigation in Small World Samu

More information

A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces Md. Nazeem Ahmed MTech(CSE) SLC s Institute of Engineering and Technology Adavelli ramesh Mtech Assoc. Prof Dep. of computer Science SLC

More information

Outline. Last 3 Weeks. Today. General background. web characterization ( web archaeology ) size and shape of the web

Outline. Last 3 Weeks. Today. General background. web characterization ( web archaeology ) size and shape of the web Web Structures Outline Last 3 Weeks General background Today web characterization ( web archaeology ) size and shape of the web What is the size of the web? Issues The web is really infinite Dynamic content,

More information

Crawling the Hidden Web Resources: A Review

Crawling the Hidden Web Resources: A Review Rosy Madaan 1, Ashutosh Dixit 2 and A.K. Sharma 2 Abstract An ever-increasing amount of information on the Web today is available only through search interfaces. The users have to type in a set of keywords

More information

Evaluation Methods for Focused Crawling

Evaluation Methods for Focused Crawling Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth

More information

Chapter 16 Heuristic Search

Chapter 16 Heuristic Search Chapter 16 Heuristic Search Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Frontiers in Web Data Management

Frontiers in Web Data Management Frontiers in Web Data Management Junghoo John Cho UCLA Computer Science Department Los Angeles, CA 90095 cho@cs.ucla.edu Abstract In the last decade, the Web has become a primary source of information

More information

Studying the Properties of Complex Network Crawled Using MFC

Studying the Properties of Complex Network Crawled Using MFC Studying the Properties of Complex Network Crawled Using MFC Varnica 1, Mini Singh Ahuja 2 1 M.Tech(CSE), Department of Computer Science and Engineering, GNDU Regional Campus, Gurdaspur, Punjab, India

More information

The 2011 IDN Homograph Attack Mitigation Survey

The 2011 IDN Homograph Attack Mitigation Survey Edith Cowan University Research Online ECU Publications 2012 2012 The 2011 IDN Homograph Attack Survey Peter Hannay Edith Cowan University Gregory Baatard Edith Cowan University This article was originally

More information

A Personalized Multimedia Web-Based Educational System with Automatic Indexing for Multimedia Courses

A Personalized Multimedia Web-Based Educational System with Automatic Indexing for Multimedia Courses A Personalized Multimedia Web-Based Educational System with Automatic Indexing for Multimedia Courses O. Shata Abstract This paper proposes a personalized web-based system with evolving automatic indexing

More information

Statistical Testing of Software Based on a Usage Model

Statistical Testing of Software Based on a Usage Model SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(1), 97 108 (JANUARY 1995) Statistical Testing of Software Based on a Usage Model gwendolyn h. walton, j. h. poore and carmen j. trammell Department of Computer

More information

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS

A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS A NEW PERFORMANCE EVALUATION TECHNIQUE FOR WEB INFORMATION RETRIEVAL SYSTEMS Fidel Cacheda, Francisco Puentes, Victor Carneiro Department of Information and Communications Technologies, University of A

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Indian Institute of Technology Kanpur. Visuomotor Learning Using Image Manifolds: ST-GK Problem

Indian Institute of Technology Kanpur. Visuomotor Learning Using Image Manifolds: ST-GK Problem Indian Institute of Technology Kanpur Introduction to Cognitive Science SE367A Visuomotor Learning Using Image Manifolds: ST-GK Problem Author: Anurag Misra Department of Computer Science and Engineering

More information

Competitive Intelligence and Web Mining:

Competitive Intelligence and Web Mining: Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction

More information

Deep Web Content Mining

Deep Web Content Mining Deep Web Content Mining Shohreh Ajoudanian, and Mohammad Davarpanah Jazi Abstract The rapid expansion of the web is causing the constant growth of information, leading to several problems such as increased

More information