Distributed Indexing of the Web Using Migrating Crawlers

Similar documents
Web Crawling As Nonlinear Dynamics

Competitive Intelligence and Web Mining:

The Issues and Challenges with the Web Crawlers

Context Based Indexing in Search Engines: A Review

International Journal of Advanced Research in Computer Science and Software Engineering

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Web Page Load Reduction using Page Change Detection

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

UNICORE Globus: Interoperability of Grid Infrastructures

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Keywords: web crawler, parallel, migration, web database

A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client.

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

pweb : A Personal Interface to the World Wide Web

Featured Archive. Saturday, February 28, :50:18 PM RSS. Home Interviews Reports Essays Upcoming Transcripts About Black and White Contact

Sentiment Analysis for Customer Review Sites

Context Based Web Indexing For Semantic Web

Tutorial. Title: Implementing Agent Applications in Java: Using Mobile and Intelligent Agents.

THE WEB SEARCH ENGINE

Evaluation Methods for Focused Crawling

Simulation Study of Language Specific Web Crawling

USING THE WEB EFFICIENTLY: MOBILE CRAWLERS

2 Approaches to worldwide web information retrieval

INTRODUCTION. Chapter GENERAL

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

HOW TO ANALYZE AND UNDERSTAND YOUR NETWORK

Enhancing E-Journal Access In A Digital Work Environment

Using WinTask to Extend ehealth Application Monitoring

View Generator (VG): A Mobile Agent Based System for the Creation and Maintenance of Web Views*

Load Balancing Algorithm over a Distributed Cloud Network

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Creating a Classifier for a Focused Web Crawler

Everyday Activity. Course Content. Objectives of Lecture 13 Search Engine

Ontology Driven Focused Crawling of Web Documents

Adaptable and Adaptive Web Information Systems. Lecture 1: Introduction

PDS 2010 System Design Report

Accessibility of INGO FAST 1997 ARTVILLE, LLC. 32 Spring 2000 intelligence

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Visualization and clusters: collaboration and integration issues. Philip NERI Integrated Solutions Director

TREND REPORT. Ethernet, MPLS and SD-WAN: What Decision Makers Really Think

A New Context Based Indexing in Search Engines Using Binary Search Tree

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

WebBiblio Subject Gateway System:

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

INDEXING FOR DOMAIN SPECIFIC HIDDEN WEB

SEVEN Networks Open Channel Traffic Optimization

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Integrated Framework for Keyword-based Text Data Collection and Analysis

A Ubiquitous Web Services Framework for Interoperability in Pervasive Environments

Cisco Unified Service Statistics Manager 8.7

Software Downloading Solutions for Mobile Value-Added Service Provision

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

ANNUAL REPORT Visit us at project.eu Supported by. Mission

New Zealand information on the Internet: the power to find the knowledge

IBM Advantage: IBM Watson Compare and Comply Element Classification

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

Keywords Web crawler; Analytics; Dynamic Web Learning; Bounce Rate; Website

Evaluator Group Inc. Executive Editor: Randy Kerns

IJESRT. [Hans, 2(6): June, 2013] ISSN:

System Requirements VERSION 2.5. Prepared for: Metropolitan Transportation Commission. Prepared by: April 17,

Automatic Data Optimization with Oracle Database 12c O R A C L E W H I T E P A P E R S E P T E M B E R

deseo: Combating Search-Result Poisoning Yu USF

Dynamic Visualization of Hubs and Authorities during Web Search

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

Creating a Corporate Taxonomy. Internet Librarian November 2001 Betsy Farr Cogliano

Search Engines. Information Retrieval in Practice

A Scalable Location Aware Service Platform for Mobile Applications Based on Java RMI

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Information Retrieval. Lecture 10 - Web crawling

Integriti Introduction For System Integrators

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

WHITE PAPER. Header Title. Side Bar Copy. Header Title 5 Reasons to Consider Disaster Recovery as a Service for IBM i WHITEPAPER

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Web Service APIs for Scribe Registrars, Nexus Diristries, PORTAL Registries and DOORS Directories in the NPD System

Increasing Network Agility through Intelligent Orchestration

Web Service Usage Mining: Mining For Executable Sequences

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

CS47300: Web Information Search and Management

Canada Education Savings Program (CESP) Data Interface Operations and Connectivity

CHAPTER 2 WEB INFORMATION RETRIEVAL: A REVIEW

Cache Management for TelcoCDNs. Daphné Tuncer Department of Electronic & Electrical Engineering University College London (UK)

Framework for Large-scale SDN Experiments via Software Defined Federated Infrastructures

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

DVM Analysis & Monitoring Cloud Solution

Adaptive Cluster Computing using JavaSpaces

TetraNode Scalability and Performance. White paper

Search Engine Optimization (SEO) using HTML Meta-Tags

A Retrieval Mechanism for Multi-versioned Digital Collection Using TAG

Query based Site Selection for Distributed Search Engines

Introducing Cisco License Manager

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks

The Transition to Networked Storage

ThinAir Server Platform White Paper June 2000

1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1

Transcription:

Distributed Indexing of the Web Using Migrating Crawlers Odysseas Papapetrou cs98po1@cs.ucy.ac.cy Stavros Papastavrou stavrosp@cs.ucy.ac.cy George Samaras cssamara@cs.ucy.ac.cy ABSTRACT Due to the tremendous increase rate and the high change frequency of Web documents, maintaining an up-to-date index for searching purposes (search engines) is becoming a challenge. The traditional crawling methods are no longer able to catch up with the constantly updating and growing Web. Realizing the problem, in this paper we suggest an alternative distributed crawling method with the use of mobile agents. Our goal is a scalable crawling scheme that minimizes network utilization, keeps up with document changes, employs time realization, and is easily upgradeable. Keywords Migrating Crawler, Search Engine, Mobile Agents, Web Crawling 1. INTRODUCTION Indexing the Web has become a challenge due to its growing and dynamic nature. The directly accessible Web (also mentioned as surface Web) exceeds 2.5 billion documents while the indirect Web (dynamically generated documents) is almost three orders of magnitude larger [8]. Moreover, the Web is changing 40 per cent of its contents every month [6] while no search engine succeeds coverage of more than 16% of the estimated web size [7]. Web crawling (or traditional crawling) has been the dominant practice for Web indexing by popular search engines and research organizations since 1993, but despite the vast computational and network resources thrown into it, traditional crawling cannot effectively catch up with the dynamic Web. More specifically, the traditional crawling model fails for the following reasons: The task of processing the crawled data introduces a vast processing bottleneck at the search engine site. The attempt to download thousands of documents per second creates a network and a DNS lookup bottleneck. Documents are usually downloaded by the crawlers uncompressed increasing in this way the network bottleneck. In general, compression is not under full facilitation since it is independent from the crawling task and cannot be forced by the crawlers. In addition, crawlers download the entire contents of a

document, including useless information such as scripting code and comments, which are rarely necessary for document indexing. The absence of a scalable crawling method triggered some significant research in the past few years. Focused Crawling [2], was proposed as an alternative method but did not introduce any architectural innovations since it relied on the same centralized practices of traditional crawling. As a first attempt to improve the centralized nature of traditional crawling, a number of distributed methods have been proposed (Harvest [1], Grub [5]). In this ongoing work, we introduce UCYMicra; a crawling system that utilizes concepts similar to those found in Mobile and distributed Crawling introduced in [3, 4]. UCYMicra extends these concepts and introduces new ones in order to build a more efficient model for distributed Web crawling, capable of keeping up with Web document changes in real time. UCYMicra proposes a complete distributed crawling strategy by utilizing Mobile Agents technology. The goals are (a) to minimize network utilization (b) to keep up with document changes by performing on-site monitoring, (c) to avoid unnecessary overloading of the Web servers by employing time realization, and (d) to be upgradeable at run time. 2. The UCYMicra Crawling System The driving force behind UCYMicra is the utilization of mobile agents that migrate from the search engine to the Web servers, and remain there to crawl, process, and monitor Web documents for updates. Since UCYMicra requires that a specific mobile agents platform be running at the Web Server to be crawled, it is currently running under a distributed voluntary academic environment spanning in a number of continents. UCYMicra (Figure 1) consists of three subsystems, (a) the Coordinator Subsystem, (b) the Mobile Agents Subsystem, and (c) a Public Search Engine that executes user queries on the database maintained by the Coordinator subsystem. The Coordinator Subsystem resides at the Search Engine site and is responsible of (a) maintaining the search database, (b) providing online registration for new Web sites to participate in UCYMicra, and (c) administering the Mobile Agents Subsystem. The Mobile Agents Subsystem is responsible for crawling the Web and consists of two categories of mobile agents, namely the Migrating Crawlers (or Mobile Crawlers) and the Data Carries. Figure 2 shows UCYMicra at work. As mentioned above, the core of the UCYMicra crawling system are the Java-based Migrating Crawlers. Powered by their inherent mobile capabilities, the Migrating Crawlers can perform the following tasks: 1. Be dispatched to a newly registered Web server that will participate in UCYMicra. 2. Crawling: A Migrating Crawler can perform a complete local crawling (either though HTTP or the file system). 3. Processing: Crawled documents are stripped down into keywords, and keywords are ranked based on their visual properties (font and color), position

and occurrence frequency, in order to locally create a keyword index of the web server contents. 4. Compression: The index of the Web server contents is locally compressed to minimize transmission time between the Migrating Crawler and the Coordinator subsystem. 5. Data transmission: The compressed index is transmitted to the Coordinator subsystem by the Data Carriers. There, it is uncompressed and integrated into the search database. The choice of using mobile agents for data transmission over other network APIs (such as RMI, CORBA or sockets) is the utilization of their asynchronisity, flexibility and intelligence in order to ensure the seamless transmission of the data. 6. Monitoring: The Migrating Crawler can detect changes on the Web server contents. Detected changes are instantly processed, compressed and transmitted to the Coordinator subsystem. 7. Real time upgrades: New code for performing any of the above tasks can be easily deployed since UCYMicra's crawling architecture is based on Java. Figure 1. UCYMicra Architecture 3. EVALUATING UCYMicra We compare the performance of the UCYMicra crawling system against traditional crawling in terms of (a) size of data transmitted across the Internet, and (b) total time required for the complete crawling of a given set of documents. For the time being, we study only those two obvious metrics and do not experiment with parameters such as document change frequency. Since it was not feasible to include commercial Web servers in our experiments, we have employed a set of ten Web servers within our voluntary distributed academic environment that span in several continents. Each Web server was hosting an average of 200 Web documents with an average document size of 25 Kbytes. The previous numbers yield 46.2 Mbytes of document data to be crawled by both the traditional crawling and the UCYMicra approach. Due to space limitations, we present our finding for the size of data moved (our findings for the total time required are analogous).

Methodology Traditional Crawling UCYMicra - no processing, no compression UCYMicra - w/processing, no compression UCYMicra - no processing, w/compression UCYMicra - (w/processing and compression) Data Moved 46,9 Mb 48,1 Mb 13,3 Mb 8,1 Mb 2,6 Mb Table 1. Performance Results The performance results (Table 1) showed that UCYMicra (row 5) outperforms traditional crawling (row 1) by generating approximately 20 times less data. This is because the Migrating Crawlers process and compress the Web documents locally to the Web server. In this way, only the compressed ranked keyword index of the Web server contents is transmitted to the Coordinator subsystem. In the traditional crawling approach, the complete contents of a Web server have to be downloaded by the crawler for centralized processing. In addition, a traditional crawler may request but cannot force a Web server to compress its contents before download. To get a better insight on our performance results, we performed three more experiments, this time by modifying the UCYMicra approach to perform (or not) local processing or compression. Our results showed that with either processing or compression, the performance gains still hold to some extend. With neither processing nor compression enabled, however, those gains are eliminated since UCYMicra emulates traditional crawling. Figure 2. UCYMicra at work 4. ONGOING WORK Our ongoing work focuses in extending UCYMicra to support a hybrid crawling mechanism that borrows technology from both the traditional and the fully distributed crawling system. Such a hybrid crawling system will support a hierarchical

management structure that considers network locality. Efficient algorithms for work delegation, administration, and result integration are currently under development. 5. ACKNOWLEDGEMENTS This work is partially funded by the Information Society Technologies programme of the European Commission under the IST-2001-32645 DBGlobe project. 6. REFERENCES 1. C. M. Brown, B. B. Danzig, D. Hardy, U. Manber, and M. F. Schwartz. The harvest information discovery and access system. In WWW2, Chicago, October 1994. 2. S. Chakrabarti, M. van den Berg, B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. WWW8 / Computer Networks 31(11-16): 1623-1640 (1999). 3. J. Fiedler and J. Hammer. Using the Web Efficiently: Mobile Crawling. In Proc. of the 7th Int'l Conf. of the Association of Management (AoM/IAoM) on Computer Science, 324-329. San Diego, CA, August 1999. 4. J. Fiedler and J. Hammer. Using Mobile Crawlers to Search the Web Efficiently. International Journal of Computer and Information Science, pp. 36-58, 2000. 5. Grub: Distributed Internet Crawler. Available at http://www.grub.org. 6. B. Kahle. Achieving the Internet. Scientific American, 1996. 7. S. Lawrence, C. Lee Giles. Accessibility of information on the web. Nature, 400(6740):107-109, July 1999. 8. P. Lyman, H. Varian, J. Dunn, A. Strygin, and K. Swearingen. How much information? Available at http://www.sims.berkeley.edu/how-much-info.