A Hierarchical Web Page Crawler for Crawling the Internet Faster

Similar documents
Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine

Multi-Agent System for Search Engine based Web Server: A Conceptual Framework

WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Estimating Page Importance based on Page Accessing Frequency

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Review: Searching the Web [Arasu 2001]

INTRODUCTION (INTRODUCTION TO MMAS)

Integrating Content Search with Structure Analysis for Hypermedia Retrieval and Management

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

Information Retrieval Issues on the World Wide Web

Introducing Dynamic Ranking on Web-Pages Based on Multiple Ontology Supported Domains

[Banjare*, 4.(6): June, 2015] ISSN: (I2OR), Publication Impact Factor: (ISRA), Journal Impact Factor: 2.114

International Journal of Advanced Research in Computer Science and Software Engineering

A Framework for Incremental Hidden Web Crawler

A STUDY ON THE EVOLUTION OF THE WEB

Automatic Web Image Categorization by Image Content:A case study with Web Document Images

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

Dynamic Visualization of Hubs and Authorities during Web Search

Web Crawling As Nonlinear Dynamics

SE4SC: A Specific Search Engine for Software Components *

An Approach to Manage and Search for Software Components *

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

A Novel Interface to a Web Crawler using VB.NET Technology

Performance Evaluation of a Regular Expression Crawler and Indexer

Word Disambiguation in Web Search

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

Abstract. 1. Introduction

Information Retrieval. Lecture 10 - Web crawling

Web-page Indexing based on the Prioritize Ontology Terms

Web-Page Indexing Based on the Prioritized Ontology Terms

Web Structure Mining using Link Analysis Algorithms

Effective Page Refresh Policies for Web Crawlers

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Recent Researches on Web Page Ranking

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Simulation Study of Language Specific Web Crawling

An Application of Personalized PageRank Vectors: Personalized Search Engine

DataRover: A Taxonomy Based Crawler for Automated Data Extraction from Data-Intensive Websites

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

Highly Efficient Architecture for Scalable Focused Crawling Using Incremental Parallel Web Crawler

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

Crawling the Hidden Web Resources: A Review

Survey on Web Structure Mining

Personalizing PageRank Based on Domain Profiles

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Searching the Web What is this Page Known for? Luis De Alba

Link Analysis in Web Information Retrieval

Automated Path Ascend Forum Crawling

Experience of Developing a Meta-Semantic Search Engine

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Web Crawling and Basic Text Analysis. Hongning Wang

A Personalized Web Search Engine Using Fuzzy Concept Network with Link Structure

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Weighted PageRank using the Rank Improvement

CS47300: Web Information Search and Management

A NOVEL APPROACH TO INTEGRATED SEARCH INFORMATION RETRIEVAL TECHNIQUE FOR HIDDEN WEB FOR DOMAIN SPECIFIC CRAWLING

Focused Web Crawler with Page Change Detection Policy

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University.

Keywords: web crawler, parallel, migration, web database

CS Search Engine Technology

E-Business s Page Ranking with Ant Colony Algorithm

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

Smart Crawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web Interfaces

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

An Focused Adaptive Web Crawling for Efficient Extraction of Data From Web Pages

Implementation of Enhanced Web Crawler for Deep-Web Interfaces

Heading-Based Sectional Hierarchy Identification for HTML Documents

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

THE WEB SEARCH ENGINE

Deep Web Content Mining

A New Approach to Design Graph Based Search Engine for Multiple Domains Using Different Ontologies

Weighted Page Content Rank for Ordering Web Search Result

Lecture 17 November 7

A Novel Architecture of Ontology-based Semantic Web Crawler

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

CRAWLING THE CLIENT-SIDE HIDDEN WEB

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

Searching the Web [Arasu 01]

ISSN: (Online) Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Support System- Pioneering approach for Web Data Mining

CS47300: Web Information Search and Management

Computer Engineering, University of Pune, Pune, Maharashtra, India 5. Sinhgad Academy of Engineering, University of Pune, Pune, Maharashtra, India

Enhancement to PeerCrawl: A Decentralized P2P Architecture for Web Crawling

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Address: Computer Science Department, Stanford University, Stanford, CA

A Connectivity Analysis Approach to Increasing Precision in Retrieval from Hyperlinked Documents

Topology Generation for Web Communities Modeling

INTRODUCTION. Chapter GENERAL

1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1

7. Mining Text and Web Data

Differences in Caching of Robots.txt by Search Engine Crawlers

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Focused and Deep Web Crawling-A Review

Multi-Modal Data Fusion: A Description

Transcription:

A Hierarchical Web Page Crawler for Crawling the Internet Faster Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay and Young-Chon Kim Web Intelligence & Distributed Computing Research Lab, Techno India (West Bengal University of Technology) EM 4/1, Sector V, Salt Lake, Calcutta 700091, India anik76in, rumadutta, debajyoti.mukhopadhyay @gmail.com Chonbuk National University Division of Electronics & Information Engineering 561-756 Jeonju, Republic of Korea yckim@chonbuk.ac.kr Abstract We are proposing a new hierarchical technique for downloading Web-pages rather than using Single Crawling or Parallel Crawling in any situation. Hierarchical Crawler means Crawler with a hierarchical view. A number of Crawlers have been used depending on the requirement of the level of hierarchy. It is like Parallel Crawler in a sense that more than one Crawler have been used to download Web-pages of a Website. But, if we see deeply, then it is clear that these number of Crawlers are created and managed dynamically at run-time using threaded program based on the number of hyperlinks (out-links) of a particular Web-page. Focussed crawling is introduced further. This paper proposes hierarchical technique for dedicated crawling of Web-pages into different domains. Index Terms (SQ), Single Crawler (SCrw), Focussed Single Crawler (FSCrw), Parallel Crawler (PCrw), Focussed Parallel Crawler (FPCrw), Hierarchical Crawler (HCrw), Focussed Hierarchical Crawler (FHCrw). I. INTRODUCTION Web-page crawling is an important issue for downloading the Web documents to facilitate the indexing, search, and retrieval of Web-pages on behalf of a Search Engine. Crawlers are also known as Spider or Robot. Crawling technique has been utilized to accumulate Web-pages of different domains under seperate databases. The enormous growth of the World Wide Web (WWW) in recent years has made it important to perform resource discovery efficiently [1-3]. In this paper, we have worked with different type of crawling mechanisms to download Web-pages from WWW. This paper is organized as follows: In Section 2, we present crawling concept. Our approach is described in Section 3 along with algorithms. In Section 4, experimental results are shown. Finally, we conclude our work in Section 5. II. CRAWLING CONCEPT Most advanced Web Search Engine Crawlers go out for checking Web-pages of a Website. Web Crawler is also known as Spider or Robot. Web Crawler generally does Search Engine optimization. Spider first finds the home-page for grabbing the information of head section by signifying and tags. Then it looks for the description and keywords. The basic idea of Crawler is to crawl the Web as directed or un-directed way. Crawler collects all the important information about Web-pages on the fly [4]. Crawler keeps a copy of visited Web-pages for further processing at serverside. Crawler arrives to the Website to look into the root folder for a file called robots.txt. robots.txt file contains the information about the permitted directories and files it is allowed to look at. In real field, some Crawlers ignore these instructions. Once Web-pages are found within a Website, each and every Web-page are being read by the Crawler as per the following sequence.

Title of the Web-page Meta-tags Contents of Web-page linkage III. OUR APPROACH In our approach, we have designed a Crawler based system to download unique Web-pages depending on the hyperlinks in the Web-pages. Crawler typically downloads the Web-pages to store into Repository or. Initially, we have designed SCrw for our experiment using Algorithm 1. Two check points are being introduced to validate the Web-page and extracted s respectively as shown in Figure 1. It has been found that a may have several synonymous addresses. To download unique Web-pages, we have to take care of this situation. If the downloaded Web-page has a which is already visited through Crawler, then the downloaded Web-page would be removed from the system. For example, http://www.coke.com and http://www.cocacola.com both point to the same http://www.coca-cola.com/indexb.html. Since a Crawler may download Web-pages which are not required for specific work of interest, FSCrw is further introduced using Algorithm 2 to enhance better performance. In this algorithm, another check point is being introduced. This check point is useful to check whether the downloaded Web-page has got the valid topic or not (Figure 2). After that, we have modified our design to build PCrw to offer better performance at the time of downloading the Web-pages from Internet. Using N (N 1) number of Crawlers, a system can download and further process the Web-pages N times faster than SCrw and also the system overhead is much lesser since we have considered different computer machines for storing Web-pages of different domains using Algorithm 4. So, in our approach, we have dispersed the whole network load into distinct domain loads. By PCrw technique, Web coverage is more within specific time compared to SCrw. Since we have used distinct Crawlers to download and store Web-pages of various domains, there is no chance of overlapping (Figure 3). We have used a Transfer Program module to transfer data from one domain to another using Algorithm 3. After parsing the downloaded Web-pages, there can be many s containing the address of other domain. That s why, this type of transfer module is a necessity while designing the whole system. So, dynamic assignment is done for allocating s within the SQ of distinct domains. Again, FPCrw is imposed on this situation to penetrate for better results using Algorithm 5. The aim of our FPCrw is to selectively search for Web-pages that are relevant to a pre-defined set of topics. The FPCrw analyzes its crawl limit to search the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources as shown in Figure 4. In the next phase of our experimentation, the Crawling system is further enhanced to achieve better performance with respect to time. This is known as HCrw system. A set of Crawlers are generated dynamically at runtime depending on the number of s in SQ. Each Crawler downloads a specific Web-page and then kills itself within the life time as described in Algorithm 6. HCrw is the combination of PCrw & depth level of searching. Here, depth level ensures the penetration limit. For example, if depth=0, then only the initial s of SQ would be downloaded through N number of dynamic Crawlers. If depth=1, then the downloaded Web-pages of depth=0 would be parsed by r module and the new distinct s would be saved within SQ for downloading next set of Web-pages through dynamically created Crawlers as discussed in Algorithm 7 (Figure 5). Like this, at any depth, HCrw system would generate N number of Crawlers; do the needful work; and then kill the Crawlers. Downloading operation is done parallelly. Focussed Crawling (as discussed earlier) is again introduced in HCrw (Algorithm 8) for topic based searching as shown in Figure 6. Definition 1 : (SQ) - A queue, containing number of s as seed for downloading the Web-pages from Internet, is known as. Definition 2 : - A Web-page contains, whenever the contents are related to the specified topic. Definition 3 : Single Crawler (SCrw) - A Single Crawler is a set of programs by which Web-pages are being surfed by a Search Engine to download selected Web-pages. Algorithm 1: Single Crawling

Internet known as Focussed Single Crawler. Crawler Internet Crawler Fig. 1. Pictorial Representation of Single Crawling Input : A set of s within Output : of downloaded Web-pages within server irrespective of domain Step 1 : Crawler starts crawling with s Step 2 : Crawler downloads the Web-page taking from the Step 3 : Check whether the of downloaded Web-page is Step 9 Step 5 : Else, save the Web-page Step 6 : Hyperlinks of each Web-page are extracted using r tool Step 7 : Check whether the extracted s are Step 8 : Save those extracted hyperlinks, which are still not visited, within as well as in storage Step 9 : If (condition for continuing the downloading Step 10 : then Goto Step 2 Step 11 : Stop Definition 4 : Focussed Single Crawler (FSCrw) - Single Crawler being used for a specific topic is Fig. 2. Pictorial Representation of Focussed Single Crawling Algorithm 2: Focussed Single Crawling Input : A set of s within Output : of focussed downloaded s within server irrespective of domain Step 1 : Crawler starts crawling with s Step 2 : Crawler downloads the Web-page taking from the Step 3 : Check whether the of downloaded Web-page is Step 11 Step 5 : Else check for valid topic Step 6 : If downloaded Web-page does not contain valid topic, then discard the Web-page and goto Step 11 Step 7 : Else, save the Web-page Step 8 : Hyperlinks of each Web-page are extracted using r tool Step 9 : Check whether the extracted s are Step 10 : Save those extracted hyperlinks, which

Internet.net Crawler.com Crawler.org Crawler Transfer Program Fig. 3. Pictorial Representation of Parallel Crawling are still not visited, within as well as in storage Step 11 : If (condition for continuing the downloading Step 12 : then Goto Step 2 Step 13 : Stop Algorithm 3: Transfer Module Input : d and the corresponding downloaded Web-page of a particular domain Output : Transfer of to domain based Step 1 : If (parsed and input Web-page belong to the same domain) Step 2 : then goto Step 6 Step 3 : Search for the specific domain Step 4 : If (domain found) Step 5 : then transfer the parsed to the specific of related domain Step 6 : Stop Definition 5 : Parallel Crawler (PCrw) - A set of Crawlers working concurrently is known as Parallel Crawler. Algorithm 4: Parallel Crawling Input : A set of s of different domains within different s Output : of downloaded Web-pages within servers of respective domains Step 1 : A set of Crawlers start crawling with s of their respective domains Step 2 : Each Crawler downloads the Web-page taking from their respective Step 3 : Check whether the of downloaded Web-page is Step 13 Step 5 : Else, save the Web-page Step 6 : Hyperlinks of each Web-page are extracted using r tool Step 7 : Check whether the extracted s are Step 8 : If not visited, then check for domain Step 9 : If extracted hyperlinks belong to same domain Step 10 : then save the hyperlinks within the respective and storage Step 11 : If extracted hyperlinks belong to different domains Step 12 : then use Transfer Program to trans-

Internet.net Crawler.com Crawler.org Crawler Transfer Program Fig. 4. Pictorial Representation of Focussed Parallel Crawling fer the hyperlinks to their respective and storage Step 13 : If (condition for continuing the downloading Step 14 : then goto Step 2 Step 15 : Stop Definition 6 : Focussed Parallel Crawler (FPCrw) - Parallel Crawler being used for a specific set of topics is known as Focussed Parallel Crawler. Algorithm 5: Focussed Parallel Crawling Input : A set of s of different domains within different s Output : of focussed downloaded s within servers of respective domains Step 1 : A set of Crawlers start crawling with s of their respective domains Step 2 : Each Crawler downloads the Web-page taking from their respective Step 3 : Check whether the of downloaded Web-page is Step 15 Step 5 : Else check for valid topic Step 6 : If downloaded Web-page does not contain valid topic, then discard the Web-page and goto Step 15 Step 7 : Else, save the Web-page Step 8 : Hyperlinks of each Web-page are extracted using r tool Step 9 : Check whether the extracted s are Step 10 : If not visited, then check for domain Step 11 : If extracted hyperlinks belong to same domain Step 12 : then save the hyperlinks within the respective and storage Step 13 : If extracted hyperlinks belong to different domains Step 14 : then use Transfer Program to transfer the hyperlinks to their respective and storage Step 15 : If (condition for continuing the downloading Step 16 : then goto Step 2

Step 17 : Stop Definition 7 : Life Cycle of Dynamic Crawler - Life Cycle of Crawler refers to the period of time between when a Crawler is created and when it is destroyed. Algorithm 6: Life Cycle of Dynamic Crawler Input : A set of s Output : Downloaded Web-pages Step 1 : Check number of s within Step 2 : Generate same number of Crawlers in runtime Step 3 : Assign each seed() to a specific Crawler Step 4 : Download all Web-pages Step 5 : Kill all Crawlers Step 6 : Stop Search for Web pages Internet Download Web pages Set of Crawlers ( C := {C, C,..., C } ) 1 2 n Crawler Creation Module Crawler Destruction Module the Web-pages, these Crawlers are destroyed automatically. Definition 9 : Depth - Level of searching through WWW for downloading the required Web-pages in a hierarchical fashion. Algorithm 7: Hierarchical Crawling Input : A set of s within and Depth level of searching Output : of downloaded Web-pages Step 1 : Initialize i with 0 Step 2 : Continue loop until i Depth Step 3 : Call Algorithm 6 Step 4 : Check whether the s of downloaded Web-pages are Step 5 : already visited Web-pages Step 6 : Save new Web-pages Step 7 : Hyperlinks of all saved Web-pages are extracted using r tool Step 8 : Check whether the extracted s are Step 9 : Save those extracted hyperlinks(s), which are still not visited, within as well as in storage Step 10 : Increment i by 1 Step 11 : Stop Number of s ( S := {S, S,..., S }) 1 2 n Search for Web pages Internet Download Web pages Set of Crawlers ( C := {C, C,..., C } ) 1 2 n Crawler Creation Module Crawler Destruction Module Number of s ( S := {S 1, S 2,..., S n } ) te : n number of s are selected at a time by n number of dynamically created Crawlers where n = 1, 2, 3,... Fig. 5. Pictorial Representation of Hierarchical Crawling Definition 8 : Hierarchical Crawler (HCrw) - Hierarchical Crawler is a set of dynamically created Crawlers whose number of instance is dependent on the number of s. One Crawler can download only one Web-page using its assigned from the SQ in its life time. After downloading te : n number of s are selected at a time by n number of dynamically created Crawlers where n = 1, 2, 3,... Fig. 6. Pictorial Representation of Focussed Hierarchical Crawling Definition 10 : Focussed Hierarchical Crawler

(FHCrw) - Hierarchical Crawler being used for a specific topic is known as Focussed Hierarchical Crawler. Algorithm 8: Focussed Hierarchical Crawling Input : A set of s within and Depth level of searching Output : of focussed downloaded s Step 1 : Initialize i with 0 Step 2 : Continue loop until i Depth Step 3 : Call Algorithm 6 Step 4 : Check whether the s of downloaded Web-pages are Step 5 : visited Web-pages Step 6 : New Web-pages are being checked for valid topic Step 7 : the Web-pages which do not contain valid topic Step 8 : Save remaining Web-pages Step 9 : Hyperlinks of all saved Web-pages are extracted using r tool Step 10 : Check whether the extracted s are Step 11 : Save those extracted hyperlinks, which are still not visited, within as well as in storage Step 12 : Increment i by 1 Step 13 : Stop IV. EXPERIMENTAL RESULTS TABLE I SAMPLE DOWNLOADED WEBSITES Sl. Website Total Number of. Name Web-pages downloaded 1 freshersworld.com 4994 2 theatrelinks.com 469 3 indiagsm.com 34 4 rediff.com 163 5 w3.org 2333 6 indiafm.com 7087 7 nokia-asia.com 193 8 amazon.com 349 In this section, experimental results are shown through the corresponding Table 1 & 2. In Table 1, a few number of Websites are enlisted along with the number of downloaded Web-pages. These Web-pages are downloaded through SCrw, PCrw & TABLE II COMPARATIVE STUDY ON TIME TAKEN BY SINGLE, PARALLEL & HIERARCHICAL CRAWLERS te: Number of Parallel Crawlers = 2 for this experimentation. Sl. Time taken by Time taken by Time taken by. Single Crawler Parallel Crawler Hierarchical Crawler 1 9 hr. 30 min. 4 hr. 45 min. 58 min. 2 44 min. 22 min. 10 min. 3 3 min. 1 min. 30 sec. 30 sec. 4 3 min. 1 min. 30 sec. 20 sec. 5 6 hr. 3 hr. 37 min. 6 7 hr. 3 hr. 30 min. 53 min. 7 2 hr. 1 hr. 17 min. 8 2 hr. 58 min. 1 hr. 29 min. 19 min. HCrw seperately for comparing the time taken by each type of Crawler as shown in Table 2. V. CONCLUSION In a Search Engine, SCrw and/or PCrw are used for downloading the Web-pages of selected s submitted through registration procedure. An enhanced methodology is discussed in this paper to minimize time requirement while Crawling through WWW using HCrw. These Crawlers are generated at runtime using the number of s present within SQ at any depth level of concerned hierarchy. After downloading the specific Web-page, the Crawler would be destroyed automatically. At any time instance, maximum number of Web-pages available for concurrent downloading depends on the allowable bandwidth of the system. REFERENCES [1] Sergey Brin, Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, April 1998 [2] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan, Searching the Web, ACM Transactions on Internet Technology, Volume 1, Issue 1, August 2001 [3] Soumen Chakrabarti, Byron E. Dom, Ravi Kumar, Prabhakar Raghavan, Shidhar Rajagopalan, Andrew Tomkins, David Gibson, Jon Kleinberg, Mining the Web s Link Structure, IEEE Computer, (32)8: August 1999, pp. 60-67 [4] Debajyoti Mukhopadhyay, Sajal Mukherjee, Soumya Ghosh, Saheli Kar, Young-Chon Kim, Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine, The International Workshop MSPT 2006 Proceedings, Youngil Publication, Republic of Korea, vember 2006, pp.103-108