Survey on Effective Web Crawling Techniques

Size: px

Start display at page:

Download "Survey on Effective Web Crawling Techniques"

Arron Short
5 years ago
Views:

1 Survey on Effective Web Crawling Techniques Priyanka Bandagale, Neha Ravindra Sawantdesai, Rakshanda Umesh Paradkar, Piyusha Prakash Shirodkar Finolex Academy of Management & Technology, Ratnagiri. ABSTRACT: Now days the internet usage has increased alot. For extraction of relevant information web requires suitable search strategies. This led to the invention of web crawlers. Web crawlers assists users in crawling one page at a time through a website until all pages have been indexed. Here the goal is to find missing links, community detection in complex networks. In this paper, we have reviewed web crawling techniques and the architectures of respective Web crawlers. And also survey advantages and disadvantages of the web crawling techniques. KEYWORDS: Web crawler, seed URLs, Best First Algorithm. INTRODUCTION: The web contains large bulk of information on various topics. Compared to the traditional collection repositories such as libraries etc., the Web has no centrally organized content structure. This data can be downloaded using web crawler. So, Web crawler is a program that used to download and store webpages. It is also called as web robot or web spider.web crawlers can be used in various areas. Most importantly it indexes a large set of pages and allow other people to search this indexes.this paper gives the overview of various crawling techniques and the rest of the paper is divided into sections. The first section gives generalidea about web crawling and its architecture.in Second section, each technique is discussed in detail with Pros and Cons. The Section Three presents Acknowledgement. I. WEB CRAWLING A web crawler is a program or automated script which browses the World Wide Web in a methodical, automated manner. This process is called Web crawler or Spidering. Many Legitimate sites in particular search engine use spidering as a means of providing up-to-date data.web crawlers are mainly used to create a copy of all the visited pages for latter processing. Crawlers can also be used for automating maintenance tasks on a Website, such as checking links or validating HTML code.also, crawlers can be used to gather specific types of information from web pages such as harvesting addresses. The architecture of web crawling is shown below- Fig 1: Web Crawling Architecture 94 Priyanka Bandagale, Neha Ravindra Sawantdesai, Rakshanda Umesh Paradkar, Piyusha Prakash Shirodkar

2 II. WEB CRAWLING TECHNIQUES Web crawling is flourishing from being a progressing technology to become an important part of many businesses.the first crawlers were developed for a much smaller web, but today some of the popular sites alone have millions of pages.there are many processes for this, which combine different levels of crawling. These levels could be systematically described as follows: 1. Focused Crawling 2. Parallel Crawling 3. Distributed Crawling 4. Incremental Crawling 1] Focused Crawling Focused crawling is a technique in which crawler collects web pages that satisfy some specific properly. The major task of the focused crawler is to seek out pages that are relevant to predefined set of topics.instead of indexing all accessible web documents, a focused crawler analyzes its crawl boundary to find the most relevant links and avoids unnecessary and irrelevant regions of the web.focused crawler is also known as Topical Crawler. It is feasible economically in terms of resources. The major advantage is that it can also reduce network traffic. As shown in Figure 2, there are three major components of the architecture of Focused crawler: Classifier: It makes the page relevancy decision to decide link expansion. Distiller: It identifies many topic related pages to determine priority of pages. Crawler: Crawling module fetches the pages which are suggested by distiller. The Architecture of Focused Crawling is as shown in figure 2. Fig 2:Architecture of Focused crawler The crawler consists of a single watchdog thread and many more worker threads. The watchdog checks out for the new work from the crawl frontier, which is passed on to workers using shared memory buffers. Workers save details of newly explored pages in private.with respect to dependencies on determining relevant web pages focused crawler approaches categorized into: Ontology based focused crawler Structure based focused crawler Context based focused crawler Priority based focused crawler 95 Priyanka Bandagale, Neha Ravindra Sawantdesai, Rakshanda Umesh Paradkar, Piyusha Prakash Shirodkar

3 Learning based focused crawler a) It acquires relevant pages while other crawler quickly its way, even though they start from the same point. b) It can easily discover valuable web pages that are at longer distance from the seed set andalso prune all the pages which are in the same radius. Thus high quality collections of web documents on a particular topic can be built. Drawbacks: a) It has to maintain the count of how frequently to revisit the page. b) It should select correct URL for extracting relevant information. c) Ranking and ordering relevant URLs to determine relevance of a web page should be done constantly. d) Incapability of machines to understand information to lack of universal format. e) Uncertainty of the information. 2] Parallel Crawling Now a days,the size of the web grow, so it becomes necessary to run the parallel crawling processes which helps in downloading number of pages in a reasonable amount of time.parallel Crawling is a web crawling technique which is used to run multiple processes in parallel and that process is called as 'C-Procs' which can run on network of different workstation.the crawlers based on Parallel Crawling technique mainly depend on page freshness and page selection. That crawler can be located on local network or be distributed at geographically distant locations. The architecture for parallel crawling is as follows: Fig 3: Architecture of Parallel crawler As shown in figure no.3: C- Procs are the multiple crawling processes which is carried out by parallel crawler.each C-proc carries out basic tasks such as it downloads pages from the web, stores it locally, extracts URLs from the downloaded pages and follows the respective links.each C-procs' consists of two major parts such as: Connected pages Queues of URLs to be visited Based on the location of C-procs', parallel crawlers are classified into two categories: Intra site Parallel crawler- In this type of crawler, all C-procs' run on the same local network and communication with each other using high speed interconnection i.e LAN.Here, as all C-procs' uses the same local network, network load from all of the C-procs' is centralized at a single location. 96 Priyanka Bandagale, Neha Ravindra Sawantdesai, Rakshanda Umesh Paradkar, Piyusha Prakash Shirodkar

4 Distributed Crawler:In this type of crawler, various C-procs' run at geographically distant locations connected by the internet.distributed crawler can disperse and reduce the load on the overall network. a) Scalability- Considering the size of the web, normal crawler cannot download pages in reasonable amount of time whereas parallel crawler can achieve required download rate. b) Network load reduction- It can reduce the network load. Network load reduction is the major advantage of the parallel crawler, as it performs parallel processing. Drawbacks: a) Overlap-While running multiple processes in parallel, it is possible that different processes can download the same page multiple times.such overlapping downloads should be minimized to save the network bandwidth and to increase the crawler's effectiveness. b) Quality:In order to increase the quality of the downloaded section, crawler first tries to download important pages.in the parallel crawler, each C-proc i.e. each process may not be aware of the whole image of a web that they have downloaded up till now. Thus may make a poor crawling decision. c) Communication Bandwidth- Crawling processes need to periodically communicate with each other, in order to prevent the overlap and to increase the quality. However this communication may grow significantly. 3] Incremental Crawling In today s web techniques there are various number of pages, some of them are more important and some of them are less important. Thus to provide valuable pages to user Incremental Crawling Technique is used.incremental Crawling is web Crawling technique which is used to refresh the previous collection of the pages by visiting them frequently.incremental Technique Exchange less important pages by new important pages. So it also resolve the problem of content consistency.the architecture of Incremental Crawling is shown below: Fig 4: Architecture of Incremental Crawling In Architecture of Incremental Crawler the UpdateModule constantly extract the top entry from Callurls, Request the CrawlModule to the page and put crawled Url return into CallUrls.It record last page and compare that one with current crawled page to change the page.the CrawlModule Crawls pages and update pages in collection. Crawl Moduleextracts all URL in page and send the urls to Allurls. a) Freshness-Crawler provide fresh pages. b) Quality-This technique maintain more important data, thus Crawler is provide good quality data c) Only valuable data is provided to user so Network bandwidth is saved. 97 Priyanka Bandagale, Neha Ravindra Sawantdesai, Rakshanda Umesh Paradkar, Piyusha Prakash Shirodkar

5 Drawbacks: a) Slower Recovery as all important must be restored. b) If previous one of the backup fails recovery will be incomplete. 4]Distributed Crawling Distributed web crawling is a distributed computing technique in which Internet search engines bestow many computers to index the Internet via web crawling. The users are allowed to voluntarily offer their own computing and bandwidth resources towards crawling web pages through such systems.there are two types of policies: 1. Dynamic assignment In this type of policy, different crawlers are assigned new urls dynamically by the central server. For instance, the central server is allowed to balance the load of each crawler dynamically. The downloader processes can be added or eliminated by the system with dynamic assignment. Most of the workload must be transferred to the distributed crawling processes for large crawls as central server has the chances to become the barrier.shkapenyuk and Suel stated two configurations of crawling architecture with dynamic assignments, they are: A small crawler configuration-in this type of configuration there is a central DNS resolver & central queues per Web site, and distributed downloaders. A large crawler configuration-in large crawler configuration the DNS resolver and the queues are also distributed. 2. Static assignment In this type of policy, there is a fixed rule stated from the beginning of the crawl that defines how to assign new URLs to the crawlers. For this type of assignment, a hashing function is used to convert URLs into a number that corresponds to the index of the corresponding crawling process. As there are external links that will go from a Web site assigned to one crawling process to a website assigned to a different crawling process, some exchange of URLs must occur. Due to the exchange of URLs between crawling processes, there is need to reduce overhanging. Therefore, the exchange should be done in batch of several URLs at a time, and the most adduce URLs in the collection should be known by all crawling processes before the crawl. The architecture for Distributed crawling is given below- Fig 5: Distributed Web Crawling Architecture a) It is robust against system crashes and other events. b) It is more scalable and memory efficient. c) It also have increased overall download speed and reliability. 98 Priyanka Bandagale, Neha Ravindra Sawantdesai, Rakshanda Umesh Paradkar, Piyusha Prakash Shirodkar

6 Drawback: A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages. III.ACKNOWLEDGEMENT This research paper was made possible by the support of Dr. VinayakBharadi, HOD, IT Department, FAMT; Ms. Priyanka Bandagale, Project Guide.We would like to express ourgreat gratitude to Ms. Priyanka Bandagalefor her kind advice on the project and precious information. REFERENCES: [1] Learning URL Patterns for Webpage De-Duplication by H.S. Koppula, K.P. Leela, A. Agarwal, K.P. Chitrapura, S. Garg and A. Sasturkar, Proc. Third ACM Conf. Web Search and Data Mining, pp , [2]. Extracting and Ranking Product Features in Opinion Documents by L. Zhang, B. Liu, S.H. Lim, and E. O Brien- Strain, Proc. 23rd Int l Conf. Computational Linguistics, pp , [3]. Automatic Extraction of Web Data Records Containing User-Generated Content by X.Y. Song, J. Liu, Y.B. Cao, and C.-Y. Lin. In Proc. of 19th CIKM, pages 39-48, [4] Mohit Malhotra (2013), Web Crawler And It s Concepts. [5] Nemeslaki, András; Pocsarovszky, Károly (2011), Web crawler research methodology, 22nd European Regional Conference of the International Telecommunications Society. [6] Shkapenyuk V. and Suel T. (2002), Design and Implementation of a highperformance distributed web crawler, In Proc. 18th International Conference on Data Engineering, pp [7] Ahmed Patel, Nikita Schmidt, Application of structured document parsing to focused web crawling, ELSEVIER 33 (2011) [8] B.Ganter,R.Wille, Formal Concept Analysis: Mathematical Foundations,Springer-Verlag, Berlin, [9] Christopher D. Manning, PrabhakarRaghavan&HinrichSchütze (2008). "Introduction to Information Retrieval". Cambridge University Press. Retrieved [10] David Vallet, Pablo Castells, Miriam Fernández, PhivosMylonas and YannisAvrithis, Personalized Content Retrieval in Context Using Ontological Knowledge, IEEE transactions on circuits and systems for video technology,(2007) vol. 17, no. 3. [11] Filippomenczer, gautam pant, padminisrinivasan, Topical Web Crawlers: Evaluating Adaptive Algorithms, ACM Transactions on Internet Technology, (2004) [12] F.Menczer,A.E. Monge, Scalable web search by adaptive online agents, an Infospider case study, in: The Proceeding of Intelligent Information Agents, Agent-Based Information Discovery and Management on the Internet, Springer,Berlin, 1999,pp [13] GeirSolskinnsbakk, Jon AtleGulla" Combining ontological profiles with context in information retrieval",elsevier,69(2010) [14] AH Chung Tsol, Daniele Forsali, Marco Gori, Markus buchnhagener, Franco Scarselli, A Simple Focused Crawler Proceeding 12th International WWW Conference 2003(poster), pp. 1. [15] Bharat Bhushan1, Narender Kumar2, Intelligent Crawling On Open Web for Business Prospects, IJCSNS International Journal of Computer Science and Network Security, VOL.12 No.6, June Priyanka Bandagale, Neha Ravindra Sawantdesai, Rakshanda Umesh Paradkar, Piyusha Prakash Shirodkar

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,