A Hierarchical Web Page Crawler for Crawling the Internet Faster

A Hierarchical Web Page Crawler for Crawling the Internet Faster Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay and Young-Chon Kim Web Intelligence & Distributed Computing Research Lab, Techno India (West Bengal University of Technology) EM 4/1, Sector V, Salt Lake, Calcutta 700091, India anik76in, rumadutta, debajyoti.mukhopadhyay @gmail.com Chonbuk National University Division of Electronics & Information Engineering 561-756 Jeonju, Republic of Korea yckim@chonbuk.ac.kr Abstract We are proposing a new hierarchical technique for downloading Web-pages rather than using Single Crawling or Parallel Crawling in any situation. Hierarchical Crawler means Crawler with a hierarchical view. A number of Crawlers have been used depending on the requirement of the level of hierarchy. It is like Parallel Crawler in a sense that more than one Crawler have been used to download Web-pages of a Website. But, if we see deeply, then it is clear that these number of Crawlers are created and managed dynamically at run-time using threaded program based on the number of hyperlinks (out-links) of a particular Web-page. Focussed crawling is introduced further. This paper proposes hierarchical technique for dedicated crawling of Web-pages into different domains. Index Terms (SQ), Single Crawler (SCrw), Focussed Single Crawler (FSCrw), Parallel Crawler (PCrw), Focussed Parallel Crawler (FPCrw), Hierarchical Crawler (HCrw), Focussed Hierarchical Crawler (FHCrw). I. INTRODUCTION Web-page crawling is an important issue for downloading the Web documents to facilitate the indexing, search, and retrieval of Web-pages on behalf of a Search Engine. Crawlers are also known as Spider or Robot. Crawling technique has been utilized to accumulate Web-pages of different domains under seperate databases. The enormous growth of the World Wide Web (WWW) in recent years has made it important to perform resource discovery efficiently [1-3]. In this paper, we have worked with different type of crawling mechanisms to download Web-pages from WWW. This paper is organized as follows: In Section 2, we present crawling concept. Our approach is described in Section 3 along with algorithms. In Section 4, experimental results are shown. Finally, we conclude our work in Section 5. II. CRAWLING CONCEPT Most advanced Web Search Engine Crawlers go out for checking Web-pages of a Website. Web Crawler is also known as Spider or Robot. Web Crawler generally does Search Engine optimization. Spider first finds the home-page for grabbing the information of head section by signifying and tags. Then it looks for the description and keywords. The basic idea of Crawler is to crawl the Web as directed or un-directed way. Crawler collects all the important information about Web-pages on the fly [4]. Crawler keeps a copy of visited Web-pages for further processing at serverside. Crawler arrives to the Website to look into the root folder for a file called robots.txt. robots.txt file contains the information about the permitted directories and files it is allowed to look at. In real field, some Crawlers ignore these instructions. Once Web-pages are found within a Website, each and every Web-page are being read by the Crawler as per the following sequence.

Title of the Web-page Meta-tags Contents of Web-page linkage III. OUR APPROACH In our approach, we have designed a Crawler based system to download unique Web-pages depending on the hyperlinks in the Web-pages. Crawler typically downloads the Web-pages to store into Repository or. Initially, we have designed SCrw for our experiment using Algorithm 1. Two check points are being introduced to validate the Web-page and extracted s respectively as shown in Figure 1. It has been found that a may have several synonymous addresses. To download unique Web-pages, we have to take care of this situation. If the downloaded Web-page has a which is already visited through Crawler, then the downloaded Web-page would be removed from the system. For example, http://www.coke.com and http://www.cocacola.com both point to the same http://www.coca-cola.com/indexb.html. Since a Crawler may download Web-pages which are not required for specific work of interest, FSCrw is further introduced using Algorithm 2 to enhance better performance. In this algorithm, another check point is being introduced. This check point is useful to check whether the downloaded Web-page has got the valid topic or not (Figure 2). After that, we have modified our design to build PCrw to offer better performance at the time of downloading the Web-pages from Internet. Using N (N 1) number of Crawlers, a system can download and further process the Web-pages N times faster than SCrw and also the system overhead is much lesser since we have considered different computer machines for storing Web-pages of different domains using Algorithm 4. So, in our approach, we have dispersed the whole network load into distinct domain loads. By PCrw technique, Web coverage is more within specific time compared to SCrw. Since we have used distinct Crawlers to download and store Web-pages of various domains, there is no chance of overlapping (Figure 3). We have used a Transfer Program module to transfer data from one domain to another using Algorithm 3. After parsing the downloaded Web-pages, there can be many s containing the address of other domain. That s why, this type of transfer module is a necessity while designing the whole system. So, dynamic assignment is done for allocating s within the SQ of distinct domains. Again, FPCrw is imposed on this situation to penetrate for better results using Algorithm 5. The aim of our FPCrw is to selectively search for Web-pages that are relevant to a pre-defined set of topics. The FPCrw analyzes its crawl limit to search the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources as shown in Figure 4. In the next phase of our experimentation, the Crawling system is further enhanced to achieve better performance with respect to time. This is known as HCrw system. A set of Crawlers are generated dynamically at runtime depending on the number of s in SQ. Each Crawler downloads a specific Web-page and then kills itself within the life time as described in Algorithm 6. HCrw is the combination of PCrw & depth level of searching. Here, depth level ensures the penetration limit. For example, if depth=0, then only the initial s of SQ would be downloaded through N number of dynamic Crawlers. If depth=1, then the downloaded Web-pages of depth=0 would be parsed by r module and the new distinct s would be saved within SQ for downloading next set of Web-pages through dynamically created Crawlers as discussed in Algorithm 7 (Figure 5). Like this, at any depth, HCrw system would generate N number of Crawlers; do the needful work; and then kill the Crawlers. Downloading operation is done parallelly. Focussed Crawling (as discussed earlier) is again introduced in HCrw (Algorithm 8) for topic based searching as shown in Figure 6. Definition 1 : (SQ) - A queue, containing number of s as seed for downloading the Web-pages from Internet, is known as. Definition 2 : - A Web-page contains, whenever the contents are related to the specified topic. Definition 3 : Single Crawler (SCrw) - A Single Crawler is a set of programs by which Web-pages are being surfed by a Search Engine to download selected Web-pages. Algorithm 1: Single Crawling

Internet known as Focussed Single Crawler. Crawler Internet Crawler Fig. 1. Pictorial Representation of Single Crawling Input : A set of s within Output : of downloaded Web-pages within server irrespective of domain Step 1 : Crawler starts crawling with s Step 2 : Crawler downloads the Web-page taking from the Step 3 : Check whether the of downloaded Web-page is Step 9 Step 5 : Else, save the Web-page Step 6 : Hyperlinks of each Web-page are extracted using r tool Step 7 : Check whether the extracted s are Step 8 : Save those extracted hyperlinks, which are still not visited, within as well as in storage Step 9 : If (condition for continuing the downloading Step 10 : then Goto Step 2 Step 11 : Stop Definition 4 : Focussed Single Crawler (FSCrw) - Single Crawler being used for a specific topic is Fig. 2. Pictorial Representation of Focussed Single Crawling Algorithm 2: Focussed Single Crawling Input : A set of s within Output : of focussed downloaded s within server irrespective of domain Step 1 : Crawler starts crawling with s Step 2 : Crawler downloads the Web-page taking from the Step 3 : Check whether the of downloaded Web-page is Step 11 Step 5 : Else check for valid topic Step 6 : If downloaded Web-page does not contain valid topic, then discard the Web-page and goto Step 11 Step 7 : Else, save the Web-page Step 8 : Hyperlinks of each Web-page are extracted using r tool Step 9 : Check whether the extracted s are Step 10 : Save those extracted hyperlinks, which

Internet.net Crawler.com Crawler.org Crawler Transfer Program Fig. 3. Pictorial Representation of Parallel Crawling are still not visited, within as well as in storage Step 11 : If (condition for continuing the downloading Step 12 : then Goto Step 2 Step 13 : Stop Algorithm 3: Transfer Module Input : d and the corresponding downloaded Web-page of a particular domain Output : Transfer of to domain based Step 1 : If (parsed and input Web-page belong to the same domain) Step 2 : then goto Step 6 Step 3 : Search for the specific domain Step 4 : If (domain found) Step 5 : then transfer the parsed to the specific of related domain Step 6 : Stop Definition 5 : Parallel Crawler (PCrw) - A set of Crawlers working concurrently is known as Parallel Crawler. Algorithm 4: Parallel Crawling Input : A set of s of different domains within different s Output : of downloaded Web-pages within servers of respective domains Step 1 : A set of Crawlers start crawling with s of their respective domains Step 2 : Each Crawler downloads the Web-page taking from their respective Step 3 : Check whether the of downloaded Web-page is Step 13 Step 5 : Else, save the Web-page Step 6 : Hyperlinks of each Web-page are extracted using r tool Step 7 : Check whether the extracted s are Step 8 : If not visited, then check for domain Step 9 : If extracted hyperlinks belong to same domain Step 10 : then save the hyperlinks within the respective and storage Step 11 : If extracted hyperlinks belong to different domains Step 12 : then use Transfer Program to trans-

Internet.net Crawler.com Crawler.org Crawler Transfer Program Fig. 4. Pictorial Representation of Focussed Parallel Crawling fer the hyperlinks to their respective and storage Step 13 : If (condition for continuing the downloading Step 14 : then goto Step 2 Step 15 : Stop Definition 6 : Focussed Parallel Crawler (FPCrw) - Parallel Crawler being used for a specific set of topics is known as Focussed Parallel Crawler. Algorithm 5: Focussed Parallel Crawling Input : A set of s of different domains within different s Output : of focussed downloaded s within servers of respective domains Step 1 : A set of Crawlers start crawling with s of their respective domains Step 2 : Each Crawler downloads the Web-page taking from their respective Step 3 : Check whether the of downloaded Web-page is Step 15 Step 5 : Else check for valid topic Step 6 : If downloaded Web-page does not contain valid topic, then discard the Web-page and goto Step 15 Step 7 : Else, save the Web-page Step 8 : Hyperlinks of each Web-page are extracted using r tool Step 9 : Check whether the extracted s are Step 10 : If not visited, then check for domain Step 11 : If extracted hyperlinks belong to same domain Step 12 : then save the hyperlinks within the respective and storage Step 13 : If extracted hyperlinks belong to different domains Step 14 : then use Transfer Program to transfer the hyperlinks to their respective and storage Step 15 : If (condition for continuing the downloading Step 16 : then goto Step 2

Step 17 : Stop Definition 7 : Life Cycle of Dynamic Crawler - Life Cycle of Crawler refers to the period of time between when a Crawler is created and when it is destroyed. Algorithm 6: Life Cycle of Dynamic Crawler Input : A set of s Output : Downloaded Web-pages Step 1 : Check number of s within Step 2 : Generate same number of Crawlers in runtime Step 3 : Assign each seed() to a specific Crawler Step 4 : Download all Web-pages Step 5 : Kill all Crawlers Step 6 : Stop Search for Web pages Internet Download Web pages Set of Crawlers ( C := {C, C,..., C } ) 1 2 n Crawler Creation Module Crawler Destruction Module the Web-pages, these Crawlers are destroyed automatically. Definition 9 : Depth - Level of searching through WWW for downloading the required Web-pages in a hierarchical fashion. Algorithm 7: Hierarchical Crawling Input : A set of s within and Depth level of searching Output : of downloaded Web-pages Step 1 : Initialize i with 0 Step 2 : Continue loop until i Depth Step 3 : Call Algorithm 6 Step 4 : Check whether the s of downloaded Web-pages are Step 5 : already visited Web-pages Step 6 : Save new Web-pages Step 7 : Hyperlinks of all saved Web-pages are extracted using r tool Step 8 : Check whether the extracted s are Step 9 : Save those extracted hyperlinks(s), which are still not visited, within as well as in storage Step 10 : Increment i by 1 Step 11 : Stop Number of s ( S := {S, S,..., S }) 1 2 n Search for Web pages Internet Download Web pages Set of Crawlers ( C := {C, C,..., C } ) 1 2 n Crawler Creation Module Crawler Destruction Module Number of s ( S := {S 1, S 2,..., S n } ) te : n number of s are selected at a time by n number of dynamically created Crawlers where n = 1, 2, 3,... Fig. 5. Pictorial Representation of Hierarchical Crawling Definition 8 : Hierarchical Crawler (HCrw) - Hierarchical Crawler is a set of dynamically created Crawlers whose number of instance is dependent on the number of s. One Crawler can download only one Web-page using its assigned from the SQ in its life time. After downloading te : n number of s are selected at a time by n number of dynamically created Crawlers where n = 1, 2, 3,... Fig. 6. Pictorial Representation of Focussed Hierarchical Crawling Definition 10 : Focussed Hierarchical Crawler

(FHCrw) - Hierarchical Crawler being used for a specific topic is known as Focussed Hierarchical Crawler. Algorithm 8: Focussed Hierarchical Crawling Input : A set of s within and Depth level of searching Output : of focussed downloaded s Step 1 : Initialize i with 0 Step 2 : Continue loop until i Depth Step 3 : Call Algorithm 6 Step 4 : Check whether the s of downloaded Web-pages are Step 5 : visited Web-pages Step 6 : New Web-pages are being checked for valid topic Step 7 : the Web-pages which do not contain valid topic Step 8 : Save remaining Web-pages Step 9 : Hyperlinks of all saved Web-pages are extracted using r tool Step 10 : Check whether the extracted s are Step 11 : Save those extracted hyperlinks, which are still not visited, within as well as in storage Step 12 : Increment i by 1 Step 13 : Stop IV. EXPERIMENTAL RESULTS TABLE I SAMPLE DOWNLOADED WEBSITES Sl. Website Total Number of. Name Web-pages downloaded 1 freshersworld.com 4994 2 theatrelinks.com 469 3 indiagsm.com 34 4 rediff.com 163 5 w3.org 2333 6 indiafm.com 7087 7 nokia-asia.com 193 8 amazon.com 349 In this section, experimental results are shown through the corresponding Table 1 & 2. In Table 1, a few number of Websites are enlisted along with the number of downloaded Web-pages. These Web-pages are downloaded through SCrw, PCrw & TABLE II COMPARATIVE STUDY ON TIME TAKEN BY SINGLE, PARALLEL & HIERARCHICAL CRAWLERS te: Number of Parallel Crawlers = 2 for this experimentation. Sl. Time taken by Time taken by Time taken by. Single Crawler Parallel Crawler Hierarchical Crawler 1 9 hr. 30 min. 4 hr. 45 min. 58 min. 2 44 min. 22 min. 10 min. 3 3 min. 1 min. 30 sec. 30 sec. 4 3 min. 1 min. 30 sec. 20 sec. 5 6 hr. 3 hr. 37 min. 6 7 hr. 3 hr. 30 min. 53 min. 7 2 hr. 1 hr. 17 min. 8 2 hr. 58 min. 1 hr. 29 min. 19 min. HCrw seperately for comparing the time taken by each type of Crawler as shown in Table 2. V. CONCLUSION In a Search Engine, SCrw and/or PCrw are used for downloading the Web-pages of selected s submitted through registration procedure. An enhanced methodology is discussed in this paper to minimize time requirement while Crawling through WWW using HCrw. These Crawlers are generated at runtime using the number of s present within SQ at any depth level of concerned hierarchy. After downloading the specific Web-page, the Crawler would be destroyed automatically. At any time instance, maximum number of Web-pages available for concurrent downloading depends on the allowable bandwidth of the system. REFERENCES [1] Sergey Brin, Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, April 1998 [2] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan, Searching the Web, ACM Transactions on Internet Technology, Volume 1, Issue 1, August 2001 [3] Soumen Chakrabarti, Byron E. Dom, Ravi Kumar, Prabhakar Raghavan, Shidhar Rajagopalan, Andrew Tomkins, David Gibson, Jon Kleinberg, Mining the Web s Link Structure, IEEE Computer, (32)8: August 1999, pp. 60-67 [4] Debajyoti Mukhopadhyay, Sajal Mukherjee, Soumya Ghosh, Saheli Kar, Young-Chon Kim, Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine, The International Workshop MSPT 2006 Proceedings, Youngil Publication, Republic of Korea, vember 2006, pp.103-108