WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization

Size: px

Start display at page:

Download "WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization"

Wilfrid Gordon
5 years ago
Views:

1 SUST Journal of Science and Technology, Vol. 16,.2, 2012; P:32-40 WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization (Submitted: February 13, 2011; Accepted for Publication: July 30, 2012) Md. Ruhul Amin 1, Mohiul Alam Prince 2 and Md. Akter Hussain 1 1 Dept. of Computer Science and Engineering, Shahjalal University of Science and Technology, Sylhet, Bangladesh. 2 Software Engineer, Structured Data Systems Ltd, Dhaka, Bangladesh. shajib-cse@sust.edu, maprince@gmail.com, akter.1985@yahoo.com Abstract The most challenging part of a web crawler is to download contents at the fastest rate to utilize bandwidth and processing of the downloaded data so that it will never starve the downloader. Our implemented scalable web crawling system, named as WEBTracker has been designed to meet this challenge. It can be used very efficiently in the distributed environment to maximize downloading. WEBTracker has a Central Crawler Server and it administers all the crawler nodes. At each crawler node, Crawler Manager runs the downloader and manages the downloaded contents. Central Crawler Server and its Crawler Managers are members of the Distributed File System which ensures synchronized distributed operations of the system. In this paper, we have only concentrated on the architecture of a web crawling node which is owned by the Crawler Manager. We have shown that our implemented crawler architecture makes efficient use of allocated bandwidth, keeps the processor less busy for the processing of downloaded contents and makes efficient use of the run time memory. Keywords: WEBTracker, Web Crawler, Information Retrieval, World Wide Web. 1. Introduction A web crawler collects web content from the World Wide Web automatically and stores this content into the storage [1-4]. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide faster search services [5-8]. Crawlers can also be used for automating maintenance tasks on a website, checking links or validating HTML code, harvesting addresses (usually for spam) and collecting information on a particular topic. The main steps [1-4, 9-11] in web crawling are: 1. A URL Queue that initially contains some seed URLs. 2. Remove a URL from the queue and download that page. 3. Extract the links from that page and store the pages in secondary storage. 4. Store the links in the queue that have to be fetched in future. 5. Recursively perform steps 2, 3, and 4 until the queue is empty. To develop a web crawler that downloads few pages per second for a short period of time is an easy task but building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges. It requires an extensive experimental work on the system design to ensure the I/O and network efficiency, and manageability of the downloaded contents. Hence to develop a high performance web crawler we need to solve the following challenges [1-4]: 1. An efficient multi threaded web content downloader. 2. Maintaining politeness according to the robots exclusion protocol.

2 WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization Efficient URL extraction, normalization and duplicate URL detection. 4. Detecting duplicate web contents during web crawling. 5. Web content management for fastest storage and retrieval. 6. Bandwidth management and well-organized distribution Policy. To achieve these goals we have developed a scalable distributed crawling system WEBTracker in which the Central Crawler Server provides the domain links to the Crawler Manager running at each crawling node. The Crawler Manager runs the downloader that is never interrupted unless its URL queue is empty. The downloaded web contents are processed in batch mode for URL extraction, URL seen and to store the contents. All of our algorithms require less run time memory and are optimized for file based operations. The given implementation is now under experimental observations and the performance measures of a crawling node are shown in this paper. In a nutshell the implemented web crawler shows that a high performance web crawler should have two properties: 1. A downloader that possibly never be paused. 2. All other work of the crawler has to be batch processed. 2. Architecture of a Web Crawling de Crawler Manager owns the web crawling node and it controls the following four major modules: 1. Downloader 2. LinkExtractorManager 3. URLSeen 4. HostHandler The basic architecture of WEBTracker crawler node is shown in Figure 1. Figure 1. Basic Architecture of a Web Crawling de.

3 34 Md. Ruhul Amin, Mohiul Alam Prince and Md. Akter Hussain The Table 1 below provides couple of very important terms that are required to understand the way our implemented web crawler works. Table 1. Term that use by our Crawler. Term Function Producer Module Consumer Elements/ Block Module Host Queue Maintains a queue of the Central Crawler Crawler Maximum 100 downloading domains Server or User Manager RepositoryPath Save domain contents Crawler Manager Downloader WebPagePathBlockRepo Save physical storage Crawler Manager Link Extractor 100 path of web page Manager ExtractedURLBlockRepo Save extracted URLs (internal) from the downloaded contents Link Extractor Manager URL Seen 5000 ExternalURLBlockRepo UnseenURLBlockRepo Save extracted URLs (external) from the downloaded contents Save unseen URLs from unchecked extracted URLs Link Extractor Manager URL Seen Host Handler 1000 Crawler Manager The Producer Modules maintains their own configuration files which keeps track about how many blocks have been created in each repository and Consumer Modules maintains their own configuration files to keep track of how many blocks are processed from the repositories. The file structure of WEBTracker repository is shown in Figure 2. Root Domain Contents DB Domain 1 Domain n Domain 1 Domain n WebPage URL WebPage URL WebPage Path Extracted URL Unseen URL External URL WebPage Path Extracted URL Unseen URL External URL Figure 2. File Structure used by our Crawler Crawler Manager This module is responsible to maintain the communication with the Central Crawler Server as well as control all the four modules: Downloader, LinkExtractorManager, URLSeen, HostHandler. The task of Crawler Manager is initiated by the command of the Central Crawler Server in a distributed crawling system; otherwise it can be initiated directly in single node web crawling system. Then in the crawling node, Crawler Manager searches for the information about previous crawling. If the information is found then first of all it adjusts itself with the previous

4 WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization 35 configuration and starts the crawling exactly from the point where it ended in the last crawl. Otherwise it only reads the list of domains that are required to be downloaded and start crawling. The important functions of crawler manager are given below: Reading Robots.txt file For each domain, crawler manager reads the robots.txt file from the domain URL [1 4, 12, 13] and create an object regarding the configurations written in the file. For each domain the crawler manager uses individual thread. This configuration is maintained every time for crawling a URL from the domain Managing Download The list of domains is inserted into HostQueue for crawling. For a brand new domain, an unseen (means which are not used) URL repository file named as UnseenUrlBlockRepo is created for the domain and the domain name is used as the starting URL. Crawler manager pops up a domain name from the host queue and then read a URL of this domain from the corresponding unseen URL block repository. Then the crawler manager assigns this URL to a downloader in a different thread. If there is no available URL then crawler manager calls up the link extractor manager for providing unseen links of that domain which is also done in a different thread. At the same time, the crawler manager continues its work in its independent thread and is never interrupted by the other management work. The maximum number of downloading thread is maintained according to the configuration file of the crawler manager Post Download Processing When the downloader finishes downloading a URL, it sends an acknowledgement to the crawler manager. Then the crawler manager adds this domain to the HostQueue again, writes the path of the downloaded page physically to the repository named as WebPage PathBlockRepo and updates the corresponding configuration files as required. 2.2 Downloader Downloader module is the simplest module of this project. Crawler manager calls this module with a URL and repository name (RepositoryPath) for saving the downloaded pages. For a particular domain, only one URL or page is downloaded at a time in an individual thread. After the completion of the download, it makes an acknowledgement to the Crawler manager for updating the management files and starts another new download from that domain. Table 2. Pseudo code of Downloader Module. Downloader(url) 1. request for url 2. save page 3. Push this domain to CrawlerManager Queue. Figure 3. Downloader Architecture

5 36 Md. Ruhul Amin, Mohiul Alam Prince and Md. Akter Hussain 2.3 Link Extractor Manager Link extractor manager is an essential module for this web crawler. Its main task is to extract links from the downloaded web pages of a particular domain. It mainly creates two lists of unseen URLs. For a particular domain, one list is the internal URLs (within the same domain) saved in a repository named as ExtractedUrlBlockRepo, and another is the list of external URLs (to other domains) saved in a repository named as ExternalUrlBlockRepo. For the link extraction of each domain, link extractor manager creates individual thread to serve the unseen URLs. When crawler manager needs URLs for a particular domain, it calls the link extractor manager that at first try to serve the requests immediately by providing unseen URLs (already extracted from the downloaded pages of the corresponding domain). Otherwise the link extractor manager assigns a thread to extract links from the unprocessed downloaded web pages of the corresponding domain whose path location is found in the repository named as WebPagePathBlockRepo. The extracted URLs from the web pages are written in the repository named as ExtractedUrlBlockRepo. As soon as the link extraction for a web page is completed, the link extractor manager calls up the URL seen module to get unseen URLs from ExtractedUrlBlockRepo and send an acknowledgement to the crawler manager. Generally link extractor manager parses at most 5000 URLs from the unprocessed web pages and store them as a block in ExtractedUrlBlockRepo. URL seen module then processes this block in batch mode for getting unseen URLs. Table 3: Pseudo code of LinkExtractorManager and Link Extract Module. LinkExtractorManager() 1. For Each element in Queue 2. if requestfromcrawlermanager 3. if this host has extracted link 4. send them to UrlSeen 5. open new block to store. 6. else 7. host = requestedhost 8. else 9. host = pop(queue) 10. if availpage(host) 11. LinkExtract(hostPage) 12. wait 13. if any push in Queue 14. Run from 1 LinkExtract(page) 1. LinkExtract of that page 2. Filter Link 3. Save in a block of own link. 4. Save other link in different block. 5. if block has more than 5000 link 6. send it to UrlSeen 7. open new block 8. push this host to LinkExtractor Queue

6 WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization 37 Start CrawlerManager Request for URL Is notified? Request Host List Send Seen Url Is there any request for link? Is there any host whose thread is not running? Wait Send that link block to UrlSeen / Is there any link for requested host? Is there any path of that host? Ignore Is there any thread running for this host? Get that path and set this host as a running. tify Link extract thread run Is this block has maximum number of URL? Write link to a block and set this host as a finish Figure 4: Architectural Diagram of Link Extractor Manager 2.4 URL Seen This module is used to identify the unseen URLs of a particular domain and filter out the URLs that has been already in the queue or downloaded earlier. For a request from the link extractor manager, this module assigns a thread for each block of unchecked URLs of the repository ExtractedUrlBlockRepo. This module creates a red-black tree using the unique URLs retrieving from UnseenUrlBlockRepo which contains all the unique URLs of that domain from the starting point. For a block of unchecked extracted URLs, this module searches each URL in the RB tree to ensure its uniqueness. If the link is found in the tree then this link is filtered out. After the end of this process, this module saves the unseen URLs to the repository UnseenUrlBlockRepo and updates the total number of already downloaded links and total unique links in the configuration files for the corresponding domain and sends an acknowledgement to the link extractor manager.

7 38 Md. Ruhul Amin, Mohiul Alam Prince and Md. Akter Hussain Table 4: Pseudo code of URL Seen Module. UrlSeen(Domain) 1. read block of Unique url block of Domain 2. make RB-tree with Unique url block 3. read block of Uncheck url 4. for each url 5. if Rb-tree.check(url) is not found 6. insert in RB-tree 7. Save in check block url. 8. Push this host to CrawlerManager Queue. Figure 5: URL Seen Architecture 2.5 Host Handler This module is important part of our system. For a single node web crawler, this module provides the unseen external domain URLs to the Crawler Manager. In a distributed crawling system this module sends those URLs to the Central Crawler Server rather than Crawler Manager in a single node system. Host handler collects the unseen host URLs from the file named ExternalUrlBlockRepo. Table 5: Psuedo code of HostHandler Module. HostHandler() 1. Read block for download 2. if sufficient for requestedhost 3. send them 4. else 5. Read all different host block 6. find them which are not downloaded 7. merge all host fro download 8. send them 9. store rest of host for download Request From Crawler Manager, then send HostList HostHandler Is there any host in list which is not downloaded? CrawlerManager Read different host list which are generated by LinkExtractor. Send Host List. Check Read file and make a not downloading list. Figure 6: Host Handler Architecture 3. PERFORMANCE ANALYSIS We have used a single node web crawler to measure the performance of our implemented system. After couple of test runs and bug fixing, we have taken the data of the latest experiment. Our crawler machine has 2GB of RAM, Core2Duo 2.8 GHz Processor and 2Mbps (15MB per minute) Bandwidth for crawling the data. The experiment has been started using 40 seed URLs. We have recorded the measurement of bandwidth utilization for 1000 minutes. The Figure 7 to Figure 10 are the outcomes of this experiment.

8 WEBTracker: A Web Crawler for Maximizing Bandwidth Utilization 39 In the Figure 7, we show the number of downloaded pages in every minute. The average number of pages downloaded per minute by our crawler is In the Figure 8 we show the download size in KB per minute. The average download size of the web crawler is MB per minute. Whereas the maximum bandwidth provided to our system is 15MB per minute. w from the Figure 7 we see that, from 551 to 951 minutes, the total number of pages downloaded per minute is decreasing. On the other hand from Figure 8 we can see that the downloaded data size is larger for that time span. That means the product of data size and number of page is approximately similar. Hence we can conclude that the crawler performance is stable. The Figure 9 shows the number of HTTP Requests sent in every minute by the implemented crawler. The average HTTP request sent per minute is While the Figure 10 show the number of HTTP errors of the crawler received in every minute. Hence the average HTTP error is per minute. The experimental results have no unusual spike which proves that the implemented crawler has no glitch. So as we will deploy more crawlers the total amount of bandwidth consumption will surely scale up Figure 7: Crawler Statistics for Total Download Pages per Minutes (X-axis: Time in Min, Y-axis: Downloaded Page Number) Figure 9: Crawler Statistics For Total Number of Http Request (X-axis: Time in Min, Y-axis: HTTP Request Number) Figure 8: Crawler Statistics for Total Download Size in KB per Minutes (X-axis: Time in Min, Y-axis: Downloaded Page Size) Figure 10: Crawler Statistics For Total Number of Error per Minute (X-axis: Time in Min, Y-axis: Number of Error)

9 40 Md. Ruhul Amin, Mohiul Alam Prince and Md. Akter Hussain 4. Conclusion In this paper we have discussed how bandwidth utilization can be maximized for a web crawler. The most important technique we have used here is that we never interrupt the downloading threads. Every other post downloading management works are also done in the independent threads. Moreover, all the tasks of link extraction, URL seen and host handling are completed in batch processing mode. Hence most of the times the processor of our system remains less busy. Since most of our algorithms are file based then the RAM utilization is also low. All these considerations ensures that the downloader can use the system at full length most of its time. That is why we could achieve 10MB per minute web page downloading over 15MB per minute bandwidth allocation. In our implementation we have maintained politeness explicitly (only one URL of a domain can be downloaded at a time). Here we did not show any module regarding the content seen since its scope is out of this paper. Our future work is to deploy a distributed web crawling system with 20 web crawling nodes and at least 1Gbps bandwidth. Acknowledgment We are grateful to the Dept. of CSE, Shahjalal University of Science and Technology to provide us technological support for conducting the research work in the IR Lab. References [1] Cho, J. and Garcia-Molina, H. and Page, L Efficient Crawling Through URL Ordering. Seventh International World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia. [2] Boldi P, Codenotti B, Santini M, Vigna S. Ubicrawler: Scalability and fault-tolerance issues. Poster Proceedings of the 11th International World Wide Web Conference, Honolulu, HI, ACM Press: New York, [3] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. UbiCrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8): , [4] Vladislav Shkapenyuk and Torsten Suel. Design and Implementation of a High-Performance Distributed Web Crawler, Proceedings of the 18th International Conference on Data Engineering, p.357, February 26-March 01, [5] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World-Wide Web Conference, Bris-bane, Australia, April [6] Google, Last Visited: [7] Paolo Boldi, Massimo Santini, Sebastiano Vigna, PageRank: Functional dependencies, ACM Transactions on Information Systems (TOIS), v.27 n.4, p.1-23, vember 2009, DOI = / [8] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the web. ACM Transactions on Internet Technology, 1(1):2 43, 2001 [9] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Trovatore: Towards a highly scalable distributed web crawler. In Poster Proc. of Tenth International World Wide Web Conference, pages , Hong Kong, China, [10] Budi Yuwono, Savio L. Lam, Jerry H. Ying, and Dik L. Lee. A world wide web resource discovery system. In Proceedings of the Fourth International World-Wide Web Conference, Darmstadt, Germany, April 1995, DOI = [11] Ziv Bar-Yossef, Alexander Berg, Steve Chien, Jittat Fakcharoenphol, Dror Weitz, Approximating Aggregate Queries about Web Pages via Random Walks, Proceedings of the 26th International Conference on Very Large Data Bases, p , September 10-14, 2000 [12] The Robots Exclusion Standard, Last Visited: [13] Robots exclusion protocol. Last Visited: [14] J. Talim, Z. Liu, Ph. Nain, E. G. Coffman, Jr., Controlling the robots of Web search engines, ACM SIGMETRICS Performance Evaluation Review, v.29 n.1, p , June 2001 DOI= / [15] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second ed ition, 1999, ISBN:

A Hierarchical Web Page Crawler for Crawling the Internet Faster

A Hierarchical Web Page Crawler for Crawling the Internet Faster Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay and Young-Chon Kim Web Intelligence & Distributed Computing Research Lab, Techno India