1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1

Size: px

Start display at page:

Download "1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1"

Leo Nash
6 years ago
Views:

1 A Scalable, Distributed Web-Crawler* Ankit Jain, Abhishek Singh, Ling Liu Technical Report GIT-CC College of Computing Atlanta,Georgia In this paper we present a design and implementation of a scalable, distributed web-crawler. The motivation for design of such a system to effectively distribute crawling tasks to different machined in a peer-peer distributed network. Such architecture will lead to scalability and help tame the exponential growth or crawl space in the World Wide Web. With experiments on the implementation of the proto-type of the system we derive show the scalability and efficiency of such a system. 1. Introduction Web crawler forms an integral part of any search engine. The basic task of a crawler is to fetch pages, parse them to get more URLs, and then fetch these URLs to get even more URLs. In this process crawler can also log these pages or perform several other operations on pages fetched according to the requirements of the search engine. Most of these auxiliary tasks are orthogonal to the design of the crawler itself. The explosive growth of the web has rendered the simple task of crawling the web non-trivial. With this rapid increase in the search space, crawling the web is becoming more difficult day by day. But all is not lost, newer computational models are being introduced to make resource intensive tasks more manageable. The price of computing is decreasing monotonically. It has now become very economical to use several cheap computation units in distributed fashion to achieve high throughputs. The challenge while using a distributed model such as one described above, is to efficiently distribute the computation tasks avoiding overheads for synchronization and maintenance of consistency. Scalability is also an important issue for such a model to be usable. In this project, design architecture of a scalable, distributed web crawler has been proposed and implemented. It has been designed to make use of cheap resources and tries to remove some of the bottleneck of the present crawlers in novel way. For sake of simplicity and focus, we only worked on the crawling part of the crawler, logging only the URL s. Other functions can be easily integrated to the design. Section 2 talks about the salient features of our design. In section 3 gives an overview of the proposed architecture. Section 4 goes into the details of a crawler entity in our architecture. In section 5 we explain the probabilistic hybrid search model. In section 6 we talk in brief about the implementation of our system. Section 7 discusses the experimental results and there interpretations. In later sections we discuss out conclusions and describe the learning experience during this project. 2. Salient features of the design Our major objectives while designing the crawler were Increased resource utilization (by multithreaded programming to increase concurrency) Effective distribution of crawling tasks with no central bottleneck Easy portability Limiting the request load for all the web servers Configurability of the crawling tasks * The manuscript is still under progress 1

2 Besides catering to these capabilities our design also includes probabilistic hybrid search model. This is done using a probabilistic hybrid of stack and queue ADTs (Abstract Data Type) for maintaining the pending URL lists. Details of the probabilistic hybrid model are presented in section 5. This distributed crawler is a peer-to-peer distributed crawler, with no central entity. By using a distributed crawling model we have overcome the bottlenecks like: Network throughput Processing capabilities Database capabilities Storage capabilities. A database capability bottleneck is avoided by dividing the URL space into disjoint sets, each of which is handled by a separate crawler. Each crawler parses and logs only the URLs that lie in its URL space subset, and forwards rest of the URL to corresponding crawler entity. Each crawler will have a prior knowledge of the look up table relating each URL subset to [IP:PORT] combination identifying all the crawler threads 3. Distributed crawler The crawler system consists of a number of crawler entities, which run on distributed sites and interact in peer-to-peer fashion. Each crawler entity has the knowledge to its URL subset, as well as mapping from URL subset to network address of corresponding peer crawler entity. Whenever the crawler entity encounters a URL from a different URL subset, it is forwarded to the appropriate peer crawler entity based on URL subset to crawler entity lookup. Each crawler entity maintains its own database, which only stores the URL s from the URL subset assigned to the particular entity. The database s are disjoint and can be combined offline when the crawling task is complete. * The manuscript is still under progress 2

3 4. Crawler Entity Each crawler entity consists of several of crawler threads, a URL handling thread, a URL packet dispatcher thread and URL packet receiver thread. The URL set assigned to each crawler entity will be further divided into subsets for each crawler thread. Each crawler thread has its own pending URL list. Each thread picks up an element from URL pending list, generates an HTTP fetch requests, gets the page, parses through this page to extracts any URL s in it and finally puts them in the job pending queue of the URL handling thread. During initialization URL handling thread reads the hash to [IP:PORT] mapping. It also has a job queue. This thread gets a URL from the job queue, checks to see if the URL belongs to the URL set corresponding to the crawler entity. It does so based on the last few bits of the hash of the domain name in the URL with conjunction of hash to [IP:PORT] mapping. If the URL belongs to another entity it will put the URL on the dispatcher queue and get a new URL from its job queue. If the URL belongs to its set, it firsts checks the URL-seen cache, if the test fails it queries the URL database to check if the URL has been seen, and puts the URL in the URL database. It then puts the URL into URL pending list of one of the crawler threads. URLs are assigned to a crawler thread based on domain names. Each domain name will only be serviced by one thread; hence only one connection will be maintained with any given server. This will make sure that the crawler doesn t overload a slow server. A different hash is used while distributing jobs in between the crawler thread and while determining the URL subset. The objective behind this to isolate the two operations such that there is no correlation between a crawler entity and the thread that is assigned to it; thus balancing the load evenly within the threads. The decision to divide URL space on the bases to domain names was based on the observation that a lot of pages on the web tend to have links to pages in the same domain name. Hence if all URL s with a particular domain name will lie in the same URL space, these URL s will not be needed to be forwarded to other crawler entities. Thus this scheme provides and effective strategy to divide the crawl task between different peer-to-peer nodes of this distributed system. We validate this argument in our experiments described in Section 7. URL dispatcher thread communicates the URL s corresponding crawler entity. A URL receiver thread collects the URL s received from other crawler entities i.e. communicated via dispatcher threads of those crawler entities and puts them on the job queue of the URL handling thread. 5. Probabilistic Search Model We use a search model that can be configured to behave as DFS, BFS or a hybrid of both. It can be configured to behave as DFS a given percentage of the times, and behave as BFS the rest of the times. We use a probabilistic hybrid of stack and queue abstract data types to store the list of pending URL s. DFS can be modeled by using a Stack to store the URL pending list. In a stack last in first out order is maintained for element lists. In a stack we push elements and pop elements from the same end of a list. Similarly BFS can be modeled by using a queue to store the URL pending list, which maintains the First In First Out order for list elements This is achieved by pushing the elements into one end of a list and popping elements from the other. In short if we pop and push elements from the same end we get DFS and if we pop and push from different ends we get BFS. We use above fact to obtain a hybrid of DFS and BFS. We push elements from one end of the list, and pop elements from the same end of list with probability p and from the other end of this list with probability 1- p. Now if p =1, then the system will behave as DFS, if p = 0, the systems will behave as BFS. For p anywhere between 0 and 1, the system will behave as DFS p*100% times and BFS rest of the times. Each time we need to pop an element we decide with probability p, whether to get the element from the top of the list or the bottom. * The manuscript is still under progress 3

4 By varying value of p the search characteristic will change, this will effect the cache-hit ratio and the coverage of the search. We intended to find the optimum value of p, which yields highest cache-hit rate for both the DNS cache as well as URL-seen cache. This study could lead to a significant improvement in the crawler performance. We have implemented this hybrid structure but due to time constraints we could not perform this study in this project. 6. The Implementation The system was implemented in Java platform for portability reasons. MySQL was used for the URL database. Even though Java is less efficient than other languages that can be compiled to the native machine code and none of the team members were proficient with it, we selected Java for this prototype. The reasons behind this decision were to keep the software architecture modular, make the system portable, and to deal with complexity of such a system. In retrospect this turned to be a good decision as we might not have been able to complete this project in time if we would have implemented it in other languages such as C. The comprehensive libraries provided with Java us to concentrate our efforts on design of the system and software architecture. A java class was written for each of the various components of the system ( i.e. different kind of threads, database, synchronized job queues, caches etc.). First we wrote generic classes for various infrastructure components of the system like synchronized job queues and caches. The LRUCache class implements an approximate LRU cache based of hash table with overlapping buckets. The JobQueue class implements a generic synchronized job queue with option for probabilistic hybrid of stack and queue ADT. The main Crawler class performs the initialization, by reading the configuration files, spawning various threads accordingly and initializing various job queues. It then behaves as the Handler Thread. A class named CrawlerThread performs the operation of the Crawler Thread. This thread simply gets a URL from its job queue, messages the URLlist class with this URL. The URLlist class then spawns a new thread that fetches the page, parses it for URL links and returns the list of these URL s back to the CrawlerThread. In java the URL fetch operation is not guaranteed to return and in case of a malicious web server the whole thread can possibly hang, waiting for the operation to complete. This is the reason why the URLlist class spawns a new thread every time to fetch the URL. The thread is completed with a certain time-out, hence if the URL fetch operation isn t completed in time the thread stops after time-out and normal operation is resumed. Spawning a new thread to fetch each page does put an extra overhead on the operation but is essential for the robustness of the system. The Sender and Receiver classes implement the Sender and Receiver threads respectively. The Receiver class starts a UDP socket at pre-determine port and waits for any packet. The Sender class transmits the URL s via UDP packet to appropriate remote node. Besides the classes that form the system architecture described before, we added a Probe Thread to the system and a Measurement class. The relevant classes report the appropriate measurements to the Measurements class and the Probe Threads messages the Measurement class to output the measurements at configurable periodic time intervals. 7. Evaluation and Results We performed experiments to evaluate the performance and scalability of the system. Our experimental setup consisted of four Sun Ultra-30 machines. One crawler entity ran in each of these machines and each of the entity was configured to have 12 Crawler Threads. During the design we decided to store all the queues in memory as the cost of memory is really low and several cheap computer system come equipped with 2 GB RAM. Our program would never require more memory. As 2GB RAM could accommodate about 20 million URLs in various queues for each entity. We do not expect that queue size of any particular node will become more than 20 million when URL space is divided into several nodes. Unfortunately * The manuscript is still under progress 4

we could not arrange such machines for our experiments. Instead we ran our experiments on machines only with 128 Mb RAM with even less memory available for our process.

First of the systems went down in about 12 minutes due to memory overflow. By this time the system had crawled about 54465 pages giving a throughput of 75.65 documents per second.

5 we could not arrange such machines for our experiments. Instead we ran our experiments on machines only with 128 Mb RAM with even less memory available for our process. In our experiment [Figure 2] we faced problems due to unavailability of required memory space. The nodes failed after memory overflow. Arrows in the above graph depict node failure. First of the systems went down in about 12 minutes due to memory overflow. By this time the system had crawled about pages giving a throughput of documents per second. The second node went down after about twenty minutes; the throughput at this time was pages per second. The third and fourth node do down at about 57 minutes with throughput of about 31.4 pages per second. This result although not straightforward to interpret due failure of the nodes, is still very promising. At about 74 documents per second, one billion pages can be crawled in less than six months. Surely tests with machines with required amount of memory need to be performed to corroborate this throughput. * The manuscript is still under progress 5

6 In figure 3, we show the queue sizes and number of pages crawled for one of the 4 systems in the above experiments. As seen in the graph the number of pages crawled is fairly linear indicating almost constant throughput though out the run. This graph also justifies our decision of keeping one handler thread per crawler entity. As seen from the graph, except for few temporary bursts the handler thread queue length is fairly low. Thus it can be inferred that one handler thread is enough to quickly execute its functions even for multiple crawler threads. Worker queue length identifies the culprit for the memory requirement. It increases at a rate much higher than rate at which pages are crawled. To study the scalability of the system we find scalability factor of the system with 4 nodes. Scalability Factor = Throughput with 4 nodes working together/ Throughput with 4 nodes working independently. We calculated the scalability factor after first ten minutes of the execution of crawler. We found this value to be 97.9%. This figure shows extremely good capability for scalability as the system show only about 2% overhead for distribution task i.e. the distribution of task was fairly effective. We also measured the number of URL s that needed to be forwarded to other peer nodes. For this reason we introduce the distribution factor Distribution Factor = Number of Local pages found/ Number of pages found. Here local pages are to pages that belong to the same URL subset as there parents. Hence these local pages are not forwarded to other peer nodes and do not lead to network traffic. Needles to say the higher the distribution factor the better as it leads to effective distribution of the crawl space. If web were to be a random set of hyper link structure. The expected value of the Distribution factor would be 25% for our case of 4 nodes. In our experimentation we found the distribution factor to be 65% (averaged over more than 100,000 pages the system crawled). This again validates our claim that dividing the URL s into subsets based on the domain names and then assigning a URL subset to each node is an effective distribution of crawl task. Our next experiment aims to explore the level of concurrency attainable in one of the crawler entity of the system. In this experiment we use only one node and then measure the performance of the system at the end of 5 minutes on varying the number of crawler threads. * The manuscript is still under progress 6

7 This graph [figure 4] although not very smooth, provides clear indication of increased throughput on increasing the number of crawler thread, validating our claim of increased resource utilization. Beyond about 48 threads the throughput start to decrease because of the synchronization overheads of the system. The graph suggests that around 32 to 48 crawler threads per crawler entity may provide optimum performance to the system. In these experiments a single node achieved the throughput of 32 documents per second, again a very promising fact in terms of system performance. 8. Contributions of this project The biggest contribution of this project is the concept of distributing crawl tasks based on disjoint subsets of the URL crawl space. We also presented a scalable, multi-threaded, peerto-peer distributed architecture for a WebCrawler based on the above concept. Another interesting contribution of the project is the proposed probabilistic hybrid of Depth-First Traversal and Breath-First Traversal, although we were unable to study its advantages or disadvantages during this project. This traversal strategy can be used to achieve the hybrid of the two traditional strategies without any extra book-keeping and is very easy to implement. We also implement the complete WebCrawler that demonstrates all of the above concepts. 9. The Learning experience Foundations of this project were laid from the discussions on Web-crawlers and challenges that lie in there design. Since web space is growing exponentially so our proposed solution should be scalable. Proposed solution should be capable of making good use of cluster of computers rather than being dependent on a large capacity machine. Discussions about the DFS and BFS navigation strategies for the efficient crawl of the web prompted us to experiment with a probabilistic navigation strategy. A lot of papers referred in the class, especially [2]; also gave us insights into the design and implementation of such a system. Design and implementation of this project was a very fruitful hands-on experience. It turned out to be a very good design exercise. We had to deal with real world system issues. The project was initiated by a task distribution idea, but to demonstrate the usefulness of such a concept we had to design a whole system that used exploited this idea in its architecture. During designing the architecture we were faced with challenges associated to Internet Applications, related decisions and trade-offs. Thus this project covered designing internet application from the scratch; from design principles to designing a system architecture, and then implementation as well as evaluation of the system. We implemented the whole project on our own, using the Java libraries. This is itself was a very useful. Due to the nature of internet applications such as this one, it is always important to emphasize on efficient implementation as well as portability of the system. Also this system included components from various domains. We implemented multi-threaded architecture, synchronized job queues, LRU caches, crawlers, other networking component, database query components etc. Even though we had studied these components before, in this project we implemented all these components, which gave us insights into the implementation issues of these useful components. Besides implementing these components during the implementation we also got an experience with integrating these components to make the whole system work. During the evaluation we dealt with designing and executing experiments to validate our claims, this also provided us insights into proper interpretation of the experimental results and logical derivation of conclusions. Through out this project we experienced the fact mention in the class that developing a WebCrawler is easy, but developing an efficient WebCrawler is very difficult * The manuscript is still under progress 7

8 10. Future extensions Future extension of the project includes implementing the DNS cache in the Crawler Thread and studying the performance of the hybrid traversal strategy on the various cache-hit rates. A lot of issues need to be dealt with to make this system usable in the real world. The Crawler needs to conform to robot exclusion protocol. We need to handle partial failure. Although at present failure of one node will not stop other components, it would be desirable for other system to take over the task of the node that failed. Also dynamic reconfiguration and dynamic load-balancing would be desirable. 11. Related work Google [1] web crawler is written in python and is a single threaded and uses asynchronous I/O to fetch data from several concurrent connections. The crawler transmits downloaded pages to a single Store Server process. The store server compresses the data and stores in repository. Another famous web crawler is Mercator[2], which is a multithreaded web crawler in JAVA. Although Mercator is not distributed, it does divide the URL space like our design to guarantee that only one thread will contact a given server. We do not deal with storing webpages or process of indexing in this project. Our architecture is distributed as well as multithreaded. This way we increase concurrency in a single machine as well as the entire system of several computers. We also have a distributed database with no central bottleneck. We also make use of probabilistic search model for crawling web pages. The whole combination of features, improved resource utilization and scalability distinguishes us from related previous work. 12. Conclusion In all the performance results of the crawler are very promising. We achieved a throughput rate of 75 documents per second. This is an encouraging result as at 31.7 pages per second one billion documents can be crawled in one year. We have also validated our claims of scalability and improved resource utilization with the experimental result. Although the results are encouraging, more tests needed to be conducted to find out if such system can be really useful in the real world situation. References [1] Google: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page, Proceedings of the 7th International World Wide Web Conference, pages , April 1998 [2] Mercator: A Scalable, Extensible Web Crawler, Allan Heydon and Marc Najork, Compaq Systems Research Center. [3] Class notes * The manuscript is still under progress 8

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the