1. Introduction. 2. Salient features of the design. * The manuscript is still under progress 1
|
|
- Leo Nash
- 6 years ago
- Views:
Transcription
1 A Scalable, Distributed Web-Crawler* Ankit Jain, Abhishek Singh, Ling Liu Technical Report GIT-CC College of Computing Atlanta,Georgia In this paper we present a design and implementation of a scalable, distributed web-crawler. The motivation for design of such a system to effectively distribute crawling tasks to different machined in a peer-peer distributed network. Such architecture will lead to scalability and help tame the exponential growth or crawl space in the World Wide Web. With experiments on the implementation of the proto-type of the system we derive show the scalability and efficiency of such a system. 1. Introduction Web crawler forms an integral part of any search engine. The basic task of a crawler is to fetch pages, parse them to get more URLs, and then fetch these URLs to get even more URLs. In this process crawler can also log these pages or perform several other operations on pages fetched according to the requirements of the search engine. Most of these auxiliary tasks are orthogonal to the design of the crawler itself. The explosive growth of the web has rendered the simple task of crawling the web non-trivial. With this rapid increase in the search space, crawling the web is becoming more difficult day by day. But all is not lost, newer computational models are being introduced to make resource intensive tasks more manageable. The price of computing is decreasing monotonically. It has now become very economical to use several cheap computation units in distributed fashion to achieve high throughputs. The challenge while using a distributed model such as one described above, is to efficiently distribute the computation tasks avoiding overheads for synchronization and maintenance of consistency. Scalability is also an important issue for such a model to be usable. In this project, design architecture of a scalable, distributed web crawler has been proposed and implemented. It has been designed to make use of cheap resources and tries to remove some of the bottleneck of the present crawlers in novel way. For sake of simplicity and focus, we only worked on the crawling part of the crawler, logging only the URL s. Other functions can be easily integrated to the design. Section 2 talks about the salient features of our design. In section 3 gives an overview of the proposed architecture. Section 4 goes into the details of a crawler entity in our architecture. In section 5 we explain the probabilistic hybrid search model. In section 6 we talk in brief about the implementation of our system. Section 7 discusses the experimental results and there interpretations. In later sections we discuss out conclusions and describe the learning experience during this project. 2. Salient features of the design Our major objectives while designing the crawler were Increased resource utilization (by multithreaded programming to increase concurrency) Effective distribution of crawling tasks with no central bottleneck Easy portability Limiting the request load for all the web servers Configurability of the crawling tasks * The manuscript is still under progress 1
2 Besides catering to these capabilities our design also includes probabilistic hybrid search model. This is done using a probabilistic hybrid of stack and queue ADTs (Abstract Data Type) for maintaining the pending URL lists. Details of the probabilistic hybrid model are presented in section 5. This distributed crawler is a peer-to-peer distributed crawler, with no central entity. By using a distributed crawling model we have overcome the bottlenecks like: Network throughput Processing capabilities Database capabilities Storage capabilities. A database capability bottleneck is avoided by dividing the URL space into disjoint sets, each of which is handled by a separate crawler. Each crawler parses and logs only the URLs that lie in its URL space subset, and forwards rest of the URL to corresponding crawler entity. Each crawler will have a prior knowledge of the look up table relating each URL subset to [IP:PORT] combination identifying all the crawler threads 3. Distributed crawler The crawler system consists of a number of crawler entities, which run on distributed sites and interact in peer-to-peer fashion. Each crawler entity has the knowledge to its URL subset, as well as mapping from URL subset to network address of corresponding peer crawler entity. Whenever the crawler entity encounters a URL from a different URL subset, it is forwarded to the appropriate peer crawler entity based on URL subset to crawler entity lookup. Each crawler entity maintains its own database, which only stores the URL s from the URL subset assigned to the particular entity. The database s are disjoint and can be combined offline when the crawling task is complete. * The manuscript is still under progress 2
3 4. Crawler Entity Each crawler entity consists of several of crawler threads, a URL handling thread, a URL packet dispatcher thread and URL packet receiver thread. The URL set assigned to each crawler entity will be further divided into subsets for each crawler thread. Each crawler thread has its own pending URL list. Each thread picks up an element from URL pending list, generates an HTTP fetch requests, gets the page, parses through this page to extracts any URL s in it and finally puts them in the job pending queue of the URL handling thread. During initialization URL handling thread reads the hash to [IP:PORT] mapping. It also has a job queue. This thread gets a URL from the job queue, checks to see if the URL belongs to the URL set corresponding to the crawler entity. It does so based on the last few bits of the hash of the domain name in the URL with conjunction of hash to [IP:PORT] mapping. If the URL belongs to another entity it will put the URL on the dispatcher queue and get a new URL from its job queue. If the URL belongs to its set, it firsts checks the URL-seen cache, if the test fails it queries the URL database to check if the URL has been seen, and puts the URL in the URL database. It then puts the URL into URL pending list of one of the crawler threads. URLs are assigned to a crawler thread based on domain names. Each domain name will only be serviced by one thread; hence only one connection will be maintained with any given server. This will make sure that the crawler doesn t overload a slow server. A different hash is used while distributing jobs in between the crawler thread and while determining the URL subset. The objective behind this to isolate the two operations such that there is no correlation between a crawler entity and the thread that is assigned to it; thus balancing the load evenly within the threads. The decision to divide URL space on the bases to domain names was based on the observation that a lot of pages on the web tend to have links to pages in the same domain name. Hence if all URL s with a particular domain name will lie in the same URL space, these URL s will not be needed to be forwarded to other crawler entities. Thus this scheme provides and effective strategy to divide the crawl task between different peer-to-peer nodes of this distributed system. We validate this argument in our experiments described in Section 7. URL dispatcher thread communicates the URL s corresponding crawler entity. A URL receiver thread collects the URL s received from other crawler entities i.e. communicated via dispatcher threads of those crawler entities and puts them on the job queue of the URL handling thread. 5. Probabilistic Search Model We use a search model that can be configured to behave as DFS, BFS or a hybrid of both. It can be configured to behave as DFS a given percentage of the times, and behave as BFS the rest of the times. We use a probabilistic hybrid of stack and queue abstract data types to store the list of pending URL s. DFS can be modeled by using a Stack to store the URL pending list. In a stack last in first out order is maintained for element lists. In a stack we push elements and pop elements from the same end of a list. Similarly BFS can be modeled by using a queue to store the URL pending list, which maintains the First In First Out order for list elements This is achieved by pushing the elements into one end of a list and popping elements from the other. In short if we pop and push elements from the same end we get DFS and if we pop and push from different ends we get BFS. We use above fact to obtain a hybrid of DFS and BFS. We push elements from one end of the list, and pop elements from the same end of list with probability p and from the other end of this list with probability 1- p. Now if p =1, then the system will behave as DFS, if p = 0, the systems will behave as BFS. For p anywhere between 0 and 1, the system will behave as DFS p*100% times and BFS rest of the times. Each time we need to pop an element we decide with probability p, whether to get the element from the top of the list or the bottom. * The manuscript is still under progress 3
4 By varying value of p the search characteristic will change, this will effect the cache-hit ratio and the coverage of the search. We intended to find the optimum value of p, which yields highest cache-hit rate for both the DNS cache as well as URL-seen cache. This study could lead to a significant improvement in the crawler performance. We have implemented this hybrid structure but due to time constraints we could not perform this study in this project. 6. The Implementation The system was implemented in Java platform for portability reasons. MySQL was used for the URL database. Even though Java is less efficient than other languages that can be compiled to the native machine code and none of the team members were proficient with it, we selected Java for this prototype. The reasons behind this decision were to keep the software architecture modular, make the system portable, and to deal with complexity of such a system. In retrospect this turned to be a good decision as we might not have been able to complete this project in time if we would have implemented it in other languages such as C. The comprehensive libraries provided with Java us to concentrate our efforts on design of the system and software architecture. A java class was written for each of the various components of the system ( i.e. different kind of threads, database, synchronized job queues, caches etc.). First we wrote generic classes for various infrastructure components of the system like synchronized job queues and caches. The LRUCache class implements an approximate LRU cache based of hash table with overlapping buckets. The JobQueue class implements a generic synchronized job queue with option for probabilistic hybrid of stack and queue ADT. The main Crawler class performs the initialization, by reading the configuration files, spawning various threads accordingly and initializing various job queues. It then behaves as the Handler Thread. A class named CrawlerThread performs the operation of the Crawler Thread. This thread simply gets a URL from its job queue, messages the URLlist class with this URL. The URLlist class then spawns a new thread that fetches the page, parses it for URL links and returns the list of these URL s back to the CrawlerThread. In java the URL fetch operation is not guaranteed to return and in case of a malicious web server the whole thread can possibly hang, waiting for the operation to complete. This is the reason why the URLlist class spawns a new thread every time to fetch the URL. The thread is completed with a certain time-out, hence if the URL fetch operation isn t completed in time the thread stops after time-out and normal operation is resumed. Spawning a new thread to fetch each page does put an extra overhead on the operation but is essential for the robustness of the system. The Sender and Receiver classes implement the Sender and Receiver threads respectively. The Receiver class starts a UDP socket at pre-determine port and waits for any packet. The Sender class transmits the URL s via UDP packet to appropriate remote node. Besides the classes that form the system architecture described before, we added a Probe Thread to the system and a Measurement class. The relevant classes report the appropriate measurements to the Measurements class and the Probe Threads messages the Measurement class to output the measurements at configurable periodic time intervals. 7. Evaluation and Results We performed experiments to evaluate the performance and scalability of the system. Our experimental setup consisted of four Sun Ultra-30 machines. One crawler entity ran in each of these machines and each of the entity was configured to have 12 Crawler Threads. During the design we decided to store all the queues in memory as the cost of memory is really low and several cheap computer system come equipped with 2 GB RAM. Our program would never require more memory. As 2GB RAM could accommodate about 20 million URLs in various queues for each entity. We do not expect that queue size of any particular node will become more than 20 million when URL space is divided into several nodes. Unfortunately * The manuscript is still under progress 4
5 we could not arrange such machines for our experiments. Instead we ran our experiments on machines only with 128 Mb RAM with even less memory available for our process. In our experiment [Figure 2] we faced problems due to unavailability of required memory space. The nodes failed after memory overflow. Arrows in the above graph depict node failure. First of the systems went down in about 12 minutes due to memory overflow. By this time the system had crawled about pages giving a throughput of documents per second. The second node went down after about twenty minutes; the throughput at this time was pages per second. The third and fourth node do down at about 57 minutes with throughput of about 31.4 pages per second. This result although not straightforward to interpret due failure of the nodes, is still very promising. At about 74 documents per second, one billion pages can be crawled in less than six months. Surely tests with machines with required amount of memory need to be performed to corroborate this throughput. * The manuscript is still under progress 5
6 In figure 3, we show the queue sizes and number of pages crawled for one of the 4 systems in the above experiments. As seen in the graph the number of pages crawled is fairly linear indicating almost constant throughput though out the run. This graph also justifies our decision of keeping one handler thread per crawler entity. As seen from the graph, except for few temporary bursts the handler thread queue length is fairly low. Thus it can be inferred that one handler thread is enough to quickly execute its functions even for multiple crawler threads. Worker queue length identifies the culprit for the memory requirement. It increases at a rate much higher than rate at which pages are crawled. To study the scalability of the system we find scalability factor of the system with 4 nodes. Scalability Factor = Throughput with 4 nodes working together/ Throughput with 4 nodes working independently. We calculated the scalability factor after first ten minutes of the execution of crawler. We found this value to be 97.9%. This figure shows extremely good capability for scalability as the system show only about 2% overhead for distribution task i.e. the distribution of task was fairly effective. We also measured the number of URL s that needed to be forwarded to other peer nodes. For this reason we introduce the distribution factor Distribution Factor = Number of Local pages found/ Number of pages found. Here local pages are to pages that belong to the same URL subset as there parents. Hence these local pages are not forwarded to other peer nodes and do not lead to network traffic. Needles to say the higher the distribution factor the better as it leads to effective distribution of the crawl space. If web were to be a random set of hyper link structure. The expected value of the Distribution factor would be 25% for our case of 4 nodes. In our experimentation we found the distribution factor to be 65% (averaged over more than 100,000 pages the system crawled). This again validates our claim that dividing the URL s into subsets based on the domain names and then assigning a URL subset to each node is an effective distribution of crawl task. Our next experiment aims to explore the level of concurrency attainable in one of the crawler entity of the system. In this experiment we use only one node and then measure the performance of the system at the end of 5 minutes on varying the number of crawler threads. * The manuscript is still under progress 6
7 This graph [figure 4] although not very smooth, provides clear indication of increased throughput on increasing the number of crawler thread, validating our claim of increased resource utilization. Beyond about 48 threads the throughput start to decrease because of the synchronization overheads of the system. The graph suggests that around 32 to 48 crawler threads per crawler entity may provide optimum performance to the system. In these experiments a single node achieved the throughput of 32 documents per second, again a very promising fact in terms of system performance. 8. Contributions of this project The biggest contribution of this project is the concept of distributing crawl tasks based on disjoint subsets of the URL crawl space. We also presented a scalable, multi-threaded, peerto-peer distributed architecture for a WebCrawler based on the above concept. Another interesting contribution of the project is the proposed probabilistic hybrid of Depth-First Traversal and Breath-First Traversal, although we were unable to study its advantages or disadvantages during this project. This traversal strategy can be used to achieve the hybrid of the two traditional strategies without any extra book-keeping and is very easy to implement. We also implement the complete WebCrawler that demonstrates all of the above concepts. 9. The Learning experience Foundations of this project were laid from the discussions on Web-crawlers and challenges that lie in there design. Since web space is growing exponentially so our proposed solution should be scalable. Proposed solution should be capable of making good use of cluster of computers rather than being dependent on a large capacity machine. Discussions about the DFS and BFS navigation strategies for the efficient crawl of the web prompted us to experiment with a probabilistic navigation strategy. A lot of papers referred in the class, especially [2]; also gave us insights into the design and implementation of such a system. Design and implementation of this project was a very fruitful hands-on experience. It turned out to be a very good design exercise. We had to deal with real world system issues. The project was initiated by a task distribution idea, but to demonstrate the usefulness of such a concept we had to design a whole system that used exploited this idea in its architecture. During designing the architecture we were faced with challenges associated to Internet Applications, related decisions and trade-offs. Thus this project covered designing internet application from the scratch; from design principles to designing a system architecture, and then implementation as well as evaluation of the system. We implemented the whole project on our own, using the Java libraries. This is itself was a very useful. Due to the nature of internet applications such as this one, it is always important to emphasize on efficient implementation as well as portability of the system. Also this system included components from various domains. We implemented multi-threaded architecture, synchronized job queues, LRU caches, crawlers, other networking component, database query components etc. Even though we had studied these components before, in this project we implemented all these components, which gave us insights into the implementation issues of these useful components. Besides implementing these components during the implementation we also got an experience with integrating these components to make the whole system work. During the evaluation we dealt with designing and executing experiments to validate our claims, this also provided us insights into proper interpretation of the experimental results and logical derivation of conclusions. Through out this project we experienced the fact mention in the class that developing a WebCrawler is easy, but developing an efficient WebCrawler is very difficult * The manuscript is still under progress 7
8 10. Future extensions Future extension of the project includes implementing the DNS cache in the Crawler Thread and studying the performance of the hybrid traversal strategy on the various cache-hit rates. A lot of issues need to be dealt with to make this system usable in the real world. The Crawler needs to conform to robot exclusion protocol. We need to handle partial failure. Although at present failure of one node will not stop other components, it would be desirable for other system to take over the task of the node that failed. Also dynamic reconfiguration and dynamic load-balancing would be desirable. 11. Related work Google [1] web crawler is written in python and is a single threaded and uses asynchronous I/O to fetch data from several concurrent connections. The crawler transmits downloaded pages to a single Store Server process. The store server compresses the data and stores in repository. Another famous web crawler is Mercator[2], which is a multithreaded web crawler in JAVA. Although Mercator is not distributed, it does divide the URL space like our design to guarantee that only one thread will contact a given server. We do not deal with storing webpages or process of indexing in this project. Our architecture is distributed as well as multithreaded. This way we increase concurrency in a single machine as well as the entire system of several computers. We also have a distributed database with no central bottleneck. We also make use of probabilistic search model for crawling web pages. The whole combination of features, improved resource utilization and scalability distinguishes us from related previous work. 12. Conclusion In all the performance results of the crawler are very promising. We achieved a throughput rate of 75 documents per second. This is an encouraging result as at 31.7 pages per second one billion documents can be crawled in one year. We have also validated our claims of scalability and improved resource utilization with the experimental result. Although the results are encouraging, more tests needed to be conducted to find out if such system can be really useful in the real world situation. References [1] Google: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page, Proceedings of the 7th International World Wide Web Conference, pages , April 1998 [2] Mercator: A Scalable, Extensible Web Crawler, Allan Heydon and Marc Najork, Compaq Systems Research Center. [3] Class notes * The manuscript is still under progress 8
Information Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationA Novel Interface to a Web Crawler using VB.NET Technology
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 6 (Nov. - Dec. 2013), PP 59-63 A Novel Interface to a Web Crawler using VB.NET Technology Deepak Kumar
More informationCRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA
CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com
More informationEnhancement to PeerCrawl: A Decentralized P2P Architecture for Web Crawling
CS8803 Project Enhancement to PeerCrawl: A Decentralized P2P Architecture for Web Crawling Mahesh Palekar, Joseph Patrao. Abstract: Search Engines like Google have become an Integral part of our life.
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationCHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER
CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but
More informationRelevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search
Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?
More informationSummary Cache based Co-operative Proxies
Summary Cache based Co-operative Proxies Project No: 1 Group No: 21 Vijay Gabale (07305004) Sagar Bijwe (07305023) 12 th November, 2007 1 Abstract Summary Cache based proxies cooperate behind a bottleneck
More informationEmbedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi
Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed
More informationDistributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4
Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4 Search Today Search Index Crawl What s Wrong? Users have a limited search interface Today s web is dynamic and
More informationPeerCrawl A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web
PeerCrawl A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web VAIBHAV J. PADLIYA MS CS Project (Fall 2005 Spring 2006) 1 PROJECT GOAL Most of the current web crawlers use a centralized
More informationArchitecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine
Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine Debajyoti Mukhopadhyay 1, 2 Sajal Mukherjee 1 Soumya Ghosh 1 Saheli Kar 1 Young-Chon
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationCollection Building on the Web. Basic Algorithm
Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring
More informationMotivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4
Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell
More informationAdministrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454
Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search
More informationA Distributed Query Engine for XML-QL
A Distributed Query Engine for XML-QL Paramjit Oberoi and Vishal Kathuria University of Wisconsin-Madison {param,vishal}@cs.wisc.edu Abstract: This paper describes a distributed Query Engine for executing
More informationSeminar report Google App Engine Submitted in partial fulfillment of the requirement for the award of degree Of CSE
A Seminar report On Google App Engine Submitted in partial fulfillment of the requirement for the award of degree Of CSE SUBMITTED TO: SUBMITTED BY: www.studymafia.org www.studymafia.org Acknowledgement
More informationGUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III
GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III Subject Name: Operating System (OS) Subject Code: 630004 Unit-1: Computer System Overview, Operating System Overview, Processes
More informationMemory. Objectives. Introduction. 6.2 Types of Memory
Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts
More informationEECS 395/495 Lecture 5: Web Crawlers. Doug Downey
EECS 395/495 Lecture 5: Web Crawlers Doug Downey Interlude: US Searches per User Year Searches/month (mlns) Internet Users (mlns) Searches/user-month 2008 10800 220 49.1 2009 14300 227 63.0 2010 15400
More informationFlash: an efficient and portable web server
Flash: an efficient and portable web server High Level Ideas Server performance has several dimensions Lots of different choices on how to express and effect concurrency in a program Paper argues that
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationDatacenter replication solution with quasardb
Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION
More informationWeb Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India
Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program
More informationChapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction
Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.
More informationStudent Name:.. Student ID... Course Code: CSC 227 Course Title: Semester: Fall Exercises Cover Sheet:
King Saud University College of Computer and Information Sciences Computer Science Department Course Code: CSC 227 Course Title: Operating Systems Semester: Fall 2016-2017 Exercises Cover Sheet: Final
More informationBroker Clusters. Cluster Models
4 CHAPTER 4 Broker Clusters Cluster Models Message Queue supports the use of broker clusters: groups of brokers working together to provide message delivery services to clients. Clusters enable a Message
More informationSEDA: An Architecture for Well-Conditioned, Scalable Internet Services
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles
More informationYIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2013 YIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION Akshat Kukreti Follow this and additional
More informationActive Server Pages Architecture
Active Server Pages Architecture Li Yi South Bank University Contents 1. Introduction... 2 1.1 Host-based databases... 2 1.2 Client/server databases... 2 1.3 Web databases... 3 2. Active Server Pages...
More informationAssignment 5. Georgia Koloniari
Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last
More informationData Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 14: Data Replication Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database Replication What is database replication The advantages of
More informationLecture 1: January 22
CMPSCI 677 Distributed and Operating Systems Spring 2018 Lecture 1: January 22 Lecturer: Prashant Shenoy Scribe: Bin Wang 1.1 Introduction to the course The lecture started by outlining the administrative
More informationScaleArc for SQL Server
Solution Brief ScaleArc for SQL Server Overview Organizations around the world depend on SQL Server for their revenuegenerating, customer-facing applications, running their most business-critical operations
More informationTinyTorrent: Implementing a Kademlia Based DHT for File Sharing
1 TinyTorrent: Implementing a Kademlia Based DHT for File Sharing A CS244B Project Report By Sierra Kaplan-Nelson, Jestin Ma, Jake Rachleff {sierrakn, jestinm, jakerach}@cs.stanford.edu Abstract We implemented
More informationFILTERING OF URLS USING WEBCRAWLER
FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,
More informationCLIENT SERVER ARCHITECTURE:
CLIENT SERVER ARCHITECTURE: Client-Server architecture is an architectural deployment style that describe the separation of functionality into layers with each segment being a tier that can be located
More informationPROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18
PROCESS VIRTUAL MEMORY CS124 Operating Systems Winter 2015-2016, Lecture 18 2 Programs and Memory Programs perform many interactions with memory Accessing variables stored at specific memory locations
More informationSharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment
SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment This document is provided as-is. Information and views expressed in this document, including
More informationMultiprocessor scheduling
Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.
More informationAddressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?
Peer-to-Peer Data Management - Part 1- Alex Coman acoman@cs.ualberta.ca Addressed Issue [1] Placement and retrieval of data [2] Server architectures for hybrid P2P [3] Improve search in pure P2P systems
More informationCLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters
CLUSTERING HIVEMQ Building highly available, horizontally scalable MQTT Broker Clusters 12/2016 About this document MQTT is based on a publish/subscribe architecture that decouples MQTT clients and uses
More information12 Abstract Data Types
12 Abstract Data Types 12.1 Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define the concept of an abstract data type (ADT). Define
More informationCOMP3121/3821/9101/ s1 Assignment 1
Sample solutions to assignment 1 1. (a) Describe an O(n log n) algorithm (in the sense of the worst case performance) that, given an array S of n integers and another integer x, determines whether or not
More informationLecture 1: January 23
CMPSCI 677 Distributed and Operating Systems Spring 2019 Lecture 1: January 23 Lecturer: Prashant Shenoy Scribe: Jonathan Westin (2019), Bin Wang (2018) 1.1 Introduction to the course The lecture started
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Ninth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More informationP2P. 1 Introduction. 2 Napster. Alex S. 2.1 Client/Server. 2.2 Problems
P2P Alex S. 1 Introduction The systems we will examine are known as Peer-To-Peer, or P2P systems, meaning that in the network, the primary mode of communication is between equally capable peers. Basically
More informationCluster-Based Scalable Network Services
Cluster-Based Scalable Network Services Suhas Uppalapati INFT 803 Oct 05 1999 (Source : Fox, Gribble, Chawathe, and Brewer, SOSP, 1997) Requirements for SNS Incremental scalability and overflow growth
More informationCS5412: TRANSACTIONS (I)
1 CS5412: TRANSACTIONS (I) Lecture XVII Ken Birman Transactions 2 A widely used reliability technology, despite the BASE methodology we use in the first tier Goal for this week: in-depth examination of
More informationCrawling the Web. Web Crawling. Main Issues I. Type of crawl
Web Crawling Crawling the Web v Retrieve (for indexing, storage, ) Web pages by using the links found on a page to locate more pages. Must have some starting point 1 2 Type of crawl Web crawl versus crawl
More informationMarket Data Publisher In a High Frequency Trading Set up
Market Data Publisher In a High Frequency Trading Set up INTRODUCTION The main theme behind the design of Market Data Publisher is to make the latest trade & book data available to several integrating
More informationOverview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste
Overview 5-44 5-44 Computer Networking 5-64 Lecture 6: Delivering Content: Peer to Peer and CDNs Peter Steenkiste Web Consistent hashing Peer-to-peer Motivation Architectures Discussion CDN Video Fall
More informationSUMMARY OF DATABASE STORAGE AND QUERYING
SUMMARY OF DATABASE STORAGE AND QUERYING 1. Why Is It Important? Usually users of a database do not have to care the issues on this level. Actually, they should focus more on the logical model of a database
More informationLecture 8: February 19
CMPSCI 677 Operating Systems Spring 2013 Lecture 8: February 19 Lecturer: Prashant Shenoy Scribe: Siddharth Gupta 8.1 Server Architecture Design of the server architecture is important for efficient and
More informationRunning Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.
Running Head: 1 How a Search Engine Works Sara Davis INFO 4206.001 Spring 2016 Erika Gutierrez May 1, 2016 2 Search engines come in many forms and types, but they all follow three basic steps: crawling,
More informationDesign and Implementation of A P2P Cooperative Proxy Cache System
Design and Implementation of A PP Cooperative Proxy Cache System James Z. Wang Vipul Bhulawala Department of Computer Science Clemson University, Box 40974 Clemson, SC 94-0974, USA +1-84--778 {jzwang,
More informationRAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE
RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting
More informationOPERATING SYSTEM. Chapter 12: File System Implementation
OPERATING SYSTEM Chapter 12: File System Implementation Chapter 12: File System Implementation File-System Structure File-System Implementation Directory Implementation Allocation Methods Free-Space Management
More informationDistributed OrcaFlex. 1. Introduction. 2. What s New. Distributed OrcaFlex
1. Introduction is a suite of programs that enables a collection of networked, OrcaFlex licensed, computers to run OrcaFlex jobs as background tasks using spare processor time. consists of four separate
More informationIntroduction. Table of Contents
Introduction This is an informal manual on the gpu search engine 'gpuse'. There are some other documents available, this one tries to be a practical how-to-use manual. Table of Contents Introduction...
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationLecture 7: February 10
CMPSCI 677 Operating Systems Spring 2016 Lecture 7: February 10 Lecturer: Prashant Shenoy Scribe: Tao Sun 7.1 Server Design Issues 7.1.1 Server Design There are two types of server design choices: Iterative
More informationDistributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid
More informationRACS: Extended Version in Java Gary Zibrat gdz4
RACS: Extended Version in Java Gary Zibrat gdz4 Abstract Cloud storage is becoming increasingly popular and cheap. It is convenient for companies to simply store their data online so that they don t have
More informationCHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL
CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL 5.1 INTRODUCTION The survey presented in Chapter 1 has shown that Model based testing approach for automatic generation of test
More informationFinding a needle in Haystack: Facebook's photo storage
Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,
More informationBIG-IP Local Traffic Management: Basics. Version 12.1
BIG-IP Local Traffic Management: Basics Version 12.1 Table of Contents Table of Contents Introduction to Local Traffic Management...7 About local traffic management...7 About the network map...7 Viewing
More informationPROJECT REPORT Thang Tran ( ), Sandeep Wadhwani ( ) and Vineet Kumar ( ) Introduction
PROJECT REPORT Thang Tran (264635), Sandeep Wadhwani (26476122) and Vineet Kumar (26461756) Introduction Zeromq Chat Application Our Zeromq Chat Application provides peer-to-peer instant messaging for
More informationA crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.
A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,
More informationA Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach
A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach Shridhar Diwan, Dennis Gannon Department of Computer Science Indiana University Bloomington,
More informationA SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS
INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS Satwinder Kaur 1 & Alisha Gupta 2 1 Research Scholar (M.tech
More informationECE519 Advanced Operating Systems
IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY Computer Systems Engineering: Spring Quiz I
Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.033 Computer Systems Engineering: Spring 2016 Quiz I There are 15 questions and 13 pages in this quiz booklet.
More informationFor use by students enrolled in #71251 CSE430 Fall 2012 at Arizona State University. Do not use if not enrolled.
Operating Systems: Internals and Design Principles Chapter 4 Threads Seventh Edition By William Stallings Operating Systems: Internals and Design Principles The basic idea is that the several components
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Seventh Edition By William Stallings Objectives of Chapter To provide a grand tour of the major computer system components:
More informationWhite paper ETERNUS Extreme Cache Performance and Use
White paper ETERNUS Extreme Cache Performance and Use The Extreme Cache feature provides the ETERNUS DX500 S3 and DX600 S3 Storage Arrays with an effective flash based performance accelerator for regions
More informationCSC 553 Operating Systems
CSC 553 Operating Systems Lecture 1- Computer System Overview Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users Manages secondary memory
More informationCS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007
CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007 Question 344 Points 444 Points Score 1 10 10 2 10 10 3 20 20 4 20 10 5 20 20 6 20 10 7-20 Total: 100 100 Instructions: 1. Question
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationUser Manual. Version 1.0. Submitted in partial fulfillment of the Masters of Software Engineering degree.
User Manual For KDD-Research Entity Search Tool (KREST) Version 1.0 Submitted in partial fulfillment of the Masters of Software Engineering degree. Eric Davis CIS 895 MSE Project Department of Computing
More informationThis project must be done in groups of 2 3 people. Your group must partner with one other group (or two if we have add odd number of groups).
1/21/2015 CS 739 Distributed Systems Fall 2014 PmWIki / Project1 PmWIki / Project1 The goal for this project is to implement a small distributed system, to get experience in cooperatively developing a
More informationLab 2: Threads and Processes
CS333: Operating Systems Lab Lab 2: Threads and Processes Goal The goal of this lab is to get you comfortable with writing basic multi-process / multi-threaded applications, and understanding their performance.
More informationApoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web
Apoidea: A Decentralized Peer-to-Peer Architecture for Crawling the World Wide Web Aameek Singh, Mudhakar Srivatsa, Ling Liu, and Todd Miller College of Computing, Georgia Institute of Technology, Atlanta,
More informationWHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group
WHITE PAPER: BEST PRACTICES Sizing and Scalability Recommendations for Symantec Rev 2.2 Symantec Enterprise Security Solutions Group White Paper: Symantec Best Practices Contents Introduction... 4 The
More informationA Unix process s address space appears to be three regions of memory: a read-only text region (containing executable code); a read-write region
A Unix process s address space appears to be three regions of memory: a read-only text region (containing executable code); a read-write region consisting of initialized data (simply called data), uninitialized
More informationSoftware Architecture
Software Architecture Mestrado em Engenharia Informática e de Computadores COMPANION TO THE FIRST EXAM ON JANUARY 8TH, 2016 VERSION: A (You do not need to turn in this set of pages with your exam) 1. Consider
More informationPERFORMANCE MEASUREMENT OF WORLD WIDE WEB SERVERS
PERFORMANCE MEASUREMENT OF WORLD WIDE WEB SERVERS Cristina Hava & Liam Murphy 1 Abstract The World Wide Web (WWW, or Web) is one of the most important Internet services, and has been largely responsible
More informationFAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017
Liang 1 Jintian Liang CS244B December 13, 2017 1 Introduction FAWN as a Service FAWN, an acronym for Fast Array of Wimpy Nodes, is a distributed cluster of inexpensive nodes designed to give users a view
More informationQ.1 Explain Computer s Basic Elements
Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some
More informationCSci 4061 Introduction to Operating Systems. (Thread-Basics)
CSci 4061 Introduction to Operating Systems (Thread-Basics) Threads Abstraction: for an executing instruction stream Threads exist within a process and share its resources (i.e. memory) But, thread has
More informationSelf Adjusting Refresh Time Based Architecture for Incremental Web Crawler
IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationWeb Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson
Web Crawling Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc.
More informationScalable overlay Networks
overlay Networks Dr. Samu Varjonen 1 Lectures MO 15.01. C122 Introduction. Exercises. Motivation. TH 18.01. DK117 Unstructured networks I MO 22.01. C122 Unstructured networks II TH 25.01. DK117 Bittorrent
More informationSSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide
SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide April 2013 SSIM Engineering Team Version 3.0 1 Document revision history Date Revision Description of Change Originator 03/20/2013
More informationChapter 11: Implementing File Systems
Chapter 11: Implementing File Systems Operating System Concepts 99h Edition DM510-14 Chapter 11: Implementing File Systems File-System Structure File-System Implementation Directory Implementation Allocation
More informationAutomated Path Ascend Forum Crawling
Automated Path Ascend Forum Crawling Ms. Joycy Joy, PG Scholar Department of CSE, Saveetha Engineering College,Thandalam, Chennai-602105 Ms. Manju. A, Assistant Professor, Department of CSE, Saveetha Engineering
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More information